Research Article Improved Feature Weight Algorithm and Its ...

13
Research Article Improved Feature Weight Algorithm and Its Application to Text Classification Songtao Shang, 1 Minyong Shi, 1 Wenqian Shang, 1 and Zhiguo Hong 2 1 School of Computer Science, Communication University of China, Beijing 100024, China 2 School of Computer, Faculty of Science and Engineering, Communication University of China, Beijing 100024, China Correspondence should be addressed to Songtao Shang; [email protected] Received 2 November 2015; Revised 1 February 2016; Accepted 3 March 2016 Academic Editor: Andrzej Swierniak Copyright © 2016 Songtao Shang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Text preprocessing is one of the key problems in pattern recognition and plays an important role in the process of text classification. Text preprocessing has two pivotal steps: feature selection and feature weighting. e preprocessing results can directly affect the classifiers’ accuracy and performance. erefore, choosing the appropriate algorithm for feature selection and feature weighting to preprocess the document can greatly improve the performance of classifiers. According to the Gini Index theory, this paper proposes an Improved Gini Index algorithm. is algorithm constructs a new feature selection and feature weighting function. e experimental results show that this algorithm can improve the classifiers’ performance effectively. At the same time, this algorithm is applied to a sensitive information identification system and has achieved a good result. e algorithm’s precision and recall are higher than those of traditional ones. It can identify sensitive information on the Internet effectively. 1. Introduction e information in the real-world is always in disorder. Usually, we need to classify the disordered information for our cognition and learning. In the field of information processing, a text is the most basic form to express informa- tion, such as the news, websites, and online chat messages. erefore, text classification is becoming an important work. Text classification (TC) [1] is the problem of automatically assigning predefined categories to free text documents. Vec- tor Space Model (VSM) [2, 3] is widely used to express text information. at is, a text can be expressed as () = {⟨ 1 , 1 ⟩, ⟨ 2 , 2 ⟩, ⟨ 3 , 3 ⟩, . . . , ⟨ , ⟩}, where 1 , 2 ,..., and are features of the text and 1 , 2 ,..., and are the weights of the features. Even a moderate- sized text collection consists of tens or hundreds of feature terms. is is prohibitively high for many machine learning algorithms, and only a few neutral network algorithms can handle such a large number of input features. However, many of these features are redundant, which will influence the precision of TC algorithms. Hence, the problem or difficulty in the research of TC is how to reduce the dimensionality of features. Feature selection (FS) [4, 5] is to find the minimum feature subsets that can represent the original text; that is, FS can be used to reduce the redundancy of features, improve the comprehensibility of models, and identify the hidden structures in high-dimensional feature space. FS is the key step in the process of TC, since it is the preparation of classification algorithms, and the precision of FS has direct impact on the performance of classification. e common methods of FS oſten use machining learning algorithms [6], such as Information Gain, Expected Cross Entropy, Mutual Information, Odds Ratio, and CHI. Each algorithm has its own advantages and disadvantages. In the English data sets, Yang and Pedersen’s experiments showed that the Informa- tion Gain, Expected Cross Entropy, and CHI algorithms get better performance [7]. In Chinese data sets, Liqing et al.’s experiments showed that the CHI and Information Gain algorithms get the best performance and the Mutual Information algorithm is the worst [8]. Many researchers employ other theories or algorithms on FS; references [9, 10] have adopted Gini Index into FS and achieved good performance. Gini Index is a measurement to evaluate the impurity of a set, which describes the importance of features in classification. It is widely used for splitting attributes in Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2016, Article ID 7819626, 12 pages http://dx.doi.org/10.1155/2016/7819626

Transcript of Research Article Improved Feature Weight Algorithm and Its ...

Page 1: Research Article Improved Feature Weight Algorithm and Its ...

Research ArticleImproved Feature Weight Algorithm and Its Application toText Classification

Songtao Shang1 Minyong Shi1 Wenqian Shang1 and Zhiguo Hong2

1School of Computer Science Communication University of China Beijing 100024 China2School of Computer Faculty of Science and Engineering Communication University of China Beijing 100024 China

Correspondence should be addressed to Songtao Shang songtaoshangfoxmailcom

Received 2 November 2015 Revised 1 February 2016 Accepted 3 March 2016

Academic Editor Andrzej Swierniak

Copyright copy 2016 Songtao Shang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Text preprocessing is one of the key problems in pattern recognition and plays an important role in the process of text classificationText preprocessing has two pivotal steps feature selection and feature weighting The preprocessing results can directly affect theclassifiersrsquo accuracy and performance Therefore choosing the appropriate algorithm for feature selection and feature weightingto preprocess the document can greatly improve the performance of classifiers According to the Gini Index theory this paperproposes an Improved Gini Index algorithmThis algorithm constructs a new feature selection and feature weighting functionTheexperimental results show that this algorithm can improve the classifiersrsquo performance effectively At the same time this algorithmis applied to a sensitive information identification system and has achieved a good result The algorithmrsquos precision and recall arehigher than those of traditional ones It can identify sensitive information on the Internet effectively

1 Introduction

The information in the real-world is always in disorderUsually we need to classify the disordered information forour cognition and learning In the field of informationprocessing a text is the most basic form to express informa-tion such as the news websites and online chat messagesTherefore text classification is becoming an important workText classification (TC) [1] is the problem of automaticallyassigning predefined categories to free text documents Vec-tor Space Model (VSM) [2 3] is widely used to expresstext information That is a text 119863 can be expressed as119881(119889) = ⟨119879

1 1198821⟩ ⟨1198792 1198822⟩ ⟨1198793 1198823⟩ ⟨119879

119899 119882119899⟩ where

1198791 1198792 and 119879

119899are features of the text and 119882

1 1198822

and 119882119899are the weights of the features Even a moderate-

sized text collection consists of tens or hundreds of featureterms This is prohibitively high for many machine learningalgorithms and only a few neutral network algorithms canhandle such a large number of input features However manyof these features are redundant which will influence theprecision of TC algorithms Hence the problem or difficultyin the research of TC is how to reduce the dimensionality offeatures

Feature selection (FS) [4 5] is to find the minimumfeature subsets that can represent the original text that is FScan be used to reduce the redundancy of features improvethe comprehensibility of models and identify the hiddenstructures in high-dimensional feature space FS is the keystep in the process of TC since it is the preparation ofclassification algorithms and the precision of FS has directimpact on the performance of classification The commonmethods of FS often use machining learning algorithms [6]such as Information Gain Expected Cross Entropy MutualInformation Odds Ratio and CHI Each algorithm has itsown advantages and disadvantages In the English data setsYang and Pedersenrsquos experiments showed that the Informa-tion Gain Expected Cross Entropy and CHI algorithmsget better performance [7] In Chinese data sets Liqinget alrsquos experiments showed that the CHI and InformationGain algorithms get the best performance and the MutualInformation algorithm is the worst [8] Many researchersemploy other theories or algorithms on FS references [910] have adopted Gini Index into FS and achieved goodperformance Gini Index is a measurement to evaluate theimpurity of a set which describes the importance of featuresin classification It is widely used for splitting attributes in

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 7819626 12 pageshttpdxdoiorg10115520167819626

2 Mathematical Problems in Engineering

decision tree algorithms [11] That is to say Gini Index canidentify the importance of features

On the other hand feature weighting (FW) is anotherimportant aspect of FS TF-IDF [12] is a popular FW algo-rithm However TF-IDF is unsuitable for text FW The basicidea of TF-IDF is that a feature term ismore important if it hasa higher frequency in a text known as Term Frequency (TF)and feature term is less important if it appears in differenttext documents in a training set known as Inverse DocumentFrequency (IDF) In TC feature frequency is one of themost important aspects regardless of its appearance in oneor more texts Many researchers have improved the TF-IDFalgorithm by replacing the IDF part with FS algorithms suchas TF-IG and TF-CHI [13ndash15] This paper mainly focuseson an Improved Gini Index algorithm and its applicationand proposes an FW algorithm using Improved Gini Indexalgorithm instead of the IDF part of TF-IDF algorithmknown as TF-Gini algorithm The experiment results of thealgorithm show that the TF-Gini algorithm is a promisingalgorithm To test the performance of the Improved GiniIndex and TF-Gini algorithms in this paper we introduce119896NN [16 17] f119896NN [18] and SVM [19 20] as the TCalgorithms standards

As an application of TF-Gini algorithm this paperdesigns and implements a sensitive information monitoringsystem Sensitive information on the Internet is the mostinteresting phenomenon that information experts are eagerto investigate Since the application of the Internet is devel-oping very fast many new words and phrases emerge on theInternet every day Therefore new words and phrases posechallenges to sensitive information monitoring At presentmost of the network information monitoring and filteringalgorithms are inflexible and inaccurate By the previousresearch results we use the TF-Gini as the core algorithmand design and implement a system for sensitive informationmonitoring This system can update sensitive words andphrases in time and has better performance on monitoringthe Internet information

2 Text Classification

21 Text Classification Process The text classification processis shown in Figure 1 The left part is the training process Theright part is the classification process The function of eachpart can be described as follows

211 The Training Process The purpose of the trainingprocess is to build a classification model The new samplesare assigned to appropriate category by using this model

(1) Words Segmentation In English and other Western lan-guages there is a space delimiter between words In Orientallanguages there are no space delimiters between wordsChinese word segmentation is the basis for all Chinesetext processing There are many Chinese word segmentationalgorithms [21 22] most of them use machine learningalgorithms ICTCLAS (Institute of Computing TechnologyChinese Lexical Analysis System) is the best Chinese lexicalanalyzer which is designed by ICT (Institute of Computing

Training set Test set

Words segmentation(for Chinese words)

Words segmentation(for Chinese words)

Stop Words orstemming

VSM

Features selection

Classification

Classificationmodel

End

Stop Words orstemming

VSM

Using model toclassify new samples

Evaluation results

End

Figure 1 Text classification process

Technology Chinese Academy of Science) In this paper weuse ICTCLAS for Chinese word segmentation

(2) StopWords or Stemming A text document always containshundreds or thousands of words Many of the words appearin a very high frequency but are useless such as ldquotherdquo ldquoardquononsense articles and adverbs These words are called StopWords [23] Stop Words which are language-specific func-tional words are frequent words that carry no informationThe first step during text process is to remove these StopWords

Stemming techniques are used to find out the root or stemof a word Stemming converts words to their stems whichincorporates a great deal of language-dependent linguisticknowledge Behind stemming the hypothesis is that wordswith the same stem or root mostly describe the same orrelatively close concepts in the document Hence words canbe conflated by using stems For example the words userusers used and using all can be stemmed to the word ldquoUSErdquoIn Chinese text classification we only need Stop Wordsbecause there is no conception of stemming in Chinese

(3) VSM (Vector SpaceModel) VSM is very popular in naturallanguage processing especially in computing the similaritybetween documents VSM ismapping a document to a vectorFor example if we view each feature word as a dimensionand the word frequency as its value the document canbe represented as an 119899-dimensional vector that is 119879 =

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ ⟨119905

119899 119908119899⟩ where 119905

119894(1 le 119894 le 119899) is the

Mathematical Problems in Engineering 3

Table 1 Three documents represented in the VSM

1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

doc1

1 2 5 7 9doc2

3 4 6 8doc3

10 11 12 13 14 15

119894th feature and 119908119894

(1 le 119894 le 119899) is the featurersquos frequency whichis also called weights Table 1 is an example of a VSM of threedocuments with ten feature words doc

1 doc2 and doc

3are

documents 1199051 1199052 119905

10are feature words

Therefore we can represent doc1 doc2 and doc

3in

VSM as 1198791

= ⟨1199051 1⟩ ⟨119905

2 2⟩ ⟨119905

4 5⟩ ⟨119905

6 7⟩ ⟨119905

8 9⟩ 119879

2=

⟨1199052 3⟩ ⟨119905

4 4⟩ ⟨119905

6 6⟩ ⟨119905

7 8⟩ and 119879

3= ⟨119905

1 10⟩ ⟨119905

3 11⟩

⟨1199055 12⟩ ⟨119905

8 13⟩ ⟨119905

9 14⟩ ⟨119905

10 15⟩

(4) FeatureWords Selection After StopWords and stemmingthere are still tens of thousands of words in a training set Itis impossible for a classifier to handle so many words Thepurpose of feature words selection is to reduce the dimensionof features words It is a very important step for text classi-fication Choosing the appropriate representative words canimprove classifiersrsquo performance This paper mainly focuseson this stage

(5) Classification and Building Classification Model After theabove steps all the training set texts have beenmapped to theVSMThis step uses the classification algorithms onVSM andbuilds classification model which can assign an appropriatecategory to new input samples

212 The Classification Process The steps of word segmenta-tion Stop Words or stemming and VSM are similar to theclassification processThe classification process will assign anappropriate category to new samples When a new samplearrives using the classification model calculates the mostlikely class label of the new sample and assigns it to thesample The results evaluation is a very important step in thetext classification process It indicates whether the algorithmis good or not

22 Feature Selection Although the Stop Words and stem-ming step get rid of some useless words the VSM still hasmany insignificantwordswhich even have a negative effect onclassification It is difficult for many classification algorithmsto handle high-dimensional data sets Hence we need toreduce the dimensional space and improve the classifiersrsquoperformance FS uses machine learning algorithms to chooseappropriate words that can represent the original text whichcan reduce the dimensions of the feature space

The definition of FS is as followsSelect119898 features from the original 119899 features (119898 le 119899)The

119898 features can bemore concise andmore efficient to representthe contents of the textThe commonly used FS algorithm canbe described as follows

221 InformationGain (IG) IG iswidely used in themachinelearning field which is a criterion of the importance of the

feature to a class The value of IG equals the difference ofinformation entropy before and after the appearance of afeature in a classThe greater the IG the greater the amount ofinformation the feature contains the more important in theTC Therefore we can filter the best feature subset accordingto the featurersquos IG The computing formula can be describedas follows

InfGain (119908) = minus

|119862|

sum

119894=1

119875 (119888119894) log119875 (119888

119894)

+ 119875 (119908)

|119862|

sum

119894=1

119875 (119908 | 119888119894) log119875 (119908 | 119888

119894)

+ 119875 (119908)

|119862|

sum

119894=1

119875 (119908119894

| 119888119894) log119875 (119908 | 119888

119894)

(1)

where 119908 is the feature word 119875(119908) is the probability of textthat contains 119908 appearing in the training set 119875(119908) is theprobability of text that does not contain 119908 appearing in thetraining set 119875(119888

119894) is the probability of 119888

119894class in the training

set 119875(119908 | 119888119894) is the probability of text that contains 119908 in 119888

119894

119875(119908 | 119888119894) is the probability of text that does not contain 119908 in

119888119894 and |119862| is the total amount of classes in training setThe shortage of IG algorithm is that it takes into account

features which do not appear although in some circumstancethe nonappearing features could contribute to the classifi-cation or the contribution is far less than that of featuresappearing in the training set However if the training set isunbalanced and119875(119908) ≫ 119875(119908) the value of IGwill be decidedby 119875(119908) sum

|119862|

119894=1119875(119908119894

| 119888) log119875(119908 | 119888119894) and the performance of

IG can be significantly reduced

222 Expected Cross Entropy (ECE) ECE is well knownas KL distance and it reflects the distance between theprobability of the theme class and the probability of the themeclass under the condition of a specific featureThe computingformula can be described as follows

CrossEntryText (119908)

= 119875 (119908)

|119862|

sum

119894=1

119875 (119908 | 119888119894) log

119875 (119908 | 119888119894)

119875 (119888119894)

(2)

where 119908 is the feature word 119875(119908) is the probability of textthat contains 119908 appearing in the training set 119875(119888

119894) is the

probability of class 119888119894in the training set 119875(119908 | 119888

119894) is the

probability of text that contains 119908 in 119888119894 and |119862| is the total

amount of classes in training setECE has been widely used in feature selection of TC and

it also achieves good performance Compared with the IGalgorithm ECE does not consider the case that feature doesnot appear which reduced the interference of rare featurewhich does not occur often and improves the performanceof classification However it also has some limitations

(1) If 119875(119908 | 119888119894) gt 119875(119888

119894) then log(119875(119908 | 119888

119894)119875(119888119894)) gt

0 When 119875(119908 | 119888119894) is larger and 119875(119888

119894) is smaller

the logarithmic value is larger which indicates thatfeature 119908 has strong association with class 119888

119894

4 Mathematical Problems in Engineering

(2) If 119875(119908 | 119888119894) lt 119875(119888

119894) then log(119875(119908 | 119888

119894)119875(119888119894)) lt 0

When 119875(119908 | 119888119894) is smaller and 119875(119888

119894) is larger the

logarithmic value is smaller which indicates that fea-ture 119908 has weaker association with class 119888

119894

(3) When 119875(119908 | 119888119894) = 0 there is no association between

feature 119908 and the class 119888119894 In this case the logarithmic

expression has no significance Introducing a smallparameter for example 120576 = 00001 that is log((119875(119908 |

119888119894) + 120576)119875(119888

119894)) the logarithmic will have significance

(4) If 119875(119908 | 119888119894) and 119875(119888

119894) are close to each other

then log(119875(119908 | 119888119894)119875(119888119894)) is close to zero When

119875(119908 | 119888119894) is large this indicates that feature 119908 has a

strong associationwith class 119888119894and should be retained

However when log(119875(119908 | 119888119894)119875(119888119894)) is close to zero

feature 119908 will be removed In this case we employinformation entropy to add to the ECE algorithmTheformula of information entropy is as follows

IE (119908) = minus

|119862|

sum

119894=0

119875 (119908 | 119888119894) log (119875 (119908 | 119888

119894)) (3)

In a summary we combine (2) and (3) together the ECEformula is as follows

ECE (119908)

=119875 (119908)

IE (119908) + 120576

|119862|

sum

119894=1

(119875 (119908 | 119888119894) + 120576) log

119875 (119908 | 119888119894) + 120576

119875 (119888119894)

(4)

If feature 119908 exists only in one class the value of infor-mation entropy is zero that is IE(119908) = 0 Hence weshould introduce a small parameter in the denominator as aregulator

223 Mutual Information (MI) In TC MI expresses theassociation between feature words and classes The MIbetween 119908 and 119888

119894is defined as

MutualInfo (119908) = log119875 (119908 | 119888

119894)

119875 (119908)= log

119875 (119908 119888119894)

119875 (119908) 119875 (119888119894)

(5)

where 119908 is the feature word 119875(119908 119888119894) is the probability of text

that contains 119908 in class 119888119894 119875(119908) is the probability of text that

contains 119908 in the training set and 119875(119888119894) is the probability of

class 119888119894in the training set

It is necessary to calculate the MI between features andeach category in the training set The following two formulascan be used to calculate the MI

MImax (119908) =|119862|

max119894=1

MutualInfo (119908)

MIavg (119908) =

|119862|

sum

119894=1

119875 (119888119894)MutualInfo (119908)

(6)

where 119875(119888119894) is the probability of class 119888

119894in the training set and

|119862| is the total amount of classes in the training set GenerallyMImax(119908) is used commonly

224 Odds Ratio (OR) The computing formula can bedescribed as follows

OddsRatio (119908 119888119894) = log

119875 (119908 | 119888119894) (1 minus 119875 (119908 | 119888

119894))

(1 minus 119875 (119908 | 119888119894)) 119875 (119908 | 119888

119894)

(7)

where 119908 is the feature word 119888119894represents the positive class

and 119888119894represents the negative class119875(119908 | 119888

119894) is the probability

of text that contains 119908 in class 119888119894 and 119875(119908 | 119888

119894) is the

conditional probability of text that contains119908 in all the classesexcept 119888

119894

OR metric measures the membership and nonmember-ship to a specific class with its numerator and denominatorrespectively Therefore the numerator must be maximizedand the denominator must be minimized to get the highestscore according to the formula The formula is a one-sidedmetric because the logarithm function produces negativescores while the value of the fraction is between 0 and 1 Inthis case the features have negative values pointed to negativefeatures Thus if we only want to identify the positive classand do not care about the negative class Odds Ratio will havea great advantage Odds Ratio is suitable for binary classifiers

225 1205942 (CHI) 120594

2 statistical algorithm measures the rele-vance between the feature 119908 and the class 119888

119894 The higher the

score the stronger the relevance between the feature 119908 andthe class 119888

119894 It means that the feature has a greater contribution

to the class The computing formula can be described asfollows

1205942

(119908 119888119894)

=|119862| (119860

1times 1198604

minus 1198602

times 1198603)2

(1198601

+ 1198603) (1198602

+ 1198604) (1198601

+ 1198602) (1198603

+ 1198604)

(8)

where 119908 is the feature word 1198601is the frequency containing

feature 119908 and class 119888119894 1198602is the frequency containing feature

119908 but not belonging to class 119888119894 1198603is the frequency belonging

to class 119888119894but not containing the feature119908119860

4is the frequency

not containing the feature119908 and not belonging to class 119888119894 and

|119862| is the total number of text documents in class 119888119894

When the feature 119908 and class 119888119894are independent that

is 1205942(119908 119888119894) = 0 this means that the feature 119908 is not

containing identification informationWe calculate values foreach category and then use formula (9) to compute the valueof the entire training set

1205942

max (119908) = max1le119894le|119862|

1205942

(119908 119888119894) (9)

3 The Gini Feature Selection Algorithm

31 The Improved Gini Index Algorithm

311 Gini Index (GI) The GI is a kind of impurity attributesplitting method It is widely used in the CART algorithm[24] the SLIQ algorithm [25] and the SPRINT algorithm[26] It chooses the splitting attribute and obtains very goodclassification accuracy The GI algorithm can be described asfollows

Mathematical Problems in Engineering 5

Suppose that 119876 is a set of 119904 samples and these sampleshave 119896 different classes (119888

119894 119894 = 1 2 3 119896) According to the

differences of classes we can divide 119876 into 119896 subsets (119876119894 119894 =

1 2 3 119896) Suppose that 119876119894is a sample set that belongs to

class 119888119894and that 119904

119894is the total number of 119876

119894 then the GI of set

119876 is

Gini (119876) = 1 minus

|119862|

sum

119894=1

1198752

119894 (10)

where 119875119894is the probability of 119876

119894in 119888119894and calculated by 119904

119894119904

When Gini(119876) is 0 that is the minimum all the membersin the set belong to the same class that is the maximumuseful information can be obtained When all of the samplesin the set distribute equally for each class Gini(119876) is themaximum that is the minimum useful information can beobtained

For an attribute 119860 with 119897 distinct values 119876 is partitionedinto 119897 subsets 119876

1 1198762 119876

119897 The GI of 119876 with respect to the

attribute 119860 is defined as

Gini119860

(119876) = 1 minus

119897

sum

119894=1

119904119894

119904Gini (119876) (11)

The main idea of GI algorithm is that the attribute withtheminimum value of GI is the best attribute which is chosento split

312 The Improved Gini Index (IGI) Algorithm The originalform of the GI algorithm was used to measure the impurityof attributes towards classification The smaller the impurityis the better the attribute is However many studies on GItheory have employed the measure of purity the larger thevalue of the purity is the better the attribute is Purity is moresuitable for text classification The formula is as follows

Gini (119876) =

|119862|

sum

119894=1

1198752

119894 (12)

This formula measures the purity of attributes towardscategorization The larger the value of purity is the betterthe attribute is However we always emphasize the high-frequency words in TC because the high-frequency wordshave more contributions to judge the class of texts But whenthe distribution of the training set is highly unbalanced thelower-frequency feature words still have some contributionto judge the class of texts although the contribution is far lesssignificant than that the high-frequency feature words haveTherefore we define the IGI algorithm as

IGI (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

119875 (119888119894

| 119908)2

(13)

where 119908 is the feature word |119862| is the total number of classesin the training set 119875(119908 | 119888

119894) is the probability of text that

contains119908 in class 119888119894 and119875(119888

119894| 119908) is the posterior probability

of text that contains 119908 in class 119888119894

This formula overcomes the shortcoming of the originalGI which considers the featuresrsquo conditional probability

combining the conditional probability and posterior prob-ability Hence the IGI algorithm can depress the affectionwhen the training set is unbalanced

32 TF-Gini Algorithm The purpose of FS is to choose themost suitable representative features to represent the originaltext In fact the featuresrsquo distinguishability is different Somefeatures can effectively distinguish a text from others butother features cannot Therefore we use weights to identifydifferent featuresrsquo distinguishabilityThe higher distinguisha-bility of features will get higher weights

The TF-IDF algorithm is a classical algorithm to calculatethe featuresrsquo weights For a word 119908 in text 119889 the weights of 119908

in 119889 are as follows

TF-IDF (119908 119889) = TF (119908 119889) sdot IDF (119908 119889)

= TF (119908 119889) sdot log(|119863|

DF (119908))

(14)

where TF(119908 119889) is the frequency of 119908 appearance in text 119889|119863| is the number of text documents in the training setand DF(119908) is the total number of texts containing 119908 in thetraining set

The TF-IDF algorithm is not suitable for TC mainlybecause of the shortcoming of TF-IDF in the section IDFThe TF-IDF considers that a word with a higher frequencyin different texts in the training set is not important Thisconsideration is not appropriate for TC Therefore we usepurity Gini to replace the IDF part in the TF-IDF formulaThe purity Gini is as follows

GiniTxt (119908) =

|119862|

sum

119894=1

119875119894(119908 | 119888119894)120575

(15)

where119908 is the featureword and 120575 is a nonzero value119875119894(119908 | 119888119894)

is the probability of text that contains 119908 in class 119888119894 and |119862| is

the total number of classes in the training set When 120575 = 2the following formula is the original Gini formula

GiniTxt (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

(16)

However in TC the experimental results indicate thatwhen 120575 = minus12 the classification performance is the best

GiniTxt (119908) =

|119862|

sum

119894=1

radic119875 (119908 | 119888119894)

119875 (119908 | 119888119894)

(17)

Therefore the new feature weighting algorithm known asTF-Gini can be represented as

TF-Gini = TF sdot GiniTxt (119908) (18)

33 Experimental and Analysis

331 Data Sets Preparation In this paper we choose Englishand Chinese data sets to verify the performance of the FSand FW algorithms The English data set which we choose

6 Mathematical Problems in Engineering

is Reuters-21578 We choose ten big categories among themThe Chinese data set is Fudan University Chinese text dataset (19637 texts)We use 3522 pieces as the training texts and2148 pieces as the testing texts

332 Classifiers Selection This paper is mainly focused ontext preprocessing stage The IGI and TF-Gini algorithms arethe main research area in this paper To validate the abovetwo algorithmsrsquo performance we introduce some commonlyused text classifiers such as 119896NN classifier f119896NN classifierand SVM classifier All the experiments are based on theseclassifiers

(1) The kNN Classifier The 119896 nearest neighbors (119896NN)algorithm is a very popular simple and nonparameter textclassification algorithm The idea of 119896NN algorithm is tochoose the category for new input sample that appears mostoften amongst its 119896 nearest neighbors Its decision rule can bedescribed as follows

120583119895

(119889) =

119896

sum

119894=1

120583119895

(119889119894) sim (119889 119889

119894) (19)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training set 119889 is anew input sample with unknown category and 119889

119894is a testing

text whose category is clear 120583119895(119889119894) is equal to 1 if sample 119889

119894

belongs to class 119895 otherwise it is 0 sim(119889 119889119894) indicates the

similarity between the new input sample and the testing textThe decision rule is

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(20)

(2) The fkNN Classifier The 119896NN classifier is a lazy clas-sification algorithm It does not need the training set Theclassifierrsquos performance is not ideal especially when thedistribution of the training set is unbalanced Therefore weimprove the algorithm by using the fuzzy logic inferencesystem Its decision rule can be described as follows

120583119895

(119889)

=

sum119896

119894=1120583119895

(119889119894) sim (119889 119889

119894) (1 minus sim (119889 119889

119894))2(1minus119887)

sum119896

119894=1(1 minus sim (119889 119889

119894))2(1minus119887)

(21)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training setand 120583119895(119889119894)sim(119889 119889

119894) is the value ofmemberships for sample 119889

belonging to class 119895 120583119895(119889119894) is equal to 1 if sample 119889

119894belongs to

class 119895 otherwise it is 0 From the formula it can be seen thatf119896NN is based on 119896NN algorithm which endows a distanceweight on 119896NN formula The parameter 119887 adjusts the degreeof the weight 119887 isin [0 1] The f119896NN decision rule is as follows

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(22)

H2

H1

H

b

+

++

+

+

minusminus

minus

minus

minus

1120596

120596 middot x + b = minus1 120596 middot x + b = 0

120596 middot x + b = +1

Figure 2 Optimal hyperplane for SVM classifier

(3) The SVM Classifier Support vector machine (SVM) is apotential classification algorithmwhich is based on statisticallearning theory SVM is highly accurate owing to its abilityto model complex nonlinear decision boundary It uses anonlinear mapping to transform the original set into a higherdimensionWithin this newdimension it searches for the lin-ear optimal separating hyperplane that is decision boundarywith an appropriate nonlinear mapping to a sufficiently highdimension data from two classes can always be separatedby a hyperplane Originally SVM mainly solves the problemof two-class classification The multiclass problem can beaddressed using multiple binary classifications

Suppose that in the feature space a set of training vectors119909119894(119894 = 1 2 3 ) belong to different classes 119910

119894isin minus1 1 We

wish to separate this training set with a hyperplane

120596 sdot 119909 + 119887 = 0 (23)

Obviously there are an infinite number of hyperplanesthat are able to partition the data set into sets Howeveraccording to the SVM theory there is only one optimalhyperplane which is lying half-way within the maximalmargin as the sumof distances of the hyperplane to the closesttraining sample of each class As shown in Figure 2 the solidline represents the optimal separating hyperplane

One of the problems of SVM is to find the maximumseparating hyperplane that is the one maximum distancebetween the nearest training samples In Figure 2 119867 is theoptimal hyperplane 119867

1and 119867

2are the parallel hyperplanes

to119867The task of SVM ismaking themargin distance between1198671and 119867

2maximum The maximum margin can be written

as follows

120596 sdot 119909119894

+ 119887 ge +1 119910119894

= +1

120596 sdot 119909119894

+ 119887 le minus1 119910119894

= minus1 997904rArr

119910119894(120596 sdot 119909

119894+ 119887) minus 1 ge 0

(24)

Mathematical Problems in Engineering 7

Therefore from Figure 2 we can get that the processof solving SVM is making the margin between 119867

1and 119867

2

maximum

min119909119894|119910119894=1

120596 sdot 119909119894

+ 119887

120596minus max119909119894119910119894=minus1

120596 sdot 119909119894

+ 119887

120596=

2

120596 (25)

The equation is also the same as minimizing 12059622

Therefore it becomes the following optimization question

min (1205962

2)

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1

(26)

The formulameans the points which satisfy 119910119894(120596sdot119909119894+119887) =

1 called support vectors namely the points on 1198671and 119867

2in

Figure 2 are all support vectors In fact many samples are notclassified correctly by the hyperplane Hence we introduceslack variables Then the above formula becomes

min 1

21205962

minus 120578 sdot

|119862|

sum

119894=1

120585119894

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1 minus 120585

119894

(27)

where 120578 is a positive constant to balance the experienceand confidence and |119862| is the total number of classes in thetraining set

333 Performance Evaluation Confusion matrix [27] alsoknown as a contingency table or an error matrix is a specifictable layout that allows visualization of the performance ofan algorithm typically a supervised learning algorithm Itis a table with two rows and two columns that reports thenumber of false positives false negatives true positives andtrue negatives Table 2 shows the confusion matrix

True positive (TP) is the number of correct predictionsthat an instance is positive False positive (FP) is the numberof incorrect predictions that an instance is positive Falsenegative (FN) is the number of incorrect predictions thatan instance is positive True negative (TN) is the number ofincorrect predictions that an instance is negative

Precision is the proportion of the predicted positive casesthat were correct as calculated using the equation

Precision (119875) =TP

TP + FP (28)

Recall is the proportion of the actual positive cases thatwere correct as calculated using the equation

Recall =TP

TP + FN (29)

We hope that it is better when the precision and recallhave higher values However both of them are contradictoryFor example if the result of prediction is only one andaccurate the precision is 100 and the recall is very lowbecause TP equals 1 FP equals 0 and FN is hugeOn the other

Table 2 Confusion matrix

Prediction as positive Prediction as negativeActual positive TP FNActual negative FP TN

hand if the prediction as positive contains all the results (ieFN equals 0) therefore the recall is 100The precision is lowbecause FP is large

1198651 [28] is the average of precision and recall If 1198651 is highthe algorithm and experimental methods are ideal 1198651 is tomeasure the performance of a classifier considering the abovetwo methods and was proposed by Van Rijsbergen [29] in1979 It can be described as follows

1198651 =2 times Precision times RecallPrecision + Recall

(30)

Precision and recall only evaluate the performance ofclassification algorithms in a certain category It is oftennecessary to evaluate classification performance in all the cat-egories When dealing with multiple categories there are twopossible ways of averaging these measures namely macro-average andmicroaverageThemacroaverage weights equallyall classes regardless of how many documents belong to itThe microaverage weights equally all the documents thusfavoring the performance in common classes The formulasof macroaverage and microaverage are as follows and |119862| isthe number of classes in the training set

Macro 119901 =sum|119862|

119894=1Precision

119894

|119862|

Macro 119903 =sum|119862|

119894=1Recall

119894

|119862|

Macro 1198651 =sum|119862|

119894=11198651119894

|119862|

Micro 119901 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FP119894)

Micro 119903 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FN119894)

Micro 1198651 =2 times Micro 119901 times Micro 119903

Micro 119901 + Micro 119903

(31)

334 Experimental Results and Analysis Figures 3 and 4are the results on the English data set The performance ofclassical GI is the lowest and the IGI is the highest There-fore the IGI algorithm is helpful to improve the classifiersrsquoperformance

From Figures 5 and 6 it can be seen that in Chinesedata sets the classical GI still has poor performanceThe bestperformance belongs to ECE and then the second is IG Theperformance of IGI is not the best Since theChinese languageis different from the English language at the processing of

8 Mathematical Problems in Engineering

GIIGIIGECE

CHIORMI

kNN fkNNSVMClassifiers

0

10

20

30

40

50

60

70

80

()

Figure 3 Macro 1198651 results on Reuters-21578 data set

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

()

kNN fkNN

Figure 4 Macro 1198651 results on Reuters-21578 data set

Stop Words the results of Chinese word segmentation algo-rithm will influence the classification algorithmrsquos accuracyMeanwhile the current experiments do not consider thefeature wordsrsquo weight The following TF-Gini algorithm hasa good solution to this problem

Although the IGI algorithm does not have the bestperformance in the Chinese data set its performance is veryclose to the highest Hence the IGI feature selection algo-rithm is suitable and acceptable in the classification algo-rithm It promotes the classifiersrsquo performance effectively

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

90

100

()

kNN fkNN

Figure 5 Macro 1198651 results on Chinese data set

GIIGIIGECE

CHIORMI

0

10

20

30

40

50

60

70

80

90

100

()

SVMClassifiers

kNN fkNN

Figure 6 Micro 1198651 results on Chinese data set

Figures 7 8 9 and 10 are results for the English and Chi-nese date sets To determine which algorithmrsquos performanceis the best we use different TF weights algorithms such asTF-IDF and TF-Gini

From Figures 7 and 8 we can see that the TF-Giniweights algorithmrsquos performance is good and acceptable forthe English data set its performance is very close to thehighest FromFigures 9 and 10 we can see that on theChinesedata set the TF-Gini algorithm overcomes the deficienciesof the IGI algorithm and gets the best performance Theseresults show that the TF-Gini algorithm is a very promisingfeature weight algorithm

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Improved Feature Weight Algorithm and Its ...

2 Mathematical Problems in Engineering

decision tree algorithms [11] That is to say Gini Index canidentify the importance of features

On the other hand feature weighting (FW) is anotherimportant aspect of FS TF-IDF [12] is a popular FW algo-rithm However TF-IDF is unsuitable for text FW The basicidea of TF-IDF is that a feature term ismore important if it hasa higher frequency in a text known as Term Frequency (TF)and feature term is less important if it appears in differenttext documents in a training set known as Inverse DocumentFrequency (IDF) In TC feature frequency is one of themost important aspects regardless of its appearance in oneor more texts Many researchers have improved the TF-IDFalgorithm by replacing the IDF part with FS algorithms suchas TF-IG and TF-CHI [13ndash15] This paper mainly focuseson an Improved Gini Index algorithm and its applicationand proposes an FW algorithm using Improved Gini Indexalgorithm instead of the IDF part of TF-IDF algorithmknown as TF-Gini algorithm The experiment results of thealgorithm show that the TF-Gini algorithm is a promisingalgorithm To test the performance of the Improved GiniIndex and TF-Gini algorithms in this paper we introduce119896NN [16 17] f119896NN [18] and SVM [19 20] as the TCalgorithms standards

As an application of TF-Gini algorithm this paperdesigns and implements a sensitive information monitoringsystem Sensitive information on the Internet is the mostinteresting phenomenon that information experts are eagerto investigate Since the application of the Internet is devel-oping very fast many new words and phrases emerge on theInternet every day Therefore new words and phrases posechallenges to sensitive information monitoring At presentmost of the network information monitoring and filteringalgorithms are inflexible and inaccurate By the previousresearch results we use the TF-Gini as the core algorithmand design and implement a system for sensitive informationmonitoring This system can update sensitive words andphrases in time and has better performance on monitoringthe Internet information

2 Text Classification

21 Text Classification Process The text classification processis shown in Figure 1 The left part is the training process Theright part is the classification process The function of eachpart can be described as follows

211 The Training Process The purpose of the trainingprocess is to build a classification model The new samplesare assigned to appropriate category by using this model

(1) Words Segmentation In English and other Western lan-guages there is a space delimiter between words In Orientallanguages there are no space delimiters between wordsChinese word segmentation is the basis for all Chinesetext processing There are many Chinese word segmentationalgorithms [21 22] most of them use machine learningalgorithms ICTCLAS (Institute of Computing TechnologyChinese Lexical Analysis System) is the best Chinese lexicalanalyzer which is designed by ICT (Institute of Computing

Training set Test set

Words segmentation(for Chinese words)

Words segmentation(for Chinese words)

Stop Words orstemming

VSM

Features selection

Classification

Classificationmodel

End

Stop Words orstemming

VSM

Using model toclassify new samples

Evaluation results

End

Figure 1 Text classification process

Technology Chinese Academy of Science) In this paper weuse ICTCLAS for Chinese word segmentation

(2) StopWords or Stemming A text document always containshundreds or thousands of words Many of the words appearin a very high frequency but are useless such as ldquotherdquo ldquoardquononsense articles and adverbs These words are called StopWords [23] Stop Words which are language-specific func-tional words are frequent words that carry no informationThe first step during text process is to remove these StopWords

Stemming techniques are used to find out the root or stemof a word Stemming converts words to their stems whichincorporates a great deal of language-dependent linguisticknowledge Behind stemming the hypothesis is that wordswith the same stem or root mostly describe the same orrelatively close concepts in the document Hence words canbe conflated by using stems For example the words userusers used and using all can be stemmed to the word ldquoUSErdquoIn Chinese text classification we only need Stop Wordsbecause there is no conception of stemming in Chinese

(3) VSM (Vector SpaceModel) VSM is very popular in naturallanguage processing especially in computing the similaritybetween documents VSM ismapping a document to a vectorFor example if we view each feature word as a dimensionand the word frequency as its value the document canbe represented as an 119899-dimensional vector that is 119879 =

⟨1199051 1199081⟩ ⟨1199052 1199082⟩ ⟨119905

119899 119908119899⟩ where 119905

119894(1 le 119894 le 119899) is the

Mathematical Problems in Engineering 3

Table 1 Three documents represented in the VSM

1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

doc1

1 2 5 7 9doc2

3 4 6 8doc3

10 11 12 13 14 15

119894th feature and 119908119894

(1 le 119894 le 119899) is the featurersquos frequency whichis also called weights Table 1 is an example of a VSM of threedocuments with ten feature words doc

1 doc2 and doc

3are

documents 1199051 1199052 119905

10are feature words

Therefore we can represent doc1 doc2 and doc

3in

VSM as 1198791

= ⟨1199051 1⟩ ⟨119905

2 2⟩ ⟨119905

4 5⟩ ⟨119905

6 7⟩ ⟨119905

8 9⟩ 119879

2=

⟨1199052 3⟩ ⟨119905

4 4⟩ ⟨119905

6 6⟩ ⟨119905

7 8⟩ and 119879

3= ⟨119905

1 10⟩ ⟨119905

3 11⟩

⟨1199055 12⟩ ⟨119905

8 13⟩ ⟨119905

9 14⟩ ⟨119905

10 15⟩

(4) FeatureWords Selection After StopWords and stemmingthere are still tens of thousands of words in a training set Itis impossible for a classifier to handle so many words Thepurpose of feature words selection is to reduce the dimensionof features words It is a very important step for text classi-fication Choosing the appropriate representative words canimprove classifiersrsquo performance This paper mainly focuseson this stage

(5) Classification and Building Classification Model After theabove steps all the training set texts have beenmapped to theVSMThis step uses the classification algorithms onVSM andbuilds classification model which can assign an appropriatecategory to new input samples

212 The Classification Process The steps of word segmenta-tion Stop Words or stemming and VSM are similar to theclassification processThe classification process will assign anappropriate category to new samples When a new samplearrives using the classification model calculates the mostlikely class label of the new sample and assigns it to thesample The results evaluation is a very important step in thetext classification process It indicates whether the algorithmis good or not

22 Feature Selection Although the Stop Words and stem-ming step get rid of some useless words the VSM still hasmany insignificantwordswhich even have a negative effect onclassification It is difficult for many classification algorithmsto handle high-dimensional data sets Hence we need toreduce the dimensional space and improve the classifiersrsquoperformance FS uses machine learning algorithms to chooseappropriate words that can represent the original text whichcan reduce the dimensions of the feature space

The definition of FS is as followsSelect119898 features from the original 119899 features (119898 le 119899)The

119898 features can bemore concise andmore efficient to representthe contents of the textThe commonly used FS algorithm canbe described as follows

221 InformationGain (IG) IG iswidely used in themachinelearning field which is a criterion of the importance of the

feature to a class The value of IG equals the difference ofinformation entropy before and after the appearance of afeature in a classThe greater the IG the greater the amount ofinformation the feature contains the more important in theTC Therefore we can filter the best feature subset accordingto the featurersquos IG The computing formula can be describedas follows

InfGain (119908) = minus

|119862|

sum

119894=1

119875 (119888119894) log119875 (119888

119894)

+ 119875 (119908)

|119862|

sum

119894=1

119875 (119908 | 119888119894) log119875 (119908 | 119888

119894)

+ 119875 (119908)

|119862|

sum

119894=1

119875 (119908119894

| 119888119894) log119875 (119908 | 119888

119894)

(1)

where 119908 is the feature word 119875(119908) is the probability of textthat contains 119908 appearing in the training set 119875(119908) is theprobability of text that does not contain 119908 appearing in thetraining set 119875(119888

119894) is the probability of 119888

119894class in the training

set 119875(119908 | 119888119894) is the probability of text that contains 119908 in 119888

119894

119875(119908 | 119888119894) is the probability of text that does not contain 119908 in

119888119894 and |119862| is the total amount of classes in training setThe shortage of IG algorithm is that it takes into account

features which do not appear although in some circumstancethe nonappearing features could contribute to the classifi-cation or the contribution is far less than that of featuresappearing in the training set However if the training set isunbalanced and119875(119908) ≫ 119875(119908) the value of IGwill be decidedby 119875(119908) sum

|119862|

119894=1119875(119908119894

| 119888) log119875(119908 | 119888119894) and the performance of

IG can be significantly reduced

222 Expected Cross Entropy (ECE) ECE is well knownas KL distance and it reflects the distance between theprobability of the theme class and the probability of the themeclass under the condition of a specific featureThe computingformula can be described as follows

CrossEntryText (119908)

= 119875 (119908)

|119862|

sum

119894=1

119875 (119908 | 119888119894) log

119875 (119908 | 119888119894)

119875 (119888119894)

(2)

where 119908 is the feature word 119875(119908) is the probability of textthat contains 119908 appearing in the training set 119875(119888

119894) is the

probability of class 119888119894in the training set 119875(119908 | 119888

119894) is the

probability of text that contains 119908 in 119888119894 and |119862| is the total

amount of classes in training setECE has been widely used in feature selection of TC and

it also achieves good performance Compared with the IGalgorithm ECE does not consider the case that feature doesnot appear which reduced the interference of rare featurewhich does not occur often and improves the performanceof classification However it also has some limitations

(1) If 119875(119908 | 119888119894) gt 119875(119888

119894) then log(119875(119908 | 119888

119894)119875(119888119894)) gt

0 When 119875(119908 | 119888119894) is larger and 119875(119888

119894) is smaller

the logarithmic value is larger which indicates thatfeature 119908 has strong association with class 119888

119894

4 Mathematical Problems in Engineering

(2) If 119875(119908 | 119888119894) lt 119875(119888

119894) then log(119875(119908 | 119888

119894)119875(119888119894)) lt 0

When 119875(119908 | 119888119894) is smaller and 119875(119888

119894) is larger the

logarithmic value is smaller which indicates that fea-ture 119908 has weaker association with class 119888

119894

(3) When 119875(119908 | 119888119894) = 0 there is no association between

feature 119908 and the class 119888119894 In this case the logarithmic

expression has no significance Introducing a smallparameter for example 120576 = 00001 that is log((119875(119908 |

119888119894) + 120576)119875(119888

119894)) the logarithmic will have significance

(4) If 119875(119908 | 119888119894) and 119875(119888

119894) are close to each other

then log(119875(119908 | 119888119894)119875(119888119894)) is close to zero When

119875(119908 | 119888119894) is large this indicates that feature 119908 has a

strong associationwith class 119888119894and should be retained

However when log(119875(119908 | 119888119894)119875(119888119894)) is close to zero

feature 119908 will be removed In this case we employinformation entropy to add to the ECE algorithmTheformula of information entropy is as follows

IE (119908) = minus

|119862|

sum

119894=0

119875 (119908 | 119888119894) log (119875 (119908 | 119888

119894)) (3)

In a summary we combine (2) and (3) together the ECEformula is as follows

ECE (119908)

=119875 (119908)

IE (119908) + 120576

|119862|

sum

119894=1

(119875 (119908 | 119888119894) + 120576) log

119875 (119908 | 119888119894) + 120576

119875 (119888119894)

(4)

If feature 119908 exists only in one class the value of infor-mation entropy is zero that is IE(119908) = 0 Hence weshould introduce a small parameter in the denominator as aregulator

223 Mutual Information (MI) In TC MI expresses theassociation between feature words and classes The MIbetween 119908 and 119888

119894is defined as

MutualInfo (119908) = log119875 (119908 | 119888

119894)

119875 (119908)= log

119875 (119908 119888119894)

119875 (119908) 119875 (119888119894)

(5)

where 119908 is the feature word 119875(119908 119888119894) is the probability of text

that contains 119908 in class 119888119894 119875(119908) is the probability of text that

contains 119908 in the training set and 119875(119888119894) is the probability of

class 119888119894in the training set

It is necessary to calculate the MI between features andeach category in the training set The following two formulascan be used to calculate the MI

MImax (119908) =|119862|

max119894=1

MutualInfo (119908)

MIavg (119908) =

|119862|

sum

119894=1

119875 (119888119894)MutualInfo (119908)

(6)

where 119875(119888119894) is the probability of class 119888

119894in the training set and

|119862| is the total amount of classes in the training set GenerallyMImax(119908) is used commonly

224 Odds Ratio (OR) The computing formula can bedescribed as follows

OddsRatio (119908 119888119894) = log

119875 (119908 | 119888119894) (1 minus 119875 (119908 | 119888

119894))

(1 minus 119875 (119908 | 119888119894)) 119875 (119908 | 119888

119894)

(7)

where 119908 is the feature word 119888119894represents the positive class

and 119888119894represents the negative class119875(119908 | 119888

119894) is the probability

of text that contains 119908 in class 119888119894 and 119875(119908 | 119888

119894) is the

conditional probability of text that contains119908 in all the classesexcept 119888

119894

OR metric measures the membership and nonmember-ship to a specific class with its numerator and denominatorrespectively Therefore the numerator must be maximizedand the denominator must be minimized to get the highestscore according to the formula The formula is a one-sidedmetric because the logarithm function produces negativescores while the value of the fraction is between 0 and 1 Inthis case the features have negative values pointed to negativefeatures Thus if we only want to identify the positive classand do not care about the negative class Odds Ratio will havea great advantage Odds Ratio is suitable for binary classifiers

225 1205942 (CHI) 120594

2 statistical algorithm measures the rele-vance between the feature 119908 and the class 119888

119894 The higher the

score the stronger the relevance between the feature 119908 andthe class 119888

119894 It means that the feature has a greater contribution

to the class The computing formula can be described asfollows

1205942

(119908 119888119894)

=|119862| (119860

1times 1198604

minus 1198602

times 1198603)2

(1198601

+ 1198603) (1198602

+ 1198604) (1198601

+ 1198602) (1198603

+ 1198604)

(8)

where 119908 is the feature word 1198601is the frequency containing

feature 119908 and class 119888119894 1198602is the frequency containing feature

119908 but not belonging to class 119888119894 1198603is the frequency belonging

to class 119888119894but not containing the feature119908119860

4is the frequency

not containing the feature119908 and not belonging to class 119888119894 and

|119862| is the total number of text documents in class 119888119894

When the feature 119908 and class 119888119894are independent that

is 1205942(119908 119888119894) = 0 this means that the feature 119908 is not

containing identification informationWe calculate values foreach category and then use formula (9) to compute the valueof the entire training set

1205942

max (119908) = max1le119894le|119862|

1205942

(119908 119888119894) (9)

3 The Gini Feature Selection Algorithm

31 The Improved Gini Index Algorithm

311 Gini Index (GI) The GI is a kind of impurity attributesplitting method It is widely used in the CART algorithm[24] the SLIQ algorithm [25] and the SPRINT algorithm[26] It chooses the splitting attribute and obtains very goodclassification accuracy The GI algorithm can be described asfollows

Mathematical Problems in Engineering 5

Suppose that 119876 is a set of 119904 samples and these sampleshave 119896 different classes (119888

119894 119894 = 1 2 3 119896) According to the

differences of classes we can divide 119876 into 119896 subsets (119876119894 119894 =

1 2 3 119896) Suppose that 119876119894is a sample set that belongs to

class 119888119894and that 119904

119894is the total number of 119876

119894 then the GI of set

119876 is

Gini (119876) = 1 minus

|119862|

sum

119894=1

1198752

119894 (10)

where 119875119894is the probability of 119876

119894in 119888119894and calculated by 119904

119894119904

When Gini(119876) is 0 that is the minimum all the membersin the set belong to the same class that is the maximumuseful information can be obtained When all of the samplesin the set distribute equally for each class Gini(119876) is themaximum that is the minimum useful information can beobtained

For an attribute 119860 with 119897 distinct values 119876 is partitionedinto 119897 subsets 119876

1 1198762 119876

119897 The GI of 119876 with respect to the

attribute 119860 is defined as

Gini119860

(119876) = 1 minus

119897

sum

119894=1

119904119894

119904Gini (119876) (11)

The main idea of GI algorithm is that the attribute withtheminimum value of GI is the best attribute which is chosento split

312 The Improved Gini Index (IGI) Algorithm The originalform of the GI algorithm was used to measure the impurityof attributes towards classification The smaller the impurityis the better the attribute is However many studies on GItheory have employed the measure of purity the larger thevalue of the purity is the better the attribute is Purity is moresuitable for text classification The formula is as follows

Gini (119876) =

|119862|

sum

119894=1

1198752

119894 (12)

This formula measures the purity of attributes towardscategorization The larger the value of purity is the betterthe attribute is However we always emphasize the high-frequency words in TC because the high-frequency wordshave more contributions to judge the class of texts But whenthe distribution of the training set is highly unbalanced thelower-frequency feature words still have some contributionto judge the class of texts although the contribution is far lesssignificant than that the high-frequency feature words haveTherefore we define the IGI algorithm as

IGI (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

119875 (119888119894

| 119908)2

(13)

where 119908 is the feature word |119862| is the total number of classesin the training set 119875(119908 | 119888

119894) is the probability of text that

contains119908 in class 119888119894 and119875(119888

119894| 119908) is the posterior probability

of text that contains 119908 in class 119888119894

This formula overcomes the shortcoming of the originalGI which considers the featuresrsquo conditional probability

combining the conditional probability and posterior prob-ability Hence the IGI algorithm can depress the affectionwhen the training set is unbalanced

32 TF-Gini Algorithm The purpose of FS is to choose themost suitable representative features to represent the originaltext In fact the featuresrsquo distinguishability is different Somefeatures can effectively distinguish a text from others butother features cannot Therefore we use weights to identifydifferent featuresrsquo distinguishabilityThe higher distinguisha-bility of features will get higher weights

The TF-IDF algorithm is a classical algorithm to calculatethe featuresrsquo weights For a word 119908 in text 119889 the weights of 119908

in 119889 are as follows

TF-IDF (119908 119889) = TF (119908 119889) sdot IDF (119908 119889)

= TF (119908 119889) sdot log(|119863|

DF (119908))

(14)

where TF(119908 119889) is the frequency of 119908 appearance in text 119889|119863| is the number of text documents in the training setand DF(119908) is the total number of texts containing 119908 in thetraining set

The TF-IDF algorithm is not suitable for TC mainlybecause of the shortcoming of TF-IDF in the section IDFThe TF-IDF considers that a word with a higher frequencyin different texts in the training set is not important Thisconsideration is not appropriate for TC Therefore we usepurity Gini to replace the IDF part in the TF-IDF formulaThe purity Gini is as follows

GiniTxt (119908) =

|119862|

sum

119894=1

119875119894(119908 | 119888119894)120575

(15)

where119908 is the featureword and 120575 is a nonzero value119875119894(119908 | 119888119894)

is the probability of text that contains 119908 in class 119888119894 and |119862| is

the total number of classes in the training set When 120575 = 2the following formula is the original Gini formula

GiniTxt (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

(16)

However in TC the experimental results indicate thatwhen 120575 = minus12 the classification performance is the best

GiniTxt (119908) =

|119862|

sum

119894=1

radic119875 (119908 | 119888119894)

119875 (119908 | 119888119894)

(17)

Therefore the new feature weighting algorithm known asTF-Gini can be represented as

TF-Gini = TF sdot GiniTxt (119908) (18)

33 Experimental and Analysis

331 Data Sets Preparation In this paper we choose Englishand Chinese data sets to verify the performance of the FSand FW algorithms The English data set which we choose

6 Mathematical Problems in Engineering

is Reuters-21578 We choose ten big categories among themThe Chinese data set is Fudan University Chinese text dataset (19637 texts)We use 3522 pieces as the training texts and2148 pieces as the testing texts

332 Classifiers Selection This paper is mainly focused ontext preprocessing stage The IGI and TF-Gini algorithms arethe main research area in this paper To validate the abovetwo algorithmsrsquo performance we introduce some commonlyused text classifiers such as 119896NN classifier f119896NN classifierand SVM classifier All the experiments are based on theseclassifiers

(1) The kNN Classifier The 119896 nearest neighbors (119896NN)algorithm is a very popular simple and nonparameter textclassification algorithm The idea of 119896NN algorithm is tochoose the category for new input sample that appears mostoften amongst its 119896 nearest neighbors Its decision rule can bedescribed as follows

120583119895

(119889) =

119896

sum

119894=1

120583119895

(119889119894) sim (119889 119889

119894) (19)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training set 119889 is anew input sample with unknown category and 119889

119894is a testing

text whose category is clear 120583119895(119889119894) is equal to 1 if sample 119889

119894

belongs to class 119895 otherwise it is 0 sim(119889 119889119894) indicates the

similarity between the new input sample and the testing textThe decision rule is

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(20)

(2) The fkNN Classifier The 119896NN classifier is a lazy clas-sification algorithm It does not need the training set Theclassifierrsquos performance is not ideal especially when thedistribution of the training set is unbalanced Therefore weimprove the algorithm by using the fuzzy logic inferencesystem Its decision rule can be described as follows

120583119895

(119889)

=

sum119896

119894=1120583119895

(119889119894) sim (119889 119889

119894) (1 minus sim (119889 119889

119894))2(1minus119887)

sum119896

119894=1(1 minus sim (119889 119889

119894))2(1minus119887)

(21)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training setand 120583119895(119889119894)sim(119889 119889

119894) is the value ofmemberships for sample 119889

belonging to class 119895 120583119895(119889119894) is equal to 1 if sample 119889

119894belongs to

class 119895 otherwise it is 0 From the formula it can be seen thatf119896NN is based on 119896NN algorithm which endows a distanceweight on 119896NN formula The parameter 119887 adjusts the degreeof the weight 119887 isin [0 1] The f119896NN decision rule is as follows

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(22)

H2

H1

H

b

+

++

+

+

minusminus

minus

minus

minus

1120596

120596 middot x + b = minus1 120596 middot x + b = 0

120596 middot x + b = +1

Figure 2 Optimal hyperplane for SVM classifier

(3) The SVM Classifier Support vector machine (SVM) is apotential classification algorithmwhich is based on statisticallearning theory SVM is highly accurate owing to its abilityto model complex nonlinear decision boundary It uses anonlinear mapping to transform the original set into a higherdimensionWithin this newdimension it searches for the lin-ear optimal separating hyperplane that is decision boundarywith an appropriate nonlinear mapping to a sufficiently highdimension data from two classes can always be separatedby a hyperplane Originally SVM mainly solves the problemof two-class classification The multiclass problem can beaddressed using multiple binary classifications

Suppose that in the feature space a set of training vectors119909119894(119894 = 1 2 3 ) belong to different classes 119910

119894isin minus1 1 We

wish to separate this training set with a hyperplane

120596 sdot 119909 + 119887 = 0 (23)

Obviously there are an infinite number of hyperplanesthat are able to partition the data set into sets Howeveraccording to the SVM theory there is only one optimalhyperplane which is lying half-way within the maximalmargin as the sumof distances of the hyperplane to the closesttraining sample of each class As shown in Figure 2 the solidline represents the optimal separating hyperplane

One of the problems of SVM is to find the maximumseparating hyperplane that is the one maximum distancebetween the nearest training samples In Figure 2 119867 is theoptimal hyperplane 119867

1and 119867

2are the parallel hyperplanes

to119867The task of SVM ismaking themargin distance between1198671and 119867

2maximum The maximum margin can be written

as follows

120596 sdot 119909119894

+ 119887 ge +1 119910119894

= +1

120596 sdot 119909119894

+ 119887 le minus1 119910119894

= minus1 997904rArr

119910119894(120596 sdot 119909

119894+ 119887) minus 1 ge 0

(24)

Mathematical Problems in Engineering 7

Therefore from Figure 2 we can get that the processof solving SVM is making the margin between 119867

1and 119867

2

maximum

min119909119894|119910119894=1

120596 sdot 119909119894

+ 119887

120596minus max119909119894119910119894=minus1

120596 sdot 119909119894

+ 119887

120596=

2

120596 (25)

The equation is also the same as minimizing 12059622

Therefore it becomes the following optimization question

min (1205962

2)

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1

(26)

The formulameans the points which satisfy 119910119894(120596sdot119909119894+119887) =

1 called support vectors namely the points on 1198671and 119867

2in

Figure 2 are all support vectors In fact many samples are notclassified correctly by the hyperplane Hence we introduceslack variables Then the above formula becomes

min 1

21205962

minus 120578 sdot

|119862|

sum

119894=1

120585119894

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1 minus 120585

119894

(27)

where 120578 is a positive constant to balance the experienceand confidence and |119862| is the total number of classes in thetraining set

333 Performance Evaluation Confusion matrix [27] alsoknown as a contingency table or an error matrix is a specifictable layout that allows visualization of the performance ofan algorithm typically a supervised learning algorithm Itis a table with two rows and two columns that reports thenumber of false positives false negatives true positives andtrue negatives Table 2 shows the confusion matrix

True positive (TP) is the number of correct predictionsthat an instance is positive False positive (FP) is the numberof incorrect predictions that an instance is positive Falsenegative (FN) is the number of incorrect predictions thatan instance is positive True negative (TN) is the number ofincorrect predictions that an instance is negative

Precision is the proportion of the predicted positive casesthat were correct as calculated using the equation

Precision (119875) =TP

TP + FP (28)

Recall is the proportion of the actual positive cases thatwere correct as calculated using the equation

Recall =TP

TP + FN (29)

We hope that it is better when the precision and recallhave higher values However both of them are contradictoryFor example if the result of prediction is only one andaccurate the precision is 100 and the recall is very lowbecause TP equals 1 FP equals 0 and FN is hugeOn the other

Table 2 Confusion matrix

Prediction as positive Prediction as negativeActual positive TP FNActual negative FP TN

hand if the prediction as positive contains all the results (ieFN equals 0) therefore the recall is 100The precision is lowbecause FP is large

1198651 [28] is the average of precision and recall If 1198651 is highthe algorithm and experimental methods are ideal 1198651 is tomeasure the performance of a classifier considering the abovetwo methods and was proposed by Van Rijsbergen [29] in1979 It can be described as follows

1198651 =2 times Precision times RecallPrecision + Recall

(30)

Precision and recall only evaluate the performance ofclassification algorithms in a certain category It is oftennecessary to evaluate classification performance in all the cat-egories When dealing with multiple categories there are twopossible ways of averaging these measures namely macro-average andmicroaverageThemacroaverage weights equallyall classes regardless of how many documents belong to itThe microaverage weights equally all the documents thusfavoring the performance in common classes The formulasof macroaverage and microaverage are as follows and |119862| isthe number of classes in the training set

Macro 119901 =sum|119862|

119894=1Precision

119894

|119862|

Macro 119903 =sum|119862|

119894=1Recall

119894

|119862|

Macro 1198651 =sum|119862|

119894=11198651119894

|119862|

Micro 119901 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FP119894)

Micro 119903 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FN119894)

Micro 1198651 =2 times Micro 119901 times Micro 119903

Micro 119901 + Micro 119903

(31)

334 Experimental Results and Analysis Figures 3 and 4are the results on the English data set The performance ofclassical GI is the lowest and the IGI is the highest There-fore the IGI algorithm is helpful to improve the classifiersrsquoperformance

From Figures 5 and 6 it can be seen that in Chinesedata sets the classical GI still has poor performanceThe bestperformance belongs to ECE and then the second is IG Theperformance of IGI is not the best Since theChinese languageis different from the English language at the processing of

8 Mathematical Problems in Engineering

GIIGIIGECE

CHIORMI

kNN fkNNSVMClassifiers

0

10

20

30

40

50

60

70

80

()

Figure 3 Macro 1198651 results on Reuters-21578 data set

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

()

kNN fkNN

Figure 4 Macro 1198651 results on Reuters-21578 data set

Stop Words the results of Chinese word segmentation algo-rithm will influence the classification algorithmrsquos accuracyMeanwhile the current experiments do not consider thefeature wordsrsquo weight The following TF-Gini algorithm hasa good solution to this problem

Although the IGI algorithm does not have the bestperformance in the Chinese data set its performance is veryclose to the highest Hence the IGI feature selection algo-rithm is suitable and acceptable in the classification algo-rithm It promotes the classifiersrsquo performance effectively

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

90

100

()

kNN fkNN

Figure 5 Macro 1198651 results on Chinese data set

GIIGIIGECE

CHIORMI

0

10

20

30

40

50

60

70

80

90

100

()

SVMClassifiers

kNN fkNN

Figure 6 Micro 1198651 results on Chinese data set

Figures 7 8 9 and 10 are results for the English and Chi-nese date sets To determine which algorithmrsquos performanceis the best we use different TF weights algorithms such asTF-IDF and TF-Gini

From Figures 7 and 8 we can see that the TF-Giniweights algorithmrsquos performance is good and acceptable forthe English data set its performance is very close to thehighest FromFigures 9 and 10 we can see that on theChinesedata set the TF-Gini algorithm overcomes the deficienciesof the IGI algorithm and gets the best performance Theseresults show that the TF-Gini algorithm is a very promisingfeature weight algorithm

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Improved Feature Weight Algorithm and Its ...

Mathematical Problems in Engineering 3

Table 1 Three documents represented in the VSM

1199051

1199052

1199053

1199054

1199055

1199056

1199057

1199058

1199059

11990510

doc1

1 2 5 7 9doc2

3 4 6 8doc3

10 11 12 13 14 15

119894th feature and 119908119894

(1 le 119894 le 119899) is the featurersquos frequency whichis also called weights Table 1 is an example of a VSM of threedocuments with ten feature words doc

1 doc2 and doc

3are

documents 1199051 1199052 119905

10are feature words

Therefore we can represent doc1 doc2 and doc

3in

VSM as 1198791

= ⟨1199051 1⟩ ⟨119905

2 2⟩ ⟨119905

4 5⟩ ⟨119905

6 7⟩ ⟨119905

8 9⟩ 119879

2=

⟨1199052 3⟩ ⟨119905

4 4⟩ ⟨119905

6 6⟩ ⟨119905

7 8⟩ and 119879

3= ⟨119905

1 10⟩ ⟨119905

3 11⟩

⟨1199055 12⟩ ⟨119905

8 13⟩ ⟨119905

9 14⟩ ⟨119905

10 15⟩

(4) FeatureWords Selection After StopWords and stemmingthere are still tens of thousands of words in a training set Itis impossible for a classifier to handle so many words Thepurpose of feature words selection is to reduce the dimensionof features words It is a very important step for text classi-fication Choosing the appropriate representative words canimprove classifiersrsquo performance This paper mainly focuseson this stage

(5) Classification and Building Classification Model After theabove steps all the training set texts have beenmapped to theVSMThis step uses the classification algorithms onVSM andbuilds classification model which can assign an appropriatecategory to new input samples

212 The Classification Process The steps of word segmenta-tion Stop Words or stemming and VSM are similar to theclassification processThe classification process will assign anappropriate category to new samples When a new samplearrives using the classification model calculates the mostlikely class label of the new sample and assigns it to thesample The results evaluation is a very important step in thetext classification process It indicates whether the algorithmis good or not

22 Feature Selection Although the Stop Words and stem-ming step get rid of some useless words the VSM still hasmany insignificantwordswhich even have a negative effect onclassification It is difficult for many classification algorithmsto handle high-dimensional data sets Hence we need toreduce the dimensional space and improve the classifiersrsquoperformance FS uses machine learning algorithms to chooseappropriate words that can represent the original text whichcan reduce the dimensions of the feature space

The definition of FS is as followsSelect119898 features from the original 119899 features (119898 le 119899)The

119898 features can bemore concise andmore efficient to representthe contents of the textThe commonly used FS algorithm canbe described as follows

221 InformationGain (IG) IG iswidely used in themachinelearning field which is a criterion of the importance of the

feature to a class The value of IG equals the difference ofinformation entropy before and after the appearance of afeature in a classThe greater the IG the greater the amount ofinformation the feature contains the more important in theTC Therefore we can filter the best feature subset accordingto the featurersquos IG The computing formula can be describedas follows

InfGain (119908) = minus

|119862|

sum

119894=1

119875 (119888119894) log119875 (119888

119894)

+ 119875 (119908)

|119862|

sum

119894=1

119875 (119908 | 119888119894) log119875 (119908 | 119888

119894)

+ 119875 (119908)

|119862|

sum

119894=1

119875 (119908119894

| 119888119894) log119875 (119908 | 119888

119894)

(1)

where 119908 is the feature word 119875(119908) is the probability of textthat contains 119908 appearing in the training set 119875(119908) is theprobability of text that does not contain 119908 appearing in thetraining set 119875(119888

119894) is the probability of 119888

119894class in the training

set 119875(119908 | 119888119894) is the probability of text that contains 119908 in 119888

119894

119875(119908 | 119888119894) is the probability of text that does not contain 119908 in

119888119894 and |119862| is the total amount of classes in training setThe shortage of IG algorithm is that it takes into account

features which do not appear although in some circumstancethe nonappearing features could contribute to the classifi-cation or the contribution is far less than that of featuresappearing in the training set However if the training set isunbalanced and119875(119908) ≫ 119875(119908) the value of IGwill be decidedby 119875(119908) sum

|119862|

119894=1119875(119908119894

| 119888) log119875(119908 | 119888119894) and the performance of

IG can be significantly reduced

222 Expected Cross Entropy (ECE) ECE is well knownas KL distance and it reflects the distance between theprobability of the theme class and the probability of the themeclass under the condition of a specific featureThe computingformula can be described as follows

CrossEntryText (119908)

= 119875 (119908)

|119862|

sum

119894=1

119875 (119908 | 119888119894) log

119875 (119908 | 119888119894)

119875 (119888119894)

(2)

where 119908 is the feature word 119875(119908) is the probability of textthat contains 119908 appearing in the training set 119875(119888

119894) is the

probability of class 119888119894in the training set 119875(119908 | 119888

119894) is the

probability of text that contains 119908 in 119888119894 and |119862| is the total

amount of classes in training setECE has been widely used in feature selection of TC and

it also achieves good performance Compared with the IGalgorithm ECE does not consider the case that feature doesnot appear which reduced the interference of rare featurewhich does not occur often and improves the performanceof classification However it also has some limitations

(1) If 119875(119908 | 119888119894) gt 119875(119888

119894) then log(119875(119908 | 119888

119894)119875(119888119894)) gt

0 When 119875(119908 | 119888119894) is larger and 119875(119888

119894) is smaller

the logarithmic value is larger which indicates thatfeature 119908 has strong association with class 119888

119894

4 Mathematical Problems in Engineering

(2) If 119875(119908 | 119888119894) lt 119875(119888

119894) then log(119875(119908 | 119888

119894)119875(119888119894)) lt 0

When 119875(119908 | 119888119894) is smaller and 119875(119888

119894) is larger the

logarithmic value is smaller which indicates that fea-ture 119908 has weaker association with class 119888

119894

(3) When 119875(119908 | 119888119894) = 0 there is no association between

feature 119908 and the class 119888119894 In this case the logarithmic

expression has no significance Introducing a smallparameter for example 120576 = 00001 that is log((119875(119908 |

119888119894) + 120576)119875(119888

119894)) the logarithmic will have significance

(4) If 119875(119908 | 119888119894) and 119875(119888

119894) are close to each other

then log(119875(119908 | 119888119894)119875(119888119894)) is close to zero When

119875(119908 | 119888119894) is large this indicates that feature 119908 has a

strong associationwith class 119888119894and should be retained

However when log(119875(119908 | 119888119894)119875(119888119894)) is close to zero

feature 119908 will be removed In this case we employinformation entropy to add to the ECE algorithmTheformula of information entropy is as follows

IE (119908) = minus

|119862|

sum

119894=0

119875 (119908 | 119888119894) log (119875 (119908 | 119888

119894)) (3)

In a summary we combine (2) and (3) together the ECEformula is as follows

ECE (119908)

=119875 (119908)

IE (119908) + 120576

|119862|

sum

119894=1

(119875 (119908 | 119888119894) + 120576) log

119875 (119908 | 119888119894) + 120576

119875 (119888119894)

(4)

If feature 119908 exists only in one class the value of infor-mation entropy is zero that is IE(119908) = 0 Hence weshould introduce a small parameter in the denominator as aregulator

223 Mutual Information (MI) In TC MI expresses theassociation between feature words and classes The MIbetween 119908 and 119888

119894is defined as

MutualInfo (119908) = log119875 (119908 | 119888

119894)

119875 (119908)= log

119875 (119908 119888119894)

119875 (119908) 119875 (119888119894)

(5)

where 119908 is the feature word 119875(119908 119888119894) is the probability of text

that contains 119908 in class 119888119894 119875(119908) is the probability of text that

contains 119908 in the training set and 119875(119888119894) is the probability of

class 119888119894in the training set

It is necessary to calculate the MI between features andeach category in the training set The following two formulascan be used to calculate the MI

MImax (119908) =|119862|

max119894=1

MutualInfo (119908)

MIavg (119908) =

|119862|

sum

119894=1

119875 (119888119894)MutualInfo (119908)

(6)

where 119875(119888119894) is the probability of class 119888

119894in the training set and

|119862| is the total amount of classes in the training set GenerallyMImax(119908) is used commonly

224 Odds Ratio (OR) The computing formula can bedescribed as follows

OddsRatio (119908 119888119894) = log

119875 (119908 | 119888119894) (1 minus 119875 (119908 | 119888

119894))

(1 minus 119875 (119908 | 119888119894)) 119875 (119908 | 119888

119894)

(7)

where 119908 is the feature word 119888119894represents the positive class

and 119888119894represents the negative class119875(119908 | 119888

119894) is the probability

of text that contains 119908 in class 119888119894 and 119875(119908 | 119888

119894) is the

conditional probability of text that contains119908 in all the classesexcept 119888

119894

OR metric measures the membership and nonmember-ship to a specific class with its numerator and denominatorrespectively Therefore the numerator must be maximizedand the denominator must be minimized to get the highestscore according to the formula The formula is a one-sidedmetric because the logarithm function produces negativescores while the value of the fraction is between 0 and 1 Inthis case the features have negative values pointed to negativefeatures Thus if we only want to identify the positive classand do not care about the negative class Odds Ratio will havea great advantage Odds Ratio is suitable for binary classifiers

225 1205942 (CHI) 120594

2 statistical algorithm measures the rele-vance between the feature 119908 and the class 119888

119894 The higher the

score the stronger the relevance between the feature 119908 andthe class 119888

119894 It means that the feature has a greater contribution

to the class The computing formula can be described asfollows

1205942

(119908 119888119894)

=|119862| (119860

1times 1198604

minus 1198602

times 1198603)2

(1198601

+ 1198603) (1198602

+ 1198604) (1198601

+ 1198602) (1198603

+ 1198604)

(8)

where 119908 is the feature word 1198601is the frequency containing

feature 119908 and class 119888119894 1198602is the frequency containing feature

119908 but not belonging to class 119888119894 1198603is the frequency belonging

to class 119888119894but not containing the feature119908119860

4is the frequency

not containing the feature119908 and not belonging to class 119888119894 and

|119862| is the total number of text documents in class 119888119894

When the feature 119908 and class 119888119894are independent that

is 1205942(119908 119888119894) = 0 this means that the feature 119908 is not

containing identification informationWe calculate values foreach category and then use formula (9) to compute the valueof the entire training set

1205942

max (119908) = max1le119894le|119862|

1205942

(119908 119888119894) (9)

3 The Gini Feature Selection Algorithm

31 The Improved Gini Index Algorithm

311 Gini Index (GI) The GI is a kind of impurity attributesplitting method It is widely used in the CART algorithm[24] the SLIQ algorithm [25] and the SPRINT algorithm[26] It chooses the splitting attribute and obtains very goodclassification accuracy The GI algorithm can be described asfollows

Mathematical Problems in Engineering 5

Suppose that 119876 is a set of 119904 samples and these sampleshave 119896 different classes (119888

119894 119894 = 1 2 3 119896) According to the

differences of classes we can divide 119876 into 119896 subsets (119876119894 119894 =

1 2 3 119896) Suppose that 119876119894is a sample set that belongs to

class 119888119894and that 119904

119894is the total number of 119876

119894 then the GI of set

119876 is

Gini (119876) = 1 minus

|119862|

sum

119894=1

1198752

119894 (10)

where 119875119894is the probability of 119876

119894in 119888119894and calculated by 119904

119894119904

When Gini(119876) is 0 that is the minimum all the membersin the set belong to the same class that is the maximumuseful information can be obtained When all of the samplesin the set distribute equally for each class Gini(119876) is themaximum that is the minimum useful information can beobtained

For an attribute 119860 with 119897 distinct values 119876 is partitionedinto 119897 subsets 119876

1 1198762 119876

119897 The GI of 119876 with respect to the

attribute 119860 is defined as

Gini119860

(119876) = 1 minus

119897

sum

119894=1

119904119894

119904Gini (119876) (11)

The main idea of GI algorithm is that the attribute withtheminimum value of GI is the best attribute which is chosento split

312 The Improved Gini Index (IGI) Algorithm The originalform of the GI algorithm was used to measure the impurityof attributes towards classification The smaller the impurityis the better the attribute is However many studies on GItheory have employed the measure of purity the larger thevalue of the purity is the better the attribute is Purity is moresuitable for text classification The formula is as follows

Gini (119876) =

|119862|

sum

119894=1

1198752

119894 (12)

This formula measures the purity of attributes towardscategorization The larger the value of purity is the betterthe attribute is However we always emphasize the high-frequency words in TC because the high-frequency wordshave more contributions to judge the class of texts But whenthe distribution of the training set is highly unbalanced thelower-frequency feature words still have some contributionto judge the class of texts although the contribution is far lesssignificant than that the high-frequency feature words haveTherefore we define the IGI algorithm as

IGI (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

119875 (119888119894

| 119908)2

(13)

where 119908 is the feature word |119862| is the total number of classesin the training set 119875(119908 | 119888

119894) is the probability of text that

contains119908 in class 119888119894 and119875(119888

119894| 119908) is the posterior probability

of text that contains 119908 in class 119888119894

This formula overcomes the shortcoming of the originalGI which considers the featuresrsquo conditional probability

combining the conditional probability and posterior prob-ability Hence the IGI algorithm can depress the affectionwhen the training set is unbalanced

32 TF-Gini Algorithm The purpose of FS is to choose themost suitable representative features to represent the originaltext In fact the featuresrsquo distinguishability is different Somefeatures can effectively distinguish a text from others butother features cannot Therefore we use weights to identifydifferent featuresrsquo distinguishabilityThe higher distinguisha-bility of features will get higher weights

The TF-IDF algorithm is a classical algorithm to calculatethe featuresrsquo weights For a word 119908 in text 119889 the weights of 119908

in 119889 are as follows

TF-IDF (119908 119889) = TF (119908 119889) sdot IDF (119908 119889)

= TF (119908 119889) sdot log(|119863|

DF (119908))

(14)

where TF(119908 119889) is the frequency of 119908 appearance in text 119889|119863| is the number of text documents in the training setand DF(119908) is the total number of texts containing 119908 in thetraining set

The TF-IDF algorithm is not suitable for TC mainlybecause of the shortcoming of TF-IDF in the section IDFThe TF-IDF considers that a word with a higher frequencyin different texts in the training set is not important Thisconsideration is not appropriate for TC Therefore we usepurity Gini to replace the IDF part in the TF-IDF formulaThe purity Gini is as follows

GiniTxt (119908) =

|119862|

sum

119894=1

119875119894(119908 | 119888119894)120575

(15)

where119908 is the featureword and 120575 is a nonzero value119875119894(119908 | 119888119894)

is the probability of text that contains 119908 in class 119888119894 and |119862| is

the total number of classes in the training set When 120575 = 2the following formula is the original Gini formula

GiniTxt (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

(16)

However in TC the experimental results indicate thatwhen 120575 = minus12 the classification performance is the best

GiniTxt (119908) =

|119862|

sum

119894=1

radic119875 (119908 | 119888119894)

119875 (119908 | 119888119894)

(17)

Therefore the new feature weighting algorithm known asTF-Gini can be represented as

TF-Gini = TF sdot GiniTxt (119908) (18)

33 Experimental and Analysis

331 Data Sets Preparation In this paper we choose Englishand Chinese data sets to verify the performance of the FSand FW algorithms The English data set which we choose

6 Mathematical Problems in Engineering

is Reuters-21578 We choose ten big categories among themThe Chinese data set is Fudan University Chinese text dataset (19637 texts)We use 3522 pieces as the training texts and2148 pieces as the testing texts

332 Classifiers Selection This paper is mainly focused ontext preprocessing stage The IGI and TF-Gini algorithms arethe main research area in this paper To validate the abovetwo algorithmsrsquo performance we introduce some commonlyused text classifiers such as 119896NN classifier f119896NN classifierand SVM classifier All the experiments are based on theseclassifiers

(1) The kNN Classifier The 119896 nearest neighbors (119896NN)algorithm is a very popular simple and nonparameter textclassification algorithm The idea of 119896NN algorithm is tochoose the category for new input sample that appears mostoften amongst its 119896 nearest neighbors Its decision rule can bedescribed as follows

120583119895

(119889) =

119896

sum

119894=1

120583119895

(119889119894) sim (119889 119889

119894) (19)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training set 119889 is anew input sample with unknown category and 119889

119894is a testing

text whose category is clear 120583119895(119889119894) is equal to 1 if sample 119889

119894

belongs to class 119895 otherwise it is 0 sim(119889 119889119894) indicates the

similarity between the new input sample and the testing textThe decision rule is

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(20)

(2) The fkNN Classifier The 119896NN classifier is a lazy clas-sification algorithm It does not need the training set Theclassifierrsquos performance is not ideal especially when thedistribution of the training set is unbalanced Therefore weimprove the algorithm by using the fuzzy logic inferencesystem Its decision rule can be described as follows

120583119895

(119889)

=

sum119896

119894=1120583119895

(119889119894) sim (119889 119889

119894) (1 minus sim (119889 119889

119894))2(1minus119887)

sum119896

119894=1(1 minus sim (119889 119889

119894))2(1minus119887)

(21)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training setand 120583119895(119889119894)sim(119889 119889

119894) is the value ofmemberships for sample 119889

belonging to class 119895 120583119895(119889119894) is equal to 1 if sample 119889

119894belongs to

class 119895 otherwise it is 0 From the formula it can be seen thatf119896NN is based on 119896NN algorithm which endows a distanceweight on 119896NN formula The parameter 119887 adjusts the degreeof the weight 119887 isin [0 1] The f119896NN decision rule is as follows

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(22)

H2

H1

H

b

+

++

+

+

minusminus

minus

minus

minus

1120596

120596 middot x + b = minus1 120596 middot x + b = 0

120596 middot x + b = +1

Figure 2 Optimal hyperplane for SVM classifier

(3) The SVM Classifier Support vector machine (SVM) is apotential classification algorithmwhich is based on statisticallearning theory SVM is highly accurate owing to its abilityto model complex nonlinear decision boundary It uses anonlinear mapping to transform the original set into a higherdimensionWithin this newdimension it searches for the lin-ear optimal separating hyperplane that is decision boundarywith an appropriate nonlinear mapping to a sufficiently highdimension data from two classes can always be separatedby a hyperplane Originally SVM mainly solves the problemof two-class classification The multiclass problem can beaddressed using multiple binary classifications

Suppose that in the feature space a set of training vectors119909119894(119894 = 1 2 3 ) belong to different classes 119910

119894isin minus1 1 We

wish to separate this training set with a hyperplane

120596 sdot 119909 + 119887 = 0 (23)

Obviously there are an infinite number of hyperplanesthat are able to partition the data set into sets Howeveraccording to the SVM theory there is only one optimalhyperplane which is lying half-way within the maximalmargin as the sumof distances of the hyperplane to the closesttraining sample of each class As shown in Figure 2 the solidline represents the optimal separating hyperplane

One of the problems of SVM is to find the maximumseparating hyperplane that is the one maximum distancebetween the nearest training samples In Figure 2 119867 is theoptimal hyperplane 119867

1and 119867

2are the parallel hyperplanes

to119867The task of SVM ismaking themargin distance between1198671and 119867

2maximum The maximum margin can be written

as follows

120596 sdot 119909119894

+ 119887 ge +1 119910119894

= +1

120596 sdot 119909119894

+ 119887 le minus1 119910119894

= minus1 997904rArr

119910119894(120596 sdot 119909

119894+ 119887) minus 1 ge 0

(24)

Mathematical Problems in Engineering 7

Therefore from Figure 2 we can get that the processof solving SVM is making the margin between 119867

1and 119867

2

maximum

min119909119894|119910119894=1

120596 sdot 119909119894

+ 119887

120596minus max119909119894119910119894=minus1

120596 sdot 119909119894

+ 119887

120596=

2

120596 (25)

The equation is also the same as minimizing 12059622

Therefore it becomes the following optimization question

min (1205962

2)

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1

(26)

The formulameans the points which satisfy 119910119894(120596sdot119909119894+119887) =

1 called support vectors namely the points on 1198671and 119867

2in

Figure 2 are all support vectors In fact many samples are notclassified correctly by the hyperplane Hence we introduceslack variables Then the above formula becomes

min 1

21205962

minus 120578 sdot

|119862|

sum

119894=1

120585119894

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1 minus 120585

119894

(27)

where 120578 is a positive constant to balance the experienceand confidence and |119862| is the total number of classes in thetraining set

333 Performance Evaluation Confusion matrix [27] alsoknown as a contingency table or an error matrix is a specifictable layout that allows visualization of the performance ofan algorithm typically a supervised learning algorithm Itis a table with two rows and two columns that reports thenumber of false positives false negatives true positives andtrue negatives Table 2 shows the confusion matrix

True positive (TP) is the number of correct predictionsthat an instance is positive False positive (FP) is the numberof incorrect predictions that an instance is positive Falsenegative (FN) is the number of incorrect predictions thatan instance is positive True negative (TN) is the number ofincorrect predictions that an instance is negative

Precision is the proportion of the predicted positive casesthat were correct as calculated using the equation

Precision (119875) =TP

TP + FP (28)

Recall is the proportion of the actual positive cases thatwere correct as calculated using the equation

Recall =TP

TP + FN (29)

We hope that it is better when the precision and recallhave higher values However both of them are contradictoryFor example if the result of prediction is only one andaccurate the precision is 100 and the recall is very lowbecause TP equals 1 FP equals 0 and FN is hugeOn the other

Table 2 Confusion matrix

Prediction as positive Prediction as negativeActual positive TP FNActual negative FP TN

hand if the prediction as positive contains all the results (ieFN equals 0) therefore the recall is 100The precision is lowbecause FP is large

1198651 [28] is the average of precision and recall If 1198651 is highthe algorithm and experimental methods are ideal 1198651 is tomeasure the performance of a classifier considering the abovetwo methods and was proposed by Van Rijsbergen [29] in1979 It can be described as follows

1198651 =2 times Precision times RecallPrecision + Recall

(30)

Precision and recall only evaluate the performance ofclassification algorithms in a certain category It is oftennecessary to evaluate classification performance in all the cat-egories When dealing with multiple categories there are twopossible ways of averaging these measures namely macro-average andmicroaverageThemacroaverage weights equallyall classes regardless of how many documents belong to itThe microaverage weights equally all the documents thusfavoring the performance in common classes The formulasof macroaverage and microaverage are as follows and |119862| isthe number of classes in the training set

Macro 119901 =sum|119862|

119894=1Precision

119894

|119862|

Macro 119903 =sum|119862|

119894=1Recall

119894

|119862|

Macro 1198651 =sum|119862|

119894=11198651119894

|119862|

Micro 119901 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FP119894)

Micro 119903 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FN119894)

Micro 1198651 =2 times Micro 119901 times Micro 119903

Micro 119901 + Micro 119903

(31)

334 Experimental Results and Analysis Figures 3 and 4are the results on the English data set The performance ofclassical GI is the lowest and the IGI is the highest There-fore the IGI algorithm is helpful to improve the classifiersrsquoperformance

From Figures 5 and 6 it can be seen that in Chinesedata sets the classical GI still has poor performanceThe bestperformance belongs to ECE and then the second is IG Theperformance of IGI is not the best Since theChinese languageis different from the English language at the processing of

8 Mathematical Problems in Engineering

GIIGIIGECE

CHIORMI

kNN fkNNSVMClassifiers

0

10

20

30

40

50

60

70

80

()

Figure 3 Macro 1198651 results on Reuters-21578 data set

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

()

kNN fkNN

Figure 4 Macro 1198651 results on Reuters-21578 data set

Stop Words the results of Chinese word segmentation algo-rithm will influence the classification algorithmrsquos accuracyMeanwhile the current experiments do not consider thefeature wordsrsquo weight The following TF-Gini algorithm hasa good solution to this problem

Although the IGI algorithm does not have the bestperformance in the Chinese data set its performance is veryclose to the highest Hence the IGI feature selection algo-rithm is suitable and acceptable in the classification algo-rithm It promotes the classifiersrsquo performance effectively

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

90

100

()

kNN fkNN

Figure 5 Macro 1198651 results on Chinese data set

GIIGIIGECE

CHIORMI

0

10

20

30

40

50

60

70

80

90

100

()

SVMClassifiers

kNN fkNN

Figure 6 Micro 1198651 results on Chinese data set

Figures 7 8 9 and 10 are results for the English and Chi-nese date sets To determine which algorithmrsquos performanceis the best we use different TF weights algorithms such asTF-IDF and TF-Gini

From Figures 7 and 8 we can see that the TF-Giniweights algorithmrsquos performance is good and acceptable forthe English data set its performance is very close to thehighest FromFigures 9 and 10 we can see that on theChinesedata set the TF-Gini algorithm overcomes the deficienciesof the IGI algorithm and gets the best performance Theseresults show that the TF-Gini algorithm is a very promisingfeature weight algorithm

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Improved Feature Weight Algorithm and Its ...

4 Mathematical Problems in Engineering

(2) If 119875(119908 | 119888119894) lt 119875(119888

119894) then log(119875(119908 | 119888

119894)119875(119888119894)) lt 0

When 119875(119908 | 119888119894) is smaller and 119875(119888

119894) is larger the

logarithmic value is smaller which indicates that fea-ture 119908 has weaker association with class 119888

119894

(3) When 119875(119908 | 119888119894) = 0 there is no association between

feature 119908 and the class 119888119894 In this case the logarithmic

expression has no significance Introducing a smallparameter for example 120576 = 00001 that is log((119875(119908 |

119888119894) + 120576)119875(119888

119894)) the logarithmic will have significance

(4) If 119875(119908 | 119888119894) and 119875(119888

119894) are close to each other

then log(119875(119908 | 119888119894)119875(119888119894)) is close to zero When

119875(119908 | 119888119894) is large this indicates that feature 119908 has a

strong associationwith class 119888119894and should be retained

However when log(119875(119908 | 119888119894)119875(119888119894)) is close to zero

feature 119908 will be removed In this case we employinformation entropy to add to the ECE algorithmTheformula of information entropy is as follows

IE (119908) = minus

|119862|

sum

119894=0

119875 (119908 | 119888119894) log (119875 (119908 | 119888

119894)) (3)

In a summary we combine (2) and (3) together the ECEformula is as follows

ECE (119908)

=119875 (119908)

IE (119908) + 120576

|119862|

sum

119894=1

(119875 (119908 | 119888119894) + 120576) log

119875 (119908 | 119888119894) + 120576

119875 (119888119894)

(4)

If feature 119908 exists only in one class the value of infor-mation entropy is zero that is IE(119908) = 0 Hence weshould introduce a small parameter in the denominator as aregulator

223 Mutual Information (MI) In TC MI expresses theassociation between feature words and classes The MIbetween 119908 and 119888

119894is defined as

MutualInfo (119908) = log119875 (119908 | 119888

119894)

119875 (119908)= log

119875 (119908 119888119894)

119875 (119908) 119875 (119888119894)

(5)

where 119908 is the feature word 119875(119908 119888119894) is the probability of text

that contains 119908 in class 119888119894 119875(119908) is the probability of text that

contains 119908 in the training set and 119875(119888119894) is the probability of

class 119888119894in the training set

It is necessary to calculate the MI between features andeach category in the training set The following two formulascan be used to calculate the MI

MImax (119908) =|119862|

max119894=1

MutualInfo (119908)

MIavg (119908) =

|119862|

sum

119894=1

119875 (119888119894)MutualInfo (119908)

(6)

where 119875(119888119894) is the probability of class 119888

119894in the training set and

|119862| is the total amount of classes in the training set GenerallyMImax(119908) is used commonly

224 Odds Ratio (OR) The computing formula can bedescribed as follows

OddsRatio (119908 119888119894) = log

119875 (119908 | 119888119894) (1 minus 119875 (119908 | 119888

119894))

(1 minus 119875 (119908 | 119888119894)) 119875 (119908 | 119888

119894)

(7)

where 119908 is the feature word 119888119894represents the positive class

and 119888119894represents the negative class119875(119908 | 119888

119894) is the probability

of text that contains 119908 in class 119888119894 and 119875(119908 | 119888

119894) is the

conditional probability of text that contains119908 in all the classesexcept 119888

119894

OR metric measures the membership and nonmember-ship to a specific class with its numerator and denominatorrespectively Therefore the numerator must be maximizedand the denominator must be minimized to get the highestscore according to the formula The formula is a one-sidedmetric because the logarithm function produces negativescores while the value of the fraction is between 0 and 1 Inthis case the features have negative values pointed to negativefeatures Thus if we only want to identify the positive classand do not care about the negative class Odds Ratio will havea great advantage Odds Ratio is suitable for binary classifiers

225 1205942 (CHI) 120594

2 statistical algorithm measures the rele-vance between the feature 119908 and the class 119888

119894 The higher the

score the stronger the relevance between the feature 119908 andthe class 119888

119894 It means that the feature has a greater contribution

to the class The computing formula can be described asfollows

1205942

(119908 119888119894)

=|119862| (119860

1times 1198604

minus 1198602

times 1198603)2

(1198601

+ 1198603) (1198602

+ 1198604) (1198601

+ 1198602) (1198603

+ 1198604)

(8)

where 119908 is the feature word 1198601is the frequency containing

feature 119908 and class 119888119894 1198602is the frequency containing feature

119908 but not belonging to class 119888119894 1198603is the frequency belonging

to class 119888119894but not containing the feature119908119860

4is the frequency

not containing the feature119908 and not belonging to class 119888119894 and

|119862| is the total number of text documents in class 119888119894

When the feature 119908 and class 119888119894are independent that

is 1205942(119908 119888119894) = 0 this means that the feature 119908 is not

containing identification informationWe calculate values foreach category and then use formula (9) to compute the valueof the entire training set

1205942

max (119908) = max1le119894le|119862|

1205942

(119908 119888119894) (9)

3 The Gini Feature Selection Algorithm

31 The Improved Gini Index Algorithm

311 Gini Index (GI) The GI is a kind of impurity attributesplitting method It is widely used in the CART algorithm[24] the SLIQ algorithm [25] and the SPRINT algorithm[26] It chooses the splitting attribute and obtains very goodclassification accuracy The GI algorithm can be described asfollows

Mathematical Problems in Engineering 5

Suppose that 119876 is a set of 119904 samples and these sampleshave 119896 different classes (119888

119894 119894 = 1 2 3 119896) According to the

differences of classes we can divide 119876 into 119896 subsets (119876119894 119894 =

1 2 3 119896) Suppose that 119876119894is a sample set that belongs to

class 119888119894and that 119904

119894is the total number of 119876

119894 then the GI of set

119876 is

Gini (119876) = 1 minus

|119862|

sum

119894=1

1198752

119894 (10)

where 119875119894is the probability of 119876

119894in 119888119894and calculated by 119904

119894119904

When Gini(119876) is 0 that is the minimum all the membersin the set belong to the same class that is the maximumuseful information can be obtained When all of the samplesin the set distribute equally for each class Gini(119876) is themaximum that is the minimum useful information can beobtained

For an attribute 119860 with 119897 distinct values 119876 is partitionedinto 119897 subsets 119876

1 1198762 119876

119897 The GI of 119876 with respect to the

attribute 119860 is defined as

Gini119860

(119876) = 1 minus

119897

sum

119894=1

119904119894

119904Gini (119876) (11)

The main idea of GI algorithm is that the attribute withtheminimum value of GI is the best attribute which is chosento split

312 The Improved Gini Index (IGI) Algorithm The originalform of the GI algorithm was used to measure the impurityof attributes towards classification The smaller the impurityis the better the attribute is However many studies on GItheory have employed the measure of purity the larger thevalue of the purity is the better the attribute is Purity is moresuitable for text classification The formula is as follows

Gini (119876) =

|119862|

sum

119894=1

1198752

119894 (12)

This formula measures the purity of attributes towardscategorization The larger the value of purity is the betterthe attribute is However we always emphasize the high-frequency words in TC because the high-frequency wordshave more contributions to judge the class of texts But whenthe distribution of the training set is highly unbalanced thelower-frequency feature words still have some contributionto judge the class of texts although the contribution is far lesssignificant than that the high-frequency feature words haveTherefore we define the IGI algorithm as

IGI (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

119875 (119888119894

| 119908)2

(13)

where 119908 is the feature word |119862| is the total number of classesin the training set 119875(119908 | 119888

119894) is the probability of text that

contains119908 in class 119888119894 and119875(119888

119894| 119908) is the posterior probability

of text that contains 119908 in class 119888119894

This formula overcomes the shortcoming of the originalGI which considers the featuresrsquo conditional probability

combining the conditional probability and posterior prob-ability Hence the IGI algorithm can depress the affectionwhen the training set is unbalanced

32 TF-Gini Algorithm The purpose of FS is to choose themost suitable representative features to represent the originaltext In fact the featuresrsquo distinguishability is different Somefeatures can effectively distinguish a text from others butother features cannot Therefore we use weights to identifydifferent featuresrsquo distinguishabilityThe higher distinguisha-bility of features will get higher weights

The TF-IDF algorithm is a classical algorithm to calculatethe featuresrsquo weights For a word 119908 in text 119889 the weights of 119908

in 119889 are as follows

TF-IDF (119908 119889) = TF (119908 119889) sdot IDF (119908 119889)

= TF (119908 119889) sdot log(|119863|

DF (119908))

(14)

where TF(119908 119889) is the frequency of 119908 appearance in text 119889|119863| is the number of text documents in the training setand DF(119908) is the total number of texts containing 119908 in thetraining set

The TF-IDF algorithm is not suitable for TC mainlybecause of the shortcoming of TF-IDF in the section IDFThe TF-IDF considers that a word with a higher frequencyin different texts in the training set is not important Thisconsideration is not appropriate for TC Therefore we usepurity Gini to replace the IDF part in the TF-IDF formulaThe purity Gini is as follows

GiniTxt (119908) =

|119862|

sum

119894=1

119875119894(119908 | 119888119894)120575

(15)

where119908 is the featureword and 120575 is a nonzero value119875119894(119908 | 119888119894)

is the probability of text that contains 119908 in class 119888119894 and |119862| is

the total number of classes in the training set When 120575 = 2the following formula is the original Gini formula

GiniTxt (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

(16)

However in TC the experimental results indicate thatwhen 120575 = minus12 the classification performance is the best

GiniTxt (119908) =

|119862|

sum

119894=1

radic119875 (119908 | 119888119894)

119875 (119908 | 119888119894)

(17)

Therefore the new feature weighting algorithm known asTF-Gini can be represented as

TF-Gini = TF sdot GiniTxt (119908) (18)

33 Experimental and Analysis

331 Data Sets Preparation In this paper we choose Englishand Chinese data sets to verify the performance of the FSand FW algorithms The English data set which we choose

6 Mathematical Problems in Engineering

is Reuters-21578 We choose ten big categories among themThe Chinese data set is Fudan University Chinese text dataset (19637 texts)We use 3522 pieces as the training texts and2148 pieces as the testing texts

332 Classifiers Selection This paper is mainly focused ontext preprocessing stage The IGI and TF-Gini algorithms arethe main research area in this paper To validate the abovetwo algorithmsrsquo performance we introduce some commonlyused text classifiers such as 119896NN classifier f119896NN classifierand SVM classifier All the experiments are based on theseclassifiers

(1) The kNN Classifier The 119896 nearest neighbors (119896NN)algorithm is a very popular simple and nonparameter textclassification algorithm The idea of 119896NN algorithm is tochoose the category for new input sample that appears mostoften amongst its 119896 nearest neighbors Its decision rule can bedescribed as follows

120583119895

(119889) =

119896

sum

119894=1

120583119895

(119889119894) sim (119889 119889

119894) (19)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training set 119889 is anew input sample with unknown category and 119889

119894is a testing

text whose category is clear 120583119895(119889119894) is equal to 1 if sample 119889

119894

belongs to class 119895 otherwise it is 0 sim(119889 119889119894) indicates the

similarity between the new input sample and the testing textThe decision rule is

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(20)

(2) The fkNN Classifier The 119896NN classifier is a lazy clas-sification algorithm It does not need the training set Theclassifierrsquos performance is not ideal especially when thedistribution of the training set is unbalanced Therefore weimprove the algorithm by using the fuzzy logic inferencesystem Its decision rule can be described as follows

120583119895

(119889)

=

sum119896

119894=1120583119895

(119889119894) sim (119889 119889

119894) (1 minus sim (119889 119889

119894))2(1minus119887)

sum119896

119894=1(1 minus sim (119889 119889

119894))2(1minus119887)

(21)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training setand 120583119895(119889119894)sim(119889 119889

119894) is the value ofmemberships for sample 119889

belonging to class 119895 120583119895(119889119894) is equal to 1 if sample 119889

119894belongs to

class 119895 otherwise it is 0 From the formula it can be seen thatf119896NN is based on 119896NN algorithm which endows a distanceweight on 119896NN formula The parameter 119887 adjusts the degreeof the weight 119887 isin [0 1] The f119896NN decision rule is as follows

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(22)

H2

H1

H

b

+

++

+

+

minusminus

minus

minus

minus

1120596

120596 middot x + b = minus1 120596 middot x + b = 0

120596 middot x + b = +1

Figure 2 Optimal hyperplane for SVM classifier

(3) The SVM Classifier Support vector machine (SVM) is apotential classification algorithmwhich is based on statisticallearning theory SVM is highly accurate owing to its abilityto model complex nonlinear decision boundary It uses anonlinear mapping to transform the original set into a higherdimensionWithin this newdimension it searches for the lin-ear optimal separating hyperplane that is decision boundarywith an appropriate nonlinear mapping to a sufficiently highdimension data from two classes can always be separatedby a hyperplane Originally SVM mainly solves the problemof two-class classification The multiclass problem can beaddressed using multiple binary classifications

Suppose that in the feature space a set of training vectors119909119894(119894 = 1 2 3 ) belong to different classes 119910

119894isin minus1 1 We

wish to separate this training set with a hyperplane

120596 sdot 119909 + 119887 = 0 (23)

Obviously there are an infinite number of hyperplanesthat are able to partition the data set into sets Howeveraccording to the SVM theory there is only one optimalhyperplane which is lying half-way within the maximalmargin as the sumof distances of the hyperplane to the closesttraining sample of each class As shown in Figure 2 the solidline represents the optimal separating hyperplane

One of the problems of SVM is to find the maximumseparating hyperplane that is the one maximum distancebetween the nearest training samples In Figure 2 119867 is theoptimal hyperplane 119867

1and 119867

2are the parallel hyperplanes

to119867The task of SVM ismaking themargin distance between1198671and 119867

2maximum The maximum margin can be written

as follows

120596 sdot 119909119894

+ 119887 ge +1 119910119894

= +1

120596 sdot 119909119894

+ 119887 le minus1 119910119894

= minus1 997904rArr

119910119894(120596 sdot 119909

119894+ 119887) minus 1 ge 0

(24)

Mathematical Problems in Engineering 7

Therefore from Figure 2 we can get that the processof solving SVM is making the margin between 119867

1and 119867

2

maximum

min119909119894|119910119894=1

120596 sdot 119909119894

+ 119887

120596minus max119909119894119910119894=minus1

120596 sdot 119909119894

+ 119887

120596=

2

120596 (25)

The equation is also the same as minimizing 12059622

Therefore it becomes the following optimization question

min (1205962

2)

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1

(26)

The formulameans the points which satisfy 119910119894(120596sdot119909119894+119887) =

1 called support vectors namely the points on 1198671and 119867

2in

Figure 2 are all support vectors In fact many samples are notclassified correctly by the hyperplane Hence we introduceslack variables Then the above formula becomes

min 1

21205962

minus 120578 sdot

|119862|

sum

119894=1

120585119894

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1 minus 120585

119894

(27)

where 120578 is a positive constant to balance the experienceand confidence and |119862| is the total number of classes in thetraining set

333 Performance Evaluation Confusion matrix [27] alsoknown as a contingency table or an error matrix is a specifictable layout that allows visualization of the performance ofan algorithm typically a supervised learning algorithm Itis a table with two rows and two columns that reports thenumber of false positives false negatives true positives andtrue negatives Table 2 shows the confusion matrix

True positive (TP) is the number of correct predictionsthat an instance is positive False positive (FP) is the numberof incorrect predictions that an instance is positive Falsenegative (FN) is the number of incorrect predictions thatan instance is positive True negative (TN) is the number ofincorrect predictions that an instance is negative

Precision is the proportion of the predicted positive casesthat were correct as calculated using the equation

Precision (119875) =TP

TP + FP (28)

Recall is the proportion of the actual positive cases thatwere correct as calculated using the equation

Recall =TP

TP + FN (29)

We hope that it is better when the precision and recallhave higher values However both of them are contradictoryFor example if the result of prediction is only one andaccurate the precision is 100 and the recall is very lowbecause TP equals 1 FP equals 0 and FN is hugeOn the other

Table 2 Confusion matrix

Prediction as positive Prediction as negativeActual positive TP FNActual negative FP TN

hand if the prediction as positive contains all the results (ieFN equals 0) therefore the recall is 100The precision is lowbecause FP is large

1198651 [28] is the average of precision and recall If 1198651 is highthe algorithm and experimental methods are ideal 1198651 is tomeasure the performance of a classifier considering the abovetwo methods and was proposed by Van Rijsbergen [29] in1979 It can be described as follows

1198651 =2 times Precision times RecallPrecision + Recall

(30)

Precision and recall only evaluate the performance ofclassification algorithms in a certain category It is oftennecessary to evaluate classification performance in all the cat-egories When dealing with multiple categories there are twopossible ways of averaging these measures namely macro-average andmicroaverageThemacroaverage weights equallyall classes regardless of how many documents belong to itThe microaverage weights equally all the documents thusfavoring the performance in common classes The formulasof macroaverage and microaverage are as follows and |119862| isthe number of classes in the training set

Macro 119901 =sum|119862|

119894=1Precision

119894

|119862|

Macro 119903 =sum|119862|

119894=1Recall

119894

|119862|

Macro 1198651 =sum|119862|

119894=11198651119894

|119862|

Micro 119901 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FP119894)

Micro 119903 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FN119894)

Micro 1198651 =2 times Micro 119901 times Micro 119903

Micro 119901 + Micro 119903

(31)

334 Experimental Results and Analysis Figures 3 and 4are the results on the English data set The performance ofclassical GI is the lowest and the IGI is the highest There-fore the IGI algorithm is helpful to improve the classifiersrsquoperformance

From Figures 5 and 6 it can be seen that in Chinesedata sets the classical GI still has poor performanceThe bestperformance belongs to ECE and then the second is IG Theperformance of IGI is not the best Since theChinese languageis different from the English language at the processing of

8 Mathematical Problems in Engineering

GIIGIIGECE

CHIORMI

kNN fkNNSVMClassifiers

0

10

20

30

40

50

60

70

80

()

Figure 3 Macro 1198651 results on Reuters-21578 data set

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

()

kNN fkNN

Figure 4 Macro 1198651 results on Reuters-21578 data set

Stop Words the results of Chinese word segmentation algo-rithm will influence the classification algorithmrsquos accuracyMeanwhile the current experiments do not consider thefeature wordsrsquo weight The following TF-Gini algorithm hasa good solution to this problem

Although the IGI algorithm does not have the bestperformance in the Chinese data set its performance is veryclose to the highest Hence the IGI feature selection algo-rithm is suitable and acceptable in the classification algo-rithm It promotes the classifiersrsquo performance effectively

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

90

100

()

kNN fkNN

Figure 5 Macro 1198651 results on Chinese data set

GIIGIIGECE

CHIORMI

0

10

20

30

40

50

60

70

80

90

100

()

SVMClassifiers

kNN fkNN

Figure 6 Micro 1198651 results on Chinese data set

Figures 7 8 9 and 10 are results for the English and Chi-nese date sets To determine which algorithmrsquos performanceis the best we use different TF weights algorithms such asTF-IDF and TF-Gini

From Figures 7 and 8 we can see that the TF-Giniweights algorithmrsquos performance is good and acceptable forthe English data set its performance is very close to thehighest FromFigures 9 and 10 we can see that on theChinesedata set the TF-Gini algorithm overcomes the deficienciesof the IGI algorithm and gets the best performance Theseresults show that the TF-Gini algorithm is a very promisingfeature weight algorithm

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Improved Feature Weight Algorithm and Its ...

Mathematical Problems in Engineering 5

Suppose that 119876 is a set of 119904 samples and these sampleshave 119896 different classes (119888

119894 119894 = 1 2 3 119896) According to the

differences of classes we can divide 119876 into 119896 subsets (119876119894 119894 =

1 2 3 119896) Suppose that 119876119894is a sample set that belongs to

class 119888119894and that 119904

119894is the total number of 119876

119894 then the GI of set

119876 is

Gini (119876) = 1 minus

|119862|

sum

119894=1

1198752

119894 (10)

where 119875119894is the probability of 119876

119894in 119888119894and calculated by 119904

119894119904

When Gini(119876) is 0 that is the minimum all the membersin the set belong to the same class that is the maximumuseful information can be obtained When all of the samplesin the set distribute equally for each class Gini(119876) is themaximum that is the minimum useful information can beobtained

For an attribute 119860 with 119897 distinct values 119876 is partitionedinto 119897 subsets 119876

1 1198762 119876

119897 The GI of 119876 with respect to the

attribute 119860 is defined as

Gini119860

(119876) = 1 minus

119897

sum

119894=1

119904119894

119904Gini (119876) (11)

The main idea of GI algorithm is that the attribute withtheminimum value of GI is the best attribute which is chosento split

312 The Improved Gini Index (IGI) Algorithm The originalform of the GI algorithm was used to measure the impurityof attributes towards classification The smaller the impurityis the better the attribute is However many studies on GItheory have employed the measure of purity the larger thevalue of the purity is the better the attribute is Purity is moresuitable for text classification The formula is as follows

Gini (119876) =

|119862|

sum

119894=1

1198752

119894 (12)

This formula measures the purity of attributes towardscategorization The larger the value of purity is the betterthe attribute is However we always emphasize the high-frequency words in TC because the high-frequency wordshave more contributions to judge the class of texts But whenthe distribution of the training set is highly unbalanced thelower-frequency feature words still have some contributionto judge the class of texts although the contribution is far lesssignificant than that the high-frequency feature words haveTherefore we define the IGI algorithm as

IGI (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

119875 (119888119894

| 119908)2

(13)

where 119908 is the feature word |119862| is the total number of classesin the training set 119875(119908 | 119888

119894) is the probability of text that

contains119908 in class 119888119894 and119875(119888

119894| 119908) is the posterior probability

of text that contains 119908 in class 119888119894

This formula overcomes the shortcoming of the originalGI which considers the featuresrsquo conditional probability

combining the conditional probability and posterior prob-ability Hence the IGI algorithm can depress the affectionwhen the training set is unbalanced

32 TF-Gini Algorithm The purpose of FS is to choose themost suitable representative features to represent the originaltext In fact the featuresrsquo distinguishability is different Somefeatures can effectively distinguish a text from others butother features cannot Therefore we use weights to identifydifferent featuresrsquo distinguishabilityThe higher distinguisha-bility of features will get higher weights

The TF-IDF algorithm is a classical algorithm to calculatethe featuresrsquo weights For a word 119908 in text 119889 the weights of 119908

in 119889 are as follows

TF-IDF (119908 119889) = TF (119908 119889) sdot IDF (119908 119889)

= TF (119908 119889) sdot log(|119863|

DF (119908))

(14)

where TF(119908 119889) is the frequency of 119908 appearance in text 119889|119863| is the number of text documents in the training setand DF(119908) is the total number of texts containing 119908 in thetraining set

The TF-IDF algorithm is not suitable for TC mainlybecause of the shortcoming of TF-IDF in the section IDFThe TF-IDF considers that a word with a higher frequencyin different texts in the training set is not important Thisconsideration is not appropriate for TC Therefore we usepurity Gini to replace the IDF part in the TF-IDF formulaThe purity Gini is as follows

GiniTxt (119908) =

|119862|

sum

119894=1

119875119894(119908 | 119888119894)120575

(15)

where119908 is the featureword and 120575 is a nonzero value119875119894(119908 | 119888119894)

is the probability of text that contains 119908 in class 119888119894 and |119862| is

the total number of classes in the training set When 120575 = 2the following formula is the original Gini formula

GiniTxt (119908) =

|119862|

sum

119894=1

119875 (119908 | 119888119894)2

(16)

However in TC the experimental results indicate thatwhen 120575 = minus12 the classification performance is the best

GiniTxt (119908) =

|119862|

sum

119894=1

radic119875 (119908 | 119888119894)

119875 (119908 | 119888119894)

(17)

Therefore the new feature weighting algorithm known asTF-Gini can be represented as

TF-Gini = TF sdot GiniTxt (119908) (18)

33 Experimental and Analysis

331 Data Sets Preparation In this paper we choose Englishand Chinese data sets to verify the performance of the FSand FW algorithms The English data set which we choose

6 Mathematical Problems in Engineering

is Reuters-21578 We choose ten big categories among themThe Chinese data set is Fudan University Chinese text dataset (19637 texts)We use 3522 pieces as the training texts and2148 pieces as the testing texts

332 Classifiers Selection This paper is mainly focused ontext preprocessing stage The IGI and TF-Gini algorithms arethe main research area in this paper To validate the abovetwo algorithmsrsquo performance we introduce some commonlyused text classifiers such as 119896NN classifier f119896NN classifierand SVM classifier All the experiments are based on theseclassifiers

(1) The kNN Classifier The 119896 nearest neighbors (119896NN)algorithm is a very popular simple and nonparameter textclassification algorithm The idea of 119896NN algorithm is tochoose the category for new input sample that appears mostoften amongst its 119896 nearest neighbors Its decision rule can bedescribed as follows

120583119895

(119889) =

119896

sum

119894=1

120583119895

(119889119894) sim (119889 119889

119894) (19)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training set 119889 is anew input sample with unknown category and 119889

119894is a testing

text whose category is clear 120583119895(119889119894) is equal to 1 if sample 119889

119894

belongs to class 119895 otherwise it is 0 sim(119889 119889119894) indicates the

similarity between the new input sample and the testing textThe decision rule is

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(20)

(2) The fkNN Classifier The 119896NN classifier is a lazy clas-sification algorithm It does not need the training set Theclassifierrsquos performance is not ideal especially when thedistribution of the training set is unbalanced Therefore weimprove the algorithm by using the fuzzy logic inferencesystem Its decision rule can be described as follows

120583119895

(119889)

=

sum119896

119894=1120583119895

(119889119894) sim (119889 119889

119894) (1 minus sim (119889 119889

119894))2(1minus119887)

sum119896

119894=1(1 minus sim (119889 119889

119894))2(1minus119887)

(21)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training setand 120583119895(119889119894)sim(119889 119889

119894) is the value ofmemberships for sample 119889

belonging to class 119895 120583119895(119889119894) is equal to 1 if sample 119889

119894belongs to

class 119895 otherwise it is 0 From the formula it can be seen thatf119896NN is based on 119896NN algorithm which endows a distanceweight on 119896NN formula The parameter 119887 adjusts the degreeof the weight 119887 isin [0 1] The f119896NN decision rule is as follows

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(22)

H2

H1

H

b

+

++

+

+

minusminus

minus

minus

minus

1120596

120596 middot x + b = minus1 120596 middot x + b = 0

120596 middot x + b = +1

Figure 2 Optimal hyperplane for SVM classifier

(3) The SVM Classifier Support vector machine (SVM) is apotential classification algorithmwhich is based on statisticallearning theory SVM is highly accurate owing to its abilityto model complex nonlinear decision boundary It uses anonlinear mapping to transform the original set into a higherdimensionWithin this newdimension it searches for the lin-ear optimal separating hyperplane that is decision boundarywith an appropriate nonlinear mapping to a sufficiently highdimension data from two classes can always be separatedby a hyperplane Originally SVM mainly solves the problemof two-class classification The multiclass problem can beaddressed using multiple binary classifications

Suppose that in the feature space a set of training vectors119909119894(119894 = 1 2 3 ) belong to different classes 119910

119894isin minus1 1 We

wish to separate this training set with a hyperplane

120596 sdot 119909 + 119887 = 0 (23)

Obviously there are an infinite number of hyperplanesthat are able to partition the data set into sets Howeveraccording to the SVM theory there is only one optimalhyperplane which is lying half-way within the maximalmargin as the sumof distances of the hyperplane to the closesttraining sample of each class As shown in Figure 2 the solidline represents the optimal separating hyperplane

One of the problems of SVM is to find the maximumseparating hyperplane that is the one maximum distancebetween the nearest training samples In Figure 2 119867 is theoptimal hyperplane 119867

1and 119867

2are the parallel hyperplanes

to119867The task of SVM ismaking themargin distance between1198671and 119867

2maximum The maximum margin can be written

as follows

120596 sdot 119909119894

+ 119887 ge +1 119910119894

= +1

120596 sdot 119909119894

+ 119887 le minus1 119910119894

= minus1 997904rArr

119910119894(120596 sdot 119909

119894+ 119887) minus 1 ge 0

(24)

Mathematical Problems in Engineering 7

Therefore from Figure 2 we can get that the processof solving SVM is making the margin between 119867

1and 119867

2

maximum

min119909119894|119910119894=1

120596 sdot 119909119894

+ 119887

120596minus max119909119894119910119894=minus1

120596 sdot 119909119894

+ 119887

120596=

2

120596 (25)

The equation is also the same as minimizing 12059622

Therefore it becomes the following optimization question

min (1205962

2)

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1

(26)

The formulameans the points which satisfy 119910119894(120596sdot119909119894+119887) =

1 called support vectors namely the points on 1198671and 119867

2in

Figure 2 are all support vectors In fact many samples are notclassified correctly by the hyperplane Hence we introduceslack variables Then the above formula becomes

min 1

21205962

minus 120578 sdot

|119862|

sum

119894=1

120585119894

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1 minus 120585

119894

(27)

where 120578 is a positive constant to balance the experienceand confidence and |119862| is the total number of classes in thetraining set

333 Performance Evaluation Confusion matrix [27] alsoknown as a contingency table or an error matrix is a specifictable layout that allows visualization of the performance ofan algorithm typically a supervised learning algorithm Itis a table with two rows and two columns that reports thenumber of false positives false negatives true positives andtrue negatives Table 2 shows the confusion matrix

True positive (TP) is the number of correct predictionsthat an instance is positive False positive (FP) is the numberof incorrect predictions that an instance is positive Falsenegative (FN) is the number of incorrect predictions thatan instance is positive True negative (TN) is the number ofincorrect predictions that an instance is negative

Precision is the proportion of the predicted positive casesthat were correct as calculated using the equation

Precision (119875) =TP

TP + FP (28)

Recall is the proportion of the actual positive cases thatwere correct as calculated using the equation

Recall =TP

TP + FN (29)

We hope that it is better when the precision and recallhave higher values However both of them are contradictoryFor example if the result of prediction is only one andaccurate the precision is 100 and the recall is very lowbecause TP equals 1 FP equals 0 and FN is hugeOn the other

Table 2 Confusion matrix

Prediction as positive Prediction as negativeActual positive TP FNActual negative FP TN

hand if the prediction as positive contains all the results (ieFN equals 0) therefore the recall is 100The precision is lowbecause FP is large

1198651 [28] is the average of precision and recall If 1198651 is highthe algorithm and experimental methods are ideal 1198651 is tomeasure the performance of a classifier considering the abovetwo methods and was proposed by Van Rijsbergen [29] in1979 It can be described as follows

1198651 =2 times Precision times RecallPrecision + Recall

(30)

Precision and recall only evaluate the performance ofclassification algorithms in a certain category It is oftennecessary to evaluate classification performance in all the cat-egories When dealing with multiple categories there are twopossible ways of averaging these measures namely macro-average andmicroaverageThemacroaverage weights equallyall classes regardless of how many documents belong to itThe microaverage weights equally all the documents thusfavoring the performance in common classes The formulasof macroaverage and microaverage are as follows and |119862| isthe number of classes in the training set

Macro 119901 =sum|119862|

119894=1Precision

119894

|119862|

Macro 119903 =sum|119862|

119894=1Recall

119894

|119862|

Macro 1198651 =sum|119862|

119894=11198651119894

|119862|

Micro 119901 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FP119894)

Micro 119903 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FN119894)

Micro 1198651 =2 times Micro 119901 times Micro 119903

Micro 119901 + Micro 119903

(31)

334 Experimental Results and Analysis Figures 3 and 4are the results on the English data set The performance ofclassical GI is the lowest and the IGI is the highest There-fore the IGI algorithm is helpful to improve the classifiersrsquoperformance

From Figures 5 and 6 it can be seen that in Chinesedata sets the classical GI still has poor performanceThe bestperformance belongs to ECE and then the second is IG Theperformance of IGI is not the best Since theChinese languageis different from the English language at the processing of

8 Mathematical Problems in Engineering

GIIGIIGECE

CHIORMI

kNN fkNNSVMClassifiers

0

10

20

30

40

50

60

70

80

()

Figure 3 Macro 1198651 results on Reuters-21578 data set

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

()

kNN fkNN

Figure 4 Macro 1198651 results on Reuters-21578 data set

Stop Words the results of Chinese word segmentation algo-rithm will influence the classification algorithmrsquos accuracyMeanwhile the current experiments do not consider thefeature wordsrsquo weight The following TF-Gini algorithm hasa good solution to this problem

Although the IGI algorithm does not have the bestperformance in the Chinese data set its performance is veryclose to the highest Hence the IGI feature selection algo-rithm is suitable and acceptable in the classification algo-rithm It promotes the classifiersrsquo performance effectively

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

90

100

()

kNN fkNN

Figure 5 Macro 1198651 results on Chinese data set

GIIGIIGECE

CHIORMI

0

10

20

30

40

50

60

70

80

90

100

()

SVMClassifiers

kNN fkNN

Figure 6 Micro 1198651 results on Chinese data set

Figures 7 8 9 and 10 are results for the English and Chi-nese date sets To determine which algorithmrsquos performanceis the best we use different TF weights algorithms such asTF-IDF and TF-Gini

From Figures 7 and 8 we can see that the TF-Giniweights algorithmrsquos performance is good and acceptable forthe English data set its performance is very close to thehighest FromFigures 9 and 10 we can see that on theChinesedata set the TF-Gini algorithm overcomes the deficienciesof the IGI algorithm and gets the best performance Theseresults show that the TF-Gini algorithm is a very promisingfeature weight algorithm

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Improved Feature Weight Algorithm and Its ...

6 Mathematical Problems in Engineering

is Reuters-21578 We choose ten big categories among themThe Chinese data set is Fudan University Chinese text dataset (19637 texts)We use 3522 pieces as the training texts and2148 pieces as the testing texts

332 Classifiers Selection This paper is mainly focused ontext preprocessing stage The IGI and TF-Gini algorithms arethe main research area in this paper To validate the abovetwo algorithmsrsquo performance we introduce some commonlyused text classifiers such as 119896NN classifier f119896NN classifierand SVM classifier All the experiments are based on theseclassifiers

(1) The kNN Classifier The 119896 nearest neighbors (119896NN)algorithm is a very popular simple and nonparameter textclassification algorithm The idea of 119896NN algorithm is tochoose the category for new input sample that appears mostoften amongst its 119896 nearest neighbors Its decision rule can bedescribed as follows

120583119895

(119889) =

119896

sum

119894=1

120583119895

(119889119894) sim (119889 119889

119894) (19)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training set 119889 is anew input sample with unknown category and 119889

119894is a testing

text whose category is clear 120583119895(119889119894) is equal to 1 if sample 119889

119894

belongs to class 119895 otherwise it is 0 sim(119889 119889119894) indicates the

similarity between the new input sample and the testing textThe decision rule is

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(20)

(2) The fkNN Classifier The 119896NN classifier is a lazy clas-sification algorithm It does not need the training set Theclassifierrsquos performance is not ideal especially when thedistribution of the training set is unbalanced Therefore weimprove the algorithm by using the fuzzy logic inferencesystem Its decision rule can be described as follows

120583119895

(119889)

=

sum119896

119894=1120583119895

(119889119894) sim (119889 119889

119894) (1 minus sim (119889 119889

119894))2(1minus119887)

sum119896

119894=1(1 minus sim (119889 119889

119894))2(1minus119887)

(21)

where 119895 (119895 = 1 2 3 ) is the 119895th text in the training setand 120583119895(119889119894)sim(119889 119889

119894) is the value ofmemberships for sample 119889

belonging to class 119895 120583119895(119889119894) is equal to 1 if sample 119889

119894belongs to

class 119895 otherwise it is 0 From the formula it can be seen thatf119896NN is based on 119896NN algorithm which endows a distanceweight on 119896NN formula The parameter 119887 adjusts the degreeof the weight 119887 isin [0 1] The f119896NN decision rule is as follows

if 120583119895

(119889) = max119894

120583119894(119889)

then 119889 isin 119888119895

(22)

H2

H1

H

b

+

++

+

+

minusminus

minus

minus

minus

1120596

120596 middot x + b = minus1 120596 middot x + b = 0

120596 middot x + b = +1

Figure 2 Optimal hyperplane for SVM classifier

(3) The SVM Classifier Support vector machine (SVM) is apotential classification algorithmwhich is based on statisticallearning theory SVM is highly accurate owing to its abilityto model complex nonlinear decision boundary It uses anonlinear mapping to transform the original set into a higherdimensionWithin this newdimension it searches for the lin-ear optimal separating hyperplane that is decision boundarywith an appropriate nonlinear mapping to a sufficiently highdimension data from two classes can always be separatedby a hyperplane Originally SVM mainly solves the problemof two-class classification The multiclass problem can beaddressed using multiple binary classifications

Suppose that in the feature space a set of training vectors119909119894(119894 = 1 2 3 ) belong to different classes 119910

119894isin minus1 1 We

wish to separate this training set with a hyperplane

120596 sdot 119909 + 119887 = 0 (23)

Obviously there are an infinite number of hyperplanesthat are able to partition the data set into sets Howeveraccording to the SVM theory there is only one optimalhyperplane which is lying half-way within the maximalmargin as the sumof distances of the hyperplane to the closesttraining sample of each class As shown in Figure 2 the solidline represents the optimal separating hyperplane

One of the problems of SVM is to find the maximumseparating hyperplane that is the one maximum distancebetween the nearest training samples In Figure 2 119867 is theoptimal hyperplane 119867

1and 119867

2are the parallel hyperplanes

to119867The task of SVM ismaking themargin distance between1198671and 119867

2maximum The maximum margin can be written

as follows

120596 sdot 119909119894

+ 119887 ge +1 119910119894

= +1

120596 sdot 119909119894

+ 119887 le minus1 119910119894

= minus1 997904rArr

119910119894(120596 sdot 119909

119894+ 119887) minus 1 ge 0

(24)

Mathematical Problems in Engineering 7

Therefore from Figure 2 we can get that the processof solving SVM is making the margin between 119867

1and 119867

2

maximum

min119909119894|119910119894=1

120596 sdot 119909119894

+ 119887

120596minus max119909119894119910119894=minus1

120596 sdot 119909119894

+ 119887

120596=

2

120596 (25)

The equation is also the same as minimizing 12059622

Therefore it becomes the following optimization question

min (1205962

2)

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1

(26)

The formulameans the points which satisfy 119910119894(120596sdot119909119894+119887) =

1 called support vectors namely the points on 1198671and 119867

2in

Figure 2 are all support vectors In fact many samples are notclassified correctly by the hyperplane Hence we introduceslack variables Then the above formula becomes

min 1

21205962

minus 120578 sdot

|119862|

sum

119894=1

120585119894

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1 minus 120585

119894

(27)

where 120578 is a positive constant to balance the experienceand confidence and |119862| is the total number of classes in thetraining set

333 Performance Evaluation Confusion matrix [27] alsoknown as a contingency table or an error matrix is a specifictable layout that allows visualization of the performance ofan algorithm typically a supervised learning algorithm Itis a table with two rows and two columns that reports thenumber of false positives false negatives true positives andtrue negatives Table 2 shows the confusion matrix

True positive (TP) is the number of correct predictionsthat an instance is positive False positive (FP) is the numberof incorrect predictions that an instance is positive Falsenegative (FN) is the number of incorrect predictions thatan instance is positive True negative (TN) is the number ofincorrect predictions that an instance is negative

Precision is the proportion of the predicted positive casesthat were correct as calculated using the equation

Precision (119875) =TP

TP + FP (28)

Recall is the proportion of the actual positive cases thatwere correct as calculated using the equation

Recall =TP

TP + FN (29)

We hope that it is better when the precision and recallhave higher values However both of them are contradictoryFor example if the result of prediction is only one andaccurate the precision is 100 and the recall is very lowbecause TP equals 1 FP equals 0 and FN is hugeOn the other

Table 2 Confusion matrix

Prediction as positive Prediction as negativeActual positive TP FNActual negative FP TN

hand if the prediction as positive contains all the results (ieFN equals 0) therefore the recall is 100The precision is lowbecause FP is large

1198651 [28] is the average of precision and recall If 1198651 is highthe algorithm and experimental methods are ideal 1198651 is tomeasure the performance of a classifier considering the abovetwo methods and was proposed by Van Rijsbergen [29] in1979 It can be described as follows

1198651 =2 times Precision times RecallPrecision + Recall

(30)

Precision and recall only evaluate the performance ofclassification algorithms in a certain category It is oftennecessary to evaluate classification performance in all the cat-egories When dealing with multiple categories there are twopossible ways of averaging these measures namely macro-average andmicroaverageThemacroaverage weights equallyall classes regardless of how many documents belong to itThe microaverage weights equally all the documents thusfavoring the performance in common classes The formulasof macroaverage and microaverage are as follows and |119862| isthe number of classes in the training set

Macro 119901 =sum|119862|

119894=1Precision

119894

|119862|

Macro 119903 =sum|119862|

119894=1Recall

119894

|119862|

Macro 1198651 =sum|119862|

119894=11198651119894

|119862|

Micro 119901 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FP119894)

Micro 119903 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FN119894)

Micro 1198651 =2 times Micro 119901 times Micro 119903

Micro 119901 + Micro 119903

(31)

334 Experimental Results and Analysis Figures 3 and 4are the results on the English data set The performance ofclassical GI is the lowest and the IGI is the highest There-fore the IGI algorithm is helpful to improve the classifiersrsquoperformance

From Figures 5 and 6 it can be seen that in Chinesedata sets the classical GI still has poor performanceThe bestperformance belongs to ECE and then the second is IG Theperformance of IGI is not the best Since theChinese languageis different from the English language at the processing of

8 Mathematical Problems in Engineering

GIIGIIGECE

CHIORMI

kNN fkNNSVMClassifiers

0

10

20

30

40

50

60

70

80

()

Figure 3 Macro 1198651 results on Reuters-21578 data set

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

()

kNN fkNN

Figure 4 Macro 1198651 results on Reuters-21578 data set

Stop Words the results of Chinese word segmentation algo-rithm will influence the classification algorithmrsquos accuracyMeanwhile the current experiments do not consider thefeature wordsrsquo weight The following TF-Gini algorithm hasa good solution to this problem

Although the IGI algorithm does not have the bestperformance in the Chinese data set its performance is veryclose to the highest Hence the IGI feature selection algo-rithm is suitable and acceptable in the classification algo-rithm It promotes the classifiersrsquo performance effectively

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

90

100

()

kNN fkNN

Figure 5 Macro 1198651 results on Chinese data set

GIIGIIGECE

CHIORMI

0

10

20

30

40

50

60

70

80

90

100

()

SVMClassifiers

kNN fkNN

Figure 6 Micro 1198651 results on Chinese data set

Figures 7 8 9 and 10 are results for the English and Chi-nese date sets To determine which algorithmrsquos performanceis the best we use different TF weights algorithms such asTF-IDF and TF-Gini

From Figures 7 and 8 we can see that the TF-Giniweights algorithmrsquos performance is good and acceptable forthe English data set its performance is very close to thehighest FromFigures 9 and 10 we can see that on theChinesedata set the TF-Gini algorithm overcomes the deficienciesof the IGI algorithm and gets the best performance Theseresults show that the TF-Gini algorithm is a very promisingfeature weight algorithm

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Improved Feature Weight Algorithm and Its ...

Mathematical Problems in Engineering 7

Therefore from Figure 2 we can get that the processof solving SVM is making the margin between 119867

1and 119867

2

maximum

min119909119894|119910119894=1

120596 sdot 119909119894

+ 119887

120596minus max119909119894119910119894=minus1

120596 sdot 119909119894

+ 119887

120596=

2

120596 (25)

The equation is also the same as minimizing 12059622

Therefore it becomes the following optimization question

min (1205962

2)

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1

(26)

The formulameans the points which satisfy 119910119894(120596sdot119909119894+119887) =

1 called support vectors namely the points on 1198671and 119867

2in

Figure 2 are all support vectors In fact many samples are notclassified correctly by the hyperplane Hence we introduceslack variables Then the above formula becomes

min 1

21205962

minus 120578 sdot

|119862|

sum

119894=1

120585119894

st 119910119894(120596 sdot 119909

119894+ 119887) ge 1 minus 120585

119894

(27)

where 120578 is a positive constant to balance the experienceand confidence and |119862| is the total number of classes in thetraining set

333 Performance Evaluation Confusion matrix [27] alsoknown as a contingency table or an error matrix is a specifictable layout that allows visualization of the performance ofan algorithm typically a supervised learning algorithm Itis a table with two rows and two columns that reports thenumber of false positives false negatives true positives andtrue negatives Table 2 shows the confusion matrix

True positive (TP) is the number of correct predictionsthat an instance is positive False positive (FP) is the numberof incorrect predictions that an instance is positive Falsenegative (FN) is the number of incorrect predictions thatan instance is positive True negative (TN) is the number ofincorrect predictions that an instance is negative

Precision is the proportion of the predicted positive casesthat were correct as calculated using the equation

Precision (119875) =TP

TP + FP (28)

Recall is the proportion of the actual positive cases thatwere correct as calculated using the equation

Recall =TP

TP + FN (29)

We hope that it is better when the precision and recallhave higher values However both of them are contradictoryFor example if the result of prediction is only one andaccurate the precision is 100 and the recall is very lowbecause TP equals 1 FP equals 0 and FN is hugeOn the other

Table 2 Confusion matrix

Prediction as positive Prediction as negativeActual positive TP FNActual negative FP TN

hand if the prediction as positive contains all the results (ieFN equals 0) therefore the recall is 100The precision is lowbecause FP is large

1198651 [28] is the average of precision and recall If 1198651 is highthe algorithm and experimental methods are ideal 1198651 is tomeasure the performance of a classifier considering the abovetwo methods and was proposed by Van Rijsbergen [29] in1979 It can be described as follows

1198651 =2 times Precision times RecallPrecision + Recall

(30)

Precision and recall only evaluate the performance ofclassification algorithms in a certain category It is oftennecessary to evaluate classification performance in all the cat-egories When dealing with multiple categories there are twopossible ways of averaging these measures namely macro-average andmicroaverageThemacroaverage weights equallyall classes regardless of how many documents belong to itThe microaverage weights equally all the documents thusfavoring the performance in common classes The formulasof macroaverage and microaverage are as follows and |119862| isthe number of classes in the training set

Macro 119901 =sum|119862|

119894=1Precision

119894

|119862|

Macro 119903 =sum|119862|

119894=1Recall

119894

|119862|

Macro 1198651 =sum|119862|

119894=11198651119894

|119862|

Micro 119901 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FP119894)

Micro 119903 =sum|119862|

119894=1TP119894

sum|119862|

119894=1(TP119894

+ FN119894)

Micro 1198651 =2 times Micro 119901 times Micro 119903

Micro 119901 + Micro 119903

(31)

334 Experimental Results and Analysis Figures 3 and 4are the results on the English data set The performance ofclassical GI is the lowest and the IGI is the highest There-fore the IGI algorithm is helpful to improve the classifiersrsquoperformance

From Figures 5 and 6 it can be seen that in Chinesedata sets the classical GI still has poor performanceThe bestperformance belongs to ECE and then the second is IG Theperformance of IGI is not the best Since theChinese languageis different from the English language at the processing of

8 Mathematical Problems in Engineering

GIIGIIGECE

CHIORMI

kNN fkNNSVMClassifiers

0

10

20

30

40

50

60

70

80

()

Figure 3 Macro 1198651 results on Reuters-21578 data set

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

()

kNN fkNN

Figure 4 Macro 1198651 results on Reuters-21578 data set

Stop Words the results of Chinese word segmentation algo-rithm will influence the classification algorithmrsquos accuracyMeanwhile the current experiments do not consider thefeature wordsrsquo weight The following TF-Gini algorithm hasa good solution to this problem

Although the IGI algorithm does not have the bestperformance in the Chinese data set its performance is veryclose to the highest Hence the IGI feature selection algo-rithm is suitable and acceptable in the classification algo-rithm It promotes the classifiersrsquo performance effectively

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

90

100

()

kNN fkNN

Figure 5 Macro 1198651 results on Chinese data set

GIIGIIGECE

CHIORMI

0

10

20

30

40

50

60

70

80

90

100

()

SVMClassifiers

kNN fkNN

Figure 6 Micro 1198651 results on Chinese data set

Figures 7 8 9 and 10 are results for the English and Chi-nese date sets To determine which algorithmrsquos performanceis the best we use different TF weights algorithms such asTF-IDF and TF-Gini

From Figures 7 and 8 we can see that the TF-Giniweights algorithmrsquos performance is good and acceptable forthe English data set its performance is very close to thehighest FromFigures 9 and 10 we can see that on theChinesedata set the TF-Gini algorithm overcomes the deficienciesof the IGI algorithm and gets the best performance Theseresults show that the TF-Gini algorithm is a very promisingfeature weight algorithm

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Improved Feature Weight Algorithm and Its ...

8 Mathematical Problems in Engineering

GIIGIIGECE

CHIORMI

kNN fkNNSVMClassifiers

0

10

20

30

40

50

60

70

80

()

Figure 3 Macro 1198651 results on Reuters-21578 data set

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

()

kNN fkNN

Figure 4 Macro 1198651 results on Reuters-21578 data set

Stop Words the results of Chinese word segmentation algo-rithm will influence the classification algorithmrsquos accuracyMeanwhile the current experiments do not consider thefeature wordsrsquo weight The following TF-Gini algorithm hasa good solution to this problem

Although the IGI algorithm does not have the bestperformance in the Chinese data set its performance is veryclose to the highest Hence the IGI feature selection algo-rithm is suitable and acceptable in the classification algo-rithm It promotes the classifiersrsquo performance effectively

GIIGIIGECE

CHIORMI

SVMClassifiers

0

10

20

30

40

50

60

70

80

90

100

()

kNN fkNN

Figure 5 Macro 1198651 results on Chinese data set

GIIGIIGECE

CHIORMI

0

10

20

30

40

50

60

70

80

90

100

()

SVMClassifiers

kNN fkNN

Figure 6 Micro 1198651 results on Chinese data set

Figures 7 8 9 and 10 are results for the English and Chi-nese date sets To determine which algorithmrsquos performanceis the best we use different TF weights algorithms such asTF-IDF and TF-Gini

From Figures 7 and 8 we can see that the TF-Giniweights algorithmrsquos performance is good and acceptable forthe English data set its performance is very close to thehighest FromFigures 9 and 10 we can see that on theChinesedata set the TF-Gini algorithm overcomes the deficienciesof the IGI algorithm and gets the best performance Theseresults show that the TF-Gini algorithm is a very promisingfeature weight algorithm

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Improved Feature Weight Algorithm and Its ...

Mathematical Problems in Engineering 9

TF-IDF

TF-Gini

TF-IG

TF-ECE

TF-CHI

TF-MI

TF-OR

72

74

76

78

80

82

84

86

88

90

92

()

SVMClassifiers

kNN fkNN

Figure 7 Feature weights Macro 1198651 results on Reuters-21578 dataset

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

70

72

74

76

78

80

82

84

86

88

90

92

()

kNN fkNN

Figure 8 Feature weights Micro 1198651 results on Reuters-21578 dataset

4 Sensitive Information Identification SystemBased on TF-Gini Algorithm

41 Introduction With the rapid development of networktechnologies the Internet has become a huge repository ofinformation At the same time much garbage informationis poured into the Internet such as a virus violence andgambling All these are called sensitive information that

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 9 Feature weights Macro 1198651 results on Chinese data set

TF-IDFTF-GiniTF-IGTF-ECE

TF-CHITF-MITF-OR

SVMClassifiers

0

20

40

60

80

100

()

10

30

50

70

90

kNN fkNN

Figure 10 Feature weights Micro 1198651 results on Chinese data set

will produce extremely adverse impacts and serious conse-quences

The commonly used methods of Internet informationfiltering andmonitoring are stringmatchingThere are a largenumber of synonyms and ambiguous words in natural textsMeanwhile a lot of new words and expressions appear onthe Internet every day Therefore the traditional sensitiveinformation filtering and monitoring are not effective Inthis paper we propose a new sensitive information filteringsystem based on machine learning The TF-Gini text featureselection algorithm is used in this system This system can

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Improved Feature Weight Algorithm and Its ...

10 Mathematical Problems in Engineering

Start

Select a training set

Chinese word segmentation

VSM

Feature selection andweighting (TF-Gini

algorithm)

Classification

Classification model

Select a training set

Chinese word segmentation

VSM

Load model

Sensitive information

Sensitive Nonsensitive

Results evaluation

Yes No

Figure 11 Sensitive system architecture

discover the sensitive information in a fast accurate andintelligent way

42 System Architecture The sensitive system includes thetext preprocessing feature selection text representation textclassification and result evaluation modules Figure 11 showsthe sensitive system architecture

421 The Chinese Word Segmentation The function of thismodule is segmentation of Chinese words and getting rid ofStop Words In this module we use IKanalyzer to completeChinese words segmentation IKanalyzer is a lightweightChinese words segmentation tool based on an open sourcepackage

(1) IKanalyzer uses a multiprocessor module and sup-ports English letters Chinese vocabulary and indi-vidual words processing

(2) It optimizes the dictionary storage and supports theuser dictionary definition

422 The Feature Selection At the stage of feature selectionthe TF-Gini algorithm has a good performance on Chinesedata sets Feature words selection is very important prepara-tion for classification In this step we use TF-Gini algorithmas the core algorithm of feature selection

423 The Text Classification We use the Naıve Bayes algo-rithm [30] as the main classifier In order to verify the

performance of the Naıve Bayes algorithm proposed in thispaper we use the 119896NN algorithm and the improved NaıveBayes [31] algorithm to compare with it

The 119896NN algorithm is a typical algorithm in text clas-sification and does not need the training data set In orderto calculate the similarity between the test sample 119902 and thetraining samples 119889

119895 we adopt the cosine similarity It can be

described as follows

cos (119902 119889119895) =

sum|119881|

119894=1119908119894119895

times 119908119894119902

radicsum|119881|

119894=11199082

119894119895times radicsum

|119881|

119894=11199082

119894119902

(32)

where 119908119894119895is the 119894th feature wordrsquos weights in 119889

119895and 119908

119894119902is the

119894th feature wordrsquos weights in sample 119902 |119881| is the number offeature spaces

Naıve Bayes classifier is based on the principle of BayesAssume 119898 classes 119888

1 1198882 119888

119898and a given unknown text 119902

with no class label The classification algorithm will predict if119902 belongs to the class with a higher posterior probability inthe condition of 119902 It can be described as follows

119875 (119888119894

| 119902) =119875 (119902 | 119888

119894) 119875 (119888119894)

119875 (119902) (33)

Naıve Bayes classification is based on a simple assump-tion in a given sample class label the attributes are condi-tionally independent Then the probability 119875(119902) is constant

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: Research Article Improved Feature Weight Algorithm and Its ...

Mathematical Problems in Engineering 11

Table 3 The result of Experiment 1

Classification Precision () Recall () 1198651 ()119896NN 692 72 706Naıve Bayes 8077 84 8235Improved Naıve Bayes 7188 92 807

Table 4 The result of Experiment 2

Classification Precision () Recall () 1198651 ()119896NN 9676 5541 7044Naıve Bayes 966 966 966Improved Naıve Bayes 8361 100 9108

and the probability of 119875(119888119894) is also constant so the calculation

of 119875(119902 | 119888119894) can be simplified as follows

119875 (119888119894

| 119902) =

119899

prod

119895=1

119875 (119909119895

| 119888119894) (34)

where 119909119895is the 119895th feature wordrsquos weights of text 119902 in class 119888

119894

and 119899 is the number of features in 119902

43 Experimental Results and Analysis In order to verifythe performance of the system we use two experiments tomeasure the sensitive information system We use tenfoldcross validation and threefold cross validation in differentdata sets to verify the system

Experiment 1 TheChinese data set has 1000 articles includ-ing 500 sensitive texts and 500 nonsensitive texts The 1000texts are randomly divided into 10 portions One of theportions is selected as test set and the other 9 samples are usedas a training set The process is repeated 10 times The finalresult is the average values of 10 classification results Table 3is the result of Experiment 1

Experiment 2 TheChinese data set has 1500 articles includ-ing 750 sensitive texts and 750 nonsensitive texts All thetexts are randomly divided into 3 parts One of the parts istest set the others are training sets The final result is theaverage values of 3 classification results Table 4 is the resultof Experiment 2

From the results of Experiments 1 and 2 we can seethat all the three classifiers have a better performance TheNaıve Bayes and Improved Naıve Bayes algorithms get abetter performance That is the sensitive information systemcould satisfy the demand of the Internet sensitive informationrecognition

5 Conclusion

Feature selection is an important aspect in text preprocessingIts performance directly affects the classifiersrsquo accuracy Inthis paper we employ a feature selection algorithm that isthe IGI algorithmWe compare its performancewith IGCHIMI and OR algorithms The experiment results show that

the IGI algorithm has the best performance on the Englishdata sets and is very close to the best performance on theChinese data setsTherefore the IGI algorithm is a promisingalgorithm in the text preprocessing

Feature weighting is another important aspect in textpreprocessing since the importance degree of features is dif-ferent in different categories We endow weights on differentfeature words the more important the feature word is thebigger the weights the feature word has According to the TF-IDF theory we construct a feature weighting algorithm thatis TF-Gini algorithm We replace the IDF part of TF-IDFalgorithm with the IGI algorithm To test the performanceof TF-Gini we also construct TF-IG TF-ECE TF-CHI TF-MI and TF-OR algorithms in the same way The experimentresults show that the TF-Gini performance is the best in theChinese data sets and is very close to the best performance inthe English data sets Hence TF-Gini algorithm can improvethe classifierrsquos performance especially in Chinese data sets

This paper also introduces a sensitive information iden-tification system which can monitor and filter sensitiveinformation on the Internet Considering the performanceof TF-Gini on Chinese data sets we choose the algorithm astext preprocessing algorithm in the systemThe core classifierwhich we choose is Naıve Bayes classifier The experimentresults show that the system achieves good performance

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This paper is partly supported by the engineering planningproject Research on Technology of Big Data-Based HighEfficient Classification of News funded by CommunicationUniversity of China (3132015XNG1504) partly supported byThe Comprehensive Reform Project of Computer Scienceand Technology funded by Ministry of Education of China(ZL140103 and ZL1503) and partly supported by The Excel-lent Young Teachers Training Project funded by Communi-cation University of China (YXJS201508 and XNG1526)

References

[1] S Jinshu Z Bofeng and X Xin ldquoAdvances in machine learningbased text categorizationrdquo Journal of Software vol 17 no 9 pp1848ndash1859 2006

[2] G Salton and C Yang ldquoOn the specification of term values inautomatic indexingrdquo Journal of Documentation vol 29 no 4pp 351ndash372 1973

[3] Z Cheng L Qing and L Fujun ldquoImprovedVSMalgorithm andits application in FAQrdquoComputer Engineering vol 38 no 17 pp201ndash204 2012

[4] X Junling Z Yuming C Lin andX Baowen ldquoAnunsupervisedfeature selection approach based on mutual informationrdquo Jour-nal of Computer Research and Development vol 49 no 2 pp372ndash382 2012

[5] Z Zhenhai L Shining and L Zhigang ldquoMulti-label featureselection algorithm based on information entropyrdquo Journal of

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: Research Article Improved Feature Weight Algorithm and Its ...

12 Mathematical Problems in Engineering

Computer Research and Development vol 50 no 6 pp 1177ndash1184 2013

[6] L Kousu and S Caiqing ldquoResearch on feature-selection inChinese text classificationrdquo Computer Simulation vol 24 no3 pp 289ndash291 2007

[7] Y Yang and J Q Pedersen ldquoA comparative study on featureselection in text categorizationrdquo in Proceedings of the 14thInternational Conference on Machine Learning pp 412ndash420Nashville Tenn USA July 1997

[8] Q Liqing Z Ruyi Z Gang et al ldquoAn extensive empirical studyof feature selection for text categorizationrdquo in Proceedings ofthe 7th IEEEACIS International Conference on Computer andInformation Science pp 312ndash315 IEEE Washington DC USAMay 2008

[9] S Wenqian D Hongbin Z Haibin and W Yongbin ldquoA novelfeature weight algorithm for text categorizationrdquo in Proceedingsof the International Conference on Natural Language Processingand Knowledge Engineering (NLP-KE rsquo08) pp 1ndash7 BeijingChina October 2008

[10] L Yongmin and Z Weidong ldquoUsing Gini-index for featureselection in text categorizationrdquo Computer Applications vol 27no 10 pp 66ndash69 2007

[11] L Xuejun ldquoDesign and research of decision tree classificationalgorithm based on minimum Gini indexrdquo Software Guide vol8 no 5 pp 56ndash57 2009

[12] R Guofeng L Dehua and P Ying ldquoAn improved algorithmfor feature weight based on Gini indexrdquo Computer amp DigitalEngineering vol 38 no 12 pp 8ndash13 2010

[13] L Yongmin L Zhengyu andZ Shuang ldquoAnalysis and improve-ment of feature weighting method TF-IDF in text categoriza-tionrdquoComputer Engineering andDesign vol 29 no 11 pp 2923ndash2925 2929 2008

[14] Q Shian and L Fayun ldquoImproved TF-IDFmethod in text class-ificationrdquo New Technology of Library and Information Servicevol 238 no 10 pp 27ndash30 2013

[15] L Yonghe and L Yanfeng ldquoImprovement of text feature weight-ing method based on TF-IDF algorithmrdquo Library and Informa-tion Service vol 57 no 3 pp 90ndash95 2013

[16] Y Xu Z Li and J Chen ldquoParallel recognition of illegal Webpages based on improved KNN classification algorithmrdquo Jour-nal of Computer Applications vol 33 no 12 pp 3368ndash3371 2013

[17] Y Liu Y Jian and J Liping ldquoAn adaptive large marginnearest neighbor classification algorithmrdquo Journal of ComputerResearch and Development vol 50 no 11 pp 2269ndash2277 2013

[18] MWan G Yang Z Lai and Z Jin ldquoLocal discriminant embed-ding with applications to face recognitionrdquo IET ComputerVision vol 5 no 5 pp 301ndash308 2011

[19] U Maulik and D Chakraborty ldquoFuzzy preference based featureselection and semi-supervised SVM for cancer classificationrdquoIEEE Transactions on Nanobioscience vol 13 no 2 pp 152ndash1602014

[20] X Liang ldquoAn effectivemethod of pruning support vectormach-ine classifiersrdquo IEEE Transactions on Neural Networks vol 21no 1 pp 26ndash38 2010

[21] P Baomao and S Haoshan ldquoResearch on improved algorithmfor Chinese word segmentation based onMarkov chainrdquo in Pro-ceedings of the 5th International Conference on InformationAssurance and Security (IAS rsquo09) pp 236ndash238 Xirsquoan ChinaSeptember 2009

[22] S Yan C Dongfeng Z Guiping and Z Hai ldquoApproach toChinese word segmentation based on character-word joint

decodingrdquo Journal of Software vol 20 no 9 pp 2366ndash23752009

[23] J Savoy ldquoA stemming procedure and stopword list for generalFrench corporardquo Journal of the American Society for InformationScience vol 50 no 10 pp 944ndash952 1999

[24] L Rutkowski L Pietruczuk P Duda and M Jaworski ldquoDeci-sion trees for mining data streams based on the McDiarmidrsquosboundrdquo IEEE Transactions on Knowledge and Data Engineeringvol 25 no 6 pp 1272ndash1279 2013

[25] N Prasad P Kumar andMM Naidu ldquoAn approach to predic-tion of precipitation using gini index in SLIQ decision treerdquoin Proceedings of the 4th International Conference on IntelligentSystems Modelling and Simulation (ISMS rsquo13) pp 56ndash60Bangkok Thailand January 2013

[26] S S Sundhari ldquoA knowledge discovery using decision tree byGini coefficientrdquo in Proceedings of the International Conferenceon Business Engineering and Industrial Applications (ICBEIArsquo11) pp 232ndash235 IEEE Kuala Lumpur Malaysia June 2011

[27] S V Stehman ldquoSelecting and interpretingmeasures of thematicclassification accuracyrdquo Remote Sensing of Environment vol 62no 1 pp 77ndash89 1997

[28] F Guohe ldquoReview of performance of text classificationrdquo Journalof Intelligence vol 30 no 8 pp 66ndash69 2011

[29] C J Van Rijsbergen Information Retrieval Butter-WorthsLondon UK 1979

[30] L Jianghao YAimin Y Yongmei et al ldquoClassification ofmicro-blog sentiment based onNaıve ByaesianrdquoComputer Engineeringamp Science vol 34 no 9 pp 160ndash165 2012

[31] Z Kenan Y Baolin M Yaming and H Yingnan ldquoMalwareclassification approach based on validwindowandNaıve bayesrdquoJournal of Computer Research andDevelopment vol 51 no 2 pp373ndash381 2014

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: Research Article Improved Feature Weight Algorithm and Its ...

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of