Research Article A Data Mining Classification Approach for...

10
Research Article A Data Mining Classification Approach for Behavioral Malware Detection Monire Norouzi, 1 Alireza Souri, 2 and Majid Samad Zamini 3 1 Young Researchers and Elite Club, Islamic Azad University, Hadishahr Branch, Hadishahr, Iran 2 Department of Computer Engineering, Islamic Azad University, Hadishahr Branch, Hadishahr, Iran 3 Department of Computer Engineering, Islamic Azad University, Sardroud Branch, Sardroud, Iran Correspondence should be addressed to Alireza Souri; [email protected] Received 29 November 2015; Revised 21 June 2016; Accepted 28 June 2016 Academic Editor: Zhiyong Xu Copyright © 2016 Monire Norouzi et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data mining techniques have numerous applications in malware detection. Classification method is one of the most popular data mining techniques. In this paper we present a data mining classification approach to detect malware behavior. We proposed different classification methods in order to detect malware based on the feature and behavior of each malware. A dynamic analysis method has been presented for identifying the malware features. A suggested program has been presented for converting a malware behavior executive history XML file to a suitable WEKA tool input. To illustrate the performance efficiency as well as training data and test, we apply the proposed approaches to a real case study data set using WEKA tool. e evaluation results demonstrated the availability of the proposed data mining approach. Also our proposed data mining approach is more efficient for detecting malware and behavioral classification of malware can be useful to detect malware in a behavioral antivirus. 1. Introduction Malicious code is one of the serious threats on the internet platform that is called malware [1]. Malware is known as a malicious application that has been obviously considered to damage the networks and computers [2]. e malware detec- tion design depends on a signature database [3, 4]. For exam- ple, a file can be examined with comparison of its bytes using signatures database. If there is an equal specification in the bytes, the suspicious file will be recognized as a malicious file [5, 6]. Some subjects concentrate the signature-based mal- ware detection less than dependable entirely which cannot handle the dynamic modification of malware behavior and cannot identify the hidden malware. In contrast, the behavior based malware detection can find the real behavior of a mali- cious file [7, 8]. e data mining objectives contain refining advertising abilities, irregular patterns detection, and the upcoming based experiences prediction [9] which can be influenced to identify the suspicious programs which have a destructive content for computer systems such as Virus, Worm, and Trojan [10]. e malware word is assigned to [11, 12] as a destructive file. Data mining techniques rely on data sets that contain some individual configurations for the malicious files and benign soſtware to construct the classification methods for malware detection [13, 14]. Because of the growing malware in the technology, the knowledge of unknown malware protection is an essential topic in the malware detection according to the machine learning methods. Generally, the data mining approaches specified both malicious executable and benign soſtware pro- grams as set of malware programs in the wild [13, 15, 16]. Usu- ally, the data mining algorithms can be categorized into two various forms: supervised and unsupervised learning proce- dures. e supervised learning methods are called classifica- tion algorithms that are needed to the exercise for data set [13, 17]. In contrast, the unsupervised learning methods are called clustering algorithms that are attempted to evaluate organiz- ing data into different clusters [18, 19]. Usually, the malware programs are classified into some parts such as Worm, Virus, Trojan, Spyware, Backdoor, and Rootkit [10, 20–22]. e base of typical and traditional Hindawi Publishing Corporation Journal of Computer Networks and Communications Volume 2016, Article ID 8069672, 9 pages http://dx.doi.org/10.1155/2016/8069672

Transcript of Research Article A Data Mining Classification Approach for...

Research ArticleA Data Mining Classification Approach forBehavioral Malware Detection

Monire Norouzi1 Alireza Souri2 and Majid Samad Zamini3

1Young Researchers and Elite Club Islamic Azad University Hadishahr Branch Hadishahr Iran2Department of Computer Engineering Islamic Azad University Hadishahr Branch Hadishahr Iran3Department of Computer Engineering Islamic Azad University Sardroud Branch Sardroud Iran

Correspondence should be addressed to Alireza Souri alirezasouriresearchgmailcom

Received 29 November 2015 Revised 21 June 2016 Accepted 28 June 2016

Academic Editor Zhiyong Xu

Copyright copy 2016 Monire Norouzi et alThis is an open access article distributed under theCreativeCommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Data mining techniques have numerous applications in malware detection Classification method is one of the most popular datamining techniques In this paperwe present a datamining classification approach to detectmalware behaviorWeproposed differentclassification methods in order to detect malware based on the feature and behavior of each malware A dynamic analysis methodhas been presented for identifying themalware features A suggested programhas been presented for converting amalware behaviorexecutive history XML file to a suitable WEKA tool input To illustrate the performance efficiency as well as training data andtest we apply the proposed approaches to a real case study data set using WEKA tool The evaluation results demonstrated theavailability of the proposed data mining approach Also our proposed data mining approach is more efficient for detecting malwareand behavioral classification of malware can be useful to detect malware in a behavioral antivirus

1 Introduction

Malicious code is one of the serious threats on the internetplatform that is called malware [1] Malware is known as amalicious application that has been obviously considered todamage the networks and computers [2]Themalware detec-tion design depends on a signature database [3 4] For exam-ple a file can be examined with comparison of its bytes usingsignatures database If there is an equal specification in thebytes the suspicious file will be recognized as a malicious file[5 6] Some subjects concentrate the signature-based mal-ware detection less than dependable entirely which cannothandle the dynamic modification of malware behavior andcannot identify the hiddenmalware In contrast the behaviorbased malware detection can find the real behavior of a mali-cious file [7 8]

The data mining objectives contain refining advertisingabilities irregular patterns detection and the upcomingbased experiences prediction [9] which can be influenced toidentify the suspicious programs which have a destructivecontent for computer systems such as Virus Worm and

Trojan [10] The malware word is assigned to [11 12] as adestructive file Data mining techniques rely on data sets thatcontain some individual configurations for themalicious filesand benign software to construct the classification methodsfor malware detection [13 14]

Because of the growing malware in the technology theknowledge of unknown malware protection is an essentialtopic in the malware detection according to the machinelearning methods Generally the data mining approachesspecified bothmalicious executable and benign software pro-grams as set of malware programs in the wild [13 15 16] Usu-ally the data mining algorithms can be categorized into twovarious forms supervised and unsupervised learning proce-dures The supervised learning methods are called classifica-tion algorithms that are needed to the exercise for data set [1317] In contrast the unsupervised learningmethods are calledclustering algorithms that are attempted to evaluate organiz-ing data into different clusters [18 19]

Usually the malware programs are classified into someparts such as Worm Virus Trojan Spyware Backdoor andRootkit [10 20ndash22] The base of typical and traditional

Hindawi Publishing CorporationJournal of Computer Networks and CommunicationsVolume 2016 Article ID 8069672 9 pageshttpdxdoiorg10115520168069672

2 Journal of Computer Networks and Communications

approaches to identify the malware is using signature-basedtechniques In recent years the disappointment of old meth-ods in unrecognized malware detection or polymorphicmalicious files exasperated researchers and they attempted topresent more dependable approaches for malware detectionwith behavior of themalware [23]Theprocedure of detectingand finding malware has been done by two types of analysisstatic analysis and dynamic analysis In the software analyzingmethods analyzing without running the codes is calledstatic analysis which can detect the malicious code andput it in one of the available collections based on differentlearning methods [24] In the static analysis malicious filesand malware are detected based on binary codes The maindisadvantage of static analysis is unavailability of the sourcecodes of the program It is valuable to declare that extractingbinary codes is a relatively complex and complicated work

In contrast the dynamic analysis detects malicious codesaccording to the runtime behavior [10] The runtime codeanalyzing is called dynamic analysis which also denotedbehavior analyzing and observing behavior and system inter-action [23] Dynamic analysis mechanism needs to executethe infested files in a virtual machine [21] Dynamic analysiscan be used with classification and clustering methods tonavigate the increasing volume and range of malware Themalware classification methods help to assign unknownmalware to recognized families [7 20] Therefore malwareclassification is used to filter unknown cases and thusdecreases the costs of analysis [8 25ndash29]

The contributions of this paper are included as follows

(i) Proposing a behavioral analysis mechanism for mal-ware detection

(ii) Presenting a converter program for transforming amalware behavior executive history XML file to asuitable WEKA input

(iii) Discussing some classification methods on a real casestudy of malware

(iv) Comparing the experimental results such asCorrectlyClassified Instances mean absolute error and accu-rate optimistic ratio in the real data set byWEKA tool

(v) Testing the best classification method based on theimportant features in the malware detection in orderto develop a behavioral antivirus

The structure of this paper is organized as follows in Section 2we have discussed some backgrounds and related works inthe malware detection and data mining techniques Section 3depicts the malware behavioral analysis In this section wepropose a new approach for analyzing the malware behaviorand translating the malicious files to data mining files byusing a real case study Also this section describes theclassification and prediction approaches using data miningplatform Then we apply some of the popular classificationmethods on our real case study using WEKA tool Theevaluation and experimental results are reported in Section 4Section 5 concludes discussion and the future work

2 Related Works

This section discusses a brief background and some relatedworks for malware detection in data mining methods Firstlywe review data mining approach briefly based on classifica-tion methods in malware and other systems Recently someresearchers presented the different approaches in malwareanalysis Schultz et al [30] proposed a dataminingmethod torecognize the newmalicious files in runtime executionTheirmethod was based on three types of DLL calls such as thelist of DLLs used by the binary the list of DLL function callsand number of different system calls used within each DLLAlso they examine byte orders extracted from the hex-dump(a hexadecimal schema of computer data) of an executable fileusing signature methods The main structure of this methodis based on Naive-Bayes (NB) algorithmThey compared theexperimental results by traditional signature-based methods

Also Kolter and Maloof [31] presented a data miningapproach and 119899-gram analysis to identify malicious exe-cutable files based on signature approach They presenteda hex-dump utility for translating each executable file tohexadecimal code in an ASCII format Their main data setconsisted of the clean programs and the malicious programsThey analyzed the proposed approach by some popularclassificationmethods such as instance-based learner TFIDFNaive-Bayes support vectormachines decision tree boostedNaive-Bayes and boosted decision tree In the other researchSiddiqui et al [32] proposed data mining techniques forrecognition some malware programs such as Worms Theyconsidered variable length instruction sequence for theirapproach Their main data set includes some Windows filesandWorms As experimental results sequence reduction wasexecuted 97 of the sequences were removed and randomforest decision tree model was performed slightly better thanthe others

Also some research work presented the data miningmethodologies for different approach For example in [33]the researchers presented various data mining methods thathave been developed for cancer diagnosis Consequentlythis research focused on captivating the clinical informationwhich can be found without surgery to exchange the pathol-ogy report They used to discover the association betweenthe clinical information and the pathology report in orderto maintain lung cancer pathologic staging diagnosis usingdata mining techniques In the other research [1 34] theauthors proposed a data mining approach to analyze thestudents careers Their approach is based on clustering andsequential methods with the aim of categorizing strategiesfor refining the performance of the exams scheduling andstudents They analyzed a real case study using 119870-meancluster techniques in WEKA tool Likewise [26] presented anew data mining method for the problem of detecting thephishing websites using a developed associative classificationmethod called multilabel classifier that generates multiplelabels rules They analyzed the experimental results by var-ious patterns in WEKA software Also the researchers in[35] analyzed the several decision tree models to classifypatients of the hospital surveillance data as a real case studyThe experimental results of their analysis showed that their

Journal of Computer Networks and Communications 3

XML

Malware behaviorexecutive

history

Converter

Based on behavioral

features

Nonsparsematrix

WEKA

DataSet1arffDataSet1arff

Training

Classification algorithms

Best classification

algorithm

Test

Malwaredetection

Adding new malware

A behavioral antivirus platform

Figure 1 The behavioral analysis of malware detection mechanism

approach improved identical dissemination of instances ineach class Other related work [36] used a neurofuzzy datamining approach for classification of generalized bell-shapedmembership functionsThey applied the proposed techniqueto ten real standard data sets from the UCI machine learningrepository for classification using Kappa statistic They simu-lated proposed technique in MATLAB Also some researchesfocused on the other approaches that consist of the hostbehavior classification methods [37ndash40] For example [29]presented a novel managed discretization technique for ana-lyzing multivariate time series which uses frequent temporalpatterns as features for classification of time chain for gearednear improvement of classification correctness This paperused temporal abstraction classification approach and timeintervals mining for the presented multivariate time seriesAlso [38] presented novel Artificial Neural Networks (ANN)basedmechanism for discovering the computerWorms basedon the behavioral computer events According to estimationof different parameters of the infected computers the ANNdecision tree and 119870-nearest neighbors classification tech-niques are compared The other research is [41] where theauthors presented computer measurements extracted mech-anism for identifying unknown computer Worm activity inthe operating system using support vector approaches Thispaper separates a series of trials to check the new techniqueby retaining several computer configuration activities

To the best of our knowledge there is no any approachthat analyzes the malware behavior in data mining platformexactly and also there is no any approach to convert malwarebehavior XML executive history file to a suitable WEKA toolinput Our approach can be used in base of a behavioralantivirus For improving this defect we present a newapproach to translate a malicious file to the data miningplatform Then we consider some classification methods forevaluating our approach based on malware behavior

Figure 2 A snapshot of XML convertor to nonsparse matrix

3 Malware Behavior Analysis

In this section we proposed a malware behavioral analysismechanism as shown in Figure 1 In this mechanism a XMLfile of malware behavior executive history will be convertedto a nonsparse matrix using a suggested application Thisapplication is producedwithVBNet language Figure 2 showsa snapshot of XML convertor to a nonsparse matrix usingour suggested application The procedure of converting eachXML file to a suitable WEKA input includes two elementsthe number of library file calls which are attacked bymalwareand their volume For example in Box 1 the XML library filentdlldll has been called 16 times by the malware which arebetween (0 2)Then we translate this matrix toWEKA inputdata set The training methods will be proceeded by someclassification algorithms Each classification that has bestperformance will be chosen for test platform by new data set

4 Journal of Computer Networks and Communications

ltxml version=ldquo10rdquo gt-lt- -This analysis was created by CWSandbox (c) CWSE GmbHSunbelt Software- -gtltanalysis cwsversion=ldquo2112rdquo time=ldquo08082009 052219rdquofile=ldquoc260589951029048b3e6d93316b3c2507rdquomd5=ldquo260589951029048b3e6d93316b3c2507rdquosha1=ldquo0089453df77890ae95ce7d9130a4ef85eaea36e8rdquologpath=ldquoccwsandboxlog260589951029048b3e6d93316b3c2507run 1rdquoanalysisid=ldquo647702rdquo sampleid=ldquo431657rdquogtltcalltreegtltprocess call index=ldquo1rdquo pid=ldquo1940rdquofilename=ldquoc260589951029048b3e6d93316b3c2507rdquo starttime=ldquo0001922rdquostartreason=ldquoAnalysisTargetrdquogtltcalltreegtltprocess call index=ldquo2rdquo pid=ldquo2084rdquo filename=ldquoCProgrammeInternet

Exploreriexploreexerdquo starttime=ldquo0005343rdquo startreason=ldquoCreateProcessrdquo gtltcalltreegtltprocess callgtltprocess call index=ldquo3rdquo pid=ldquo948rdquo

filename=ldquoCWINDOWSsystem32svchostexerdquo starttime=ldquo0007062rdquostartreason=ldquoDCOMServicerdquo gtltcalltreegtltprocessesgtltprocess index=ldquo1rdquo pid=ldquo1940rdquofilename=ldquoc260589951029048b3e6d93316b3c2507rdquo filesize=ldquo761856rdquomd5=ldquo260589951029048b3e6d93316b3c2507rdquosha1=ldquo0089453df77890ae95ce7d9130a4ef85eaea36e8rdquo username=ldquoAdministratorrdquoparentindex=ldquo0rdquo starttime=ldquo0001922rdquo terminationtime=ldquo0007484rdquostartreason=ldquoAnalysisTargetrdquo terminationreason=ldquoNormalTerminationrdquoexecutionstatus=ldquoOKrdquo applicationtype=ldquoWin32Applicationrdquogtltdll handling sectiongtltload image filename=ldquoc260589951029048b3e6d93316b3c2507rdquo successful=ldquo1rdquo

address=ldquo$400000rdquo end address=ldquo$4C1000rdquo size=ldquo790528rdquo gtltload dll filename=ldquoCWINDOWSsystem32ntdlldllrdquo successful=ldquo1rdquo

address=ldquo$7C910000rdquo end address=ldquo$7C9C9000rdquo size=ldquo757760rdquo quantity=ldquo16rdquogtltload dll filename=ldquoCWINDOWSsystem32kernel32dllrdquo successful=ldquo1rdquo

address=ldquo$7C800000rdquo end address=ldquo$7C908000rdquo size=ldquo1081344rdquo quantity=ldquo2rdquo gtltload dll filename=ldquoCWINDOWSsystem32gdi32dllrdquo successful=ldquo1rdquo

address=ldquo$77EF0000rdquo end address=ldquo$77F39000rdquo size=ldquo299008rdquo quantity=ldquo2rdquo gtltload dll filename=ldquoCWINDOWSsystem32USER32dllrdquo successful=ldquo1rdquo

address=ldquo$7E360000rdquo end address=ldquo$7E3F1000rdquo size=ldquo593920rdquo quantity=ldquo2rdquo gtltdll handling sectiongtltfilesystem sectiongt

Box 1 A sample part of XML file contains a malware behavior

malware Finally this procedure can be used for developing abehavioral antivirus For describing the behavioral model ofmalware we should download the XML file which is availablein PIL (httpdwsinformatikuni-mannheimde) as an XMLfile [38ndash40] We use 7155 XML files as data set 1 and data set2 Our first data set contains 4024 XML file and data set 2 has3131 XML files too Data set 1 has 89 properties and data set 2has 91 properties for each malware

Then we convert this XML file to a nonsparse matrix byusing our suggested programThe nonsparse matrix includestwo numbers the first number shows the number of proper-ties and the second number shows their importanceThe firstrow of this matrix is shown as follows

(0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 320534 33 0534 35 0534 36 0534 40 0534 45 106846 1603 47 1068 48 1068 49 1068 50 1068 51 106852 1068 53 1068 54 2137 55 1068 56 0534 57 106858 2137 61 0534 62 0534 63 2137 65 0534 66 053473 1603 83 22 84 16 85 4 86 8 87 6 88 T1)

The last number of this row is 88 T1 that shows the kind ofmalware

Finally we analyze the executive history of malware inWEKA environment The malware executive history can bedeveloped by some applications such as SandBox tool andvirtual machine for safe execution of malware in computer

Journal of Computer Networks and Communications 5

RELATION TEST file nameATTRIBUTE dll1 numeric propertyATTRIBUTE dll2 numeric propertyATTRIBUTE dll3 numeric propertyATTRIBUTE dll4 numeric property propertyATTRIBUTE param88 numeric propertyATTRIBUTE class Answer - propertyDATA0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 32 0534 33 0534 35

Box 2 An example of standard form for WEKA input

systems and preventing malware spread [28 38ndash41] TheXML file includes useful information such as system libraryfiles calls creating searching and change of files modifyingregistry main processes information creating the mutex (amutex is an application object which permits the multipleprogram threads to share the same resource) modifyingvirtual memory sending email registry operations andswitches communications By using the suggested programall of the information is read and saved as a nonsparsematrix

Now thematrix has been converted to a standard form ofWEKA tool input as arff file for data set 1 and data set 2Thisstandard form is shown in Box 2

31 Classification and Prediction Approaches This sectiondescribes the classification methods in two real case studiesas data set 1 and data set 2 At first we analyze the dataminingresult on data set 1 and data set 2 by WEKA classificationalgorithms For specifying the performance of classificationmethods in WEKA we describe some effective featuresbriefly [27] The Correctly Classified Instances (CCI) depictthe test cases percentages that were correctly classified Alsothe Incorrectly Classified Instances (ICI) represent the testcases percentages that were incorrectly classified

The relative absolute error (RAE) is qualified to a simplepredictor error which is objective for the typical real valuesIn the RAE the error is only the total absolute error ratherthan the total squared error

Definition 1 A relative absolute error is a 3-tuple RAE119894=

(119865(119894119895) 119881119895) in formula (1) where 119865

(119894119895)is the value predicted

by the individual program 119894 for sample case 119895 (out of 119896 samplecases) 119881

119895is the objective value for sample case 119895 and is

given by the following formulas

RAE119894=

sum119896

119895=1

10038161003816100381610038161003816119865(119894119895)minus 119881119895

10038161003816100381610038161003816

sum119896

119895=1

10038161003816100381610038161003816119881119895minus 10038161003816100381610038161003816

(1)

=1

119896

119896

sum

119895=1

119881119895 (2)

Also themean absolute error (MAE) shows themean averagegreatness of the errors in a set of predictions without allowingfor their courseTheMAEdepicts the correctness of incessant

variables in prediction procedure The MAE specifies andverifies an average on the absolute values between forecastand the corresponding statement The MAE is a linear scorewhich means that all the individual differences are weightedequally in the average [42ndash44]

Definition 2 A mean absolute error is a 2-tuple MAE119894=

(119875119894 119879119894) in formula (3) where 119875

119894is the prediction of value and

119879119894is the true value This feature specifies the average error in

the classification procedure in

MAE119894=

1

119896

119896

sum

119895=1

10038161003816100381610038161003816119875119895minus 119879119895

10038161003816100381610038161003816 (3)

Also we can measure the classifiers proficiency using atrue optimistic ratio (TOR) where NC is the number ofcorrectly detected malware programs and NI is the numberof incorrectly detected malware programs in (4) The AORcreates the cost of estimated classification that is significantto setting the cost of malware classification [45]

TOR = NCNC +NI

(4)

Also there are two error rates for measuring the classificationperformance The False Acceptance Rate (FAR) is the ratioof the number of test cases that are incorrectly accepted bya given model to the total number of cases This means thatthis ratio shows the percentage of invalid inputs which areincorrectly accepted The False Rejection Rate (FRR) is theratio of the number of test cases that are incorrectly rejectedby a given model to the total number of cases This meansthat this ratio shows the percentage of valid inputs whichare incorrectly rejected [46] By using these factors we cancalculate the Total Error Rate (TER) as follows [47]

TER = FAR + FRRNC +NI

(5)

In the classification process we use NaiveBayse BayseNetIB1 J48 and classification via regression algorithms TheNaiveBayes and BayesNet are a probabilistic learning algo-rithms based on supervised learning method which requirea small number of training data to estimate the constraintsThe IB1 data mining algorithm is based on lazy approaches

6 Journal of Computer Networks and Communications

Table 1 The statistical analysis of data set 1 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 1107275099

2917724901 00069 900871 02526 00754 1228107 4024

BayesNet 2662661531

1362338469 00032 424047 05979 00479 781282 4024

IB1 2802696322

1222303678 00028 372325 06199 00533 868274 4024

J48 2908722664 1116 277336 00032 416312 06379 00454 739957 4024

Regression 3051758201 973 241799 00011 210201 06859 00392 639686 4024

SVM 2251641571

1773358429 00039 420019 05743 04758 849596 4024

Also J48 data mining algorithm is based on decision treemethods Finally classification via regression algorithm isbased on Meta approach that is the new approach in datamining methods In other words regression analysis is astatistical method which is used to achieve data analysisRegression is applied with correlation analysis usually Thecorrelation analysis evaluates the association degree betweentwo quantitative data sets [37] For example Figure 3 showsthe classification result of NaiveBayse algorithm in WEKAtoolThe following section describes the experimental resultsof classification algorithms inWEKA Some effective featuressuch as Correctly Classified Instances Incorrectly ClassifiedInstances mean absolute error and relative absolute errorare compared with each other in order to achieve the bestclassification algorithm for developing a behavioral antivirus

4 Experimental Results and Discussion

In this section we implemented our approach using WEKAtool We use a system by Intel Core i3 213 GHz CPU 4GBRAM for the classification methods This analysis has beendone by some classification algorithms such as NaiveBayseBayseNet IB1 J48 System Vector Machine (SVM) andlogistic regression method We compared performance ofclassification methods in two malware data sets

In Tables 1 and 2 the statistical analysis of data sets 1and 2 is specified for proposed classification methods Thecompared factors in the classification methods are CorrectlyClassified Instances Incorrectly Classified Instances Kappastatistic mean absolute error relative absolute error rootmean squared error and root relative squared error In thiscomparison we show that the classification via regressionmethod has best performance in malware detection Forexample in data set 1 the number of correctly classifiedmalware programs is 3051 from total 4024malware programsAlso in data set 2 the number of correctly classified malwareprograms is 3069 from total 3131 malware programs

Figure 3 The snapshot of NaiveBayse classification algorithm inWEKA

According to Tables 1 and 2 the percentage of CorrectlyClassified Instances of the logistic regression algorithm ishigher than the other classification methods in each of datasets 1 and 2 Also the percentage of Incorrectly ClassifiedInstances of the logistic regression algorithm is lower than theother classification methods in each of data sets 1 and 2

After data mining process we test a new malware caseby the regression classification algorithm 100 binarymalwareprograms are downloaded from NetLux (httpvxheavenorg)and we analyzed their behaviors by using CW-Sandbox tooland we get its XML file [38] Then we add these 100 malwareprograms to the new data set and compute the quality of theirclassification as true optimistic ratio As we expect by classi-fication via regression 88 malware programs are detected Sowe can use the classification via regression to develop a behav-aioral antivirus

Journal of Computer Networks and Communications 7

Table 2 The statistical analysis of data set 2 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 2678855318

453144682 0012 153329 08459 01026 518792 3131

BayesNet 2874917918 257 82082 00073 93575 09127 00747 377504 3131

IB1 3028967103 103 32897 00027 35032 0965 00524 26472 3131

J48 3008960715 123 39285 00043 55353 09581 00527 26652 3131

Regression 3069 98321 62 1679 00021 22102 09578 00543 274333 3131

SVM 1698542319

1433457681 00046 57993 05011 01942 981954 3131

New data set

0102030405060708090

100

True

opt

imist

ic ra

tio (T

OR)

()

Baye

sNet

Regr

essio

n

J48

SVM

Nai

veBa

yes

IB1

Classification algorithms

Figure 4 The true optimistic ratio for the classifications test in thenew data set

Figure 4 depicts the true optimistic ratio percentage formalware detection in the new data sets The true optimisticratio percentage of regressionmethod is higher than the otherclassification methods in the new data set

After testing our new case study by 100 malware pro-grams Table 3 describes a statistical result for the FalseAcceptance Rate (FAR) number of cases and the FalseRejection Rate (FRR) number of cases Of course there aresome platforms such as STAC (httpteccitiususcesstac)[48] for statistical comparison of the tested algorithms Butwe use theWEKA tool for statistical and experimental resultsfor our data sets

According to Table 3 there is no valid input whichis incorrectly rejected using our approach by regressionmethod Also NaiveBayes method rejected 6 valid inputsincorrectly

Also in this test case we find one FAR incorrectly acceptedas a malware So Figure 5 shows the Total Error Rate (TER)

New data set

IB1

Baye

sNet J48

Regr

essio

n

SVM

Nai

veBa

yes

Classification algorithms

02468

101214161820

Tota

l Err

or R

ate (

TER)

()

Figure 5 The Total Error Rate (TER) for the classifications test inthe new data set

Table 3The statistical analysis of the FAR and FRR number of casesin the new test case study

AlgorithmsStatistical analysis

Number ofFAR cases

Number ofFRR cases

Total numberof instances

NaiveBayes 5 6 100BayesNet 4 2 100IB1 2 1 100J48 3 2 100Regression 1 0 100SVM 3 2 100

for our new test case using our approach by the regressionmethod

8 Journal of Computer Networks and Communications

5 Conclusion and Future Work

In this paper we proposed a new datamining approach basedon classificationmethodologies for detectingmalware behav-ior Firstly a malware behavior executive history XML file isconverted to a nonsparse matrix using our suggested applica-tionThen thismatrixwas translated toWEKA input data setTo illustrate the performance efficiency we applied the pro-posed approaches to a real case study data set using WEKAtool The training methods proceeded using some classifica-tion algorithms such as NaiveBayse BayseNet IB1 J48 andregression algorithms The regression classification methodhad best performance for classification of malware detectionAlso we analyzed the new data set by the regression classifica-tion method The evaluation results demonstrated the avail-ability of the proposed data mining approach Also our pro-posed data mining mechanism is more efficient for detectingmalware By notice to the experimental results classificationofmalware behavioral features can be a convenientmethod indeveloping a behavioral antivirus In the future work we willtry to develop and analyze a real behavioral antivirus platformbased on classification via regression algorithm

Competing Interests

The authors declare that they have no competing interests

References

[1] D B Ekta Gandotra and S Sofat ldquoMalware analysis andclassification a surveyrdquo Journal of Information Security vol 5pp 56ndash64 2014

[2] P Wang and Y-S Wang ldquoMalware behavioural detection andvaccine development by using a support vectormodel classifierrdquoJournal of Computer and System Sciences vol 81 no 6 pp 1012ndash1026 2015

[3] G Ollmann ldquoThe evolution of commercial malware develop-ment kits and colour-by-numbers custom malwarerdquo ComputerFraud and Security vol 2008 no 9 pp 4ndash7 2008

[4] M Ghiasi A Sami and Z Salehi ldquoDynamic VSA a frameworkfor malware detection based on register contentsrdquo EngineeringApplications of Artificial Intelligence vol 44 pp 111ndash122 2015

[5] D Bruschi L Martignoni and M Monga ldquoDetecting self-mutating malware using control-flow graph matchingrdquo inDetection of Intrusions andMalwareampVulnerability AssessmentR Buschkes andP Laskov Eds vol 4064 pp 129ndash143 SpringerBerlin Germany 2006

[6] M R Chouchane and A Lakhotia ldquoUsing engine signature todetect metamorphic malwarerdquo in Proceedings of the 4th ACMWorkshop on Recurring Malcode Alexandria Va USA 2006

[7] N Kuzurin A Shokurov N Varnovsky and V Zakharov ldquoOnthe concept of software obfuscation in computer securityrdquo inInformation Security J Garay A Lenstra M Mambo and RPeralta Eds vol 4779 pp 281ndash298 Springer Berlin Germany2007

[8] M Christodorescu and S Jha ldquoTesting malware detectorsrdquoSIGSOFT Software Engineering Notes vol 29 no 4 pp 34ndash442004

[9] L K Mehedy Masud and BThuraisinghamData Mining Toolsfor Malware Detection vol 1 CRC Press 2012

[10] M Egele T Scholte E Kirda and C Kruegel ldquoA survey onautomated dynamic malware-analysis techniques and toolsrdquoACM Computing Surveys vol 44 pp 1ndash42 2008

[11] S P Monire Norouzi and A Mahjur ldquoA new approach forformal behavioral modeling of protection services in antivirussystemsrdquo International Journal in Foundations of ComputerScience amp Technology vol 4 pp 57ndash67 2014

[12] A Safarkhanlou A Souri M Norouzi and S E H SardroudldquoFormalizing and verification of an antivirus protection serviceusing model checkingrdquo Procedia Computer Science vol 57 pp1324ndash1331 2015

[13] I Santos F Brezo X Ugarte-Pedrero and P G BringasldquoOpcode sequences as representation of executables for data-mining-based unknown malware detectionrdquo Information Sci-ences vol 231 pp 64ndash82 2013

[14] N Abdelhamid A Ayesh and F Thabtah ldquoPhishing detectionbased associative classification data miningrdquo Expert Systemswith Applications vol 41 no 13 pp 5948ndash5959 2014

[15] G Jacob H Debar and E Filiol ldquoBehavioral detection ofmalware from a survey towards an established taxonomyrdquoJournal in Computer Virology vol 4 no 3 pp 251ndash266 2008

[16] A Shabtai R Moskovitch Y Elovici and C Glezer ldquoDetectionof malicious code by applying machine learning classifiers onstatic features a state-of-the-art surveyrdquo Information SecurityTechnical Report vol 14 no 1 pp 16ndash29 2009

[17] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo in Proceedings of the Conferenceon Emerging Artificial Intelligence Applications in ComputerEngineering RealWord AI Systems with Applications in eHealthHCI Information Retrieval and Pervasive Technologies pp 3ndash24IOS Press 2007

[18] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[19] L Chen Q Jiang and S Wang ldquoModel-based method forprojective clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 24 no 7 pp 1291ndash1305 2012

[20] C Ravi and RManoharan ldquoMalware detection usingWindowsApi sequence and machine learningrdquo International Journal ofComputer Applications vol 43 no 17 pp 12ndash16 2012

[21] R Rizwan G C Hazarika and G Chetia ldquoMalware threatsand mitigation strategies a surveyrdquo Journal of Theoretical andApplied Information Technology vol 29 no 2 pp 69ndash73 2011

[22] K Mathur and H Saroj ldquoA survey on techniques in detectionand analyzing malware executablesrdquo International Journal ofAdvanced Research in Computer Science and Software Engineer-ing vol 44 no 2 2012

[23] N F Doherty L Anastasakis and H Fulford ldquoThe informationsecurity policy unpacked a critical study of the content ofuniversity policiesrdquo International Journal of Information Man-agement vol 29 no 6 pp 449ndash457 2009

[24] G Tahan L Rokach and Y Shahar ldquoAutomatic malwaredetection using common segment analysis and meta-featuresrdquoJournal ofMachine Learning Research vol 13 pp 949ndash979 2012

[25] M Bailey J Oberheide J Andersen Z M Mao F Jahanianand JNazario ldquoAutomated classification and analysis of internetmalwarerdquo inRecent Advances in Intrusion Detection C KruegelR Lippmann andAClark Eds vol 4637 pp 178ndash197 SpringerBerlin Germany 2007

[26] U Bayer AMoser C Kruegel and E Kirda ldquoDynamic analysisof malicious coderdquo Journal in Computer Virology vol 2 no 1pp 67ndash77 2006

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

2 Journal of Computer Networks and Communications

approaches to identify the malware is using signature-basedtechniques In recent years the disappointment of old meth-ods in unrecognized malware detection or polymorphicmalicious files exasperated researchers and they attempted topresent more dependable approaches for malware detectionwith behavior of themalware [23]Theprocedure of detectingand finding malware has been done by two types of analysisstatic analysis and dynamic analysis In the software analyzingmethods analyzing without running the codes is calledstatic analysis which can detect the malicious code andput it in one of the available collections based on differentlearning methods [24] In the static analysis malicious filesand malware are detected based on binary codes The maindisadvantage of static analysis is unavailability of the sourcecodes of the program It is valuable to declare that extractingbinary codes is a relatively complex and complicated work

In contrast the dynamic analysis detects malicious codesaccording to the runtime behavior [10] The runtime codeanalyzing is called dynamic analysis which also denotedbehavior analyzing and observing behavior and system inter-action [23] Dynamic analysis mechanism needs to executethe infested files in a virtual machine [21] Dynamic analysiscan be used with classification and clustering methods tonavigate the increasing volume and range of malware Themalware classification methods help to assign unknownmalware to recognized families [7 20] Therefore malwareclassification is used to filter unknown cases and thusdecreases the costs of analysis [8 25ndash29]

The contributions of this paper are included as follows

(i) Proposing a behavioral analysis mechanism for mal-ware detection

(ii) Presenting a converter program for transforming amalware behavior executive history XML file to asuitable WEKA input

(iii) Discussing some classification methods on a real casestudy of malware

(iv) Comparing the experimental results such asCorrectlyClassified Instances mean absolute error and accu-rate optimistic ratio in the real data set byWEKA tool

(v) Testing the best classification method based on theimportant features in the malware detection in orderto develop a behavioral antivirus

The structure of this paper is organized as follows in Section 2we have discussed some backgrounds and related works inthe malware detection and data mining techniques Section 3depicts the malware behavioral analysis In this section wepropose a new approach for analyzing the malware behaviorand translating the malicious files to data mining files byusing a real case study Also this section describes theclassification and prediction approaches using data miningplatform Then we apply some of the popular classificationmethods on our real case study using WEKA tool Theevaluation and experimental results are reported in Section 4Section 5 concludes discussion and the future work

2 Related Works

This section discusses a brief background and some relatedworks for malware detection in data mining methods Firstlywe review data mining approach briefly based on classifica-tion methods in malware and other systems Recently someresearchers presented the different approaches in malwareanalysis Schultz et al [30] proposed a dataminingmethod torecognize the newmalicious files in runtime executionTheirmethod was based on three types of DLL calls such as thelist of DLLs used by the binary the list of DLL function callsand number of different system calls used within each DLLAlso they examine byte orders extracted from the hex-dump(a hexadecimal schema of computer data) of an executable fileusing signature methods The main structure of this methodis based on Naive-Bayes (NB) algorithmThey compared theexperimental results by traditional signature-based methods

Also Kolter and Maloof [31] presented a data miningapproach and 119899-gram analysis to identify malicious exe-cutable files based on signature approach They presenteda hex-dump utility for translating each executable file tohexadecimal code in an ASCII format Their main data setconsisted of the clean programs and the malicious programsThey analyzed the proposed approach by some popularclassificationmethods such as instance-based learner TFIDFNaive-Bayes support vectormachines decision tree boostedNaive-Bayes and boosted decision tree In the other researchSiddiqui et al [32] proposed data mining techniques forrecognition some malware programs such as Worms Theyconsidered variable length instruction sequence for theirapproach Their main data set includes some Windows filesandWorms As experimental results sequence reduction wasexecuted 97 of the sequences were removed and randomforest decision tree model was performed slightly better thanthe others

Also some research work presented the data miningmethodologies for different approach For example in [33]the researchers presented various data mining methods thathave been developed for cancer diagnosis Consequentlythis research focused on captivating the clinical informationwhich can be found without surgery to exchange the pathol-ogy report They used to discover the association betweenthe clinical information and the pathology report in orderto maintain lung cancer pathologic staging diagnosis usingdata mining techniques In the other research [1 34] theauthors proposed a data mining approach to analyze thestudents careers Their approach is based on clustering andsequential methods with the aim of categorizing strategiesfor refining the performance of the exams scheduling andstudents They analyzed a real case study using 119870-meancluster techniques in WEKA tool Likewise [26] presented anew data mining method for the problem of detecting thephishing websites using a developed associative classificationmethod called multilabel classifier that generates multiplelabels rules They analyzed the experimental results by var-ious patterns in WEKA software Also the researchers in[35] analyzed the several decision tree models to classifypatients of the hospital surveillance data as a real case studyThe experimental results of their analysis showed that their

Journal of Computer Networks and Communications 3

XML

Malware behaviorexecutive

history

Converter

Based on behavioral

features

Nonsparsematrix

WEKA

DataSet1arffDataSet1arff

Training

Classification algorithms

Best classification

algorithm

Test

Malwaredetection

Adding new malware

A behavioral antivirus platform

Figure 1 The behavioral analysis of malware detection mechanism

approach improved identical dissemination of instances ineach class Other related work [36] used a neurofuzzy datamining approach for classification of generalized bell-shapedmembership functionsThey applied the proposed techniqueto ten real standard data sets from the UCI machine learningrepository for classification using Kappa statistic They simu-lated proposed technique in MATLAB Also some researchesfocused on the other approaches that consist of the hostbehavior classification methods [37ndash40] For example [29]presented a novel managed discretization technique for ana-lyzing multivariate time series which uses frequent temporalpatterns as features for classification of time chain for gearednear improvement of classification correctness This paperused temporal abstraction classification approach and timeintervals mining for the presented multivariate time seriesAlso [38] presented novel Artificial Neural Networks (ANN)basedmechanism for discovering the computerWorms basedon the behavioral computer events According to estimationof different parameters of the infected computers the ANNdecision tree and 119870-nearest neighbors classification tech-niques are compared The other research is [41] where theauthors presented computer measurements extracted mech-anism for identifying unknown computer Worm activity inthe operating system using support vector approaches Thispaper separates a series of trials to check the new techniqueby retaining several computer configuration activities

To the best of our knowledge there is no any approachthat analyzes the malware behavior in data mining platformexactly and also there is no any approach to convert malwarebehavior XML executive history file to a suitable WEKA toolinput Our approach can be used in base of a behavioralantivirus For improving this defect we present a newapproach to translate a malicious file to the data miningplatform Then we consider some classification methods forevaluating our approach based on malware behavior

Figure 2 A snapshot of XML convertor to nonsparse matrix

3 Malware Behavior Analysis

In this section we proposed a malware behavioral analysismechanism as shown in Figure 1 In this mechanism a XMLfile of malware behavior executive history will be convertedto a nonsparse matrix using a suggested application Thisapplication is producedwithVBNet language Figure 2 showsa snapshot of XML convertor to a nonsparse matrix usingour suggested application The procedure of converting eachXML file to a suitable WEKA input includes two elementsthe number of library file calls which are attacked bymalwareand their volume For example in Box 1 the XML library filentdlldll has been called 16 times by the malware which arebetween (0 2)Then we translate this matrix toWEKA inputdata set The training methods will be proceeded by someclassification algorithms Each classification that has bestperformance will be chosen for test platform by new data set

4 Journal of Computer Networks and Communications

ltxml version=ldquo10rdquo gt-lt- -This analysis was created by CWSandbox (c) CWSE GmbHSunbelt Software- -gtltanalysis cwsversion=ldquo2112rdquo time=ldquo08082009 052219rdquofile=ldquoc260589951029048b3e6d93316b3c2507rdquomd5=ldquo260589951029048b3e6d93316b3c2507rdquosha1=ldquo0089453df77890ae95ce7d9130a4ef85eaea36e8rdquologpath=ldquoccwsandboxlog260589951029048b3e6d93316b3c2507run 1rdquoanalysisid=ldquo647702rdquo sampleid=ldquo431657rdquogtltcalltreegtltprocess call index=ldquo1rdquo pid=ldquo1940rdquofilename=ldquoc260589951029048b3e6d93316b3c2507rdquo starttime=ldquo0001922rdquostartreason=ldquoAnalysisTargetrdquogtltcalltreegtltprocess call index=ldquo2rdquo pid=ldquo2084rdquo filename=ldquoCProgrammeInternet

Exploreriexploreexerdquo starttime=ldquo0005343rdquo startreason=ldquoCreateProcessrdquo gtltcalltreegtltprocess callgtltprocess call index=ldquo3rdquo pid=ldquo948rdquo

filename=ldquoCWINDOWSsystem32svchostexerdquo starttime=ldquo0007062rdquostartreason=ldquoDCOMServicerdquo gtltcalltreegtltprocessesgtltprocess index=ldquo1rdquo pid=ldquo1940rdquofilename=ldquoc260589951029048b3e6d93316b3c2507rdquo filesize=ldquo761856rdquomd5=ldquo260589951029048b3e6d93316b3c2507rdquosha1=ldquo0089453df77890ae95ce7d9130a4ef85eaea36e8rdquo username=ldquoAdministratorrdquoparentindex=ldquo0rdquo starttime=ldquo0001922rdquo terminationtime=ldquo0007484rdquostartreason=ldquoAnalysisTargetrdquo terminationreason=ldquoNormalTerminationrdquoexecutionstatus=ldquoOKrdquo applicationtype=ldquoWin32Applicationrdquogtltdll handling sectiongtltload image filename=ldquoc260589951029048b3e6d93316b3c2507rdquo successful=ldquo1rdquo

address=ldquo$400000rdquo end address=ldquo$4C1000rdquo size=ldquo790528rdquo gtltload dll filename=ldquoCWINDOWSsystem32ntdlldllrdquo successful=ldquo1rdquo

address=ldquo$7C910000rdquo end address=ldquo$7C9C9000rdquo size=ldquo757760rdquo quantity=ldquo16rdquogtltload dll filename=ldquoCWINDOWSsystem32kernel32dllrdquo successful=ldquo1rdquo

address=ldquo$7C800000rdquo end address=ldquo$7C908000rdquo size=ldquo1081344rdquo quantity=ldquo2rdquo gtltload dll filename=ldquoCWINDOWSsystem32gdi32dllrdquo successful=ldquo1rdquo

address=ldquo$77EF0000rdquo end address=ldquo$77F39000rdquo size=ldquo299008rdquo quantity=ldquo2rdquo gtltload dll filename=ldquoCWINDOWSsystem32USER32dllrdquo successful=ldquo1rdquo

address=ldquo$7E360000rdquo end address=ldquo$7E3F1000rdquo size=ldquo593920rdquo quantity=ldquo2rdquo gtltdll handling sectiongtltfilesystem sectiongt

Box 1 A sample part of XML file contains a malware behavior

malware Finally this procedure can be used for developing abehavioral antivirus For describing the behavioral model ofmalware we should download the XML file which is availablein PIL (httpdwsinformatikuni-mannheimde) as an XMLfile [38ndash40] We use 7155 XML files as data set 1 and data set2 Our first data set contains 4024 XML file and data set 2 has3131 XML files too Data set 1 has 89 properties and data set 2has 91 properties for each malware

Then we convert this XML file to a nonsparse matrix byusing our suggested programThe nonsparse matrix includestwo numbers the first number shows the number of proper-ties and the second number shows their importanceThe firstrow of this matrix is shown as follows

(0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 320534 33 0534 35 0534 36 0534 40 0534 45 106846 1603 47 1068 48 1068 49 1068 50 1068 51 106852 1068 53 1068 54 2137 55 1068 56 0534 57 106858 2137 61 0534 62 0534 63 2137 65 0534 66 053473 1603 83 22 84 16 85 4 86 8 87 6 88 T1)

The last number of this row is 88 T1 that shows the kind ofmalware

Finally we analyze the executive history of malware inWEKA environment The malware executive history can bedeveloped by some applications such as SandBox tool andvirtual machine for safe execution of malware in computer

Journal of Computer Networks and Communications 5

RELATION TEST file nameATTRIBUTE dll1 numeric propertyATTRIBUTE dll2 numeric propertyATTRIBUTE dll3 numeric propertyATTRIBUTE dll4 numeric property propertyATTRIBUTE param88 numeric propertyATTRIBUTE class Answer - propertyDATA0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 32 0534 33 0534 35

Box 2 An example of standard form for WEKA input

systems and preventing malware spread [28 38ndash41] TheXML file includes useful information such as system libraryfiles calls creating searching and change of files modifyingregistry main processes information creating the mutex (amutex is an application object which permits the multipleprogram threads to share the same resource) modifyingvirtual memory sending email registry operations andswitches communications By using the suggested programall of the information is read and saved as a nonsparsematrix

Now thematrix has been converted to a standard form ofWEKA tool input as arff file for data set 1 and data set 2Thisstandard form is shown in Box 2

31 Classification and Prediction Approaches This sectiondescribes the classification methods in two real case studiesas data set 1 and data set 2 At first we analyze the dataminingresult on data set 1 and data set 2 by WEKA classificationalgorithms For specifying the performance of classificationmethods in WEKA we describe some effective featuresbriefly [27] The Correctly Classified Instances (CCI) depictthe test cases percentages that were correctly classified Alsothe Incorrectly Classified Instances (ICI) represent the testcases percentages that were incorrectly classified

The relative absolute error (RAE) is qualified to a simplepredictor error which is objective for the typical real valuesIn the RAE the error is only the total absolute error ratherthan the total squared error

Definition 1 A relative absolute error is a 3-tuple RAE119894=

(119865(119894119895) 119881119895) in formula (1) where 119865

(119894119895)is the value predicted

by the individual program 119894 for sample case 119895 (out of 119896 samplecases) 119881

119895is the objective value for sample case 119895 and is

given by the following formulas

RAE119894=

sum119896

119895=1

10038161003816100381610038161003816119865(119894119895)minus 119881119895

10038161003816100381610038161003816

sum119896

119895=1

10038161003816100381610038161003816119881119895minus 10038161003816100381610038161003816

(1)

=1

119896

119896

sum

119895=1

119881119895 (2)

Also themean absolute error (MAE) shows themean averagegreatness of the errors in a set of predictions without allowingfor their courseTheMAEdepicts the correctness of incessant

variables in prediction procedure The MAE specifies andverifies an average on the absolute values between forecastand the corresponding statement The MAE is a linear scorewhich means that all the individual differences are weightedequally in the average [42ndash44]

Definition 2 A mean absolute error is a 2-tuple MAE119894=

(119875119894 119879119894) in formula (3) where 119875

119894is the prediction of value and

119879119894is the true value This feature specifies the average error in

the classification procedure in

MAE119894=

1

119896

119896

sum

119895=1

10038161003816100381610038161003816119875119895minus 119879119895

10038161003816100381610038161003816 (3)

Also we can measure the classifiers proficiency using atrue optimistic ratio (TOR) where NC is the number ofcorrectly detected malware programs and NI is the numberof incorrectly detected malware programs in (4) The AORcreates the cost of estimated classification that is significantto setting the cost of malware classification [45]

TOR = NCNC +NI

(4)

Also there are two error rates for measuring the classificationperformance The False Acceptance Rate (FAR) is the ratioof the number of test cases that are incorrectly accepted bya given model to the total number of cases This means thatthis ratio shows the percentage of invalid inputs which areincorrectly accepted The False Rejection Rate (FRR) is theratio of the number of test cases that are incorrectly rejectedby a given model to the total number of cases This meansthat this ratio shows the percentage of valid inputs whichare incorrectly rejected [46] By using these factors we cancalculate the Total Error Rate (TER) as follows [47]

TER = FAR + FRRNC +NI

(5)

In the classification process we use NaiveBayse BayseNetIB1 J48 and classification via regression algorithms TheNaiveBayes and BayesNet are a probabilistic learning algo-rithms based on supervised learning method which requirea small number of training data to estimate the constraintsThe IB1 data mining algorithm is based on lazy approaches

6 Journal of Computer Networks and Communications

Table 1 The statistical analysis of data set 1 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 1107275099

2917724901 00069 900871 02526 00754 1228107 4024

BayesNet 2662661531

1362338469 00032 424047 05979 00479 781282 4024

IB1 2802696322

1222303678 00028 372325 06199 00533 868274 4024

J48 2908722664 1116 277336 00032 416312 06379 00454 739957 4024

Regression 3051758201 973 241799 00011 210201 06859 00392 639686 4024

SVM 2251641571

1773358429 00039 420019 05743 04758 849596 4024

Also J48 data mining algorithm is based on decision treemethods Finally classification via regression algorithm isbased on Meta approach that is the new approach in datamining methods In other words regression analysis is astatistical method which is used to achieve data analysisRegression is applied with correlation analysis usually Thecorrelation analysis evaluates the association degree betweentwo quantitative data sets [37] For example Figure 3 showsthe classification result of NaiveBayse algorithm in WEKAtoolThe following section describes the experimental resultsof classification algorithms inWEKA Some effective featuressuch as Correctly Classified Instances Incorrectly ClassifiedInstances mean absolute error and relative absolute errorare compared with each other in order to achieve the bestclassification algorithm for developing a behavioral antivirus

4 Experimental Results and Discussion

In this section we implemented our approach using WEKAtool We use a system by Intel Core i3 213 GHz CPU 4GBRAM for the classification methods This analysis has beendone by some classification algorithms such as NaiveBayseBayseNet IB1 J48 System Vector Machine (SVM) andlogistic regression method We compared performance ofclassification methods in two malware data sets

In Tables 1 and 2 the statistical analysis of data sets 1and 2 is specified for proposed classification methods Thecompared factors in the classification methods are CorrectlyClassified Instances Incorrectly Classified Instances Kappastatistic mean absolute error relative absolute error rootmean squared error and root relative squared error In thiscomparison we show that the classification via regressionmethod has best performance in malware detection Forexample in data set 1 the number of correctly classifiedmalware programs is 3051 from total 4024malware programsAlso in data set 2 the number of correctly classified malwareprograms is 3069 from total 3131 malware programs

Figure 3 The snapshot of NaiveBayse classification algorithm inWEKA

According to Tables 1 and 2 the percentage of CorrectlyClassified Instances of the logistic regression algorithm ishigher than the other classification methods in each of datasets 1 and 2 Also the percentage of Incorrectly ClassifiedInstances of the logistic regression algorithm is lower than theother classification methods in each of data sets 1 and 2

After data mining process we test a new malware caseby the regression classification algorithm 100 binarymalwareprograms are downloaded from NetLux (httpvxheavenorg)and we analyzed their behaviors by using CW-Sandbox tooland we get its XML file [38] Then we add these 100 malwareprograms to the new data set and compute the quality of theirclassification as true optimistic ratio As we expect by classi-fication via regression 88 malware programs are detected Sowe can use the classification via regression to develop a behav-aioral antivirus

Journal of Computer Networks and Communications 7

Table 2 The statistical analysis of data set 2 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 2678855318

453144682 0012 153329 08459 01026 518792 3131

BayesNet 2874917918 257 82082 00073 93575 09127 00747 377504 3131

IB1 3028967103 103 32897 00027 35032 0965 00524 26472 3131

J48 3008960715 123 39285 00043 55353 09581 00527 26652 3131

Regression 3069 98321 62 1679 00021 22102 09578 00543 274333 3131

SVM 1698542319

1433457681 00046 57993 05011 01942 981954 3131

New data set

0102030405060708090

100

True

opt

imist

ic ra

tio (T

OR)

()

Baye

sNet

Regr

essio

n

J48

SVM

Nai

veBa

yes

IB1

Classification algorithms

Figure 4 The true optimistic ratio for the classifications test in thenew data set

Figure 4 depicts the true optimistic ratio percentage formalware detection in the new data sets The true optimisticratio percentage of regressionmethod is higher than the otherclassification methods in the new data set

After testing our new case study by 100 malware pro-grams Table 3 describes a statistical result for the FalseAcceptance Rate (FAR) number of cases and the FalseRejection Rate (FRR) number of cases Of course there aresome platforms such as STAC (httpteccitiususcesstac)[48] for statistical comparison of the tested algorithms Butwe use theWEKA tool for statistical and experimental resultsfor our data sets

According to Table 3 there is no valid input whichis incorrectly rejected using our approach by regressionmethod Also NaiveBayes method rejected 6 valid inputsincorrectly

Also in this test case we find one FAR incorrectly acceptedas a malware So Figure 5 shows the Total Error Rate (TER)

New data set

IB1

Baye

sNet J48

Regr

essio

n

SVM

Nai

veBa

yes

Classification algorithms

02468

101214161820

Tota

l Err

or R

ate (

TER)

()

Figure 5 The Total Error Rate (TER) for the classifications test inthe new data set

Table 3The statistical analysis of the FAR and FRR number of casesin the new test case study

AlgorithmsStatistical analysis

Number ofFAR cases

Number ofFRR cases

Total numberof instances

NaiveBayes 5 6 100BayesNet 4 2 100IB1 2 1 100J48 3 2 100Regression 1 0 100SVM 3 2 100

for our new test case using our approach by the regressionmethod

8 Journal of Computer Networks and Communications

5 Conclusion and Future Work

In this paper we proposed a new datamining approach basedon classificationmethodologies for detectingmalware behav-ior Firstly a malware behavior executive history XML file isconverted to a nonsparse matrix using our suggested applica-tionThen thismatrixwas translated toWEKA input data setTo illustrate the performance efficiency we applied the pro-posed approaches to a real case study data set using WEKAtool The training methods proceeded using some classifica-tion algorithms such as NaiveBayse BayseNet IB1 J48 andregression algorithms The regression classification methodhad best performance for classification of malware detectionAlso we analyzed the new data set by the regression classifica-tion method The evaluation results demonstrated the avail-ability of the proposed data mining approach Also our pro-posed data mining mechanism is more efficient for detectingmalware By notice to the experimental results classificationofmalware behavioral features can be a convenientmethod indeveloping a behavioral antivirus In the future work we willtry to develop and analyze a real behavioral antivirus platformbased on classification via regression algorithm

Competing Interests

The authors declare that they have no competing interests

References

[1] D B Ekta Gandotra and S Sofat ldquoMalware analysis andclassification a surveyrdquo Journal of Information Security vol 5pp 56ndash64 2014

[2] P Wang and Y-S Wang ldquoMalware behavioural detection andvaccine development by using a support vectormodel classifierrdquoJournal of Computer and System Sciences vol 81 no 6 pp 1012ndash1026 2015

[3] G Ollmann ldquoThe evolution of commercial malware develop-ment kits and colour-by-numbers custom malwarerdquo ComputerFraud and Security vol 2008 no 9 pp 4ndash7 2008

[4] M Ghiasi A Sami and Z Salehi ldquoDynamic VSA a frameworkfor malware detection based on register contentsrdquo EngineeringApplications of Artificial Intelligence vol 44 pp 111ndash122 2015

[5] D Bruschi L Martignoni and M Monga ldquoDetecting self-mutating malware using control-flow graph matchingrdquo inDetection of Intrusions andMalwareampVulnerability AssessmentR Buschkes andP Laskov Eds vol 4064 pp 129ndash143 SpringerBerlin Germany 2006

[6] M R Chouchane and A Lakhotia ldquoUsing engine signature todetect metamorphic malwarerdquo in Proceedings of the 4th ACMWorkshop on Recurring Malcode Alexandria Va USA 2006

[7] N Kuzurin A Shokurov N Varnovsky and V Zakharov ldquoOnthe concept of software obfuscation in computer securityrdquo inInformation Security J Garay A Lenstra M Mambo and RPeralta Eds vol 4779 pp 281ndash298 Springer Berlin Germany2007

[8] M Christodorescu and S Jha ldquoTesting malware detectorsrdquoSIGSOFT Software Engineering Notes vol 29 no 4 pp 34ndash442004

[9] L K Mehedy Masud and BThuraisinghamData Mining Toolsfor Malware Detection vol 1 CRC Press 2012

[10] M Egele T Scholte E Kirda and C Kruegel ldquoA survey onautomated dynamic malware-analysis techniques and toolsrdquoACM Computing Surveys vol 44 pp 1ndash42 2008

[11] S P Monire Norouzi and A Mahjur ldquoA new approach forformal behavioral modeling of protection services in antivirussystemsrdquo International Journal in Foundations of ComputerScience amp Technology vol 4 pp 57ndash67 2014

[12] A Safarkhanlou A Souri M Norouzi and S E H SardroudldquoFormalizing and verification of an antivirus protection serviceusing model checkingrdquo Procedia Computer Science vol 57 pp1324ndash1331 2015

[13] I Santos F Brezo X Ugarte-Pedrero and P G BringasldquoOpcode sequences as representation of executables for data-mining-based unknown malware detectionrdquo Information Sci-ences vol 231 pp 64ndash82 2013

[14] N Abdelhamid A Ayesh and F Thabtah ldquoPhishing detectionbased associative classification data miningrdquo Expert Systemswith Applications vol 41 no 13 pp 5948ndash5959 2014

[15] G Jacob H Debar and E Filiol ldquoBehavioral detection ofmalware from a survey towards an established taxonomyrdquoJournal in Computer Virology vol 4 no 3 pp 251ndash266 2008

[16] A Shabtai R Moskovitch Y Elovici and C Glezer ldquoDetectionof malicious code by applying machine learning classifiers onstatic features a state-of-the-art surveyrdquo Information SecurityTechnical Report vol 14 no 1 pp 16ndash29 2009

[17] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo in Proceedings of the Conferenceon Emerging Artificial Intelligence Applications in ComputerEngineering RealWord AI Systems with Applications in eHealthHCI Information Retrieval and Pervasive Technologies pp 3ndash24IOS Press 2007

[18] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[19] L Chen Q Jiang and S Wang ldquoModel-based method forprojective clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 24 no 7 pp 1291ndash1305 2012

[20] C Ravi and RManoharan ldquoMalware detection usingWindowsApi sequence and machine learningrdquo International Journal ofComputer Applications vol 43 no 17 pp 12ndash16 2012

[21] R Rizwan G C Hazarika and G Chetia ldquoMalware threatsand mitigation strategies a surveyrdquo Journal of Theoretical andApplied Information Technology vol 29 no 2 pp 69ndash73 2011

[22] K Mathur and H Saroj ldquoA survey on techniques in detectionand analyzing malware executablesrdquo International Journal ofAdvanced Research in Computer Science and Software Engineer-ing vol 44 no 2 2012

[23] N F Doherty L Anastasakis and H Fulford ldquoThe informationsecurity policy unpacked a critical study of the content ofuniversity policiesrdquo International Journal of Information Man-agement vol 29 no 6 pp 449ndash457 2009

[24] G Tahan L Rokach and Y Shahar ldquoAutomatic malwaredetection using common segment analysis and meta-featuresrdquoJournal ofMachine Learning Research vol 13 pp 949ndash979 2012

[25] M Bailey J Oberheide J Andersen Z M Mao F Jahanianand JNazario ldquoAutomated classification and analysis of internetmalwarerdquo inRecent Advances in Intrusion Detection C KruegelR Lippmann andAClark Eds vol 4637 pp 178ndash197 SpringerBerlin Germany 2007

[26] U Bayer AMoser C Kruegel and E Kirda ldquoDynamic analysisof malicious coderdquo Journal in Computer Virology vol 2 no 1pp 67ndash77 2006

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Journal of Computer Networks and Communications 3

XML

Malware behaviorexecutive

history

Converter

Based on behavioral

features

Nonsparsematrix

WEKA

DataSet1arffDataSet1arff

Training

Classification algorithms

Best classification

algorithm

Test

Malwaredetection

Adding new malware

A behavioral antivirus platform

Figure 1 The behavioral analysis of malware detection mechanism

approach improved identical dissemination of instances ineach class Other related work [36] used a neurofuzzy datamining approach for classification of generalized bell-shapedmembership functionsThey applied the proposed techniqueto ten real standard data sets from the UCI machine learningrepository for classification using Kappa statistic They simu-lated proposed technique in MATLAB Also some researchesfocused on the other approaches that consist of the hostbehavior classification methods [37ndash40] For example [29]presented a novel managed discretization technique for ana-lyzing multivariate time series which uses frequent temporalpatterns as features for classification of time chain for gearednear improvement of classification correctness This paperused temporal abstraction classification approach and timeintervals mining for the presented multivariate time seriesAlso [38] presented novel Artificial Neural Networks (ANN)basedmechanism for discovering the computerWorms basedon the behavioral computer events According to estimationof different parameters of the infected computers the ANNdecision tree and 119870-nearest neighbors classification tech-niques are compared The other research is [41] where theauthors presented computer measurements extracted mech-anism for identifying unknown computer Worm activity inthe operating system using support vector approaches Thispaper separates a series of trials to check the new techniqueby retaining several computer configuration activities

To the best of our knowledge there is no any approachthat analyzes the malware behavior in data mining platformexactly and also there is no any approach to convert malwarebehavior XML executive history file to a suitable WEKA toolinput Our approach can be used in base of a behavioralantivirus For improving this defect we present a newapproach to translate a malicious file to the data miningplatform Then we consider some classification methods forevaluating our approach based on malware behavior

Figure 2 A snapshot of XML convertor to nonsparse matrix

3 Malware Behavior Analysis

In this section we proposed a malware behavioral analysismechanism as shown in Figure 1 In this mechanism a XMLfile of malware behavior executive history will be convertedto a nonsparse matrix using a suggested application Thisapplication is producedwithVBNet language Figure 2 showsa snapshot of XML convertor to a nonsparse matrix usingour suggested application The procedure of converting eachXML file to a suitable WEKA input includes two elementsthe number of library file calls which are attacked bymalwareand their volume For example in Box 1 the XML library filentdlldll has been called 16 times by the malware which arebetween (0 2)Then we translate this matrix toWEKA inputdata set The training methods will be proceeded by someclassification algorithms Each classification that has bestperformance will be chosen for test platform by new data set

4 Journal of Computer Networks and Communications

ltxml version=ldquo10rdquo gt-lt- -This analysis was created by CWSandbox (c) CWSE GmbHSunbelt Software- -gtltanalysis cwsversion=ldquo2112rdquo time=ldquo08082009 052219rdquofile=ldquoc260589951029048b3e6d93316b3c2507rdquomd5=ldquo260589951029048b3e6d93316b3c2507rdquosha1=ldquo0089453df77890ae95ce7d9130a4ef85eaea36e8rdquologpath=ldquoccwsandboxlog260589951029048b3e6d93316b3c2507run 1rdquoanalysisid=ldquo647702rdquo sampleid=ldquo431657rdquogtltcalltreegtltprocess call index=ldquo1rdquo pid=ldquo1940rdquofilename=ldquoc260589951029048b3e6d93316b3c2507rdquo starttime=ldquo0001922rdquostartreason=ldquoAnalysisTargetrdquogtltcalltreegtltprocess call index=ldquo2rdquo pid=ldquo2084rdquo filename=ldquoCProgrammeInternet

Exploreriexploreexerdquo starttime=ldquo0005343rdquo startreason=ldquoCreateProcessrdquo gtltcalltreegtltprocess callgtltprocess call index=ldquo3rdquo pid=ldquo948rdquo

filename=ldquoCWINDOWSsystem32svchostexerdquo starttime=ldquo0007062rdquostartreason=ldquoDCOMServicerdquo gtltcalltreegtltprocessesgtltprocess index=ldquo1rdquo pid=ldquo1940rdquofilename=ldquoc260589951029048b3e6d93316b3c2507rdquo filesize=ldquo761856rdquomd5=ldquo260589951029048b3e6d93316b3c2507rdquosha1=ldquo0089453df77890ae95ce7d9130a4ef85eaea36e8rdquo username=ldquoAdministratorrdquoparentindex=ldquo0rdquo starttime=ldquo0001922rdquo terminationtime=ldquo0007484rdquostartreason=ldquoAnalysisTargetrdquo terminationreason=ldquoNormalTerminationrdquoexecutionstatus=ldquoOKrdquo applicationtype=ldquoWin32Applicationrdquogtltdll handling sectiongtltload image filename=ldquoc260589951029048b3e6d93316b3c2507rdquo successful=ldquo1rdquo

address=ldquo$400000rdquo end address=ldquo$4C1000rdquo size=ldquo790528rdquo gtltload dll filename=ldquoCWINDOWSsystem32ntdlldllrdquo successful=ldquo1rdquo

address=ldquo$7C910000rdquo end address=ldquo$7C9C9000rdquo size=ldquo757760rdquo quantity=ldquo16rdquogtltload dll filename=ldquoCWINDOWSsystem32kernel32dllrdquo successful=ldquo1rdquo

address=ldquo$7C800000rdquo end address=ldquo$7C908000rdquo size=ldquo1081344rdquo quantity=ldquo2rdquo gtltload dll filename=ldquoCWINDOWSsystem32gdi32dllrdquo successful=ldquo1rdquo

address=ldquo$77EF0000rdquo end address=ldquo$77F39000rdquo size=ldquo299008rdquo quantity=ldquo2rdquo gtltload dll filename=ldquoCWINDOWSsystem32USER32dllrdquo successful=ldquo1rdquo

address=ldquo$7E360000rdquo end address=ldquo$7E3F1000rdquo size=ldquo593920rdquo quantity=ldquo2rdquo gtltdll handling sectiongtltfilesystem sectiongt

Box 1 A sample part of XML file contains a malware behavior

malware Finally this procedure can be used for developing abehavioral antivirus For describing the behavioral model ofmalware we should download the XML file which is availablein PIL (httpdwsinformatikuni-mannheimde) as an XMLfile [38ndash40] We use 7155 XML files as data set 1 and data set2 Our first data set contains 4024 XML file and data set 2 has3131 XML files too Data set 1 has 89 properties and data set 2has 91 properties for each malware

Then we convert this XML file to a nonsparse matrix byusing our suggested programThe nonsparse matrix includestwo numbers the first number shows the number of proper-ties and the second number shows their importanceThe firstrow of this matrix is shown as follows

(0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 320534 33 0534 35 0534 36 0534 40 0534 45 106846 1603 47 1068 48 1068 49 1068 50 1068 51 106852 1068 53 1068 54 2137 55 1068 56 0534 57 106858 2137 61 0534 62 0534 63 2137 65 0534 66 053473 1603 83 22 84 16 85 4 86 8 87 6 88 T1)

The last number of this row is 88 T1 that shows the kind ofmalware

Finally we analyze the executive history of malware inWEKA environment The malware executive history can bedeveloped by some applications such as SandBox tool andvirtual machine for safe execution of malware in computer

Journal of Computer Networks and Communications 5

RELATION TEST file nameATTRIBUTE dll1 numeric propertyATTRIBUTE dll2 numeric propertyATTRIBUTE dll3 numeric propertyATTRIBUTE dll4 numeric property propertyATTRIBUTE param88 numeric propertyATTRIBUTE class Answer - propertyDATA0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 32 0534 33 0534 35

Box 2 An example of standard form for WEKA input

systems and preventing malware spread [28 38ndash41] TheXML file includes useful information such as system libraryfiles calls creating searching and change of files modifyingregistry main processes information creating the mutex (amutex is an application object which permits the multipleprogram threads to share the same resource) modifyingvirtual memory sending email registry operations andswitches communications By using the suggested programall of the information is read and saved as a nonsparsematrix

Now thematrix has been converted to a standard form ofWEKA tool input as arff file for data set 1 and data set 2Thisstandard form is shown in Box 2

31 Classification and Prediction Approaches This sectiondescribes the classification methods in two real case studiesas data set 1 and data set 2 At first we analyze the dataminingresult on data set 1 and data set 2 by WEKA classificationalgorithms For specifying the performance of classificationmethods in WEKA we describe some effective featuresbriefly [27] The Correctly Classified Instances (CCI) depictthe test cases percentages that were correctly classified Alsothe Incorrectly Classified Instances (ICI) represent the testcases percentages that were incorrectly classified

The relative absolute error (RAE) is qualified to a simplepredictor error which is objective for the typical real valuesIn the RAE the error is only the total absolute error ratherthan the total squared error

Definition 1 A relative absolute error is a 3-tuple RAE119894=

(119865(119894119895) 119881119895) in formula (1) where 119865

(119894119895)is the value predicted

by the individual program 119894 for sample case 119895 (out of 119896 samplecases) 119881

119895is the objective value for sample case 119895 and is

given by the following formulas

RAE119894=

sum119896

119895=1

10038161003816100381610038161003816119865(119894119895)minus 119881119895

10038161003816100381610038161003816

sum119896

119895=1

10038161003816100381610038161003816119881119895minus 10038161003816100381610038161003816

(1)

=1

119896

119896

sum

119895=1

119881119895 (2)

Also themean absolute error (MAE) shows themean averagegreatness of the errors in a set of predictions without allowingfor their courseTheMAEdepicts the correctness of incessant

variables in prediction procedure The MAE specifies andverifies an average on the absolute values between forecastand the corresponding statement The MAE is a linear scorewhich means that all the individual differences are weightedequally in the average [42ndash44]

Definition 2 A mean absolute error is a 2-tuple MAE119894=

(119875119894 119879119894) in formula (3) where 119875

119894is the prediction of value and

119879119894is the true value This feature specifies the average error in

the classification procedure in

MAE119894=

1

119896

119896

sum

119895=1

10038161003816100381610038161003816119875119895minus 119879119895

10038161003816100381610038161003816 (3)

Also we can measure the classifiers proficiency using atrue optimistic ratio (TOR) where NC is the number ofcorrectly detected malware programs and NI is the numberof incorrectly detected malware programs in (4) The AORcreates the cost of estimated classification that is significantto setting the cost of malware classification [45]

TOR = NCNC +NI

(4)

Also there are two error rates for measuring the classificationperformance The False Acceptance Rate (FAR) is the ratioof the number of test cases that are incorrectly accepted bya given model to the total number of cases This means thatthis ratio shows the percentage of invalid inputs which areincorrectly accepted The False Rejection Rate (FRR) is theratio of the number of test cases that are incorrectly rejectedby a given model to the total number of cases This meansthat this ratio shows the percentage of valid inputs whichare incorrectly rejected [46] By using these factors we cancalculate the Total Error Rate (TER) as follows [47]

TER = FAR + FRRNC +NI

(5)

In the classification process we use NaiveBayse BayseNetIB1 J48 and classification via regression algorithms TheNaiveBayes and BayesNet are a probabilistic learning algo-rithms based on supervised learning method which requirea small number of training data to estimate the constraintsThe IB1 data mining algorithm is based on lazy approaches

6 Journal of Computer Networks and Communications

Table 1 The statistical analysis of data set 1 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 1107275099

2917724901 00069 900871 02526 00754 1228107 4024

BayesNet 2662661531

1362338469 00032 424047 05979 00479 781282 4024

IB1 2802696322

1222303678 00028 372325 06199 00533 868274 4024

J48 2908722664 1116 277336 00032 416312 06379 00454 739957 4024

Regression 3051758201 973 241799 00011 210201 06859 00392 639686 4024

SVM 2251641571

1773358429 00039 420019 05743 04758 849596 4024

Also J48 data mining algorithm is based on decision treemethods Finally classification via regression algorithm isbased on Meta approach that is the new approach in datamining methods In other words regression analysis is astatistical method which is used to achieve data analysisRegression is applied with correlation analysis usually Thecorrelation analysis evaluates the association degree betweentwo quantitative data sets [37] For example Figure 3 showsthe classification result of NaiveBayse algorithm in WEKAtoolThe following section describes the experimental resultsof classification algorithms inWEKA Some effective featuressuch as Correctly Classified Instances Incorrectly ClassifiedInstances mean absolute error and relative absolute errorare compared with each other in order to achieve the bestclassification algorithm for developing a behavioral antivirus

4 Experimental Results and Discussion

In this section we implemented our approach using WEKAtool We use a system by Intel Core i3 213 GHz CPU 4GBRAM for the classification methods This analysis has beendone by some classification algorithms such as NaiveBayseBayseNet IB1 J48 System Vector Machine (SVM) andlogistic regression method We compared performance ofclassification methods in two malware data sets

In Tables 1 and 2 the statistical analysis of data sets 1and 2 is specified for proposed classification methods Thecompared factors in the classification methods are CorrectlyClassified Instances Incorrectly Classified Instances Kappastatistic mean absolute error relative absolute error rootmean squared error and root relative squared error In thiscomparison we show that the classification via regressionmethod has best performance in malware detection Forexample in data set 1 the number of correctly classifiedmalware programs is 3051 from total 4024malware programsAlso in data set 2 the number of correctly classified malwareprograms is 3069 from total 3131 malware programs

Figure 3 The snapshot of NaiveBayse classification algorithm inWEKA

According to Tables 1 and 2 the percentage of CorrectlyClassified Instances of the logistic regression algorithm ishigher than the other classification methods in each of datasets 1 and 2 Also the percentage of Incorrectly ClassifiedInstances of the logistic regression algorithm is lower than theother classification methods in each of data sets 1 and 2

After data mining process we test a new malware caseby the regression classification algorithm 100 binarymalwareprograms are downloaded from NetLux (httpvxheavenorg)and we analyzed their behaviors by using CW-Sandbox tooland we get its XML file [38] Then we add these 100 malwareprograms to the new data set and compute the quality of theirclassification as true optimistic ratio As we expect by classi-fication via regression 88 malware programs are detected Sowe can use the classification via regression to develop a behav-aioral antivirus

Journal of Computer Networks and Communications 7

Table 2 The statistical analysis of data set 2 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 2678855318

453144682 0012 153329 08459 01026 518792 3131

BayesNet 2874917918 257 82082 00073 93575 09127 00747 377504 3131

IB1 3028967103 103 32897 00027 35032 0965 00524 26472 3131

J48 3008960715 123 39285 00043 55353 09581 00527 26652 3131

Regression 3069 98321 62 1679 00021 22102 09578 00543 274333 3131

SVM 1698542319

1433457681 00046 57993 05011 01942 981954 3131

New data set

0102030405060708090

100

True

opt

imist

ic ra

tio (T

OR)

()

Baye

sNet

Regr

essio

n

J48

SVM

Nai

veBa

yes

IB1

Classification algorithms

Figure 4 The true optimistic ratio for the classifications test in thenew data set

Figure 4 depicts the true optimistic ratio percentage formalware detection in the new data sets The true optimisticratio percentage of regressionmethod is higher than the otherclassification methods in the new data set

After testing our new case study by 100 malware pro-grams Table 3 describes a statistical result for the FalseAcceptance Rate (FAR) number of cases and the FalseRejection Rate (FRR) number of cases Of course there aresome platforms such as STAC (httpteccitiususcesstac)[48] for statistical comparison of the tested algorithms Butwe use theWEKA tool for statistical and experimental resultsfor our data sets

According to Table 3 there is no valid input whichis incorrectly rejected using our approach by regressionmethod Also NaiveBayes method rejected 6 valid inputsincorrectly

Also in this test case we find one FAR incorrectly acceptedas a malware So Figure 5 shows the Total Error Rate (TER)

New data set

IB1

Baye

sNet J48

Regr

essio

n

SVM

Nai

veBa

yes

Classification algorithms

02468

101214161820

Tota

l Err

or R

ate (

TER)

()

Figure 5 The Total Error Rate (TER) for the classifications test inthe new data set

Table 3The statistical analysis of the FAR and FRR number of casesin the new test case study

AlgorithmsStatistical analysis

Number ofFAR cases

Number ofFRR cases

Total numberof instances

NaiveBayes 5 6 100BayesNet 4 2 100IB1 2 1 100J48 3 2 100Regression 1 0 100SVM 3 2 100

for our new test case using our approach by the regressionmethod

8 Journal of Computer Networks and Communications

5 Conclusion and Future Work

In this paper we proposed a new datamining approach basedon classificationmethodologies for detectingmalware behav-ior Firstly a malware behavior executive history XML file isconverted to a nonsparse matrix using our suggested applica-tionThen thismatrixwas translated toWEKA input data setTo illustrate the performance efficiency we applied the pro-posed approaches to a real case study data set using WEKAtool The training methods proceeded using some classifica-tion algorithms such as NaiveBayse BayseNet IB1 J48 andregression algorithms The regression classification methodhad best performance for classification of malware detectionAlso we analyzed the new data set by the regression classifica-tion method The evaluation results demonstrated the avail-ability of the proposed data mining approach Also our pro-posed data mining mechanism is more efficient for detectingmalware By notice to the experimental results classificationofmalware behavioral features can be a convenientmethod indeveloping a behavioral antivirus In the future work we willtry to develop and analyze a real behavioral antivirus platformbased on classification via regression algorithm

Competing Interests

The authors declare that they have no competing interests

References

[1] D B Ekta Gandotra and S Sofat ldquoMalware analysis andclassification a surveyrdquo Journal of Information Security vol 5pp 56ndash64 2014

[2] P Wang and Y-S Wang ldquoMalware behavioural detection andvaccine development by using a support vectormodel classifierrdquoJournal of Computer and System Sciences vol 81 no 6 pp 1012ndash1026 2015

[3] G Ollmann ldquoThe evolution of commercial malware develop-ment kits and colour-by-numbers custom malwarerdquo ComputerFraud and Security vol 2008 no 9 pp 4ndash7 2008

[4] M Ghiasi A Sami and Z Salehi ldquoDynamic VSA a frameworkfor malware detection based on register contentsrdquo EngineeringApplications of Artificial Intelligence vol 44 pp 111ndash122 2015

[5] D Bruschi L Martignoni and M Monga ldquoDetecting self-mutating malware using control-flow graph matchingrdquo inDetection of Intrusions andMalwareampVulnerability AssessmentR Buschkes andP Laskov Eds vol 4064 pp 129ndash143 SpringerBerlin Germany 2006

[6] M R Chouchane and A Lakhotia ldquoUsing engine signature todetect metamorphic malwarerdquo in Proceedings of the 4th ACMWorkshop on Recurring Malcode Alexandria Va USA 2006

[7] N Kuzurin A Shokurov N Varnovsky and V Zakharov ldquoOnthe concept of software obfuscation in computer securityrdquo inInformation Security J Garay A Lenstra M Mambo and RPeralta Eds vol 4779 pp 281ndash298 Springer Berlin Germany2007

[8] M Christodorescu and S Jha ldquoTesting malware detectorsrdquoSIGSOFT Software Engineering Notes vol 29 no 4 pp 34ndash442004

[9] L K Mehedy Masud and BThuraisinghamData Mining Toolsfor Malware Detection vol 1 CRC Press 2012

[10] M Egele T Scholte E Kirda and C Kruegel ldquoA survey onautomated dynamic malware-analysis techniques and toolsrdquoACM Computing Surveys vol 44 pp 1ndash42 2008

[11] S P Monire Norouzi and A Mahjur ldquoA new approach forformal behavioral modeling of protection services in antivirussystemsrdquo International Journal in Foundations of ComputerScience amp Technology vol 4 pp 57ndash67 2014

[12] A Safarkhanlou A Souri M Norouzi and S E H SardroudldquoFormalizing and verification of an antivirus protection serviceusing model checkingrdquo Procedia Computer Science vol 57 pp1324ndash1331 2015

[13] I Santos F Brezo X Ugarte-Pedrero and P G BringasldquoOpcode sequences as representation of executables for data-mining-based unknown malware detectionrdquo Information Sci-ences vol 231 pp 64ndash82 2013

[14] N Abdelhamid A Ayesh and F Thabtah ldquoPhishing detectionbased associative classification data miningrdquo Expert Systemswith Applications vol 41 no 13 pp 5948ndash5959 2014

[15] G Jacob H Debar and E Filiol ldquoBehavioral detection ofmalware from a survey towards an established taxonomyrdquoJournal in Computer Virology vol 4 no 3 pp 251ndash266 2008

[16] A Shabtai R Moskovitch Y Elovici and C Glezer ldquoDetectionof malicious code by applying machine learning classifiers onstatic features a state-of-the-art surveyrdquo Information SecurityTechnical Report vol 14 no 1 pp 16ndash29 2009

[17] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo in Proceedings of the Conferenceon Emerging Artificial Intelligence Applications in ComputerEngineering RealWord AI Systems with Applications in eHealthHCI Information Retrieval and Pervasive Technologies pp 3ndash24IOS Press 2007

[18] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[19] L Chen Q Jiang and S Wang ldquoModel-based method forprojective clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 24 no 7 pp 1291ndash1305 2012

[20] C Ravi and RManoharan ldquoMalware detection usingWindowsApi sequence and machine learningrdquo International Journal ofComputer Applications vol 43 no 17 pp 12ndash16 2012

[21] R Rizwan G C Hazarika and G Chetia ldquoMalware threatsand mitigation strategies a surveyrdquo Journal of Theoretical andApplied Information Technology vol 29 no 2 pp 69ndash73 2011

[22] K Mathur and H Saroj ldquoA survey on techniques in detectionand analyzing malware executablesrdquo International Journal ofAdvanced Research in Computer Science and Software Engineer-ing vol 44 no 2 2012

[23] N F Doherty L Anastasakis and H Fulford ldquoThe informationsecurity policy unpacked a critical study of the content ofuniversity policiesrdquo International Journal of Information Man-agement vol 29 no 6 pp 449ndash457 2009

[24] G Tahan L Rokach and Y Shahar ldquoAutomatic malwaredetection using common segment analysis and meta-featuresrdquoJournal ofMachine Learning Research vol 13 pp 949ndash979 2012

[25] M Bailey J Oberheide J Andersen Z M Mao F Jahanianand JNazario ldquoAutomated classification and analysis of internetmalwarerdquo inRecent Advances in Intrusion Detection C KruegelR Lippmann andAClark Eds vol 4637 pp 178ndash197 SpringerBerlin Germany 2007

[26] U Bayer AMoser C Kruegel and E Kirda ldquoDynamic analysisof malicious coderdquo Journal in Computer Virology vol 2 no 1pp 67ndash77 2006

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

4 Journal of Computer Networks and Communications

ltxml version=ldquo10rdquo gt-lt- -This analysis was created by CWSandbox (c) CWSE GmbHSunbelt Software- -gtltanalysis cwsversion=ldquo2112rdquo time=ldquo08082009 052219rdquofile=ldquoc260589951029048b3e6d93316b3c2507rdquomd5=ldquo260589951029048b3e6d93316b3c2507rdquosha1=ldquo0089453df77890ae95ce7d9130a4ef85eaea36e8rdquologpath=ldquoccwsandboxlog260589951029048b3e6d93316b3c2507run 1rdquoanalysisid=ldquo647702rdquo sampleid=ldquo431657rdquogtltcalltreegtltprocess call index=ldquo1rdquo pid=ldquo1940rdquofilename=ldquoc260589951029048b3e6d93316b3c2507rdquo starttime=ldquo0001922rdquostartreason=ldquoAnalysisTargetrdquogtltcalltreegtltprocess call index=ldquo2rdquo pid=ldquo2084rdquo filename=ldquoCProgrammeInternet

Exploreriexploreexerdquo starttime=ldquo0005343rdquo startreason=ldquoCreateProcessrdquo gtltcalltreegtltprocess callgtltprocess call index=ldquo3rdquo pid=ldquo948rdquo

filename=ldquoCWINDOWSsystem32svchostexerdquo starttime=ldquo0007062rdquostartreason=ldquoDCOMServicerdquo gtltcalltreegtltprocessesgtltprocess index=ldquo1rdquo pid=ldquo1940rdquofilename=ldquoc260589951029048b3e6d93316b3c2507rdquo filesize=ldquo761856rdquomd5=ldquo260589951029048b3e6d93316b3c2507rdquosha1=ldquo0089453df77890ae95ce7d9130a4ef85eaea36e8rdquo username=ldquoAdministratorrdquoparentindex=ldquo0rdquo starttime=ldquo0001922rdquo terminationtime=ldquo0007484rdquostartreason=ldquoAnalysisTargetrdquo terminationreason=ldquoNormalTerminationrdquoexecutionstatus=ldquoOKrdquo applicationtype=ldquoWin32Applicationrdquogtltdll handling sectiongtltload image filename=ldquoc260589951029048b3e6d93316b3c2507rdquo successful=ldquo1rdquo

address=ldquo$400000rdquo end address=ldquo$4C1000rdquo size=ldquo790528rdquo gtltload dll filename=ldquoCWINDOWSsystem32ntdlldllrdquo successful=ldquo1rdquo

address=ldquo$7C910000rdquo end address=ldquo$7C9C9000rdquo size=ldquo757760rdquo quantity=ldquo16rdquogtltload dll filename=ldquoCWINDOWSsystem32kernel32dllrdquo successful=ldquo1rdquo

address=ldquo$7C800000rdquo end address=ldquo$7C908000rdquo size=ldquo1081344rdquo quantity=ldquo2rdquo gtltload dll filename=ldquoCWINDOWSsystem32gdi32dllrdquo successful=ldquo1rdquo

address=ldquo$77EF0000rdquo end address=ldquo$77F39000rdquo size=ldquo299008rdquo quantity=ldquo2rdquo gtltload dll filename=ldquoCWINDOWSsystem32USER32dllrdquo successful=ldquo1rdquo

address=ldquo$7E360000rdquo end address=ldquo$7E3F1000rdquo size=ldquo593920rdquo quantity=ldquo2rdquo gtltdll handling sectiongtltfilesystem sectiongt

Box 1 A sample part of XML file contains a malware behavior

malware Finally this procedure can be used for developing abehavioral antivirus For describing the behavioral model ofmalware we should download the XML file which is availablein PIL (httpdwsinformatikuni-mannheimde) as an XMLfile [38ndash40] We use 7155 XML files as data set 1 and data set2 Our first data set contains 4024 XML file and data set 2 has3131 XML files too Data set 1 has 89 properties and data set 2has 91 properties for each malware

Then we convert this XML file to a nonsparse matrix byusing our suggested programThe nonsparse matrix includestwo numbers the first number shows the number of proper-ties and the second number shows their importanceThe firstrow of this matrix is shown as follows

(0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 320534 33 0534 35 0534 36 0534 40 0534 45 106846 1603 47 1068 48 1068 49 1068 50 1068 51 106852 1068 53 1068 54 2137 55 1068 56 0534 57 106858 2137 61 0534 62 0534 63 2137 65 0534 66 053473 1603 83 22 84 16 85 4 86 8 87 6 88 T1)

The last number of this row is 88 T1 that shows the kind ofmalware

Finally we analyze the executive history of malware inWEKA environment The malware executive history can bedeveloped by some applications such as SandBox tool andvirtual machine for safe execution of malware in computer

Journal of Computer Networks and Communications 5

RELATION TEST file nameATTRIBUTE dll1 numeric propertyATTRIBUTE dll2 numeric propertyATTRIBUTE dll3 numeric propertyATTRIBUTE dll4 numeric property propertyATTRIBUTE param88 numeric propertyATTRIBUTE class Answer - propertyDATA0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 32 0534 33 0534 35

Box 2 An example of standard form for WEKA input

systems and preventing malware spread [28 38ndash41] TheXML file includes useful information such as system libraryfiles calls creating searching and change of files modifyingregistry main processes information creating the mutex (amutex is an application object which permits the multipleprogram threads to share the same resource) modifyingvirtual memory sending email registry operations andswitches communications By using the suggested programall of the information is read and saved as a nonsparsematrix

Now thematrix has been converted to a standard form ofWEKA tool input as arff file for data set 1 and data set 2Thisstandard form is shown in Box 2

31 Classification and Prediction Approaches This sectiondescribes the classification methods in two real case studiesas data set 1 and data set 2 At first we analyze the dataminingresult on data set 1 and data set 2 by WEKA classificationalgorithms For specifying the performance of classificationmethods in WEKA we describe some effective featuresbriefly [27] The Correctly Classified Instances (CCI) depictthe test cases percentages that were correctly classified Alsothe Incorrectly Classified Instances (ICI) represent the testcases percentages that were incorrectly classified

The relative absolute error (RAE) is qualified to a simplepredictor error which is objective for the typical real valuesIn the RAE the error is only the total absolute error ratherthan the total squared error

Definition 1 A relative absolute error is a 3-tuple RAE119894=

(119865(119894119895) 119881119895) in formula (1) where 119865

(119894119895)is the value predicted

by the individual program 119894 for sample case 119895 (out of 119896 samplecases) 119881

119895is the objective value for sample case 119895 and is

given by the following formulas

RAE119894=

sum119896

119895=1

10038161003816100381610038161003816119865(119894119895)minus 119881119895

10038161003816100381610038161003816

sum119896

119895=1

10038161003816100381610038161003816119881119895minus 10038161003816100381610038161003816

(1)

=1

119896

119896

sum

119895=1

119881119895 (2)

Also themean absolute error (MAE) shows themean averagegreatness of the errors in a set of predictions without allowingfor their courseTheMAEdepicts the correctness of incessant

variables in prediction procedure The MAE specifies andverifies an average on the absolute values between forecastand the corresponding statement The MAE is a linear scorewhich means that all the individual differences are weightedequally in the average [42ndash44]

Definition 2 A mean absolute error is a 2-tuple MAE119894=

(119875119894 119879119894) in formula (3) where 119875

119894is the prediction of value and

119879119894is the true value This feature specifies the average error in

the classification procedure in

MAE119894=

1

119896

119896

sum

119895=1

10038161003816100381610038161003816119875119895minus 119879119895

10038161003816100381610038161003816 (3)

Also we can measure the classifiers proficiency using atrue optimistic ratio (TOR) where NC is the number ofcorrectly detected malware programs and NI is the numberof incorrectly detected malware programs in (4) The AORcreates the cost of estimated classification that is significantto setting the cost of malware classification [45]

TOR = NCNC +NI

(4)

Also there are two error rates for measuring the classificationperformance The False Acceptance Rate (FAR) is the ratioof the number of test cases that are incorrectly accepted bya given model to the total number of cases This means thatthis ratio shows the percentage of invalid inputs which areincorrectly accepted The False Rejection Rate (FRR) is theratio of the number of test cases that are incorrectly rejectedby a given model to the total number of cases This meansthat this ratio shows the percentage of valid inputs whichare incorrectly rejected [46] By using these factors we cancalculate the Total Error Rate (TER) as follows [47]

TER = FAR + FRRNC +NI

(5)

In the classification process we use NaiveBayse BayseNetIB1 J48 and classification via regression algorithms TheNaiveBayes and BayesNet are a probabilistic learning algo-rithms based on supervised learning method which requirea small number of training data to estimate the constraintsThe IB1 data mining algorithm is based on lazy approaches

6 Journal of Computer Networks and Communications

Table 1 The statistical analysis of data set 1 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 1107275099

2917724901 00069 900871 02526 00754 1228107 4024

BayesNet 2662661531

1362338469 00032 424047 05979 00479 781282 4024

IB1 2802696322

1222303678 00028 372325 06199 00533 868274 4024

J48 2908722664 1116 277336 00032 416312 06379 00454 739957 4024

Regression 3051758201 973 241799 00011 210201 06859 00392 639686 4024

SVM 2251641571

1773358429 00039 420019 05743 04758 849596 4024

Also J48 data mining algorithm is based on decision treemethods Finally classification via regression algorithm isbased on Meta approach that is the new approach in datamining methods In other words regression analysis is astatistical method which is used to achieve data analysisRegression is applied with correlation analysis usually Thecorrelation analysis evaluates the association degree betweentwo quantitative data sets [37] For example Figure 3 showsthe classification result of NaiveBayse algorithm in WEKAtoolThe following section describes the experimental resultsof classification algorithms inWEKA Some effective featuressuch as Correctly Classified Instances Incorrectly ClassifiedInstances mean absolute error and relative absolute errorare compared with each other in order to achieve the bestclassification algorithm for developing a behavioral antivirus

4 Experimental Results and Discussion

In this section we implemented our approach using WEKAtool We use a system by Intel Core i3 213 GHz CPU 4GBRAM for the classification methods This analysis has beendone by some classification algorithms such as NaiveBayseBayseNet IB1 J48 System Vector Machine (SVM) andlogistic regression method We compared performance ofclassification methods in two malware data sets

In Tables 1 and 2 the statistical analysis of data sets 1and 2 is specified for proposed classification methods Thecompared factors in the classification methods are CorrectlyClassified Instances Incorrectly Classified Instances Kappastatistic mean absolute error relative absolute error rootmean squared error and root relative squared error In thiscomparison we show that the classification via regressionmethod has best performance in malware detection Forexample in data set 1 the number of correctly classifiedmalware programs is 3051 from total 4024malware programsAlso in data set 2 the number of correctly classified malwareprograms is 3069 from total 3131 malware programs

Figure 3 The snapshot of NaiveBayse classification algorithm inWEKA

According to Tables 1 and 2 the percentage of CorrectlyClassified Instances of the logistic regression algorithm ishigher than the other classification methods in each of datasets 1 and 2 Also the percentage of Incorrectly ClassifiedInstances of the logistic regression algorithm is lower than theother classification methods in each of data sets 1 and 2

After data mining process we test a new malware caseby the regression classification algorithm 100 binarymalwareprograms are downloaded from NetLux (httpvxheavenorg)and we analyzed their behaviors by using CW-Sandbox tooland we get its XML file [38] Then we add these 100 malwareprograms to the new data set and compute the quality of theirclassification as true optimistic ratio As we expect by classi-fication via regression 88 malware programs are detected Sowe can use the classification via regression to develop a behav-aioral antivirus

Journal of Computer Networks and Communications 7

Table 2 The statistical analysis of data set 2 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 2678855318

453144682 0012 153329 08459 01026 518792 3131

BayesNet 2874917918 257 82082 00073 93575 09127 00747 377504 3131

IB1 3028967103 103 32897 00027 35032 0965 00524 26472 3131

J48 3008960715 123 39285 00043 55353 09581 00527 26652 3131

Regression 3069 98321 62 1679 00021 22102 09578 00543 274333 3131

SVM 1698542319

1433457681 00046 57993 05011 01942 981954 3131

New data set

0102030405060708090

100

True

opt

imist

ic ra

tio (T

OR)

()

Baye

sNet

Regr

essio

n

J48

SVM

Nai

veBa

yes

IB1

Classification algorithms

Figure 4 The true optimistic ratio for the classifications test in thenew data set

Figure 4 depicts the true optimistic ratio percentage formalware detection in the new data sets The true optimisticratio percentage of regressionmethod is higher than the otherclassification methods in the new data set

After testing our new case study by 100 malware pro-grams Table 3 describes a statistical result for the FalseAcceptance Rate (FAR) number of cases and the FalseRejection Rate (FRR) number of cases Of course there aresome platforms such as STAC (httpteccitiususcesstac)[48] for statistical comparison of the tested algorithms Butwe use theWEKA tool for statistical and experimental resultsfor our data sets

According to Table 3 there is no valid input whichis incorrectly rejected using our approach by regressionmethod Also NaiveBayes method rejected 6 valid inputsincorrectly

Also in this test case we find one FAR incorrectly acceptedas a malware So Figure 5 shows the Total Error Rate (TER)

New data set

IB1

Baye

sNet J48

Regr

essio

n

SVM

Nai

veBa

yes

Classification algorithms

02468

101214161820

Tota

l Err

or R

ate (

TER)

()

Figure 5 The Total Error Rate (TER) for the classifications test inthe new data set

Table 3The statistical analysis of the FAR and FRR number of casesin the new test case study

AlgorithmsStatistical analysis

Number ofFAR cases

Number ofFRR cases

Total numberof instances

NaiveBayes 5 6 100BayesNet 4 2 100IB1 2 1 100J48 3 2 100Regression 1 0 100SVM 3 2 100

for our new test case using our approach by the regressionmethod

8 Journal of Computer Networks and Communications

5 Conclusion and Future Work

In this paper we proposed a new datamining approach basedon classificationmethodologies for detectingmalware behav-ior Firstly a malware behavior executive history XML file isconverted to a nonsparse matrix using our suggested applica-tionThen thismatrixwas translated toWEKA input data setTo illustrate the performance efficiency we applied the pro-posed approaches to a real case study data set using WEKAtool The training methods proceeded using some classifica-tion algorithms such as NaiveBayse BayseNet IB1 J48 andregression algorithms The regression classification methodhad best performance for classification of malware detectionAlso we analyzed the new data set by the regression classifica-tion method The evaluation results demonstrated the avail-ability of the proposed data mining approach Also our pro-posed data mining mechanism is more efficient for detectingmalware By notice to the experimental results classificationofmalware behavioral features can be a convenientmethod indeveloping a behavioral antivirus In the future work we willtry to develop and analyze a real behavioral antivirus platformbased on classification via regression algorithm

Competing Interests

The authors declare that they have no competing interests

References

[1] D B Ekta Gandotra and S Sofat ldquoMalware analysis andclassification a surveyrdquo Journal of Information Security vol 5pp 56ndash64 2014

[2] P Wang and Y-S Wang ldquoMalware behavioural detection andvaccine development by using a support vectormodel classifierrdquoJournal of Computer and System Sciences vol 81 no 6 pp 1012ndash1026 2015

[3] G Ollmann ldquoThe evolution of commercial malware develop-ment kits and colour-by-numbers custom malwarerdquo ComputerFraud and Security vol 2008 no 9 pp 4ndash7 2008

[4] M Ghiasi A Sami and Z Salehi ldquoDynamic VSA a frameworkfor malware detection based on register contentsrdquo EngineeringApplications of Artificial Intelligence vol 44 pp 111ndash122 2015

[5] D Bruschi L Martignoni and M Monga ldquoDetecting self-mutating malware using control-flow graph matchingrdquo inDetection of Intrusions andMalwareampVulnerability AssessmentR Buschkes andP Laskov Eds vol 4064 pp 129ndash143 SpringerBerlin Germany 2006

[6] M R Chouchane and A Lakhotia ldquoUsing engine signature todetect metamorphic malwarerdquo in Proceedings of the 4th ACMWorkshop on Recurring Malcode Alexandria Va USA 2006

[7] N Kuzurin A Shokurov N Varnovsky and V Zakharov ldquoOnthe concept of software obfuscation in computer securityrdquo inInformation Security J Garay A Lenstra M Mambo and RPeralta Eds vol 4779 pp 281ndash298 Springer Berlin Germany2007

[8] M Christodorescu and S Jha ldquoTesting malware detectorsrdquoSIGSOFT Software Engineering Notes vol 29 no 4 pp 34ndash442004

[9] L K Mehedy Masud and BThuraisinghamData Mining Toolsfor Malware Detection vol 1 CRC Press 2012

[10] M Egele T Scholte E Kirda and C Kruegel ldquoA survey onautomated dynamic malware-analysis techniques and toolsrdquoACM Computing Surveys vol 44 pp 1ndash42 2008

[11] S P Monire Norouzi and A Mahjur ldquoA new approach forformal behavioral modeling of protection services in antivirussystemsrdquo International Journal in Foundations of ComputerScience amp Technology vol 4 pp 57ndash67 2014

[12] A Safarkhanlou A Souri M Norouzi and S E H SardroudldquoFormalizing and verification of an antivirus protection serviceusing model checkingrdquo Procedia Computer Science vol 57 pp1324ndash1331 2015

[13] I Santos F Brezo X Ugarte-Pedrero and P G BringasldquoOpcode sequences as representation of executables for data-mining-based unknown malware detectionrdquo Information Sci-ences vol 231 pp 64ndash82 2013

[14] N Abdelhamid A Ayesh and F Thabtah ldquoPhishing detectionbased associative classification data miningrdquo Expert Systemswith Applications vol 41 no 13 pp 5948ndash5959 2014

[15] G Jacob H Debar and E Filiol ldquoBehavioral detection ofmalware from a survey towards an established taxonomyrdquoJournal in Computer Virology vol 4 no 3 pp 251ndash266 2008

[16] A Shabtai R Moskovitch Y Elovici and C Glezer ldquoDetectionof malicious code by applying machine learning classifiers onstatic features a state-of-the-art surveyrdquo Information SecurityTechnical Report vol 14 no 1 pp 16ndash29 2009

[17] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo in Proceedings of the Conferenceon Emerging Artificial Intelligence Applications in ComputerEngineering RealWord AI Systems with Applications in eHealthHCI Information Retrieval and Pervasive Technologies pp 3ndash24IOS Press 2007

[18] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[19] L Chen Q Jiang and S Wang ldquoModel-based method forprojective clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 24 no 7 pp 1291ndash1305 2012

[20] C Ravi and RManoharan ldquoMalware detection usingWindowsApi sequence and machine learningrdquo International Journal ofComputer Applications vol 43 no 17 pp 12ndash16 2012

[21] R Rizwan G C Hazarika and G Chetia ldquoMalware threatsand mitigation strategies a surveyrdquo Journal of Theoretical andApplied Information Technology vol 29 no 2 pp 69ndash73 2011

[22] K Mathur and H Saroj ldquoA survey on techniques in detectionand analyzing malware executablesrdquo International Journal ofAdvanced Research in Computer Science and Software Engineer-ing vol 44 no 2 2012

[23] N F Doherty L Anastasakis and H Fulford ldquoThe informationsecurity policy unpacked a critical study of the content ofuniversity policiesrdquo International Journal of Information Man-agement vol 29 no 6 pp 449ndash457 2009

[24] G Tahan L Rokach and Y Shahar ldquoAutomatic malwaredetection using common segment analysis and meta-featuresrdquoJournal ofMachine Learning Research vol 13 pp 949ndash979 2012

[25] M Bailey J Oberheide J Andersen Z M Mao F Jahanianand JNazario ldquoAutomated classification and analysis of internetmalwarerdquo inRecent Advances in Intrusion Detection C KruegelR Lippmann andAClark Eds vol 4637 pp 178ndash197 SpringerBerlin Germany 2007

[26] U Bayer AMoser C Kruegel and E Kirda ldquoDynamic analysisof malicious coderdquo Journal in Computer Virology vol 2 no 1pp 67ndash77 2006

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Journal of Computer Networks and Communications 5

RELATION TEST file nameATTRIBUTE dll1 numeric propertyATTRIBUTE dll2 numeric propertyATTRIBUTE dll3 numeric propertyATTRIBUTE dll4 numeric property propertyATTRIBUTE param88 numeric propertyATTRIBUTE class Answer - propertyDATA0 1068 2 0534 8 0534 11 0534 12 0534 23 0534 32 0534 33 0534 35

Box 2 An example of standard form for WEKA input

systems and preventing malware spread [28 38ndash41] TheXML file includes useful information such as system libraryfiles calls creating searching and change of files modifyingregistry main processes information creating the mutex (amutex is an application object which permits the multipleprogram threads to share the same resource) modifyingvirtual memory sending email registry operations andswitches communications By using the suggested programall of the information is read and saved as a nonsparsematrix

Now thematrix has been converted to a standard form ofWEKA tool input as arff file for data set 1 and data set 2Thisstandard form is shown in Box 2

31 Classification and Prediction Approaches This sectiondescribes the classification methods in two real case studiesas data set 1 and data set 2 At first we analyze the dataminingresult on data set 1 and data set 2 by WEKA classificationalgorithms For specifying the performance of classificationmethods in WEKA we describe some effective featuresbriefly [27] The Correctly Classified Instances (CCI) depictthe test cases percentages that were correctly classified Alsothe Incorrectly Classified Instances (ICI) represent the testcases percentages that were incorrectly classified

The relative absolute error (RAE) is qualified to a simplepredictor error which is objective for the typical real valuesIn the RAE the error is only the total absolute error ratherthan the total squared error

Definition 1 A relative absolute error is a 3-tuple RAE119894=

(119865(119894119895) 119881119895) in formula (1) where 119865

(119894119895)is the value predicted

by the individual program 119894 for sample case 119895 (out of 119896 samplecases) 119881

119895is the objective value for sample case 119895 and is

given by the following formulas

RAE119894=

sum119896

119895=1

10038161003816100381610038161003816119865(119894119895)minus 119881119895

10038161003816100381610038161003816

sum119896

119895=1

10038161003816100381610038161003816119881119895minus 10038161003816100381610038161003816

(1)

=1

119896

119896

sum

119895=1

119881119895 (2)

Also themean absolute error (MAE) shows themean averagegreatness of the errors in a set of predictions without allowingfor their courseTheMAEdepicts the correctness of incessant

variables in prediction procedure The MAE specifies andverifies an average on the absolute values between forecastand the corresponding statement The MAE is a linear scorewhich means that all the individual differences are weightedequally in the average [42ndash44]

Definition 2 A mean absolute error is a 2-tuple MAE119894=

(119875119894 119879119894) in formula (3) where 119875

119894is the prediction of value and

119879119894is the true value This feature specifies the average error in

the classification procedure in

MAE119894=

1

119896

119896

sum

119895=1

10038161003816100381610038161003816119875119895minus 119879119895

10038161003816100381610038161003816 (3)

Also we can measure the classifiers proficiency using atrue optimistic ratio (TOR) where NC is the number ofcorrectly detected malware programs and NI is the numberof incorrectly detected malware programs in (4) The AORcreates the cost of estimated classification that is significantto setting the cost of malware classification [45]

TOR = NCNC +NI

(4)

Also there are two error rates for measuring the classificationperformance The False Acceptance Rate (FAR) is the ratioof the number of test cases that are incorrectly accepted bya given model to the total number of cases This means thatthis ratio shows the percentage of invalid inputs which areincorrectly accepted The False Rejection Rate (FRR) is theratio of the number of test cases that are incorrectly rejectedby a given model to the total number of cases This meansthat this ratio shows the percentage of valid inputs whichare incorrectly rejected [46] By using these factors we cancalculate the Total Error Rate (TER) as follows [47]

TER = FAR + FRRNC +NI

(5)

In the classification process we use NaiveBayse BayseNetIB1 J48 and classification via regression algorithms TheNaiveBayes and BayesNet are a probabilistic learning algo-rithms based on supervised learning method which requirea small number of training data to estimate the constraintsThe IB1 data mining algorithm is based on lazy approaches

6 Journal of Computer Networks and Communications

Table 1 The statistical analysis of data set 1 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 1107275099

2917724901 00069 900871 02526 00754 1228107 4024

BayesNet 2662661531

1362338469 00032 424047 05979 00479 781282 4024

IB1 2802696322

1222303678 00028 372325 06199 00533 868274 4024

J48 2908722664 1116 277336 00032 416312 06379 00454 739957 4024

Regression 3051758201 973 241799 00011 210201 06859 00392 639686 4024

SVM 2251641571

1773358429 00039 420019 05743 04758 849596 4024

Also J48 data mining algorithm is based on decision treemethods Finally classification via regression algorithm isbased on Meta approach that is the new approach in datamining methods In other words regression analysis is astatistical method which is used to achieve data analysisRegression is applied with correlation analysis usually Thecorrelation analysis evaluates the association degree betweentwo quantitative data sets [37] For example Figure 3 showsthe classification result of NaiveBayse algorithm in WEKAtoolThe following section describes the experimental resultsof classification algorithms inWEKA Some effective featuressuch as Correctly Classified Instances Incorrectly ClassifiedInstances mean absolute error and relative absolute errorare compared with each other in order to achieve the bestclassification algorithm for developing a behavioral antivirus

4 Experimental Results and Discussion

In this section we implemented our approach using WEKAtool We use a system by Intel Core i3 213 GHz CPU 4GBRAM for the classification methods This analysis has beendone by some classification algorithms such as NaiveBayseBayseNet IB1 J48 System Vector Machine (SVM) andlogistic regression method We compared performance ofclassification methods in two malware data sets

In Tables 1 and 2 the statistical analysis of data sets 1and 2 is specified for proposed classification methods Thecompared factors in the classification methods are CorrectlyClassified Instances Incorrectly Classified Instances Kappastatistic mean absolute error relative absolute error rootmean squared error and root relative squared error In thiscomparison we show that the classification via regressionmethod has best performance in malware detection Forexample in data set 1 the number of correctly classifiedmalware programs is 3051 from total 4024malware programsAlso in data set 2 the number of correctly classified malwareprograms is 3069 from total 3131 malware programs

Figure 3 The snapshot of NaiveBayse classification algorithm inWEKA

According to Tables 1 and 2 the percentage of CorrectlyClassified Instances of the logistic regression algorithm ishigher than the other classification methods in each of datasets 1 and 2 Also the percentage of Incorrectly ClassifiedInstances of the logistic regression algorithm is lower than theother classification methods in each of data sets 1 and 2

After data mining process we test a new malware caseby the regression classification algorithm 100 binarymalwareprograms are downloaded from NetLux (httpvxheavenorg)and we analyzed their behaviors by using CW-Sandbox tooland we get its XML file [38] Then we add these 100 malwareprograms to the new data set and compute the quality of theirclassification as true optimistic ratio As we expect by classi-fication via regression 88 malware programs are detected Sowe can use the classification via regression to develop a behav-aioral antivirus

Journal of Computer Networks and Communications 7

Table 2 The statistical analysis of data set 2 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 2678855318

453144682 0012 153329 08459 01026 518792 3131

BayesNet 2874917918 257 82082 00073 93575 09127 00747 377504 3131

IB1 3028967103 103 32897 00027 35032 0965 00524 26472 3131

J48 3008960715 123 39285 00043 55353 09581 00527 26652 3131

Regression 3069 98321 62 1679 00021 22102 09578 00543 274333 3131

SVM 1698542319

1433457681 00046 57993 05011 01942 981954 3131

New data set

0102030405060708090

100

True

opt

imist

ic ra

tio (T

OR)

()

Baye

sNet

Regr

essio

n

J48

SVM

Nai

veBa

yes

IB1

Classification algorithms

Figure 4 The true optimistic ratio for the classifications test in thenew data set

Figure 4 depicts the true optimistic ratio percentage formalware detection in the new data sets The true optimisticratio percentage of regressionmethod is higher than the otherclassification methods in the new data set

After testing our new case study by 100 malware pro-grams Table 3 describes a statistical result for the FalseAcceptance Rate (FAR) number of cases and the FalseRejection Rate (FRR) number of cases Of course there aresome platforms such as STAC (httpteccitiususcesstac)[48] for statistical comparison of the tested algorithms Butwe use theWEKA tool for statistical and experimental resultsfor our data sets

According to Table 3 there is no valid input whichis incorrectly rejected using our approach by regressionmethod Also NaiveBayes method rejected 6 valid inputsincorrectly

Also in this test case we find one FAR incorrectly acceptedas a malware So Figure 5 shows the Total Error Rate (TER)

New data set

IB1

Baye

sNet J48

Regr

essio

n

SVM

Nai

veBa

yes

Classification algorithms

02468

101214161820

Tota

l Err

or R

ate (

TER)

()

Figure 5 The Total Error Rate (TER) for the classifications test inthe new data set

Table 3The statistical analysis of the FAR and FRR number of casesin the new test case study

AlgorithmsStatistical analysis

Number ofFAR cases

Number ofFRR cases

Total numberof instances

NaiveBayes 5 6 100BayesNet 4 2 100IB1 2 1 100J48 3 2 100Regression 1 0 100SVM 3 2 100

for our new test case using our approach by the regressionmethod

8 Journal of Computer Networks and Communications

5 Conclusion and Future Work

In this paper we proposed a new datamining approach basedon classificationmethodologies for detectingmalware behav-ior Firstly a malware behavior executive history XML file isconverted to a nonsparse matrix using our suggested applica-tionThen thismatrixwas translated toWEKA input data setTo illustrate the performance efficiency we applied the pro-posed approaches to a real case study data set using WEKAtool The training methods proceeded using some classifica-tion algorithms such as NaiveBayse BayseNet IB1 J48 andregression algorithms The regression classification methodhad best performance for classification of malware detectionAlso we analyzed the new data set by the regression classifica-tion method The evaluation results demonstrated the avail-ability of the proposed data mining approach Also our pro-posed data mining mechanism is more efficient for detectingmalware By notice to the experimental results classificationofmalware behavioral features can be a convenientmethod indeveloping a behavioral antivirus In the future work we willtry to develop and analyze a real behavioral antivirus platformbased on classification via regression algorithm

Competing Interests

The authors declare that they have no competing interests

References

[1] D B Ekta Gandotra and S Sofat ldquoMalware analysis andclassification a surveyrdquo Journal of Information Security vol 5pp 56ndash64 2014

[2] P Wang and Y-S Wang ldquoMalware behavioural detection andvaccine development by using a support vectormodel classifierrdquoJournal of Computer and System Sciences vol 81 no 6 pp 1012ndash1026 2015

[3] G Ollmann ldquoThe evolution of commercial malware develop-ment kits and colour-by-numbers custom malwarerdquo ComputerFraud and Security vol 2008 no 9 pp 4ndash7 2008

[4] M Ghiasi A Sami and Z Salehi ldquoDynamic VSA a frameworkfor malware detection based on register contentsrdquo EngineeringApplications of Artificial Intelligence vol 44 pp 111ndash122 2015

[5] D Bruschi L Martignoni and M Monga ldquoDetecting self-mutating malware using control-flow graph matchingrdquo inDetection of Intrusions andMalwareampVulnerability AssessmentR Buschkes andP Laskov Eds vol 4064 pp 129ndash143 SpringerBerlin Germany 2006

[6] M R Chouchane and A Lakhotia ldquoUsing engine signature todetect metamorphic malwarerdquo in Proceedings of the 4th ACMWorkshop on Recurring Malcode Alexandria Va USA 2006

[7] N Kuzurin A Shokurov N Varnovsky and V Zakharov ldquoOnthe concept of software obfuscation in computer securityrdquo inInformation Security J Garay A Lenstra M Mambo and RPeralta Eds vol 4779 pp 281ndash298 Springer Berlin Germany2007

[8] M Christodorescu and S Jha ldquoTesting malware detectorsrdquoSIGSOFT Software Engineering Notes vol 29 no 4 pp 34ndash442004

[9] L K Mehedy Masud and BThuraisinghamData Mining Toolsfor Malware Detection vol 1 CRC Press 2012

[10] M Egele T Scholte E Kirda and C Kruegel ldquoA survey onautomated dynamic malware-analysis techniques and toolsrdquoACM Computing Surveys vol 44 pp 1ndash42 2008

[11] S P Monire Norouzi and A Mahjur ldquoA new approach forformal behavioral modeling of protection services in antivirussystemsrdquo International Journal in Foundations of ComputerScience amp Technology vol 4 pp 57ndash67 2014

[12] A Safarkhanlou A Souri M Norouzi and S E H SardroudldquoFormalizing and verification of an antivirus protection serviceusing model checkingrdquo Procedia Computer Science vol 57 pp1324ndash1331 2015

[13] I Santos F Brezo X Ugarte-Pedrero and P G BringasldquoOpcode sequences as representation of executables for data-mining-based unknown malware detectionrdquo Information Sci-ences vol 231 pp 64ndash82 2013

[14] N Abdelhamid A Ayesh and F Thabtah ldquoPhishing detectionbased associative classification data miningrdquo Expert Systemswith Applications vol 41 no 13 pp 5948ndash5959 2014

[15] G Jacob H Debar and E Filiol ldquoBehavioral detection ofmalware from a survey towards an established taxonomyrdquoJournal in Computer Virology vol 4 no 3 pp 251ndash266 2008

[16] A Shabtai R Moskovitch Y Elovici and C Glezer ldquoDetectionof malicious code by applying machine learning classifiers onstatic features a state-of-the-art surveyrdquo Information SecurityTechnical Report vol 14 no 1 pp 16ndash29 2009

[17] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo in Proceedings of the Conferenceon Emerging Artificial Intelligence Applications in ComputerEngineering RealWord AI Systems with Applications in eHealthHCI Information Retrieval and Pervasive Technologies pp 3ndash24IOS Press 2007

[18] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[19] L Chen Q Jiang and S Wang ldquoModel-based method forprojective clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 24 no 7 pp 1291ndash1305 2012

[20] C Ravi and RManoharan ldquoMalware detection usingWindowsApi sequence and machine learningrdquo International Journal ofComputer Applications vol 43 no 17 pp 12ndash16 2012

[21] R Rizwan G C Hazarika and G Chetia ldquoMalware threatsand mitigation strategies a surveyrdquo Journal of Theoretical andApplied Information Technology vol 29 no 2 pp 69ndash73 2011

[22] K Mathur and H Saroj ldquoA survey on techniques in detectionand analyzing malware executablesrdquo International Journal ofAdvanced Research in Computer Science and Software Engineer-ing vol 44 no 2 2012

[23] N F Doherty L Anastasakis and H Fulford ldquoThe informationsecurity policy unpacked a critical study of the content ofuniversity policiesrdquo International Journal of Information Man-agement vol 29 no 6 pp 449ndash457 2009

[24] G Tahan L Rokach and Y Shahar ldquoAutomatic malwaredetection using common segment analysis and meta-featuresrdquoJournal ofMachine Learning Research vol 13 pp 949ndash979 2012

[25] M Bailey J Oberheide J Andersen Z M Mao F Jahanianand JNazario ldquoAutomated classification and analysis of internetmalwarerdquo inRecent Advances in Intrusion Detection C KruegelR Lippmann andAClark Eds vol 4637 pp 178ndash197 SpringerBerlin Germany 2007

[26] U Bayer AMoser C Kruegel and E Kirda ldquoDynamic analysisof malicious coderdquo Journal in Computer Virology vol 2 no 1pp 67ndash77 2006

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

6 Journal of Computer Networks and Communications

Table 1 The statistical analysis of data set 1 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 1107275099

2917724901 00069 900871 02526 00754 1228107 4024

BayesNet 2662661531

1362338469 00032 424047 05979 00479 781282 4024

IB1 2802696322

1222303678 00028 372325 06199 00533 868274 4024

J48 2908722664 1116 277336 00032 416312 06379 00454 739957 4024

Regression 3051758201 973 241799 00011 210201 06859 00392 639686 4024

SVM 2251641571

1773358429 00039 420019 05743 04758 849596 4024

Also J48 data mining algorithm is based on decision treemethods Finally classification via regression algorithm isbased on Meta approach that is the new approach in datamining methods In other words regression analysis is astatistical method which is used to achieve data analysisRegression is applied with correlation analysis usually Thecorrelation analysis evaluates the association degree betweentwo quantitative data sets [37] For example Figure 3 showsthe classification result of NaiveBayse algorithm in WEKAtoolThe following section describes the experimental resultsof classification algorithms inWEKA Some effective featuressuch as Correctly Classified Instances Incorrectly ClassifiedInstances mean absolute error and relative absolute errorare compared with each other in order to achieve the bestclassification algorithm for developing a behavioral antivirus

4 Experimental Results and Discussion

In this section we implemented our approach using WEKAtool We use a system by Intel Core i3 213 GHz CPU 4GBRAM for the classification methods This analysis has beendone by some classification algorithms such as NaiveBayseBayseNet IB1 J48 System Vector Machine (SVM) andlogistic regression method We compared performance ofclassification methods in two malware data sets

In Tables 1 and 2 the statistical analysis of data sets 1and 2 is specified for proposed classification methods Thecompared factors in the classification methods are CorrectlyClassified Instances Incorrectly Classified Instances Kappastatistic mean absolute error relative absolute error rootmean squared error and root relative squared error In thiscomparison we show that the classification via regressionmethod has best performance in malware detection Forexample in data set 1 the number of correctly classifiedmalware programs is 3051 from total 4024malware programsAlso in data set 2 the number of correctly classified malwareprograms is 3069 from total 3131 malware programs

Figure 3 The snapshot of NaiveBayse classification algorithm inWEKA

According to Tables 1 and 2 the percentage of CorrectlyClassified Instances of the logistic regression algorithm ishigher than the other classification methods in each of datasets 1 and 2 Also the percentage of Incorrectly ClassifiedInstances of the logistic regression algorithm is lower than theother classification methods in each of data sets 1 and 2

After data mining process we test a new malware caseby the regression classification algorithm 100 binarymalwareprograms are downloaded from NetLux (httpvxheavenorg)and we analyzed their behaviors by using CW-Sandbox tooland we get its XML file [38] Then we add these 100 malwareprograms to the new data set and compute the quality of theirclassification as true optimistic ratio As we expect by classi-fication via regression 88 malware programs are detected Sowe can use the classification via regression to develop a behav-aioral antivirus

Journal of Computer Networks and Communications 7

Table 2 The statistical analysis of data set 2 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 2678855318

453144682 0012 153329 08459 01026 518792 3131

BayesNet 2874917918 257 82082 00073 93575 09127 00747 377504 3131

IB1 3028967103 103 32897 00027 35032 0965 00524 26472 3131

J48 3008960715 123 39285 00043 55353 09581 00527 26652 3131

Regression 3069 98321 62 1679 00021 22102 09578 00543 274333 3131

SVM 1698542319

1433457681 00046 57993 05011 01942 981954 3131

New data set

0102030405060708090

100

True

opt

imist

ic ra

tio (T

OR)

()

Baye

sNet

Regr

essio

n

J48

SVM

Nai

veBa

yes

IB1

Classification algorithms

Figure 4 The true optimistic ratio for the classifications test in thenew data set

Figure 4 depicts the true optimistic ratio percentage formalware detection in the new data sets The true optimisticratio percentage of regressionmethod is higher than the otherclassification methods in the new data set

After testing our new case study by 100 malware pro-grams Table 3 describes a statistical result for the FalseAcceptance Rate (FAR) number of cases and the FalseRejection Rate (FRR) number of cases Of course there aresome platforms such as STAC (httpteccitiususcesstac)[48] for statistical comparison of the tested algorithms Butwe use theWEKA tool for statistical and experimental resultsfor our data sets

According to Table 3 there is no valid input whichis incorrectly rejected using our approach by regressionmethod Also NaiveBayes method rejected 6 valid inputsincorrectly

Also in this test case we find one FAR incorrectly acceptedas a malware So Figure 5 shows the Total Error Rate (TER)

New data set

IB1

Baye

sNet J48

Regr

essio

n

SVM

Nai

veBa

yes

Classification algorithms

02468

101214161820

Tota

l Err

or R

ate (

TER)

()

Figure 5 The Total Error Rate (TER) for the classifications test inthe new data set

Table 3The statistical analysis of the FAR and FRR number of casesin the new test case study

AlgorithmsStatistical analysis

Number ofFAR cases

Number ofFRR cases

Total numberof instances

NaiveBayes 5 6 100BayesNet 4 2 100IB1 2 1 100J48 3 2 100Regression 1 0 100SVM 3 2 100

for our new test case using our approach by the regressionmethod

8 Journal of Computer Networks and Communications

5 Conclusion and Future Work

In this paper we proposed a new datamining approach basedon classificationmethodologies for detectingmalware behav-ior Firstly a malware behavior executive history XML file isconverted to a nonsparse matrix using our suggested applica-tionThen thismatrixwas translated toWEKA input data setTo illustrate the performance efficiency we applied the pro-posed approaches to a real case study data set using WEKAtool The training methods proceeded using some classifica-tion algorithms such as NaiveBayse BayseNet IB1 J48 andregression algorithms The regression classification methodhad best performance for classification of malware detectionAlso we analyzed the new data set by the regression classifica-tion method The evaluation results demonstrated the avail-ability of the proposed data mining approach Also our pro-posed data mining mechanism is more efficient for detectingmalware By notice to the experimental results classificationofmalware behavioral features can be a convenientmethod indeveloping a behavioral antivirus In the future work we willtry to develop and analyze a real behavioral antivirus platformbased on classification via regression algorithm

Competing Interests

The authors declare that they have no competing interests

References

[1] D B Ekta Gandotra and S Sofat ldquoMalware analysis andclassification a surveyrdquo Journal of Information Security vol 5pp 56ndash64 2014

[2] P Wang and Y-S Wang ldquoMalware behavioural detection andvaccine development by using a support vectormodel classifierrdquoJournal of Computer and System Sciences vol 81 no 6 pp 1012ndash1026 2015

[3] G Ollmann ldquoThe evolution of commercial malware develop-ment kits and colour-by-numbers custom malwarerdquo ComputerFraud and Security vol 2008 no 9 pp 4ndash7 2008

[4] M Ghiasi A Sami and Z Salehi ldquoDynamic VSA a frameworkfor malware detection based on register contentsrdquo EngineeringApplications of Artificial Intelligence vol 44 pp 111ndash122 2015

[5] D Bruschi L Martignoni and M Monga ldquoDetecting self-mutating malware using control-flow graph matchingrdquo inDetection of Intrusions andMalwareampVulnerability AssessmentR Buschkes andP Laskov Eds vol 4064 pp 129ndash143 SpringerBerlin Germany 2006

[6] M R Chouchane and A Lakhotia ldquoUsing engine signature todetect metamorphic malwarerdquo in Proceedings of the 4th ACMWorkshop on Recurring Malcode Alexandria Va USA 2006

[7] N Kuzurin A Shokurov N Varnovsky and V Zakharov ldquoOnthe concept of software obfuscation in computer securityrdquo inInformation Security J Garay A Lenstra M Mambo and RPeralta Eds vol 4779 pp 281ndash298 Springer Berlin Germany2007

[8] M Christodorescu and S Jha ldquoTesting malware detectorsrdquoSIGSOFT Software Engineering Notes vol 29 no 4 pp 34ndash442004

[9] L K Mehedy Masud and BThuraisinghamData Mining Toolsfor Malware Detection vol 1 CRC Press 2012

[10] M Egele T Scholte E Kirda and C Kruegel ldquoA survey onautomated dynamic malware-analysis techniques and toolsrdquoACM Computing Surveys vol 44 pp 1ndash42 2008

[11] S P Monire Norouzi and A Mahjur ldquoA new approach forformal behavioral modeling of protection services in antivirussystemsrdquo International Journal in Foundations of ComputerScience amp Technology vol 4 pp 57ndash67 2014

[12] A Safarkhanlou A Souri M Norouzi and S E H SardroudldquoFormalizing and verification of an antivirus protection serviceusing model checkingrdquo Procedia Computer Science vol 57 pp1324ndash1331 2015

[13] I Santos F Brezo X Ugarte-Pedrero and P G BringasldquoOpcode sequences as representation of executables for data-mining-based unknown malware detectionrdquo Information Sci-ences vol 231 pp 64ndash82 2013

[14] N Abdelhamid A Ayesh and F Thabtah ldquoPhishing detectionbased associative classification data miningrdquo Expert Systemswith Applications vol 41 no 13 pp 5948ndash5959 2014

[15] G Jacob H Debar and E Filiol ldquoBehavioral detection ofmalware from a survey towards an established taxonomyrdquoJournal in Computer Virology vol 4 no 3 pp 251ndash266 2008

[16] A Shabtai R Moskovitch Y Elovici and C Glezer ldquoDetectionof malicious code by applying machine learning classifiers onstatic features a state-of-the-art surveyrdquo Information SecurityTechnical Report vol 14 no 1 pp 16ndash29 2009

[17] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo in Proceedings of the Conferenceon Emerging Artificial Intelligence Applications in ComputerEngineering RealWord AI Systems with Applications in eHealthHCI Information Retrieval and Pervasive Technologies pp 3ndash24IOS Press 2007

[18] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[19] L Chen Q Jiang and S Wang ldquoModel-based method forprojective clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 24 no 7 pp 1291ndash1305 2012

[20] C Ravi and RManoharan ldquoMalware detection usingWindowsApi sequence and machine learningrdquo International Journal ofComputer Applications vol 43 no 17 pp 12ndash16 2012

[21] R Rizwan G C Hazarika and G Chetia ldquoMalware threatsand mitigation strategies a surveyrdquo Journal of Theoretical andApplied Information Technology vol 29 no 2 pp 69ndash73 2011

[22] K Mathur and H Saroj ldquoA survey on techniques in detectionand analyzing malware executablesrdquo International Journal ofAdvanced Research in Computer Science and Software Engineer-ing vol 44 no 2 2012

[23] N F Doherty L Anastasakis and H Fulford ldquoThe informationsecurity policy unpacked a critical study of the content ofuniversity policiesrdquo International Journal of Information Man-agement vol 29 no 6 pp 449ndash457 2009

[24] G Tahan L Rokach and Y Shahar ldquoAutomatic malwaredetection using common segment analysis and meta-featuresrdquoJournal ofMachine Learning Research vol 13 pp 949ndash979 2012

[25] M Bailey J Oberheide J Andersen Z M Mao F Jahanianand JNazario ldquoAutomated classification and analysis of internetmalwarerdquo inRecent Advances in Intrusion Detection C KruegelR Lippmann andAClark Eds vol 4637 pp 178ndash197 SpringerBerlin Germany 2007

[26] U Bayer AMoser C Kruegel and E Kirda ldquoDynamic analysisof malicious coderdquo Journal in Computer Virology vol 2 no 1pp 67ndash77 2006

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Journal of Computer Networks and Communications 7

Table 2 The statistical analysis of data set 2 for specified classification methods

Algorithms

ResultsCorrectlyClassifiedInstancesNumber

IncorrectlyClassifiedInstancesNumber

Meanabsolute error

Relativeabsolute error

Kappastatistic

Root meansquared error

Root relativesquared error

Total numberof instances

NaiveBayes 2678855318

453144682 0012 153329 08459 01026 518792 3131

BayesNet 2874917918 257 82082 00073 93575 09127 00747 377504 3131

IB1 3028967103 103 32897 00027 35032 0965 00524 26472 3131

J48 3008960715 123 39285 00043 55353 09581 00527 26652 3131

Regression 3069 98321 62 1679 00021 22102 09578 00543 274333 3131

SVM 1698542319

1433457681 00046 57993 05011 01942 981954 3131

New data set

0102030405060708090

100

True

opt

imist

ic ra

tio (T

OR)

()

Baye

sNet

Regr

essio

n

J48

SVM

Nai

veBa

yes

IB1

Classification algorithms

Figure 4 The true optimistic ratio for the classifications test in thenew data set

Figure 4 depicts the true optimistic ratio percentage formalware detection in the new data sets The true optimisticratio percentage of regressionmethod is higher than the otherclassification methods in the new data set

After testing our new case study by 100 malware pro-grams Table 3 describes a statistical result for the FalseAcceptance Rate (FAR) number of cases and the FalseRejection Rate (FRR) number of cases Of course there aresome platforms such as STAC (httpteccitiususcesstac)[48] for statistical comparison of the tested algorithms Butwe use theWEKA tool for statistical and experimental resultsfor our data sets

According to Table 3 there is no valid input whichis incorrectly rejected using our approach by regressionmethod Also NaiveBayes method rejected 6 valid inputsincorrectly

Also in this test case we find one FAR incorrectly acceptedas a malware So Figure 5 shows the Total Error Rate (TER)

New data set

IB1

Baye

sNet J48

Regr

essio

n

SVM

Nai

veBa

yes

Classification algorithms

02468

101214161820

Tota

l Err

or R

ate (

TER)

()

Figure 5 The Total Error Rate (TER) for the classifications test inthe new data set

Table 3The statistical analysis of the FAR and FRR number of casesin the new test case study

AlgorithmsStatistical analysis

Number ofFAR cases

Number ofFRR cases

Total numberof instances

NaiveBayes 5 6 100BayesNet 4 2 100IB1 2 1 100J48 3 2 100Regression 1 0 100SVM 3 2 100

for our new test case using our approach by the regressionmethod

8 Journal of Computer Networks and Communications

5 Conclusion and Future Work

In this paper we proposed a new datamining approach basedon classificationmethodologies for detectingmalware behav-ior Firstly a malware behavior executive history XML file isconverted to a nonsparse matrix using our suggested applica-tionThen thismatrixwas translated toWEKA input data setTo illustrate the performance efficiency we applied the pro-posed approaches to a real case study data set using WEKAtool The training methods proceeded using some classifica-tion algorithms such as NaiveBayse BayseNet IB1 J48 andregression algorithms The regression classification methodhad best performance for classification of malware detectionAlso we analyzed the new data set by the regression classifica-tion method The evaluation results demonstrated the avail-ability of the proposed data mining approach Also our pro-posed data mining mechanism is more efficient for detectingmalware By notice to the experimental results classificationofmalware behavioral features can be a convenientmethod indeveloping a behavioral antivirus In the future work we willtry to develop and analyze a real behavioral antivirus platformbased on classification via regression algorithm

Competing Interests

The authors declare that they have no competing interests

References

[1] D B Ekta Gandotra and S Sofat ldquoMalware analysis andclassification a surveyrdquo Journal of Information Security vol 5pp 56ndash64 2014

[2] P Wang and Y-S Wang ldquoMalware behavioural detection andvaccine development by using a support vectormodel classifierrdquoJournal of Computer and System Sciences vol 81 no 6 pp 1012ndash1026 2015

[3] G Ollmann ldquoThe evolution of commercial malware develop-ment kits and colour-by-numbers custom malwarerdquo ComputerFraud and Security vol 2008 no 9 pp 4ndash7 2008

[4] M Ghiasi A Sami and Z Salehi ldquoDynamic VSA a frameworkfor malware detection based on register contentsrdquo EngineeringApplications of Artificial Intelligence vol 44 pp 111ndash122 2015

[5] D Bruschi L Martignoni and M Monga ldquoDetecting self-mutating malware using control-flow graph matchingrdquo inDetection of Intrusions andMalwareampVulnerability AssessmentR Buschkes andP Laskov Eds vol 4064 pp 129ndash143 SpringerBerlin Germany 2006

[6] M R Chouchane and A Lakhotia ldquoUsing engine signature todetect metamorphic malwarerdquo in Proceedings of the 4th ACMWorkshop on Recurring Malcode Alexandria Va USA 2006

[7] N Kuzurin A Shokurov N Varnovsky and V Zakharov ldquoOnthe concept of software obfuscation in computer securityrdquo inInformation Security J Garay A Lenstra M Mambo and RPeralta Eds vol 4779 pp 281ndash298 Springer Berlin Germany2007

[8] M Christodorescu and S Jha ldquoTesting malware detectorsrdquoSIGSOFT Software Engineering Notes vol 29 no 4 pp 34ndash442004

[9] L K Mehedy Masud and BThuraisinghamData Mining Toolsfor Malware Detection vol 1 CRC Press 2012

[10] M Egele T Scholte E Kirda and C Kruegel ldquoA survey onautomated dynamic malware-analysis techniques and toolsrdquoACM Computing Surveys vol 44 pp 1ndash42 2008

[11] S P Monire Norouzi and A Mahjur ldquoA new approach forformal behavioral modeling of protection services in antivirussystemsrdquo International Journal in Foundations of ComputerScience amp Technology vol 4 pp 57ndash67 2014

[12] A Safarkhanlou A Souri M Norouzi and S E H SardroudldquoFormalizing and verification of an antivirus protection serviceusing model checkingrdquo Procedia Computer Science vol 57 pp1324ndash1331 2015

[13] I Santos F Brezo X Ugarte-Pedrero and P G BringasldquoOpcode sequences as representation of executables for data-mining-based unknown malware detectionrdquo Information Sci-ences vol 231 pp 64ndash82 2013

[14] N Abdelhamid A Ayesh and F Thabtah ldquoPhishing detectionbased associative classification data miningrdquo Expert Systemswith Applications vol 41 no 13 pp 5948ndash5959 2014

[15] G Jacob H Debar and E Filiol ldquoBehavioral detection ofmalware from a survey towards an established taxonomyrdquoJournal in Computer Virology vol 4 no 3 pp 251ndash266 2008

[16] A Shabtai R Moskovitch Y Elovici and C Glezer ldquoDetectionof malicious code by applying machine learning classifiers onstatic features a state-of-the-art surveyrdquo Information SecurityTechnical Report vol 14 no 1 pp 16ndash29 2009

[17] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo in Proceedings of the Conferenceon Emerging Artificial Intelligence Applications in ComputerEngineering RealWord AI Systems with Applications in eHealthHCI Information Retrieval and Pervasive Technologies pp 3ndash24IOS Press 2007

[18] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[19] L Chen Q Jiang and S Wang ldquoModel-based method forprojective clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 24 no 7 pp 1291ndash1305 2012

[20] C Ravi and RManoharan ldquoMalware detection usingWindowsApi sequence and machine learningrdquo International Journal ofComputer Applications vol 43 no 17 pp 12ndash16 2012

[21] R Rizwan G C Hazarika and G Chetia ldquoMalware threatsand mitigation strategies a surveyrdquo Journal of Theoretical andApplied Information Technology vol 29 no 2 pp 69ndash73 2011

[22] K Mathur and H Saroj ldquoA survey on techniques in detectionand analyzing malware executablesrdquo International Journal ofAdvanced Research in Computer Science and Software Engineer-ing vol 44 no 2 2012

[23] N F Doherty L Anastasakis and H Fulford ldquoThe informationsecurity policy unpacked a critical study of the content ofuniversity policiesrdquo International Journal of Information Man-agement vol 29 no 6 pp 449ndash457 2009

[24] G Tahan L Rokach and Y Shahar ldquoAutomatic malwaredetection using common segment analysis and meta-featuresrdquoJournal ofMachine Learning Research vol 13 pp 949ndash979 2012

[25] M Bailey J Oberheide J Andersen Z M Mao F Jahanianand JNazario ldquoAutomated classification and analysis of internetmalwarerdquo inRecent Advances in Intrusion Detection C KruegelR Lippmann andAClark Eds vol 4637 pp 178ndash197 SpringerBerlin Germany 2007

[26] U Bayer AMoser C Kruegel and E Kirda ldquoDynamic analysisof malicious coderdquo Journal in Computer Virology vol 2 no 1pp 67ndash77 2006

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

8 Journal of Computer Networks and Communications

5 Conclusion and Future Work

In this paper we proposed a new datamining approach basedon classificationmethodologies for detectingmalware behav-ior Firstly a malware behavior executive history XML file isconverted to a nonsparse matrix using our suggested applica-tionThen thismatrixwas translated toWEKA input data setTo illustrate the performance efficiency we applied the pro-posed approaches to a real case study data set using WEKAtool The training methods proceeded using some classifica-tion algorithms such as NaiveBayse BayseNet IB1 J48 andregression algorithms The regression classification methodhad best performance for classification of malware detectionAlso we analyzed the new data set by the regression classifica-tion method The evaluation results demonstrated the avail-ability of the proposed data mining approach Also our pro-posed data mining mechanism is more efficient for detectingmalware By notice to the experimental results classificationofmalware behavioral features can be a convenientmethod indeveloping a behavioral antivirus In the future work we willtry to develop and analyze a real behavioral antivirus platformbased on classification via regression algorithm

Competing Interests

The authors declare that they have no competing interests

References

[1] D B Ekta Gandotra and S Sofat ldquoMalware analysis andclassification a surveyrdquo Journal of Information Security vol 5pp 56ndash64 2014

[2] P Wang and Y-S Wang ldquoMalware behavioural detection andvaccine development by using a support vectormodel classifierrdquoJournal of Computer and System Sciences vol 81 no 6 pp 1012ndash1026 2015

[3] G Ollmann ldquoThe evolution of commercial malware develop-ment kits and colour-by-numbers custom malwarerdquo ComputerFraud and Security vol 2008 no 9 pp 4ndash7 2008

[4] M Ghiasi A Sami and Z Salehi ldquoDynamic VSA a frameworkfor malware detection based on register contentsrdquo EngineeringApplications of Artificial Intelligence vol 44 pp 111ndash122 2015

[5] D Bruschi L Martignoni and M Monga ldquoDetecting self-mutating malware using control-flow graph matchingrdquo inDetection of Intrusions andMalwareampVulnerability AssessmentR Buschkes andP Laskov Eds vol 4064 pp 129ndash143 SpringerBerlin Germany 2006

[6] M R Chouchane and A Lakhotia ldquoUsing engine signature todetect metamorphic malwarerdquo in Proceedings of the 4th ACMWorkshop on Recurring Malcode Alexandria Va USA 2006

[7] N Kuzurin A Shokurov N Varnovsky and V Zakharov ldquoOnthe concept of software obfuscation in computer securityrdquo inInformation Security J Garay A Lenstra M Mambo and RPeralta Eds vol 4779 pp 281ndash298 Springer Berlin Germany2007

[8] M Christodorescu and S Jha ldquoTesting malware detectorsrdquoSIGSOFT Software Engineering Notes vol 29 no 4 pp 34ndash442004

[9] L K Mehedy Masud and BThuraisinghamData Mining Toolsfor Malware Detection vol 1 CRC Press 2012

[10] M Egele T Scholte E Kirda and C Kruegel ldquoA survey onautomated dynamic malware-analysis techniques and toolsrdquoACM Computing Surveys vol 44 pp 1ndash42 2008

[11] S P Monire Norouzi and A Mahjur ldquoA new approach forformal behavioral modeling of protection services in antivirussystemsrdquo International Journal in Foundations of ComputerScience amp Technology vol 4 pp 57ndash67 2014

[12] A Safarkhanlou A Souri M Norouzi and S E H SardroudldquoFormalizing and verification of an antivirus protection serviceusing model checkingrdquo Procedia Computer Science vol 57 pp1324ndash1331 2015

[13] I Santos F Brezo X Ugarte-Pedrero and P G BringasldquoOpcode sequences as representation of executables for data-mining-based unknown malware detectionrdquo Information Sci-ences vol 231 pp 64ndash82 2013

[14] N Abdelhamid A Ayesh and F Thabtah ldquoPhishing detectionbased associative classification data miningrdquo Expert Systemswith Applications vol 41 no 13 pp 5948ndash5959 2014

[15] G Jacob H Debar and E Filiol ldquoBehavioral detection ofmalware from a survey towards an established taxonomyrdquoJournal in Computer Virology vol 4 no 3 pp 251ndash266 2008

[16] A Shabtai R Moskovitch Y Elovici and C Glezer ldquoDetectionof malicious code by applying machine learning classifiers onstatic features a state-of-the-art surveyrdquo Information SecurityTechnical Report vol 14 no 1 pp 16ndash29 2009

[17] S B Kotsiantis ldquoSupervised machine learning a review ofclassification techniquesrdquo in Proceedings of the Conferenceon Emerging Artificial Intelligence Applications in ComputerEngineering RealWord AI Systems with Applications in eHealthHCI Information Retrieval and Pervasive Technologies pp 3ndash24IOS Press 2007

[18] A K Jain M N Murty and P J Flynn ldquoData clustering areviewrdquo ACM Computing Surveys vol 31 no 3 pp 264ndash3231999

[19] L Chen Q Jiang and S Wang ldquoModel-based method forprojective clusteringrdquo IEEE Transactions on Knowledge andData Engineering vol 24 no 7 pp 1291ndash1305 2012

[20] C Ravi and RManoharan ldquoMalware detection usingWindowsApi sequence and machine learningrdquo International Journal ofComputer Applications vol 43 no 17 pp 12ndash16 2012

[21] R Rizwan G C Hazarika and G Chetia ldquoMalware threatsand mitigation strategies a surveyrdquo Journal of Theoretical andApplied Information Technology vol 29 no 2 pp 69ndash73 2011

[22] K Mathur and H Saroj ldquoA survey on techniques in detectionand analyzing malware executablesrdquo International Journal ofAdvanced Research in Computer Science and Software Engineer-ing vol 44 no 2 2012

[23] N F Doherty L Anastasakis and H Fulford ldquoThe informationsecurity policy unpacked a critical study of the content ofuniversity policiesrdquo International Journal of Information Man-agement vol 29 no 6 pp 449ndash457 2009

[24] G Tahan L Rokach and Y Shahar ldquoAutomatic malwaredetection using common segment analysis and meta-featuresrdquoJournal ofMachine Learning Research vol 13 pp 949ndash979 2012

[25] M Bailey J Oberheide J Andersen Z M Mao F Jahanianand JNazario ldquoAutomated classification and analysis of internetmalwarerdquo inRecent Advances in Intrusion Detection C KruegelR Lippmann andAClark Eds vol 4637 pp 178ndash197 SpringerBerlin Germany 2007

[26] U Bayer AMoser C Kruegel and E Kirda ldquoDynamic analysisof malicious coderdquo Journal in Computer Virology vol 2 no 1pp 67ndash77 2006

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

Journal of Computer Networks and Communications 9

[27] J Z Kolter and M A Maloof ldquoLearning to detect and classifymalicious executables in the wildrdquo Journal of Machine LearningResearch vol 7 pp 2721ndash2744 2006

[28] P Trinius C Willems T Holz and K Rieck A MalwareInstruction Set for Behavior-Based Analysis 2009

[29] R Moskovitch and Y Shahar ldquoClassification-driven temporaldiscretization of multivariate time seriesrdquo Data Mining andKnowledge Discovery vol 29 no 4 pp 871ndash913 2015

[30] M G Schultz E Eskin E Zadok and S J Stolfo ldquoDatamining methods for detection of new malicious executablesrdquoin Proceedings of the IEEE Symposium on Security and PrivacySampP pp 38ndash49 Oakland Calif USA 2001

[31] J Z Kolter and M A Maloof ldquoLearning to detect maliciousexecutables in thewildrdquo inProceedings of the 10thACMSIGKDDInternational Conference on Knowledge Discovery and DataMining (KDD rsquo04) pp 470ndash478 ACM Seattle Wash USAAugust 2004

[32] M Siddiqui M CWang and J Lee ldquoDetecting internet wormsusing data mining techniquesrdquo Journal of Systemics Cyberneticsand Informatics vol 6 pp 48ndash53 2008

[33] H Yang and Y-P P Chen ldquoData mining in lung cancerpathologic staging diagnosis correlation between clinical andpathology informationrdquo Expert Systems with Applications vol42 no 15-16 pp 6168ndash6176 2015

[34] R Campagni D Merlini R Sprugnoli and M C VerrildquoData mining models for student careersrdquo Expert Systems withApplications vol 42 no 13 pp 5508ndash5521 2015

[35] R M Rahman and F R Md Hasan ldquoUsing and comparingdifferent decision tree classification techniques for miningICDDRB Hospital Surveillance datardquo Expert Systems withApplications vol 38 no 9 pp 11421ndash11436 2011

[36] S Ghosh S Biswas D Sarkar and P P Sarkar ldquoA novelNeuro-fuzzy classification technique for data miningrdquo EgyptianInformatics Journal vol 15 no 3 pp 129ndash147 2014

[37] R Moskovitch and Y Shahar ldquoFast time intervals miningusing the transitivity of temporal relationsrdquo Knowledge andInformation Systems vol 42 no 1 pp 21ndash48 2015

[38] D Stopel Z Boger R Moskovitch Y Shahar and Y ElovicildquoApplication of artificial neural networks techniques to com-puter worm detectionrdquo in Proceedings of the International JointConference onNeural Networks (IJCNN rsquo06) pp 2362ndash2369 July2006

[39] D Stopel R Moskovitch Z Boger Y Shahar and Y ElovicildquoUsing artificial neural networks to detect unknown computerwormsrdquo Neural Computing and Applications vol 18 no 7 pp663ndash674 2009

[40] R Moskovitch I Gus S Pluderman et al ldquoDetection ofunknown computer worms activity based on computer behav-ior using dataminingrdquo in Proceedings of the 1st IEEE Symposiumon Computational Intelligence and Data Mining (CIDM rsquo07) pp202ndash209 IEEE Honolulu Hawaii USA April 2007

[41] N Nissim R Moskovitch L Rokach and Y Elovici ldquoDetectingunknown computer worm activity via support vector machinesand active learningrdquo Pattern Analysis and Applications vol 15no 4 pp 459ndash475 2012

[42] N Karthik R Arul and M J H Prasad ldquoModeling ofwind turbine power curves using firefly algorithmrdquo in PowerElectronics and Renewable Energy Systems C KamalakannanL P Suresh S S Dash and B K Panigrahi Eds vol 326 pp1407ndash1414 Springer New Delhi India 2015

[43] F Galton Finger Prints Macmillan and Company 1892

[44] B D Eugenio andMGlass ldquoThe kappa statistic a second lookrdquoComputational Linguistics vol 30 no 1 pp 95ndash101 2004

[45] M N Mohammad N Sulaiman and O A Muhsin ldquoA novelintrusion detection system by using intelligent data mining inweka environmentrdquo Procedia Computer Science vol 3 pp 1237ndash1242 2011

[46] M Kantardzic Data Mining Concepts Models Methods andAlgorithms John Wiley amp Sons 2002

[47] M Deshmukh and M N K Prasad ldquoPartial segmentationand matching technique for iris recognitionrdquo in ComputationalIntelligence in Data MiningmdashVolume 1 L C Jain H S BeheraJ K Mandal and D P Mohapatra Eds vol 31 pp 77ndash86Springer India 2015

[48] I Rodrıguez-Fdez A Canosa M Mucientes and A BugarınldquoSTAC a web platform for the comparison of algorithmsusing statistical testsrdquo in Proceedings of the IEEE InternationalConference on Fuzzy Systems pp 1ndash8 Istanbul Turkey August2015

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

DistributedSensor Networks

International Journal of