Research Article - Hindawi Publishing...

Research ArticleAutomatic Identification of Honeypot Server Using MachineLearning Techniques

Cheng Huang 1 Jiaxuan Han1 Xing Zhang2 and Jiayong Liu 1

1College of Cybersecurity Sichuan University Chengdu 610065 China2NSFOCUS Beijing 100089 China

Correspondence should be addressed to Jiayong Liu jylscugmailcom

Received 28 April 2019 Revised 14 July 2019 Accepted 23 August 2019 Published 22 September 2019

Academic Editor Emanuele Maiorana

Copyright copy 2019 Cheng Huang et alis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Traditional security strategies are powerless when facing novel attacks in the complex network environment such as advancedpersistent threat (APT) Compared with traditional security detection strategies the honeypot system especially on the Internet ofthings research area is intended to be attacked and automatically monitor potential attacks by analyzing network packages or logles e researcher can extract exactly threat actor tactics techniques and procedures from these data and then generate moreeective defense strategies But for normal security researchers it is an urgent topic how to improve the honeypot mechanismwhich could not be recognized by attackers and silently capture their behaviors So they need awesome intelligent techniques toautomatically check remotely whether the server runs honeypot service or not As the rapid progress in honeypot detection usingmachine learning technologies the paper proposed a new automatic identication model based on random forest algorithm withthree group features application-layer feature network-layer feature and other system-layer feature e experiment datasets arecollected from public known platforms and designed to prove the eectiveness of the proposed model e experiment resultsshowed that the presented model achieved a high area under curve (AUC) value with 093 (area under the receiver operatingcharacteristic curve) which is better than other machine learning algorithms

1 Introduction

Many traditional security detection strategies such as rewallor intrusion detection systems (IDS) have been invented toprotect the systemrsquos security but there are still many criticalissues which are reported every day [1 2] e situation hasbecome more complex with the development of Internettechnologies Honeypots are dened as a new deceptiontechnology of cybersecurity defense in recent years [3]which can be used to strengthen the companyrsquos securitydetection level In other words honeypots are used to lureattackers into interacting with them and collect the in-formation which will be used to analyze and study the attackway of attackers [4ndash6]

Depending on the interaction level of honeypots theyare divided into three categories low-interaction honey-pots medium-interaction honeypots and high-interactionhoneypots [3 7ndash9] Low-interaction honeypots and

medium-interaction honeypots will not provide anyphysical environment for attackers us they only run afew sets of web services Especially for low-interactionhoneypots services that are simulated through port lis-tening could be detected easily and remotely But for high-interaction honeypots the services are fully deployed withnative programs [10]

With the widespread use of deception technologyhoneypot detection methods also need improvement whichmeans researchers need to modify their honeypots morerealistic to cheat potential attackers instead of easily dis-tinguished us many researchers start to focus on how toautomatically detect honeypots before publishing theirhoneypot framework [11ndash14] e authors [15] discussedseveral honeypot detection features by ngerprinting Holzand Raynal [16] proposed other methods based on detectingUser-mode Linux or VMWare system With the rapid de-velopment of honeypots the poundaws of previous detection

HindawiSecurity and Communication NetworksVolume 2019 Article ID 2627608 8 pageshttpsdoiorg10115520192627608

methods are exposed gradually)emain problem is that thesimulation of high-interaction honeypots is too similar toreal systems making it difficult to detect with a singlefeature )erefore other researchers start to extract otherfeatures and introduce the latest algorithms [17ndash19] Butthere are still many shortcomings at present this paperexplores honeypot detection technology to improve theeffectiveness of deception technologies which is helpful topromote the current development of honeypot technology)e contributions of this paper are summarized as follows

(i) Based on the analysis of the existing honeypot de-tection technology this paper proposed three groupfeatures which can distinguish ordinary servers andhoneypot servers remotely and accurately )esefeatures could be summarized as application-layernetwork-layer and other system-layer

(ii) )e paper presented an automatic detection modelusing multiple features and machine learning al-gorithms to identify honeypots In order to gethigher accuracy four machine learning algorithmsare introduced to build separate models

(iii) In the interest of highlighting the effectiveness of theproposed model the paper compared receiver op-erating characteristic (ROC) curves of four machinelearning algorithms based on 10-fold cross valida-tion methods According to the experiment resultthe model achieved a higher detection rate with theAUC value of 093

)e construction of this paper is as follows )e firstsection introduces the issue Section 2 is the related workwhich presents current research progress in honeypotdetection Section 3 is the proposed framework and itdescribes the detail of our system Section 4 is the exper-iment and evaluation result Section 5 is the conclusion andfuture work

2 Related Works

With the development of Internet technology honeypotsplay a significant role in network security However theattackers could easily distinguish whether the server hasdeployed honeypot services or not In response to solve sucha threat researchers need to make their honeypot servicemore realistic and improve the inner mechanism and outerinterface of the honeypot framework Recently many re-searchers start to focus on how to automatically detect ahoneypot server [20ndash24]

For example Fu et al [18] proposed a method based onexamining the link latency of honeyd honeypots Authorssaid that because the stimulative accuracy of the link latencyof honeypots such as honeyd is always 1ms or 10ms the linklatency in some virtual network is around a multiple of 1msor 10ms According to this characteristic they designed amethod based on NeymanndashPearson decision theory andachieved a high detecting rate Wenda and Ning [25] sug-gested that people can determine honeypots by analyzing thenetwork monitoring characteristic or detecting the virtual

environment Because many high-interactive honeypots arealways deployed with firewalls and IDS such features couldbe used to detect quickly Defibaugh-Chavez et al [17]proposed another method which focused on detectingnetwork-level activities and services of honeypots )enMukkamala et al [19] proposed another method based onthe SVM algorithm [26] )ey added a different featuregroup TCPIP fingerprint and designed an experiment toprove their method effective )e result proved their ideawas correct Except for these methods Holz and Raynal [16]proposed a method based on the feedback information ofhoneypots )ey presented a honeypot detection frameworkby extracting data of User-mode Linux and VMWare [27]For UML people could use the information of commandldquodmesgrdquo in fingerprinting or view the file etcfstab ForVMWare they suggested capturing the hardware and MACaddress According to IEEE Standards a MAC address suchas ldquo00-0E-23-xx-xx-xxrdquo is assigned for VMWare BesidesSend-Safe Honeypot Hunter is the first commercial anti-honeypot technology [28]

Although many detection methods have been proposedthere is still a big problem with the rapid development ofcloud and virtual technologies many of them may not be aseffective as they used to be So we studied previous re-searchersrsquo methods on honeypot detection and then pro-posed a novel honeypot detection framework based onmachine learning technologies

3 Framework

31 Architecture In order to detect honeypots more ef-fectively the paper presents an automatic identificationframework which is shown in Figure 1 )ere are threeimportant parts in the framework feature extraction labelmethods and detection model Feature extraction is pri-marily responsible for collecting features from the targetserver In this part all features are split into three groupfeatures the application layer the network layer and thesystem layer )ese extracted feature values will be cal-culated with string or integer meaning )e second im-portant part is the label methods As the name suggests thepurpose of this part is to collect label data which containhoneypots or normal host IP addresses Cyberspace searchengine such as Shodan FOFA and other public platformsare introduced to collect honeypots servers Manual checkmethods are used after crawling real data from theseplatforms via the application programming interface (API)After obtaining the feature data and label data both ofthem will be transferred to the detection method as thetraining data Each data has a corresponding label data intraining steps More in detail each data record in thetraining data can be described as (feature data and labeldata) In the detection steps the training data will be loadedfirst and then the machine learning algorithms includingrandom forest SVM kNN and Naive_Bayes will be used totrain the data for building machine learning models )ebest model will be selected to do parameter adjusting )ehoneypot detection model will be used to predict theresults

2 Security and Communication Networks

32 Feature Extraction Good features are the basis oftraining an excellent machine learning model In this paperfeatures were divided into three groups application-layerfeature network-layer feature and other system-layerfeature

321 Application-Layer Feature According to Veenarsquos in-vestigation [29] HTTP FTP and SMTP services are threecommon protocols that attackers most attack )e resultshows such simulated services could cheat majority ofhackers Moreover low-interaction honeypots do not im-plement a complete service which could be easily recognizedby simple interaction commands [17] For example bothHTTP-GET and HTTP-OPTIONS are provided in a realsystem but only HTTP-GET method is available in thehoneypot In addition referring to ldquoservice exercisingrdquo ofDefibaugh-Chavezrsquos research [17] and the complexity ofhoneypots services proposed in [28] the application-layerfeature is summarized in Table 1

322 Network-Layer Feature )is TCPIP fingerprints canhelp people recognize different devices including honeypotsSome common TCPIP fingerprints are TCP FIN flag TCPinitial window and ACK value [30] After analysis we finallydecided to use average_received_bytes average_received_ttlaverage_received_ack_flags average_received_push_flagsaverage_received_syn_flags average_received_fin_flagsand average_received_window_size as the network-layerfeature )e network-layer feature is extracted as shown inTable 2 Each value of network-layer feature is computed

from received packets by a special equation For examplethe average_received_ttl comes from the followingequation

average_received_ttl received_ttl1 + received_ttl2 + received_ttl3

3 (1)

In this equation the value of received_ttl1 is obtainedfrom the packet received time from port 21 of the remoteserver )e value of received_ttl2 is calculated from the

packet received time from port 25 of the remote server )evalue of received_ttl3 is obtained from the packet returnedtime from port 80 of the remote server

Public server

Feature data

Application layer

Network layer

System layer

Random forestSVMkNN

Detection modelLabel methods

Feature extraction

Shodan API

FOFA API

Manual Check Label dataHoneypot

detection model

Parameteradjusting Non-honeypot

server

Honeypot server

Machine learning algorithm

Results

Naiumlve_bayes

(i)(ii)

(iii)(iv)

Figure 1 )e framework of proposed automatic identification methods

Table 1 Application-layer feature

No Feature name1 HTTP_GET2 HTTP_OPTIONS3 HTTP_HEAD4 HTTP_TRACE5 FTP_USER6 FTP_PASS7 FTP_MODE8 FTP_RETR9 SMTP_HELO10 SMTP_MAIL11 SMTP_DATA12 SMTP_VRFY13 SMTP_ETRN

Table 2 Network-layer feature

No Feature name1 average_received_bytes2 average_received_ttl3 average_received_ack_flags4 average_received_push_flags5 average_received_syn_flags6 average_received_fin_flags7 Average_received_window_size

Security and Communication Networks 3

323 Other System-Layer Feature Table 3 shows the featuresextracted in the other system-layer feature)ey are port1 port2port3 System_Fingerprint and ICMP_ECHO_response_time

(1) Port1 Port2 and Port3 Since low-interactive honeypotsonly simulate a special service the ports open on them aredifferent from real systems )erefore we chose the portnumber as one feature in the other system-layer feature )emeanings and values of port1 port2 and port3 are listedbelow

(i) Port1 indicates whether the port 21 of target islistening If port 21 is open the value of port1 is 21Otherwise the value is minus 1

(ii) Port2 indicates whether the port 25 of target islistening If port 25 is open the value of port1 is 25Otherwise the value is minus 2

(iii) Port3 indicates whether the port 80 of target islistening If port 80 is open the value of port1 is 80Otherwise the value is minus 3

(2) System_Fingerprint It means the version of operatingsystem running a honeypot such as ldquoMicrosoft WindowsServer 2016rdquo or ldquoUbuntu 1604rdquo According to the manualcheck result it indicates that many honeypots have the sameSystem_Fingerprint so it is considered to be a separatefeature that distinguishes honeypots from real systems

(3) ICMP_ECHO_response_time Mukkamala et al [19]claim that it is common to run several virtual honeypots on asingle machine So when handling ICMP ECHO requesthoneypots are always slower than real systems Utilizing thischaracteristic ICMP_ECHO_response_time was extractedfor the other system-layer feature

33 Classification Model After extracting three groupfeatures it needs to consider choosing an effective andaccurate algorithm and constructing an intelligent modelEnsemble learning is a method of accomplishing the taskby structuring and combining several learners [31] Firstlywe need to explain an important concept named hy-potheses in the ensemble learning area Hypotheses is thecollection of functions which can be described as thecollection of h1 h2 hn As for traditional machinelearning algorithms their purpose is to find the mostappropriate classifier hk for f (f is a function describing therelationship between input X and label y) But ensemblelearning is different all h in hypotheses will be combinedby a special fashion (such as voting) to decide the label ofinput X Specifically the process of ensemble learning canbe described as two steps One is to generate some indi-vidual learners Another is to combine these learners by aspecial fashion By combining individual learners en-semble learning can achieve more superior generalizationperformance than single learning Random forest is themost representative algorithm in ensemble learning

)e definition of random forest is ldquoA random forest is aclassifier consisting of a collection of tree-structured

classifier h (X Θk) k 1 where the Θk are in-dependent identically distributed random vectors and eachtree casts a unit vote for the most popular class at input Xrdquo[32] In fact random forest is the collection of decision treesEach decision tree is a classifier which gives a predictedresult So there are n results from different n classifiersAccording to the principle of random forest algorithm thefinal result will be decided by themajorities of all results [33]Random forest has the ability to give a more accurate resultand it can also handle thousands of input variables withoutvariable deletion [34] this is the best choice for the proposedmachine learning model

34Model Evaluation Model evaluation is a significant partof model training )e purpose of model evaluation is toassess the performance of final model In this paper weintroduced recall precision F1-score (F1) ROC curve andAUC metrics to evaluate the proposed model )eir defi-nitions are as follows

recall TP

TP + FN

precision TP

TP + FP

F1 2 times TP

n + TP minus TN

AUC 12

1113944

nminus 1

i1xi+1 minus xi( 1113857 middot yi+1 + yi( 1113857

(2)

where n is the number of samplesTrue positive (TP) is the number of honeypot servers

classified as honeypot servers True negative (TN) is thenumber of normal servers classified as normal servers Falsepositive (FP) is the number of normal servers classified ashoneypot servers False negative (FN) is the number ofhoneypot servers classified as normal servers Recall rate isthe ratio of all samples correctly classified as honeypotservers to all honeypot servers Precision rate is the ratio ofall samples that are correctly classified as honeypot servers toall samples that are predicted to be honeypot servers F1-score is a comprehensive measure of recall and precision

ROC means receiver operating characteristic It isconstructed by true positive rate (TPR) and false positiverate (FPR) TPR is the ratio of all samples correctly clas-sified as honeypot servers to all honeypot servers FPR is theratio of the sample incorrectly predicted to be honeypotservers to all the normal servers In the coordinates of a

Table 3 Other system-layer feature

No Feature name1 Port12 Port23 Port32 System_Fingerprint3 ICMP_ECHO_response_time


ROC curve X-axis is FPR and Y-axis is TPR TPR and FPRare defined as follows

TPR TP

TP + FN

FPR FP

TN + FP

(3)

Each (FPR TPR) is a point on a ROC curve If a ROCcurve is above the diagonal line between (0 0) and (1 1) itmeans the classification performance of the model is ac-ceptable If a ROC curve is below this line there may besome mistakes in the label of the dataset If a ROC curve isnear to this line the classification performance of themodel is poor In other words the model could predictwildly In order to evaluate the model comprehensivelyresearchers usually use the AUC index to measure theoverall efficiency of classification model )e closer thevalue of AUC is to 1 the better the classification effect of themodel

4 Experiment

In order to make the predicted result more authentic andaccurate the experiment focuses on the real honeypotservers and normal servers based on remote scanningtechnologies In the experiment scikit-learn library is usedto accomplish machine learning model training )e cal-culate procedure ran on the computer with i7-7700HQCPU16GB of memory and GTX 1050ti GPU

41 Dataset Shodan and Fofa are well-known cyberspacesearch engines which are used to search for network devicesMany known honeypot servers have been marked by thesetwo platforms so we can firstly get some IP addresses whichhave been marked as honeypots We also randomly selectmany IP addresses from the Internet Secondly we manuallycheck all IP addresses to determine whether they are hon-eypots or not )en we scan all IP addresses via raw sockettechnology to collect the feature data Finally we collect 2413IP records as the experiment dataset )e number of hon-eypot IP addresses is 807 )e number of real systems IPaddress is 1606 )e dataset is shown in Table 4

42 Experiment Design In order to improve the accuracy ofthe proposed model this paper introduced three steps in thedetail experiment dimension unification parameter ad-justment and cross validation

421 Dimension Unifying In some cases some large di-mension features will affect the final result when training amodel For example there are three records in Table 5 )edimension of feature2 is larger than the dimension offeature1 So when training a model the large dimensionfeature may dominate the predicted result )erefore thelabel of the 4th record may be predicted as 0 However thetruth is 1 )at is why dimension unifying is needed )epaper chose popular normalization method to achieve

dimension unifying )e equation of normalization is asfollows

yi xi minus 1113954x

δ

1113954x 1k

1113944

k

i1xi

δ

1k minus 1

1113944

k

i1xi minus 1113954x( 1113857

2

11139741113972

(4)

x1 x2 xk is a raw data record Each value xi will betransformed into another value yi by the equation )en thenew data record y1 y2 yk is composed of all values yFinally the standard deviation of the new data record is 0and the variance is 1 In this way the problem that largedimension features affect the training result is solved

422 Parameters Adjusting Choosing the most suitableparameters for the model is a necessary part of the modeltraining)ere are two important parameters in random forestalgorithm which will determine the accuracy of the modelOne is ldquon_estimatorsrdquo Another is ldquomax_depthrdquo ldquon_estima-torsrdquo is the number of decision trees and ldquomax_depthrdquo is themaximum allowable depth for each tree If ldquon_estimatorsrdquo istoo big the result may be overfitting However if ldquon_esti-matorsrdquo is too small the result may be underfitting So asuitable value of ldquon_estimatorsrdquo is important for the predictedaccuracy of the finalmodel 100 120 200 300 500 800 1200was prepared for ldquon_estimatorsrdquo and 5 8 15 25 30 100 wasprepared for ldquomax_depthrdquo We tested each of these two col-lections separately to choose the most appropriate value forthese two parameters

423 Cross Validation We introduce 10-fold cross vali-dation methods to avoid overfitting issues Cross validationis a common method of model evaluation [35] )e crossvalidation structure of this paper is shown in Figure 2 )ecross validation structure is composed of three partsFirstly the data D was divided into ten similar mutuallyexclusive subsets in the data split (D was the collectionD1 cupD2 cup cupD10 Di capDj ϕ) )en the divided data

Table 4 Experimental dataset

Data type SizeHoneypot records 807Real system records 1606

Table 5 Records for example

Feature1 Feature2 Feature3 Label001 3000 15 10051 45000 58 00001 12785 15 1002 80000 22


could be transferred to training and testing In the trainingand testing steps nine subsets were used for training everytime and the last one was used for validation So after thetraining and testing there were ten classifiers Each clas-sifier would give a predicted result )e last result wascomputed from these ten results

424 Comparing Experiment )e result without compar-ison is not convincing So we designed this comparingexperiment to highlight the advantages of the method in thispaper Reviewing some researches about machine learningwe chose SVM kNN and Naive_Bayes as the counterpartsof random forest )en these four machine learning algo-rithms would be used to train four models on the samedataset respectively Finally for eachmodel we plotted theirrespective ROC curves in the same coordinate systemAccording to these ROC curves the performance of modelsis clear at a glance

43 Experiment Result )is section shows the result of theexperiment Firstly we tested the values in the two pa-rameter collections respectively )e effect of differentvalues of ldquon_estimatorsrdquo on the accuracy is shown in Fig-ure 3 )e effect of different values of ldquomax_depthrdquo on theaccuracy is shown in Figure 4 )e blue line represents thechange in the training data and the green line represents thechange in the testing data

Analyzing these two figures when ldquon_estimatorsrdquo was200 the accuracy of the blue line and the green line were thehighest So the value of ldquon_estimatorsrdquo was decided to use200 When ldquomax_depthrdquo was 30 the accuracy of blue lineand green line were the highest So we chose 30 to be thevalue of ldquomax_depthrdquo

Table 6 is the recall precision F1-score and accuracy ofthe final model Figure 5 is the ROC curves of the fouralgorithms All of them were concluded after the 10-foldcross validation In Figure 5 the blue line is the ROC curve ofthe random forest model the green line is the ROC curve ofthe SVMmodel the yellow line is the ROC curve of the kNNmodel and the purple line is the ROC curve of theNaive_Bayes model Analyzing the figure results it can beconcluded that the performance of Naive_Bayes is the worstHowever there is little difference in the AUC values betweenrandom forest SVM and kNN But the performance of

random forest is better In Table 6 the recall rate is 082 theprecision rate is 082 the F1-score is 082 and the accuracy is090 )e ROC curve of random forest and these indexes

D

D1 D2

Result1

D10

D1 D2 D10

Result2D10 D2 D1

Result10D1 D2 D9

Training data

Testing data

Final result

Data split

Training and testing

Figure 2 10-fold cross validation method for the model

084

086

088

090

092

094

096

098

100

Accu

racy

200 400 1000800600 1200n_estimators

Training accuracyTesting accuracy

Figure 3 )e effect of different values of ldquon_estimatorsrdquo on theaccuracy

20 40 8060 100max_depth

084

086

088

090

092

094

096

098

100

Accu

racy


Figure 4 )e effect of different values of ldquomax_depthrdquo on theaccuracy


prove the final model has a high detection rate and a goodability of generalization

5 Conclusion

As an active cybersecurity defense strategy the emergenceof honeypots further reduces the attackerrsquos activity spaceAlthough honeypot technology is constantly improvingattackers are constantly searching for weaknesses inhoneypots to identify them)ere is an old saying in Chinathe authorities are fascinated bystanders clear For hon-eypot researchers in order to promote the development ofhoneypot technology it is necessary to learn and discoverthe method of identifying honeypot remotely

Based on the research of early honeypot and honeypotdetection technology this paper proposed a honeypot de-tection technology based onmachine learning algorithms)etechnology used machine learning technology to achieveaccurate honeypot detection through extraction and collectiondifferent layer honeypot features )e experimental resultproved that honeypots are still insufficient in the degree ofsimulating real systems such as the integrity of the simulatedservice At the same time the experimental result provided areference for the improvement of honeypots and promotedthe development of honeypot technology

Data Availability

)e experiment data used to support the findings of thisstudy are available from the corresponding author uponrequest

Conflicts of Interest

)e authors declare that they have no conflicts of interest

Acknowledgments

)e authors of this work were supported by the CCF-NSFOCUS KunPeng Research Fund (2018008) the SichuanUniversity Postdoc Research Foundation (19XJ0002) andthe Frontier Science and Technology Innovation Projects ofNational Key Research and Development Program(2019QY1405) )e authors would like to thank BaimaohuiSecurity Research Institute for providing some experimentdata

References

[1] M B Salem S Hershkop and S J Stolfo ldquoA survey ofinsider attack detection researchrdquo in Insider Attack andCyber Security pp 69ndash90 Springer Berlin Germany 2008

[2] F Sabahi and A Movaghar ldquoIntrusion detection a surveyrdquo inProceedings of the 2008 gtird International Conference onSystems and Networks Communications pp 23ndash26 IEEESliema Malta October 2008

[3] I Mokube andM Adams ldquoHoneypots concepts approachesand challengesrdquo in Proceedings of the 45th Annual SoutheastRegional Conference pp 321ndash326 ACMWinston-Salem NCUSA March 2007

[4] L Spitzner ldquo)e honeynet project trapping the hackersrdquoIEEE Security amp Privacy vol 1 no 2 pp 15ndash23 2003

[5] O )onnard and M Dacier ldquoA framework for attack pat-ternsrsquo discovery in honeynet datardquoDigital Investigation vol 5pp 128ndash139 2008

[6] W Fan Z Du M Smith-Creasey and D FernandezldquoHoneyDOC an efficient honeypot architecture enabling all-round designrdquo IEEE Journal on Selected Areas in Commu-nications vol 37 no 3 pp 683ndash697 2019

[7] K Sadasivam B Samudrala and T A Yang ldquoDesign ofnetwork security projects using honeypotsrdquo Journal ofComputing Sciences in Colleges vol 20 pp 282ndash293 2005

[8] M Mansoori O Zakaria and A Gani ldquoImproving exposureof intrusion deception system through implementation ofhybrid honeypotrdquo gte International Arab Journal of In-formation Technology vol 9 no 5 pp 436ndash444 2012

[9] G Portokalidis and H Bos ldquoSweetBait zero-hour wormdetection and containment using low- and high-interactionhoneypotsrdquo Computer Networks vol 51 no 5 pp 1256ndash12742007

[10] W Fan Z Du D Fernandez and V A Villagra ldquoEnabling ananatomic view to investigate honeypot systems a surveyrdquoIEEE Systems Journal vol 12 no 4 pp 3906ndash3919 2017

[11] M L Bringer C A Chelmecki and H Fujinoki ldquoA surveyrecent advances and future trends in honeypot researchrdquoInternational Journal of Computer Network and InformationSecurity vol 4 no 10 pp 63ndash75 2012

[12] P Wang L Wu R Cunningham and C C Zou ldquoHoneypotdetection in advanced botnet attacksrdquo International Journal ofInformation and Computer Security vol 4 no 1 pp 30ndash512010

[13] K Papazis and N Chilamkurti ldquoDetecting indicators ofdeception in emulated monitoring systemsrdquo Service OrientedComputing and Applications vol 13 no 1 pp 17ndash29 2019

Table 6 Recall precision F1-score and accuracy of the proposedmodel

Recall Precision F1-score Accuracy Data size082 083 082 090 2413

00 02 04 06 08 10

00

02

04

06

08

10

False positive rate

ROC curves

True

pos

itive

rate

Random forest (AUC = 093)SVM (AUC = 090)

kNN (AUC = 090)Naive_bayes(AUC = 060)

Figure 5 ROC curves about four algorithms


[14] W Fan and D Fernandez ldquoA novel SDN based stealthy TCPconnection handover mechanism for hybrid honeypot sys-temsrdquo in Proceedings of the 2017 IEEE Conference on NetworkSoftwarization (NetSoft) pp 1ndash9 IEEE Bologna Italy July2017

[15] H Artail H Safa M Sraj I Kuwatly and Z Al-Masri ldquoAhybrid honeypot framework for improving intrusion detectionsystems in protecting organizational networksrdquo Computers ampSecurity vol 25 no 4 pp 274ndash288 2006

[16] T Holz and F Raynal ldquoDetecting honeypots and othersuspicious environmentsrdquo in Proceedings from the SixthAnnual IEEE SMC Information Assurance Workshoppp 29ndash36 IEEE West Point NY USA June 2005

[17] P Defibaugh-Chavez R Veeraghattam M KannappaS Mukkamala and A Sung ldquoNetwork based detection ofvirtual environments and low interaction honeypotsrdquo inProceedings of the 2006 IEEE SMC Information AssuranceWorkshop pp 283ndash289 IEEE West Point NY USA June2006

[18] X Fu W Yu D Cheng X Tan K Streff and S Graham ldquoOnrecognizing virtual honeypots and countermeasuresrdquo inProceedings of the 2006 2nd IEEE International Symposium onDependable Autonomic and Secure Computing pp 211ndash218IEEE Indianapolis IN USA September 2006

[19] S Mukkamala K Yendrapalli R Basnet M K Shankarapaniand A H Sung ldquoDetection of virtual environments and lowinteraction honeypotsrdquo in Proceedings of the 2007 IEEE SMCInformation Assurance and Security Workshop pp 92ndash98IEEE West Point NY USA June 2007

[20] J Papalitsas S Rauti and V Leppanen ldquoA comparison ofrecord and play honeypot designsrdquo in Proceedings of the 18thInternational Conference on Computer Systems and Technol-ogies pp 133ndash140 ACM Ruse Bulgaria June 2017

[21] R M Campbell K Padayachee and T Masombuka ldquoAsurvey of honeypot research trends and opportunitiesrdquo inProceedings of the 2015 10th International Conference forInternet Technology and Secured Transactions (ICITST)pp 208ndash212 IEEE London UK December 2015

[22] F Y-S Lin Y-S Wang and M-Y Huang ldquoEffective pro-active and reactive defense strategies against malicious attacksin a virtualized honeynetrdquo Journal of Applied Mathematicsvol 2013 Article ID 518213 11 pages 2013

[23] O Surnin F Hussain R Hussain et al ldquoProbabilistic esti-mation of honeypot detection in Internet of things envi-ronmentrdquo in Proceedings of the 2019 International Conferenceon Computing Networking and Communications (ICNC)pp 191ndash196 IEEE Honolulu HI USA February 2019

[24] Z Wang X Feng Y Niu C Zhang and J Su ldquoTSMWD ahigh-speed malicious web page detection system based ontwo-step classifiersrdquo in Proceedings of the 2017 InternationalConference on Networking and Network Applications (NaNA)pp 170ndash175 IEEE Kathmandu City Nepal October 2017

[25] D Wenda and D Ning ldquoA honeypot detection method basedon characteristic analysis and environment detectionrdquo in 2011International Conference in Electrics Communication andAutomatic Control Proceedings pp 201ndash206 Springer BerlinGermany 2012

[26] N Cristianini and J Shawe-TaylorAn Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000

[27] N Provos ldquoA virtual honeypot frameworkrdquo in Proceedings ofthe USENIX Security Symposium vol 173 pp 1ndash14 SanDiego CA USA August 2004

[28] N Krawetz ldquoAnti-honeypot technologyrdquo IEEE Security ampPrivacy Magazine vol 2 no 1 pp 76ndash79 2004

[29] K Veena and K Meena ldquoImplementing file and real timebased intrusion detections in secure direct method usingadvanced honeypotrdquo Cluster Computing pp 1ndash8 2018

[30] R Tyagi T Paul B S Manoj and B )anudas ldquoPacketinspection for unauthorized OS detection in enterprisesrdquoIEEE Security amp Privacy vol 13 no 4 pp 60ndash65 2015

[31] Y Liu and X Yao ldquoEnsemble learning via negative corre-lationrdquo Neural Networks vol 12 no 10 pp 1399ndash1404 1999

[32] L Breiman ldquoRandom forestsrdquo Machine Learning vol 45no 1 pp 5ndash32 2001

[33] A Liaw and M Wiener ldquoClassification and regression byrandomForestrdquo R News vol 2 no 3 pp 18ndash22 2002

[34] R Genuer J-M Poggi and C Tuleau-Malot ldquoVariable se-lection using random forestsrdquo Pattern Recognition Lettersvol 31 no 14 pp 2225ndash2236 2010

[35] T Fushiki ldquoEstimation of prediction error by using K-foldcross-validationrdquo Statistics and Computing vol 21 no 2pp 137ndash146 2011


International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018


Active and Passive Electronic Components

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of



Journal ofEngineeringVolume 2018

SensorsJournal of



RotatingMachinery


Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation


Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

methods are exposed gradually)emain problem is that thesimulation of high-interaction honeypots is too similar toreal systems making it difficult to detect with a singlefeature )erefore other researchers start to extract otherfeatures and introduce the latest algorithms [17ndash19] Butthere are still many shortcomings at present this paperexplores honeypot detection technology to improve theeffectiveness of deception technologies which is helpful topromote the current development of honeypot technology)e contributions of this paper are summarized as follows

(i) Based on the analysis of the existing honeypot de-tection technology this paper proposed three groupfeatures which can distinguish ordinary servers andhoneypot servers remotely and accurately )esefeatures could be summarized as application-layernetwork-layer and other system-layer

(ii) )e paper presented an automatic detection modelusing multiple features and machine learning al-gorithms to identify honeypots In order to gethigher accuracy four machine learning algorithmsare introduced to build separate models

(iii) In the interest of highlighting the effectiveness of theproposed model the paper compared receiver op-erating characteristic (ROC) curves of four machinelearning algorithms based on 10-fold cross valida-tion methods According to the experiment resultthe model achieved a higher detection rate with theAUC value of 093

)e construction of this paper is as follows )e firstsection introduces the issue Section 2 is the related workwhich presents current research progress in honeypotdetection Section 3 is the proposed framework and itdescribes the detail of our system Section 4 is the exper-iment and evaluation result Section 5 is the conclusion andfuture work

2 Related Works

With the development of Internet technology honeypotsplay a significant role in network security However theattackers could easily distinguish whether the server hasdeployed honeypot services or not In response to solve sucha threat researchers need to make their honeypot servicemore realistic and improve the inner mechanism and outerinterface of the honeypot framework Recently many re-searchers start to focus on how to automatically detect ahoneypot server [20ndash24]

For example Fu et al [18] proposed a method based onexamining the link latency of honeyd honeypots Authorssaid that because the stimulative accuracy of the link latencyof honeypots such as honeyd is always 1ms or 10ms the linklatency in some virtual network is around a multiple of 1msor 10ms According to this characteristic they designed amethod based on NeymanndashPearson decision theory andachieved a high detecting rate Wenda and Ning [25] sug-gested that people can determine honeypots by analyzing thenetwork monitoring characteristic or detecting the virtual

environment Because many high-interactive honeypots arealways deployed with firewalls and IDS such features couldbe used to detect quickly Defibaugh-Chavez et al [17]proposed another method which focused on detectingnetwork-level activities and services of honeypots )enMukkamala et al [19] proposed another method based onthe SVM algorithm [26] )ey added a different featuregroup TCPIP fingerprint and designed an experiment toprove their method effective )e result proved their ideawas correct Except for these methods Holz and Raynal [16]proposed a method based on the feedback information ofhoneypots )ey presented a honeypot detection frameworkby extracting data of User-mode Linux and VMWare [27]For UML people could use the information of commandldquodmesgrdquo in fingerprinting or view the file etcfstab ForVMWare they suggested capturing the hardware and MACaddress According to IEEE Standards a MAC address suchas ldquo00-0E-23-xx-xx-xxrdquo is assigned for VMWare BesidesSend-Safe Honeypot Hunter is the first commercial anti-honeypot technology [28]

Although many detection methods have been proposedthere is still a big problem with the rapid development ofcloud and virtual technologies many of them may not be aseffective as they used to be So we studied previous re-searchersrsquo methods on honeypot detection and then pro-posed a novel honeypot detection framework based onmachine learning technologies

3 Framework

31 Architecture In order to detect honeypots more ef-fectively the paper presents an automatic identificationframework which is shown in Figure 1 )ere are threeimportant parts in the framework feature extraction labelmethods and detection model Feature extraction is pri-marily responsible for collecting features from the targetserver In this part all features are split into three groupfeatures the application layer the network layer and thesystem layer )ese extracted feature values will be cal-culated with string or integer meaning )e second im-portant part is the label methods As the name suggests thepurpose of this part is to collect label data which containhoneypots or normal host IP addresses Cyberspace searchengine such as Shodan FOFA and other public platformsare introduced to collect honeypots servers Manual checkmethods are used after crawling real data from theseplatforms via the application programming interface (API)After obtaining the feature data and label data both ofthem will be transferred to the detection method as thetraining data Each data has a corresponding label data intraining steps More in detail each data record in thetraining data can be described as (feature data and labeldata) In the detection steps the training data will be loadedfirst and then the machine learning algorithms includingrandom forest SVM kNN and Naive_Bayes will be used totrain the data for building machine learning models )ebest model will be selected to do parameter adjusting )ehoneypot detection model will be used to predict theresults







3 (1)



Public server

Feature data

Application layer

Network layer

System layer

Random forestSVMkNN


Feature extraction

Shodan API

FOFA API


detection model


server

Honeypot server


Results

Naiumlve_bayes

(i)(ii)

(iii)(iv)


















recall TP

TP + FN

precision TP

TP + FP

F1 2 times TP

n + TP minus TN

AUC 12

1113944

nminus 1


(2)








TPR TP

TP + FN

FPR FP

TN + FP

(3)


4 Experiment







δ

1113954x 1k

1113944

k

i1xi

δ

1k minus 1

1113944

k

i1xi minus 1113954x( 1113857

2

11139741113972

(4)















D

D1 D2

Result1

D10

D1 D2 D10

Result2D10 D2 D1

Result10D1 D2 D9

Training data

Testing data

Final result

Data split



084

086

088

090

092

094

096

098

100

Accu

racy

200 400 1000800600 1200n_estimators



20 40 8060 100max_depth

084

086

088

090

092

094

096

098

100

Accu

racy





5 Conclusion



Data Availability




Acknowledgments


References
















00 02 04 06 08 10

00

02

04

06

08

10

False positive rate

ROC curves

True

pos

itive

rate






























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia







3 (1)



Public server

Feature data

Application layer

Network layer

System layer

Random forestSVMkNN


Feature extraction

Shodan API

FOFA API


detection model


server

Honeypot server


Results

Naiumlve_bayes

(i)(ii)

(iii)(iv)


















recall TP

TP + FN

precision TP

TP + FP

F1 2 times TP

n + TP minus TN

AUC 12

1113944

nminus 1


(2)








TPR TP

TP + FN

FPR FP

TN + FP

(3)


4 Experiment







δ

1113954x 1k

1113944

k

i1xi

δ

1k minus 1

1113944

k

i1xi minus 1113954x( 1113857

2

11139741113972

(4)















D

D1 D2

Result1

D10

D1 D2 D10

Result2D10 D2 D1

Result10D1 D2 D9

Training data

Testing data

Final result

Data split



084

086

088

090

092

094

096

098

100

Accu

racy

200 400 1000800600 1200n_estimators



20 40 8060 100max_depth

084

086

088

090

092

094

096

098

100

Accu

racy





5 Conclusion



Data Availability




Acknowledgments


References
















00 02 04 06 08 10

00

02

04

06

08

10

False positive rate

ROC curves

True

pos

itive

rate






























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia













recall TP

TP + FN

precision TP

TP + FP

F1 2 times TP

n + TP minus TN

AUC 12

1113944

nminus 1


(2)








TPR TP

TP + FN

FPR FP

TN + FP

(3)


4 Experiment







δ

1113954x 1k

1113944

k

i1xi

δ

1k minus 1

1113944

k

i1xi minus 1113954x( 1113857

2

11139741113972

(4)















D

D1 D2

Result1

D10

D1 D2 D10

Result2D10 D2 D1

Result10D1 D2 D9

Training data

Testing data

Final result

Data split



084

086

088

090

092

094

096

098

100

Accu

racy

200 400 1000800600 1200n_estimators



20 40 8060 100max_depth

084

086

088

090

092

094

096

098

100

Accu

racy





5 Conclusion



Data Availability




Acknowledgments


References
















00 02 04 06 08 10

00

02

04

06

08

10

False positive rate

ROC curves

True

pos

itive

rate






























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



TPR TP

TP + FN

FPR FP

TN + FP

(3)


4 Experiment







δ

1113954x 1k

1113944

k

i1xi

δ

1k minus 1

1113944

k

i1xi minus 1113954x( 1113857

2

11139741113972

(4)















D

D1 D2

Result1

D10

D1 D2 D10

Result2D10 D2 D1

Result10D1 D2 D9

Training data

Testing data

Final result

Data split



084

086

088

090

092

094

096

098

100

Accu

racy

200 400 1000800600 1200n_estimators



20 40 8060 100max_depth

084

086

088

090

092

094

096

098

100

Accu

racy





5 Conclusion



Data Availability




Acknowledgments


References
















00 02 04 06 08 10

00

02

04

06

08

10

False positive rate

ROC curves

True

pos

itive

rate






























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia








D

D1 D2

Result1

D10

D1 D2 D10

Result2D10 D2 D1

Result10D1 D2 D9

Training data

Testing data

Final result

Data split



084

086

088

090

092

094

096

098

100

Accu

racy

200 400 1000800600 1200n_estimators



20 40 8060 100max_depth

084

086

088

090

092

094

096

098

100

Accu

racy





5 Conclusion



Data Availability




Acknowledgments


References
















00 02 04 06 08 10

00

02

04

06

08

10

False positive rate

ROC curves

True

pos

itive

rate






























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



5 Conclusion



Data Availability




Acknowledgments


References
















00 02 04 06 08 10

00

02

04

06

08

10

False positive rate

ROC curves

True

pos

itive

rate






























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia


Research Article - Hindawi Publishing...

Documents

Transcript of Research Article - Hindawi Publishing...