[IEEE The International Conference on Emerging Security Information, Systems, and Technologies...

6
Increasing Detection Rate of User-to-Root Attacks Using Genetic Algorithms Zorana Banković 1 , Slobodan Bojanić 1 , Octavio Nieto-Taladriz 1 and Atta Badii 2 1 Universidad Politécnica de Madrid {zorana, slobodan, nieto}@die.upm.es 2 University of Reading [email protected] Abstract An extensive set of machine learning and pattern classification techniques trained and tested on KDD dataset failed in detecting most of the user-to-root attacks. This paper aims to provide an approach for mitigating negative aspects of the mentioned dataset, which led to low detection rates. Genetic algorithm is employed to implement rules for detecting various types of attacks. Rules are formed of the features of the dataset identified as the most important ones for each attack type. In this way we introduce high level of generality and thus achieve high detection rates, but also gain high reduction of the system training time. Thenceforth we re-check the decision of the user-to- root rules with the rules that detect other types of attacks. In this way we decrease the false-positive rate. The model was verified on KDD99, demonstrating higher detection rates than those reported by the state- of-the-art while maintaining low false-positive rate. 1. Introduction Along with bringing revolution in communication and information exchange, Internet has also provided greater opportunity for disruption and sabotage of data that was previously considered secure. Computer networks are protected against attacks by a number of access restriction policies that act as a coarse grain filter (anti-virus software, firewall, message encryption, secured network protocols, password protection). Intrusion detection systems (IDS) are the fine grain filter placed inside the protected network, looking for known or potential threats in network traffic and/or audit data recorded by hosts. Intrusion detection systems have three common problems: speed, accuracy and adaptability. The problem of speed arises from the extensive amount of data that intrusion detection systems need to monitor in order to perceive the entire situation. In this work the initial point is the extraction of the most important piece of information that can be deployed for efficient detection of attacks in order to cope with this problem. Incorporation of learning algorithms provides a potential solution for the adaptation and accuracy issues of the intrusion detection problem [1]. Since the introduction of the Knowledge Discovery in Databases (KDD) 1999 dataset [2] for the 1999 KDD Cup contest, intrusion detection systems based on pattern recognition and machine learning algorithms have been extensively developed. The dataset contains four main attack categories: Probing, Denial of Service (DoS), User-to-Root (U2R), and Remote-to-Local (R2L). Pattern recognition and machine learning algorithms trained with the KDD training data subset and tested on the KDD testing data subset failed to detect majority of U2R and R2L attacks within the context of misuse detection, as reported in the literature [3], [4]. As it will be explained in the Section 2, law detection rates of U2R and R2L are exhibited due to deficiencies of the dataset. This work investigates the possibility of increasing detection rate of U2R attacks in the misuse detection context trained on KDD dataset. A User-to-Root (U2R) attack is a process where any normal system user illegally gains access to the super-user privileges. Our idea is to achieve high detection rate of U2R attacks by introducing high level of generality when deploying the subset of the most important features of the dataset. As this also results in high false-positive rate, we deploy additional set of rules in order the re-check the decision of the rule set for detecting U2R attacks. In this way we manage to highly reduce the false-positive rate. We deploy genetic algorithm (GA) approach for offline training of the rules for classifying different types of intrusions. Genetic Algorithm field is one of International Conference on Emerging Security Information, Systems and Technologies 0-7695-2989-5/07 $25.00 © 2007 IEEE DOI 10.1109/SECURWARE.2007.33 48

Transcript of [IEEE The International Conference on Emerging Security Information, Systems, and Technologies...

Page 1: [IEEE The International Conference on Emerging Security Information, Systems, and Technologies (SECUREWARE 2007) - Valencia, Spain (2007.10.14-2007.10.20)] The International Conference

Increasing Detection Rate of User-to-Root Attacks Using Genetic Algorithms

Zorana Banković1, Slobodan Bojanić1, Octavio Nieto-Taladriz1 and Atta Badii2 1Universidad Politécnica de Madrid

{zorana, slobodan, nieto}@die.upm.es

2University of Reading [email protected]

Abstract

An extensive set of machine learning and pattern classification techniques trained and tested on KDD dataset failed in detecting most of the user-to-root attacks. This paper aims to provide an approach for mitigating negative aspects of the mentioned dataset, which led to low detection rates. Genetic algorithm is employed to implement rules for detecting various types of attacks. Rules are formed of the features of the dataset identified as the most important ones for each attack type. In this way we introduce high level of generality and thus achieve high detection rates, but also gain high reduction of the system training time. Thenceforth we re-check the decision of the user-to-root rules with the rules that detect other types of attacks. In this way we decrease the false-positive rate. The model was verified on KDD99, demonstrating higher detection rates than those reported by the state-of-the-art while maintaining low false-positive rate. 1. Introduction

Along with bringing revolution in communication and information exchange, Internet has also provided greater opportunity for disruption and sabotage of data that was previously considered secure. Computer networks are protected against attacks by a number of access restriction policies that act as a coarse grain filter (anti-virus software, firewall, message encryption, secured network protocols, password protection). Intrusion detection systems (IDS) are the fine grain filter placed inside the protected network, looking for known or potential threats in network traffic and/or audit data recorded by hosts.

Intrusion detection systems have three common problems: speed, accuracy and adaptability. The problem of speed arises from the extensive amount of

data that intrusion detection systems need to monitor in order to perceive the entire situation. In this work the initial point is the extraction of the most important piece of information that can be deployed for efficient detection of attacks in order to cope with this problem.

Incorporation of learning algorithms provides a potential solution for the adaptation and accuracy issues of the intrusion detection problem [1]. Since the introduction of the Knowledge Discovery in Databases (KDD) 1999 dataset [2] for the 1999 KDD Cup contest, intrusion detection systems based on pattern recognition and machine learning algorithms have been extensively developed. The dataset contains four main attack categories: Probing, Denial of Service (DoS), User-to-Root (U2R), and Remote-to-Local (R2L). Pattern recognition and machine learning algorithms trained with the KDD training data subset and tested on the KDD testing data subset failed to detect majority of U2R and R2L attacks within the context of misuse detection, as reported in the literature [3], [4]. As it will be explained in the Section 2, law detection rates of U2R and R2L are exhibited due to deficiencies of the dataset.

This work investigates the possibility of increasing detection rate of U2R attacks in the misuse detection context trained on KDD dataset. A User-to-Root (U2R) attack is a process where any normal system user illegally gains access to the super-user privileges. Our idea is to achieve high detection rate of U2R attacks by introducing high level of generality when deploying the subset of the most important features of the dataset. As this also results in high false-positive rate, we deploy additional set of rules in order the re-check the decision of the rule set for detecting U2R attacks. In this way we manage to highly reduce the false-positive rate.

We deploy genetic algorithm (GA) approach for offline training of the rules for classifying different types of intrusions. Genetic Algorithm field is one of

International Conference on Emerging Security Information, Systems and Technologies

0-7695-2989-5/07 $25.00 © 2007 IEEEDOI 10.1109/SECURWARE.2007.33

48

Page 2: [IEEE The International Conference on Emerging Security Information, Systems, and Technologies (SECUREWARE 2007) - Valencia, Spain (2007.10.14-2007.10.20)] The International Conference

the up-coming fields in computer security and has only recently been recognized as having potential in the intrusion detection field [5], [6].

In the following text Section 2 gives overview on U2R attacks within the KDD dataset and proposes solutions to overcome the deficiencies of the dataset. Sections 3 gives the overview on genetic algorithms and the benefits of its using in intrusion detection field. Section 4 details the implementation of the system. Section 5 presents the benchmark KDD99 dataset deployed for training and testing, evaluates the performance of the system on the benchmark dataset and discusses the results. Finally, the conclusions are drawn in Section 6.

2. U2R attacks within KDD Dataset – Issues and Proposed Solutions

Learning algorithms have a training phase where they mathematically ’learn’ the patterns in the input dataset. The input dataset is also called the training set which should contain sufficient and representative instances of the patterns being discovered. A dataset instance is composed of features, which describe the dataset instance. Learned patterns can be used to make predictions on a new dataset instance based on its diversity from normal patterns or its similarity to known attack patterns or a combination of both.

In order to promote the comparison of advanced research in the area of intrusion detection, the Lincoln Laboratory at MIT, under DARPA sponsorship, conducted the 1998 and 1999 evaluation of intrusion detection [7]. Based on binary TCP dump data provided by DARPA evaluation, millions of connection statistics are collected and generated to form the training and test data in the Classifier Learning Contest organized in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1999 (KDD-99) [2]. The learning task was to build a detector (i.e. a classifier) capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” or normal connections.

The dataset contains 5,000,000 network connection records. A connection is a sequence of TCP packets starting and ending at some well defined moments of time, between which data flows from a source IP address to a target IP address under some well defined protocol [8]. The training portion of the dataset (labelled as “kdd_10_percent”) contains 494,021 connections of which 20% are normal (97,277). The testing dataset (labelled “corrected”) provides a dataset with a significantly different statistical distribution than the training dataset and contains an additional 14

attacks. It contains 311,029 connections of which 60,593 are normal.

KDD dataset is comprised of records. Each record in the dataset consists of 41 features [2], where 38 are numeric and 3 are symbolic, defined to characterize individual TCP sessions. Features for different connections were formulated using data mining techniques and domain knowledge on TCP packets [9]. Each record is also labelled, i.e. the information whether it represents an attack or a normal connection is also provided, thus offering the possibility of supervised learning.

A User-to-Root (U2R) attack is a process where any normal system user illegally gains access to the super-user privileges. Generally, a system defect or bug is exploited to execute a successful privilege transition from user level to root level. Buffer overflows are the most common type of attack mechanisms in this category [2].

The KDD training dataset has 52 U2R records, while the testing dataset has 228 of them. Furthermore, four new U2R attacks are present only in the KDD testing data subset and records for these new attacks constitute around 80% of all U2R records in the testing data subset. Thus many attacks and their records are present only in the KDD testing data subset. Therefore, misuse detection is quite difficult to perform. Furthermore, in [10] is demonstrated that the transformation model used for transforming raw DARPA’s network data to the KDD featured data item set is ‘poor’. Here ‘poor’ refers to the fact that some attribute values are the same in different data items that have different class labels. Due to this, some of the U2R attacks are very similar to normal connections in the testing dataset. Thus the misclassification between these two groups is very likely to occur.

As the first step in our work, in order to cope with the speed problem mentioned above, we have used the results obtained in our previous work [11] where we deployed Principal Component Analysis (PCA) and the results obtained in [12] deploying Multi Expression Programming (MEP), in order to extract the most relevant features of the data. In this way the total amount of data to be processed is highly reduced. As an important benefit of this arises the high speed of training the system thus providing high refreshing rate of the rule set.

Subsequently, these features are used to form rules for detecting various types of intrusions. This permits the introduction of higher level of generality and thus higher detection rates. The problem that arises with discarding features is a certain increase of false-positive rate. Our experiments confirmed that the most of the connections that were incorrectly reported as a U2R attack were either normal or a denial-of-service

49

Page 3: [IEEE The International Conference on Emerging Security Information, Systems, and Technologies (SECUREWARE 2007) - Valencia, Spain (2007.10.14-2007.10.20)] The International Conference

(DoS) attack. In order to mitigate the increase of the false-positive rate, we implement two more rule sets. One of the sets is deployed for detection of DoS attacks and the other for detection of normal connections. These rules are compounded of the features identified as the most important ones for detecting DoS and normal connections respectively. Due to the deployment of these rules for affirming or negating the decision of the rules for detecting U2R attacks we manage to highly reduce the false-positive rate while maintaining high-detection rate. 3. Genetic Algorithm Approach

Genetic algorithms (GA) are search algorithms based on the principles of natural selection and genetics [13]. It has been deployed to solve wide range of problems in computer science, engineering, economics, mathematics and many others. The most important idea that stands beyond the initial creation of GAs is the aim to develop a system as robust and as adaptable to the environment as the natural systems are.

3.1. Genetic Algorithm Overview

GA evolves a population of initial individuals to a population of high quality individuals, where each individual represents a solution of the problem to be solved. Each individual is called chromosome, and is composed of a predetermined number of genes. The quality of each rule is measured by a fitness function as the quantitative representation of each rule’s adaptation to a certain environment. The algorithm flow is presented in Figure 1.

The procedure starts from an initial population of randomly generated individuals. Then the population is evolved for a number of generations while gradually improving the qualities of the individuals in the sense of increasing the fitness value as the measure of quality. During each generation, three basic genetic operators are sequentially applied to each individual with certain probabilities, i.e. selection, crossover and mutation. Crossover consists of exchanging of the genes between two chromosomes performed in a certain way, while mutation consists of random changing of a value of a randomly chosen gene of a

chromosome. Both crossover and mutation are performed with a certain possibility, called crossover/mutation rate.

Determination of the following factors has the crucial impact on the efficiency of the algorithm: selection of fitness function, representation of individuals and the values of GA parameters (crossover and mutation rate, size of population, number of generations). This determination usually depends on the application.

Figure1. Genetic algorithm flow

3.2. The Advantages of Deploying GA to Intrusion Detection

Deployment of GA in the intrusion detection field

offers number of benefits, namely: • GAs are intrinsically parallel, since they have

multiple offspring, they can explore the solution space in multiple directions at once. If one path turns out to be a dead end, they can easily eliminate it and continue working on more promising avenues, giving them a greater chance by each run of finding the optimal solution.

• Due to the parallelism that allows them to implicitly evaluate many schemas at once, GAs are particularly well-suited to solving problems where the space of all potential solutions is truly huge - too vast to search exhaustively in any reasonable amount of time, as network data is.

• System based on GA can easily be re-trained, thus

50

Page 4: [IEEE The International Conference on Emerging Security Information, Systems, and Technologies (SECUREWARE 2007) - Valencia, Spain (2007.10.14-2007.10.20)] The International Conference

Figure 2. Block diagram of the implemented system

• providing the possibility of evolving new rules for intrusion detection. This property provides the adaptability of a GA-based system, which is an imperative quality of an intrusion detection system having in mind the high rate of emerging of new attacks.

4. System Implementation

The implemented system for intrusion detection consists of a combination of three different parts. The block diagram of the system is presented in Figure 2. The first block represents a rule-based system for detection of U2R attacks. In the continuation two rule-based systems for re-checking the decision of the first one are adjoined. One of the systems contains the rules for detecting DoS attacks and the other one for detecting normal connections. The decision whether a connection is a U2R attacks is made after having confirmed by both of the adjoined systems that it is not either normal or a DoS attacks. In this way we have managed to highly reduce the false-positive rate of the first block of the system while maintaining its high detection rate.

All the detection systems presented in Figure 2 are rule-based systems, where simple if-then rules for distinguishing different attacks and normal connections are evolved. For that reason, the most important features of normal connections and specific types of attacks are identified using Multi Expression Programming [12] and Principal Component Analysis [11]. The features and their explanations are listed in Table 1, Table 2 and Table 3.

Table 1. The features used to describe U2R attacks

Name of the feature

Explication

root_shell 1 if root shell is obtained; 0 otherwise dst host srv count

count of connections having the same destination host and using the same service

Some examples of rules are the following ones:

if (root_shell=”1” and dst_host_srv_count>0 and dst_host_srv_count<3 ) then U2R;

if (service=”http” and hot=”0” and logged_in=”0”) then normal; if (duration=”0” and src_bytes=”1032” and dst_host_srv_serror_rate=”0”) then DoS;;

Thus, the process of detection is the following: 1. U2R rules check if a connection is a U2R

attack 2. If so, DoS rules and rules for detecting normal

connections re-check the connection 3. If none of them has detected it as either DoS

or a normal connection, then the connection is reported to be a U2R attack

Each rule set was trained with a separate steady-state GA. An individual (a chromosome) of each population consisted of genes, where each gene represented a certain feature and its values represented the value of the feature.

Table 2. The features used to describe DoS attacks

Name of the feature Explication duration length (number of seconds) of

the connection src_bytes number of data bytes from

source to destination dst_host_srv_serror_rate percentage of connections that

have “SYN” errors GAs for training different rule systems differ only

in the number of the individuals that each population contain. GA for training the rules for detecting U2R attacks and normal connections contain 500 individuals, while the GA for training rules for detecting DoS attacks contains 1000 individuals due to greater number and diversity of DoS attacks in both training and testing dataset. Each GA is trained during 300 generations where in each generation 100 worst-

51

Page 5: [IEEE The International Conference on Emerging Security Information, Systems, and Technologies (SECUREWARE 2007) - Valencia, Spain (2007.10.14-2007.10.20)] The International Conference

performed individuals are replaced with the newly-generated ones. New individuals are created from the existing ones in the process of breeding using crossover and mutation operator. Crossover and mutation are performed with the probabilities of 0.9 and 0.1 respectively.

The result of the training process is a certain number of best-performed rules. We have performed experiments using 50, 75 and 100 best performed rules for detecting U2R attacks. In all of the experiments 100 of best-performed individuals for detecting DoS and normal connections are deployed. All the numbers specified in this Section were chosen on the basis of the large number of experiments.

Table 3. The features used to describe normal connections

Name of the feature Explication service Destination service (e.g. telnet, ftp) hot number of hot indicators logged in 1 if successfully logged in; 0 otherwise

The performance measurement (the fitness function)

in all of these cases was the following one:

BAfitness βα *2*32 −+=

where is the number of correctly classified connections (either attacks or normal connections depending on the rule set), A is the total number of attacks (when rules are trained for detecting attacks), or the total number of normal connections (when rules are trained for detecting normal connections) in the training dataset, β is the number of connection incorrectly characterized, i.e. false-positives, and B is the total number of normal connections (when rules are trained for detecting attacks) or the total number of attacks (when rules are trained for detecting normal connections) in the training dataset. High detection rate ( /A) and low rate of false-positives (β/B) result in high fitness value. On the other side, low detection rate and high rate of false-positives result in low fitness value. Thus this fitness function offers true prospect for gaining high detection and low false-positive rate. A similar equation is deployed in [6]. We have improved it by giving different weights to detection and false-positive rate. Greater coefficient at the side of the detection rate gives a certain advantage of the detection over false-positive rate pursuing in this way greater detection rates. The coefficients were chosen after having performed a number of experiments where they assumed different values.

The algorithm is performed as presented in Section 2. The system presented here is implemented in C++ programming language. The software for this work uses the GAlib genetic algorithm package, written by

Matthew Wall at the Massachusetts Institute of Technology [14]. 5. Results

The system was trained using “kdd_10_percent” and tested on “corrected” dataset. The obtained results are summarized in Table 4. As reported in [3], the highest detection rate of U2R attacks of a machine learning technique applied to the entire KDD dataset (rather than on randomly chosen subset) within misuse detection context reported in literature is 29.4% with the false-positive rate of 0.4%. The model deployed was a multi-classifier. Thus, our system outperforms the best-performed model reported in literature. Moreover, our previous statement of reducing false-positive rate when deploying additional rule systems for detecting DoS attacks and normal connections is conformed, as the false-positive rate has decreased in each of the cases.

Table 4. Performance of the implemented system

Detection rate (%) False-positive rate (%)

Number of rules

Total system

U2R rule system

Total system

U2R rule system

50 50 46.3 0.0055 0.007 75 77.8 77.8 7.2 10.2 100 100 100 16.54 27.4

The final results are surpassing the best one presented in [3]. Thus, our results outperform the best result (as far as we know) exhibited in misuse detection context. On the other hand, techniques within anomaly detection context achieve higher detection rate of U2R attacks. The best result within anomaly detection context known to us is 96.93% of detection and 2.19% of false-positive rate [10]. However, anomaly detection systems are not capable of detecting the exact type of an intrusion, i.e. they detect that an intrusion has occurred, but without specifying the type.

Unfortunately, the set of rules with 100 rules for detecting U2R (Table 4) that exhibits highest detection rate exhibits also very high false-positive rate which constraints its usage. Still, the most of the connections falsely reported as U2R attacks pertain to DoS category of attacks. This problem could be overcome by implementing a rule-set for detecting DoS attacks with better performances than the one deployed here.

6. Conclusions

In this work a combination of GA-based IDSes for detecting different types of attacks is introduced. The

52

Page 6: [IEEE The International Conference on Emerging Security Information, Systems, and Technologies (SECUREWARE 2007) - Valencia, Spain (2007.10.14-2007.10.20)] The International Conference

proposed combination is showed as very favorable for mitigating the negative aspects of the system for detecting U2R attacks. As our system uses only eight features to describe the data, its time of training is considerably reduced, thus providing higher refreshing rate of the rule set.

The effectiveness of the resulting system is illustrated by experimental results. High U2R attack detection and low false-positive rate demonstrate advantages of applying this system to intrusion detection.

This approach could be adopted for detecting other groups of attacks in KDD dataset. Especially interesting would be to apply this model to detection of R2L attacks, as they are also reported to exhibit low detection rate. The adaptation will consist in identifying the most important features of the dataset that participate in R2L attacks and then determine to which groups of attacks belong the false-positives. From this point further on, the process follows the steps indicated in Section 4.

In conclusion, we have demonstrated that detection rate of U2R attacks within the misuse detection context can be increased applying a heuristic design rather than a single machine-learning technique.

7. Acknowledgements

This work has been partially funded by the Spanish Ministry of Education and Science under project TEC2006-13067-C03-03 and by the European Commission under the FastMatch project FP6 IST 27095.

8. References [1] C.Elkan, “Results of the KDD’99 Classifier Learning”, ACM SIGKDD Explorations, 1, 2000, 63-64. [2] KDD Cup 1999 data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html [3] M. Sabhnani and G. Serpen, “Application of Machine Learning Algorithms to KDD Intrusion Detection Dataset within Misuse Detection Context”. In Proc. of International Conference on Machine Learning: Models, Technologies and Applications. June 2003. [4] P. Laskov, P. Düssel, C. Schäfer, K. Rieck, “Learning Intrusion Detection: Supervised or Unsaupervised?”, CIAP : international conference on image analysis and processing No13, Cagliari, Italy, vol. 3617, pp. 50-57, September 2005. [5] R. H Gong, M. Zulkernine, P. Abolmaesumi, “A Software Implementation of a Genetic Algorithm Based

Approach to Network Intrusion Detection”, Proceedings of the SNPD/SAWN`05, 2005. [6] Chittur, A.: Model Generation for an Intrusion Detection System Using Genetic Algorithms, http:// www1.cs.columbia.edu/ids/publications/gaids-thesis01.pdf, accessed in 2006. [7] R.P. Lippmann, D.J. Fried, I. Graf, J.W. Haines, K.R. Kendall, D. McClung, D. Weber, S.E. Webster, D. Wyschogrod, R.K Cunningham., M.A. Zissman, “Evaluating Intrusion Detection Systems: the 1998 DARPA Off-Line Intrusion Detection Evaluation”, Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, 2 (2000) [8] Hettich, S., Bay., S. D: The UCI KDD Archive. Irvine, CA: University of California, Department of Information and Computer Science, 1999. http://kdd.ics.uci.edu. [9] W. Lee, S.J. Stolfo and K.W. Mok, Mining in a Data-Flow Environment: Experience in Network Intrusion Detection, in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 1999, pp. 114–124. [10] Y. Bouzida and F. Cuppens, “Detecting known and novel network intrusion”, IFIP/SEC 2006 21st IFIP TC-11 International Information Security Conference Karlstad University, Sweden. May 2006. [11] Z. Banković, D. Stepanović, S. Bojanić and O. Nieto-Taladriz, “Improving Network Security Using Genetic Algorithm Approach”, to be published in Computers & Electrical Engineering, Elsevier [12] C. Grosan, A. Abraham, and M. Chis, “Computational Intelligence for light weight intrusion detection systems”, International Conference on Applied Computing (IADIS'06), San Sebastian, Spain, 2006. pp. 538-542, [13] D. E Goldberg, Genetic algorithms for search, optimization, and machine learning. Addison-Wesley (1989) [14] GAlib A C++ Library of Genetic Algorithm Components, http://lancet.mit.edu/ga/

53