Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study...
Transcript of Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study...
![Page 1: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/1.jpg)
1
Genetic Programming in Classifying Large-Scale Data: An Ensemble Method
Yifeng Zhang
Siddhartha Bhattacharyya1
Information and Decision Sciences College of Business Administration
University of Illinois at Chicago 601 S. Morgan Street (MC 294)
Chicago, IL 60607
[email protected], [email protected] Abstract This study demonstrates the potential of genetic programming (GP) as a base classifier algorithm in building ensembles in the context of large-scale data classification. An ensemble built upon base classifiers that were trained with GP was found to significantly outperform its counterparts built upon base classifiers that were trained with decision tree and logistic regression. The superiority of GP ensembles is attributed to the higher diversity, both in terms of the functional form of as well as with respect to the variables defining the models, among the base classifiers. Keywords: Genetic Programming, Ensemble, Classification, Large Scale Data
Please do not quote without permission.
1 Author for correspondence.
![Page 2: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/2.jpg)
2
1. INTRODUCTION
The classification problem of assigning observations into one of several groups
plays a key role in decision-making. The binary classification problem, where the
data are restricted to one of two groups, has wide applicability in problems
ranging from credit scoring, default prediction and direct marketing to
applications in biology and medical domains. It has been studied extensively by
statisticians, and in more recent years a number of machine learning and data
mining approaches have been proposed. Amongst the latter group of techniques
are those that can broadly be categorized as “soft computing” methods, noted to
tradeoff optimality and precision for advantages of representational power and
applicability under wider conditions. The form of the solution varies with the
technique employed, and is expressed in some form of classification rule or
function based on the multivariate data defining each observation.
Techniques from the statistics realm beginning with Fisher’s seminal work
include linear, quadratic and logistic discriminant models, and are amongst the
most commonly used [1]. Logistic regression remains, even today, amongst the
most widely used of data mining methods in industry. The major drawback with
these methods arise from the fact that real-world data often do not satisfy their
underlying assumptions. Nonparametric methods are less restrictive, and popular
amongst them are the k-nearest neighbor and various mathematical programming
methods [2] [3]. Bradley et al. provide a comprehensive review of various
mathematical programming approaches for data mining [4]. Various inductive
learning approaches have been developed for classification, including decision
![Page 3: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/3.jpg)
3
trees, neural networks, support vector machines. More recently, evolutionary
computation techniques like genetic algorithms have shown to be attractive for
classification and data mining.
Genetic algorithm (GA), modeled loosely on principles of natural evolution,
provide a powerful general-purpose search approach, and have been noted to offer
advantages for data mining. These advantages stem from the representational
flexibility allowed on the form that a classification function may take, and from
the largely open formulation of the search objective (fitness function).
Classification functions studied range from a set of data-attribute weights as in
traditional regression models [5] [6], condition-action type rules with
conjunction/disjunction of terms [7] [8], to the parse-tree expressions of genetic
programming [9]. The flexibility in fitness function formulation allows the
development of classification models that are tailored to specific domain
constraints and objectives. Bhattacharyya [10] [6], for example, shows how GA-
based models developed with the fitness function explicitly seeking the direct
marketing objective of maximizing lifts at specific mailing-depths, can yield
significant improvements over the traditional classification objective of
minimizing errors. Various recent studies report on the application of
evolutionary search in data mining [11] [12] [13].
Given the massive amounts of data accumulated by organizations today, a
key challenge facing the data-mining field is the development of methods for
large-scale data mining. A focus on problems aris ing from large data is, in fact, a
defining characteristic of data mining (being defined as the process of gleaning
![Page 4: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/4.jpg)
4
useful and actionable knowledge from large data [14]). Most traditional
classification methods, however, require the simultaneous use of all the training
data, thereby posing a severe challenge in their direct application to large data
problems – it is, after all, desirable that all available data be used in developing
the classifier, even when all the data cannot be accommodated in memory.
Further, many learning algorithms are of an iterative nature, necessitating several
passes through the data – attempting to develop models on the entire data may
then take too long. Further, it may not always be possible to have all the data at a
same place at the same time due to, for example, security or privacy reasons, or
because the data may be present in naturally distributed settings. In short,
traditional classification methods are not readily amenable to handling large-scale
distributed data.
In recent years, a number of approaches have been suggested for overcoming
such large-data limitations. While some seek to incrementally develop a model
from subsets of the data [15], others obtain classification from a team or ensemble
of classifiers. Various studies have shown that ensembling is a useful method
addressing the problem of large-scale and potentially distributed data [16] [17]
[18] [19]. Ideally, different base classifiers in an ensemble capture different
patterns or aspects of a pattern embedded in the whole data. And then through
ensembling, these different patterns or aspects are incorporated into a final
prediction. A number of studies have examined techniques like bagging and
boosting for combining predictions form multiple classifiers, and these have been
shown to improve the performance of ensembles over individual classifiers.
![Page 5: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/5.jpg)
5
Comparative evaluations of different variants of these are given in [20] and [21].
Note that data may be ‘large’ not only from the perspective of number of
observations, but also in carrying a large number of attributes. Various
dimensionality reduction and feature selection methods are applicable for this.
The focus of this paper is on the classification of data having a large number of
observations, and the term large-scale is used only is this context.
This paper proposes the use of genetic programming (GP) as a technique for
developing base classifiers for an ensemble. A key advantage of GP arises from
the ability of its parse-tree representation to model almost arbitrary non- linear
forms. A GP model can thus seek to capture various non- linear relationships in
the data. The genetic search procedure is also ideally suited to explore the large,
and, given the usually scant prior knowledge on data relationships, often poorly
understood search spaces that typify real- life data-mining problems. Different
models developed on data subsets can thus incorporate a diversity of data-patterns,
which can then be combined as an ensemble for final classification. The utility of
GP as an ensembling approach is examined using a real- life large dataset of about
five million observations from the KDD-Cup 1999 competition2. The
experimental study compares the performance of GP-ensembling to ensembles
based on decision trees as well as logistic regression. Results demonstrate the
advantages of the proposed GP based ensembling scheme. As expected, the GP
models are found to capture diverse patterns in the data, while the developed
decision tree models look similar. The ensemble of GP models is seen to yield
significantly superior performance over the decision tree ensemble. Our findings 2 The data is available at http://www.kdnuggets.com/datasets/.
![Page 6: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/6.jpg)
6
also show that GP is able to discern data patterns using a small subset of variables
in the data, indicating its utility for feature selection and exploratory data analysis.
The paper is organized as follows: the next section provides a brief
background on genetic search and the non- linear representations of genetic
programming, and presents some relevant literature on ensembling approaches for
data mining. Section 3 details the data used for the experimental study, and
presents the experiments and results. Concluding remarks and future research
directions are given in the last section
2. BACKGROUND
2.1 Ensembles for Large Scale Classification
The notion of aggregating information from multiple models has been
addressed in the statistics and decision-making literature (see, for example, [22]
[23] [24], mostly in the context of combining forecasts from multiple models
based on different methods, but using the same data. For data mining on large-
scale data, the focus is on using smaller data subsets so as to ease the
computational burden. The essence of ensembling is to take small subsets from a
large data set to build a number of base classifiers, and then combine these base
classifiers into an ensemble. Since base classifiers are built using a small part of
the whole data, learning requires less memory, as compared to traditional
approaches of considering all the data together, and also proceeds faster in terms
of computational speed. Besides, it can readily be implemented in a parallel
fashion.
![Page 7: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/7.jpg)
7
Bagging [25] and boosting [26] are popular methods for combining the
predictions from multiple base classifiers in creating ensembles [27]. In bagging,
different training sets for learning the base classifiers are obtained by resampling
the training data with replacement, and the classifiers are combined using a voting
scheme. While a simple average of the predictions of individual classifiers has
been found to yield a good ensemble, more complex voting mechanisms have also
been proposed; while these can show improved performance, they can also result
in an overfitted ensemble [28]. Boosting is based on weighted resampling where
classifiers are obtained as a series, with the later classifiers being trained on data
containing more of the observations that were incorrectly classified by the earlier
classifiers. Bagging has been found to outperform individual classifiers in several
studies, while boosting, though showing improved performance over bagging in
certain datasets, is noted to be susceptible to noise in the data [21]. It is noted in
[25] that bagging is effective with learning algorithms that are “unstable” in the
sense that small changes in the data can produce large differences in classifier
performance, and mentions decision tree and neural network learning procedures
as examples of such algorithms.
Although the speedup advantage of ensembling is obvious and easy to
obtain, the cha llenge of preventing classification accuracy from significant
decrease, compared to that of a single model trained on the whole data, is not an
easy task. However, [19] and [16] showed that reaching such a goal is possible.
[19] examined the number of subsets that an original dataset can be divided into
before significant performance decrease occurs. They found that, although the
![Page 8: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/8.jpg)
8
number varies from dataset to dataset, most of the 15 datasets they studied could
be divided into four or more subsets with less than 2% decrease in classification
accuracy. Given that the 15 datasets studied by [19] are relatively small (all less
than 43,500 examples), one would expect that much larger datasets ( millions or
more examples) can be divided into many more subsets, with each still not being
too small. [16] also studied the impact of number of subsets on an ensemble’s
performance. However, instead of dividing the original dataset, [16] took subsets
from it with replacement, which makes it possible to take an infinite number of
subsets. [16] showed that an ensemble’s performance increases with number of
subsets. The increase is fast in the beginning and after a certain point reaches an
asymptotic value close to the performance of a single model built on the entire
data. In the five datasets that [16] studied, the ensemble’s performance reaches
the asymptotic value within 100 subsets. [16] also showed that the bigger the
subset (i.e. 800 examples vs. 100 examples), the faster the performance of the
ensemble approaches the asymptotic value.
Besides number and size of subsets, studies have showed that the
combining of predictions from an ensemble tends to be effective when the
individual predictors have errors that are independent of each others [29] [30] [31].
Diversity amongst the individual classifiers is thus beneficial to an ensemble’s
performance. In [30], diversity among base classifiers (neural network models)
was increased by constructing each model on a different data subset. In [31],
diversity among base classifiers (neural network models) was increased by using
a different subset of attributes in constructing each model. Thus each model only
![Page 9: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/9.jpg)
9
captures patterns embedded in the subset of attributes on which it was built. [31]
calls this method “ensemble feature selection”. The notion of diversity is also
intuitive, considering that combining a number of very similar models is not
likely to help much in performance. [29], lending theoretical support for this
notion, demonstrated that integrating independent inputs produces better results
than integrating dependent inputs.
Other aspects of ensembling have also been studied. [17] examined
further mechanisms for combining base classifiers. They proposed two “meta-
learning” strategies that learn how to combine base classifiers, just as base
classifiers are learned from raw data. Both the strategies, called arbitration and
combining, use the base classifiers’ predictions as input for “meta- learning”, with
the arbitration strategy also taking the prediction of an arbiter as input.
Ensembling has also been studied on a special type of large-scale data: streaming
data, where new data continuously flows into a dataset at a high speed. The
strategy proposed by [18] is to train a new base classifier on a new chunk of data
of certain size and evaluates the new classifier against all those in a committee of
classifiers. If the new classifier performs better than some classifier in the
committee, one classifier in the committee is replaced by the new classifier. In
this way, the classifier committee is kept updated as new data arrives. A similar
strategy was proposed by [32] to update a decision tree.
![Page 10: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/10.jpg)
10
2.2 Genetic Search and Genetic Programming
The term genetic algorithm describes a class of stochastic search procedure
inspired by principles of natural genetics and survival of the fittest. They operate
through a simulated evolution process on a population of solution structures that
represent candidate solutions in the search space. Evolution occurs through (1) a
selection mechanism that implements a survival of the fittest strategy, and (2)
genetic recombination of the selected solutions to produce offspring for the next
generation. GAs are considered suitable for application to complex search spaces
not easily amenable to traditional techniques, and are noted to provide an
effective tradeoff between exploitation of currently known solutions and a robust
exploration of the entire search space. The selection scheme operationalizes
exploitation and recombination effects exploration. Crossover and mutation are
the two basic recombination operators. Crossover implements a mating scheme
between pairs of “parents” to produce “progeny” that carry characteristics of both
parents. Mutation is a random operator applied to insure against premature
convergence of the population; mutation also maintains the possibility that any
population representative can be ultimately generated. A detailed account of
genetic search may be found in [33].
In the context of classification, each population member can specify a
symbolic classification rule as in [7], or can represents a weight vector defining a
traditional regression-like function [5], or general non- linear expressions as in
genetic programming (GP) [9]. In this paper, we consider the non-linear GP
representation for classifiers.
![Page 11: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/11.jpg)
11
Genetic programming [9] is a GA-variant that uses hierarchical
representations (Figure 1a gives an example.) Solutions in GP can be represented
as parse trees where internal nodes define some function (with its subtree
branches as function arguments) and leaf nodes define some problem related
constants (terminals). Each population member thus represents a function f (x) of
the data attributes (dependent variables) that can be depicted as a parse tree, thus
allowing arbitrarily complex classifiers. The function f(x) can interpreted as a
classifier in the usual way:
f(x) ≤
0 implies Group 1> 0 implies Group 2.
The functional form of the classifier is determined by (1) a function set F defining
the functions to be used, (for example, the arithmetic operators (+, -, *, /), together
with any allowed functions like log(.), exp(.), sin(.), etc., logical operators (And,
Or) etc.), and (2) a terminal set T defining the variables and values allowed at the
leaf nodes of the tree (in this case, the dependent variables in the data, together
with numerical constants.)
Crossover is defined as a operator that exchanges randomly chosen
subtrees of two parents to create two new offspring. This is illustrated in Figure
1b. Mutation randomly changes a subtree. Since the regular GP operators are
inadequate for learning numeric constants [34], a non-uniform mutation is used
here [35].
With its ability to model various non-linear terms, GP holds much promise
for data mining. [9] illustrates the use of GP for ‘symbolic regression’ showing
how a GP can learn a non- linear function to fit given data points. The use of
![Page 12: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/12.jpg)
12
genetic search in data mining problems is, however, relatively new. [5] examined
the use of genetic algorithms for learning linear discriminant functions, and [36]
comparative study includes both linear-GA and GP. [37] examined the impact of
different GP settings on classification accuracy; [11] proposed the use of genetic
search for feature selection; [10] [6] applied genetic algorithms to directly
optimize lifts, and [38] examined genetic algorithms and GP for multi-objective
data mining. [39] proposed a method for using GP in context of a relational
database, where a GP tree and its fitness function (classification accuracy) were
converted into a relational query.
(insert figure 1a here)
(insert figure 1b here)
3. EXPERIMENTS AND RESULTS
3.1 Data and Experiments
The dataset used in this study is from the KDD Cup’99 competition,
which consists of connection data of a simulated U.S. Air Force LAN. The
dataset is large, with training dataset containing about five million examples, and
test dataset about 300,000 examples. Each example is labeled “normal” or one of
25 attack types. Since multi-class classification is not the focus of this study,
different attack types are not differentiated. That is, each example is only
classified as “normal” or “attack”. Besides the class variable, each example is
described by 41 attributes, which can be classified into three groups: basic
![Page 13: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/13.jpg)
13
features (i.e., duration of the connection, etc.), content features (i.e., number of
failed login attempts within a connection, etc.), and traffic features (i.e., number
of connections to the same host as the current connection in the past two seconds,
etc.). Most examples in the datasets are of “attack” type, which accounts for 80%
of the training data and 80.52% of the test data.
To construct base classifiers, ten relatively small (0.6% of the training
data/28,000 examples) subsets were randomly taken with replacement from the
training dataset. Subsets of larger size were examined in pilot experiments, but
these did not show advantages in terms of base classifiers’ classification accuracy
over the chosen size.
A post hoc experiment showed that the number of subsets chosen in this
study (10) is also appropriate. Specifically, the relationship between number of
subsets and performance of the GP ensemble on test data was examined. The
results showed that the GP ensembles’ classification accuracies increases
significantly when the number of subsets goes from 1 to 10. However, the
performance improvement is marginal when the number of subsets is more than
10 (see figure 2).
(insert figure 2 here)
Prior to training, two preparation measures were conducted: a) non-
numerical attributes were converted to numeric values, and b) all attributes were
standardized to zero mean and unit standard deviation.
![Page 14: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/14.jpg)
14
On each subset, one GP model, together with one decision tree and one
logistic regression model were trained. The decision tree and logistic regression
models were built for comparison purpose. A uniform cost model was adopted in
all model training. That is, a same unit cost is assigned to both false positives
(“normal” being classified as “attack”) and false negatives (“attack” being
classified as “normal”). The decision tree and logistic models were developed
using the standard implementations in the SAS Enterprise Miner data-mining tool.
Since the software automatically prunes decision trees based on performance on a
validation dataset, a separate (different from the ten subsets used for training the
base classifiers) data subset was used for this purpose.
Running time of the ten individual GP models varies greatly because
complexities of the models are very different. On a PC with a 400 MHz CPU and
about 200 MB RAM memory, the running time varies from about 10 to about 40
minutes. Running times of individual decision tree and logistic regression models
are only a couple of minutes on a same PC. Since computation cost nowadays is
very low and will be increasingly lower, the heavier computation burden in
training GP models is considered tolerable.
Regularly used parameter values and GP settings, as found in the literature,
were used in the experimental study. The function-set for GP included four
arithmetic (+, -, *, /), two comparisons (>=, <=), and six mathematical (exp, ln,
log, sin, cos, tan) operators. The result of a comparison operation is 1 or 0, so that
they can be combined with arithmetic and mathematical operations without
adopting strongly typed GP. The terminal set included the dependent variables in
![Page 15: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/15.jpg)
15
the data, together with numeric real constants in the [-1, 1] range. Classification
accuracy was set to be the fitness function. Crossover and mutation probability
were set to be 1 and 0.2 respectively. A population size of 51 was used. All
models were trained for 100 generations; monitoring of the training process
showed that all models converged within 100 generations. The GP settings were
kept unchanged across all ten subsets, except for different random seeds used for
the subsets.
After base classifiers were built, they are combined into ensembles: one
GP ensemble, one decision tree ensemble and one logistic regression ensemble.
The combining mechanism used is simple majority voting, that is, predictions of
the majority of base classifiers were set as the ensemble’s prediction.
3.2 Results
The GP ensemble was found to perform much better than the decision tree
and logistic regression ensembles, with classification accuracies of the three
ensembles being 90.55%, 80.52%, and 80.52% respectively. Table 1 gives the
confusion matrix of GP ensemble on the test data. Considering that 80.52%
examples in test data are of “attack” type, the decision tree and logistic regression
ensembles simply classified all examples as “attack”. Amongst the 10 base
decision-tree classifiers, only one was observed to show better performance on the
test data (see Table 2); the ensemble, as a whole, however, performs rather poorly.
(insert table 1 here)
![Page 16: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/16.jpg)
16
The poor performance of decision tree ensemble is surprising because two
of the three winning entries of KDD Cup’99 used decision tree as base algorithm.
However, results of the current study cannot be directly compared to those of the
three winning entries of KDD Cup’99 because, as mentioned in the beginning of
section 3.1, the original problem was simplified in this study. Specifically, the
original five-class (one normal and four attack classes) classification problem is
simplified into a binary classification problem by aggregating the four attack
classes into a single class called “attack”. This simplification is appropriate
because multi-class classification is not the focus of this study.
The difference in the three ensembles’ performance can be attributed to
differences in their base classifiers. Two obvious differences can be observed
between the GP and decision tree classifiers: a) as expected, GP classifiers
exhibited greater diversity; b) in general GP classifiers exhibited less over fitting.
Diversity of GP classifiers was exhibited in two aspects: performance and
structure.
Table 2 shows the performance of the individual classifiers in the GP,
decision tree, and logistic regression ensembles, on both the training and the test
data. While the individual decision tree classifiers attained uniformly higher
accuracies than GP classifiers on training data subsets, only four of them kept this
superiority on test data subsets. Paired t-test showed that the average
classification accuracy of individual decision tree classifiers on training data
(99.5%) is significantly higher than that of GP classifiers (94.5%, p = 0.01), but
the average accuracies on test data of the two groups of cla ssifiers are not
![Page 17: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/17.jpg)
17
significantly different. This demonstrates decision tree classifiers’ higher over
fitting. Another observation that we can make from table 2 is the variation of the
individual classifiers’ performances. All three groups of individual classifiers
showed greater performance variation on test data than on training data. And GP
classifiers showed higher performance variation than decision tree and logistic
regression classifiers, on both training test data. This can be verified by the
disparate variances of classification accuracy of the three types of classifiers (see
table 2).
Notice that in table 2, the classification accuracies for two GP base
classifiers’ (subset 7 and 9) on the test data were exceptional low: 19.48% and
41.15%. However, these two classifiers did not have a significant detrimental
effect on the final ensemble’s performance. When these two classifiers were
excluded from the ensemble, the ensemble’s performance only increased from
90.55% to 91.06%. This reinforces that ensembling is a robust method, especially
given the fact that only ten base classifiers were constructed in this study. This
robustness can be valuable when using a higher number of base classifiers; that is,
the ensemble’s performance will not be greatly influenced by a few poorly
performed base classifiers.
(insert table 2 here)
The high diversity of GP base classifiers can also been observed in the
structure of the models obtained. An important aspect of a GP model’s structure
is its complexity, which is defined here as the number of operators and terminal
![Page 18: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/18.jpg)
18
terms that are contained in the model. As mentioned earlier, the terminal terms of
a GP model are comprised of data attributes and constants. For the purpose of
comparison of the diversity of an ensemble’s base classifiers, the complexity of a
decision tree model is represented here by its depth and number of leaves.
The complexities of the ten GP and decision tree base classifiers are
presented in Table 3. It can be seen from the table that the complexity of GP base
classifiers varies much more than that of decision tree classifiers. For instance,
complexity of GP classifiers varies widely from 3 to 23, while the depths of the
decision tree base classifiers were uniformly 3, and number of leaves varied
within a narrow range of (11, 16). The structural diversity of GP base classifiers
can also be observed in the attributes that are included in each model -- these are
shown in Table 4. It can be seen from the table that attributes included in
different GP models overlap with each other to a much lesser extent than the
attributes included in the different decision tree models. For instance, no attribute
appeared in more than four GP base classifiers, but three attributes (v4, v23, v37)
appeared in all ten decision-tree base classifiers.
Previous studies have indicated that slight overfit in base classifiers is
beneficial to ensembling, as this overfit is neutralized when base classifiers are
combined [28]. However, an opposite result was observed in our study. In
general, the decision tree base classifiers exhibited greater overfit than GP base
classifiers. Classification accuracies of the ten decision tree base classifiers on
training data were all very high (>99%) and higher than that of the corresponding
GP base classifiers’ (see table 2). However, on the test data, the majority of GP
![Page 19: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/19.jpg)
19
base classifiers have higher classification accuracies than the corresponding
decision tree base classifiers. This contrary result as observed in this study arises
because the decision tree base classifiers in this study are too similar to each other,
which prevents their overfit from being neutralized during combining.
(insert table 3 here)
(insert table 4 here)
4. CONCLUSION
This study demonstrates the utility of GP as a method for creating and
using ensembles in data mining. Given its representational power in modeling
complex non- linearities in the data, GP is seen to be effective at learning diverse
patterns in the data. With different models capturing varied data relationships, GP
models are ideally suited for combination in ensembles. Experimental results
show that different GP models are dissimilar both in terms of the functional form
as well as with respect to the variables defining the models.
The observed diversity in different GP models can be attributed to a local
search conducted in different invocations of the genetic search. Note that
relatively small population sizes (of 51 population members) were used. Genetic
search with small populations is known to lead to local search; given different
(here, random) initial points, small populations effect local search in different
regions of the search space. While this study, with its focus on large data,
considered different GP models obtained from different data subsets, a diversity
of models can also be sought from the same dataset, by initiating genetic search
![Page 20: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/20.jpg)
20
using different random number seeds. In this way, GP can be generally useful as
a modeling technique for use with various bagging and boosting schemes, and this
forms a topic of our continuing investigation.
Results obtained in this study also indicate that GP models are able to
‘select’ a subset of relevant attributes – notice that many of the GP models are
comprised of a handful of the total variables in the dataset (Table 4). This points
to the potential usefulness of GP as a method for variable selection and feature
extraction. Related studies by the authors show promise in that GP models are
seen to yield comparable, if not better, classification performance (over traditional
data mining methods) with fewer variables; GP here is able to discern useful non-
linear interactions consisting of fewer variables. Future studies can explore this
capability of GP more thoroughly. For example, datasets that contain a large
number of attributes is most suitable for this kind of study. GP can be used to
select a small subset of useful attributes out of all attributes first; other methods
can then be used to work on the attribute subset.
Notice that the size of the data subsets used for building the base
classifiers in this study was relatively small, compared to entire dataset. It is
valuable to find that GP is able to learn informative models from these small data
subsets. While the resulting GP-ensemble was found to significantly outperform
the other techniques, the use of a higher number of data-subsets to consider more
base classifiers for a GP-ensemble is an interesting topic for further study. Given
the positive results of GP’s usefulness in ensembling, as evidenced in this study, it
will also be worthwhile to consider extending this work towards multi-group
![Page 21: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/21.jpg)
21
classification (the same data as used in this study, which defines, in general,
multiple “attack” types, can be useful for this purpose.)
Decision-makers often seek models that are interpretable. Interpretability
of obtained models presents an advantage of GP over techniques like neural
networks, where interpretation of different weights and networks structures is
usually not obvious. However, large tree-structured GP models may, also, not be
easily interpretable. GP trees often exhibit ‘bloat’, with repeating and redundant
terms in the tree [9] [40] [41]. While it is recognized that some redundancy can
aid in the search (introns) and may be desirable [40], this tendency of GP to
develop large trees is usually addressed by limiting the trees to some specified
depth and size (as done in this study); alternative methods have also been
proposed for controlling such bloat (for example, [41] [42]). Another
representation related issue with regular GP arises from the closure requirement
over the function set. The closure property requires that all elements of a tree
return the same data type, so as to allow arbitrary sub-trees to be recombined by
the crossover and mutation operators. Where such closure is enforced by treating,
say, Boolean and Real values identically, the resulting trees can make
interpretation harder. To overcome this, [43] describes strongly typed GP, where
elements of a tree can be of any pre-defined data type, with the tree initialization
routines and search operators generating only syntactically correct tree structures.
In the context of data mining, the typing mechanism can be defined to handle
binary, nominal and real-valued attributes suitably, and also to specify appropriate
![Page 22: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/22.jpg)
22
combinations of different function types so that the resulting model is readily
interpretable.
The use of GP as a data mining technique has not received much attention
in the literature. For practitioners, too, no GP-based software tool, designed for
data mining, is currently available (in fact, no genetic search based tool was
noticed in the software section in the KDD website – www.kdnuggets.com).
The findings in this study provide only an initial step in exploring the data-mining
potential of GP. Various issues remain for further research. The design of fitness
functions tailored to different data-mining tasks and objectives, the structuring of
functional primitives and the model representation, susceptibility to and
avoidance of overfit and noise, are amongst important issues for investigation.
![Page 23: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/23.jpg)
23
Figure 1a: GP tree example
Figure 1b: Crossover in GP
![Page 24: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/24.jpg)
24
Number of subsets and classification accuracy of GP ensembles
0.740.760.780.800.820.840.860.880.900.920.94
1 2 3 4 5 10 20 50
number of subsets
clas
sifi
cati
on
acc
ura
cy
Predicted Normal Attack Total
Normal 44,591 16,002 60,593 Actual Attack 13,405 237,031 250,436
Total 57,996 253,033 311,029 Table 1: Confusion Matrix of GP Ensemble
Figure 2
![Page 25: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/25.jpg)
25
Accuracy on Test Data (%):
Accuracy on Training Data (%):
Training Subsets
GP Classifiers
Decision Tree Classifiers
Logistic Regression Classifiers
GP Classifiers
Decision Tree
Classifiers
Logistic Regression Classifiers
S1 90.48 80.52 80.52 98.88 99.39 79.89 S2 89.26 80.52 80.52 91.52 99.50 80.53 S3 88.81 81.19 80.52 89.88 99.56 79.91 S4 90.64 80.52 80.52 94.51 99.40 80.36 S5 80.46 80.52 80.52 99.30 99.60 80.17 S6 74.64 81.93 80.52 84.97 99.40 79.80 S7 19.48 81.32 80.52 97.13 99.58 80.16 S8 90.67 80.52 80.52 98.70 99.26 80.30 S9 41.15 90.95 80.52 91.69 99.82 80.21 S10 91.85 78.66 80.52 99.03 99.48 79.59
Standard Deviation 25 3.4 0 4.9 0.15 0.29
Classification Accuracy of Ensemble
90.55 80.52 80.52
Table2: Classification Accuracy of Base Classifiers
GP Trees Decision Trees Subsets
Complexity
Depth Number of Leaves
S1 13 3 11 S2 3 3 15
S3 5 3 12
S4 5 3 12 S5 23 3 12
S6 3 3 12
S7 15 3 12 S8 7 3 11
S9 13 3 13
S10 3 3 16 Table 3: Complexity of GP and Decision Tree Base Classifiers
![Page 26: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/26.jpg)
26
Subsets GP Trees Decision Trees
S1 v1, v5, v12, v23, v32, v37 v3, v4, v23, v37 S2 v30, v32, v4, v5, v23, v37 S3 v37 v4, v6, v23, v34, v37 S4 v12, v22, v28, v3, v4, v23, v37 S5 v14, v23, v31, v38, v41, v2, v4, v23, v34, v37 S6 v6, v40 v3, v4, v6, v23, v37 S7 v1, v12, v32 v3, v4, v6, v23, v37, S8 v12, v23 v4, v23, v37 S9 v19, v32, v34 v3, v4, v6, v23, v37 S10 v2, v30 v4, v23, v34, v37
Table 4: Attributes Included in GP and Decision Tree Base Classifiers
![Page 27: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/27.jpg)
27
References:
[1] R.-A. Fisher, The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(1936), 179-188. [2] N. Freed, and F. Glover. A Linear Programming Approach to the Discriminant Problem. Decision Sciences, 12(1981), 68-74. [3] D.-J. Hand, Discrimination and Classification. New York: John Wiley and Sons. 1981. [4] P.-S. Bradley, U.-M. Fayyad, and O.-L. Mangasarian, Mathematical programming for data mining: Formulations and challenges. INFORMS Journal on Computing, 11(1999), 217-238. [5] G.-J. Koehler, Linear Discriminant Functions Determined through Genetic Search. ORSA Journal on Computing, 3(1991), 345-357. [6] S. Bhattacharyya, Direct Marketing Performance Modeling using Genetic Algorithms, INFORMS Journal of Computing, 11 (1999). [7] D.-P. Greene, and S.F. Smith, A Genetic System for Learning Models of Consumer Choice, in J.J. Grefenstette: Genetic Algorithms and their Applications: Proceedings of the Second International Conference on Genetic Algorithms, Hillsdale, NJ: L. Erlbaum Associates. 1987. [8] K. DeJong, W.-M. Spears, and D.-F. Gordon. Using Genetic Algorithms for Concept Learning. Machine Learning, 13(1993), 161-188. [9] J.-R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992. [10] S. Bhattacharyya, Direct Marketing Response Models using Genetic Search, Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, AAAI Press, 1998. [11] Y. Kim, W. Street, F. Menczer. Feature selection in unsupervised learning via evolutionary search. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), 2000, 365-369. [12] Y. Kim, W. Street, F. Menczer. An evolutionary multi-objective local selection algorithm for customer targeting. Congress on Evolutionary Computing (CEC2001), Seoul, Korea. 2001, 759-766.
![Page 28: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/28.jpg)
28
[13] S. Bhattacharyya, Evolutionary Algorithms in Data Mining: Multi-Objective Performance Modeling for Direct Marketing. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, AAAI Press, 2000.
[14] U.-M. Fayyad, G. Piatetskt-Shapiro, P. Smyth, and R. Uthurusamy, Advances in Knowledge discovery and Data Mining. AAAI Press, 1996.
[15] P. Domingos, and G. Hulten.. Mining High-Speed Data Streams. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, AAAI Press, 2000.
[16] L. Breiman, Pasting bites together for prediction in large data sets and on-line. Machine Learning, 36(1999), 85-103. [17] P. Chan, S. Stolfo. On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems, 8(1997), 5-28. [18] W. Street, Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-200), San Francisco, CA. 2001, 377-382.
[19] L. Hall, K. Bowyer, W. Kegelmeyer, T. Moore, C. Chao. Distributed learning on very large data sets. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, AAAI Press. 2000.
[20] E. Bauer, and R. Kohavi. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36(1999), 105-142. [21] D. Optiz, and R. Maclin. Popular Ensemble Methods; An Empirical Study. Journal of Artificial Intelligence Research, 11(1999) 169-198.
[22] J.-M. Bates, and C.-J Granger. The Combination of Forecasts. Operations Research Quarterly. 20(1969), 451-468.
[23] R.-L. Winkler, and S. Makridakis. The Combination of Forecasts. Journal of the Royal Statistical Society, A, 146(1983) 150-157. [24] R. Clemen, Combining forecasts: A review and annotated bibliography. Journal of Forecasting, 5(1989), 559-583. [25] L. Breiman, Bagging predictors. Machine Learning, 24(1996), 123-140.
![Page 29: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/29.jpg)
29
[26] R. Schapire, The Strength of Weak Learnability. Machine Learning, 5(1990), 197-227. [27] M. Kaufmann. Bagging, boosting, and bloating in genetic programming. Proceedings of the Genetic and Evolutionary Computation Conference, 1999, 1053-1060. [28] P. Sollich, A. Krogh. Learning with ensembles: How overfitting can be useful, in: D. Touretzky, M. Mozer, M. Hasselmo, Advances in Neural Information Processing Systems, 8(1996). MIT Press, Cambridge, MA. [29] R. Jacobs,. Methods for combining experts’ probability assessments. Neural Computation 7(1995) 867-888. [30] A. Krogh, J. Vedelsby. Neural network ensembles, cross validation, and active learning, in: G. Tesauro, D. Touretzky, T. Leen, Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA., 7(1995), 231-238. [31] D. Optiz, Feature selection for ensembles. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI), 1999 379-384. [32] J. Gehrke, R. Ganti, BOAT – Optimistic decision tree construction. Proceedings of the 1999 SIGMOD conference, Philadelphia, PA, 1999. [33] M. Mitchell, An Introduction to Genetic Algorithms. MIT Press. Cambridge, Massachusetts. 1996. [34] M. Evett, and T. Fernandez. Numeric Mutation Improves the Discovery of Numreic Constants in Genetic Programming. In J.R. Koza: Proceedings of the Third Annual Genetic Programming Conference, Wisconsin, Madison, Morgan Kaufmann, 1998. [35] Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, 2nd Edition, Springer-Verlag, 1994.
[36] S. Bhattacharyya, and P. Pendharkar. Inductive, Evolutionary, and Neural Techniques for Discrimination: A Comparative Study. Decision Sciences, 9(1998), 871-900.
[37] J. Eggermont, A. Eiben, J. Hemert.. A comparison of genetic programming variants for data classification. in: D. Hand, J. Kok, M. Berthold, Advances in Intelligent Data Analysis (Third International Symposium). Amsterdam, The Netherlands, 1999
![Page 30: Genetic Programming in Classifying Large -Scale Data: An … · 2015-12-11 · experimental study compares the performance of GP-ensembling to ensembles based on decision trees as](https://reader034.fdocuments.in/reader034/viewer/2022050522/5fa57900d4e80f055f6b3457/html5/thumbnails/30.jpg)
30
[38] S. Bhattacharyya, Multi-Objective Data-Mining using Genetic Search, Proceedings of the GECCO-2000 Workshop on Data Mining with Evolutionary Algorithms, AAAI press, 2000.
[39] A. Freitas, A genetic programming framework for two data mining tasks: Classification and generalized rule induction. Genetic Programming 1997: Proc. 2nd Annual Conf., Morgan Kaufmann, 1997, 96-101. [40] L. Altenberg, Emergent phenomena in genetic programming, in: A. V. Sebald and L. J. Fogel (ed.), Evolutionary Programming -- Proceedings of the Third Annual Conference, World Scientific Pub., 1994, 233-241. [41] P. -J. Angeline, Subtree crossover causes bloat, in: John R. Koza et. al., (Ed.), Genetic Programming 1998: Proceedings of the Third Annual Conference, Morgan Kaufmann Pub., 1998. 745-752. [42] W.-B. Langdon, Size fair and homologous tree genetic programming crossovers. Genetic Programming and Evolvable Machines, 1(2000), 95-119. [43] D. -J. Montana, Strongly Typed Genetic Programming. Evolutionary Computation. 3(1995) 199-230.