Towards incremental learning of nonstationary imbalanced data … ·...
Transcript of Towards incremental learning of nonstationary imbalanced data … ·...
ORIGINAL PAPER
Towards incremental learning of nonstationary imbalanced datastream: a multiple selectively recursive approach
Sheng Chen • Haibo He
Received: 1 August 2010 / Accepted: 4 November 2010
� Springer-Verlag 2010
Abstract Difficulties of learning from nonstationary data
stream are generally twofold. First, dynamically structured
learning framework is required to catch up with the evolu-
tion of unstable class concepts, i.e., concept drifts. Second,
imbalanced class distribution over data stream demands a
mechanism to intensify the underrepresented class concepts
for improved overall performance. To alleviate the chal-
lenges brought by these issues, we propose the recursive
ensemble approach (REA) in this paper. To battle against the
imbalanced learning problem in training data chunk
received at any timestamp t, i.e., St; REA adaptively pushes
into St part of minority class examples received within
[0, t - 1] to balance its skewed class distribution. Hypoth-
eses are then progressively developed over time for all
balanced training data chunks and combined together as an
ensemble classifier in a dynamically weighted manner,
which therefore addresses the concept drifts issue in time.
Theoretical analysis proves that REA can provide less
erroneous prediction results than a comparative algorithm.
Besides that, empirical study on both synthetic benchmarks
and real-world data set is also applied to validate effec-
tiveness of REA as compared with other algorithms in terms
of evaluation metrics consisting of overall prediction accu-
racy and ROC curve.
Keywords Incremental learning � Nonstationary data �Imbalanced learning � Stream data � Ensemble learning �Concept drift
1 Introduction
Learning from data stream has been featured in many practical
applications such as network traffic monitoring and credit fraud
identification (Babcock et al. 2002). Generally speaking, data
stream is a sequence of unbounded, real-time data items with a
very high rate that can be read only once by an application
(Gaber et al. 2003). The restriction placed by the end of this
definition is also called one-pass constraint (Aggarwal 2007),
which is also claimed by other literature (Sharma 1998; Lange
and Grieser 2002; Muhlbaier et al. 2009). It has been flour-
ished for quite a few years for the studies of learning from data
stream. To name a few, (Domingos and Hulten 2000) proposed
very fast decision tree (VFDT) to address data mining from
high speed data stream like Web access data. By using Hoe-
ffding bound, it can offer approximately identical performance
as that of conventional learner on static data set. Learn??
(Polikar et al. 2001) approaches learning from data stream
through an aggressive ensemble-of-ensemble learning para-
digm. Briefly speaking, Learn?? processes the data stream in
unit of data chunks. For each data chunk, learn?? applies base
learner of multi-layer perceptron (MLP) to create multiple
ensemble hypotheses upon it. He and Chen (2008) proposed
IMORL framework to address learning from video data
stream. It calculates the Euclidean distance in feature space
between each example within consecutive data chunk to
transmit sampling weights in a biased manner, i.e., hard-to-
learn examples would gradually be assigned higher weights for
learning, which resembles itself to AdaBoost’s weights
updating mechanism (Freund et al. 1997) to some degree.
H. He (&)
Department of Electrical, Computer,
and Biomedical Engineering, University of Rhode Island,
Kingston, RI 02881, USA
e-mail: [email protected]
S. Chen
Department of Electrical and Computer Engineering,
Stevens Institute of Technology, Hoboken, NJ 07030, USA
e-mail: [email protected]
123
Evolving Systems
DOI 10.1007/s12530-010-9021-y
In Angelov and Zhou (2006), an approach to real-time gener-
ation of fuzzy rule-based systems of eXtended Takagi-Sugeno
(xTS) type from data streams was proposed, which applies
incremental clustering procedure to generate clusters to form
fuzzy rule based systems. (Georgieva and Filev 2009) pro-
posed the Gustafson-Kessel Algorithm for incremental clus-
tering of data stream. It applies adaptive-distance metric to
identify clusters with different shape and orientation. As a
follow-up, (Filev and Georgieva 2010) extended Gustafson-
Kessel Algorithm to enable real-time clustering of data stream.
In Dovzan and Skrjanc (2010), a recursive version of the fuzzy
identification method and predictive functional model is pro-
posed to the control of a nonlinear, time-varying process.
Incapability of storing all data into memory for learning as
done by traditional approaches has yet been the solely chal-
lenge data stream has presented to the community. As what it
sounds to be, concept drift, also recognized as time-evolving
nature (Aggarwal 2003), suggests it is undesirable yet inev-
itable that most of the time class concepts evolve as data
stream forwards. This property combined with virtually
unbounded volume of data stream accounts for the so-called
‘‘stability-plasticity’’ dilemma (Grossberg 1988). One may be
trapped in an endless loop of pondering either reserving just
the most recent knowledge to battle against concept drift or
keeping track of knowledge as much as possible in avoidance
of ‘‘catastrophic forgetting’’. With regards to this, many
works have been recorded to strike a balance between two
ends of the ‘‘stability-plasticity’’ dilemma. Marked as an
effort of adapting ensemble approach to time-evolving data
stream, SEA (Street and Kim 2001) maintains an ensemble
pool of C4.5 hypotheses with a fixed size, each of which is
built upon a data chunk with unique time stamp. When the
request of inserting a new hypothesis is made but ensemble
pool has been fully occupied, some criterion is introduced to
evaluate whether the new hypothesis is qualified enough to be
accommodated at the expense of popping an existing
hypothesis therein. Directly targeting upon making one’s
choice for the new and old data, (Fan 2004) examines itself
the necessity of referring to old data’s help. Because if it is
unnecessary, reserving the most recent data would suffice to
yield a hypothesis with satisfying performance. Otherwise,
cross validation will be applied to locate the portion of old
data that may be mostly helpful to complement the most
recent data for building an optimal hypothesis. The potential
problem for this approach is the choice of granularity for cross
validation. As straightforward as it can be, finer granularity
would more accurately provide the desirable portion of old
data. However increasing performance comes with extra
overhead. When granularity is tuned fine enough to the scale
of single example, cross validation would degenerate itself
into a brute force method, which may exhibit intractability for
applications sensitive of speed. Other ways of countering
concept drift include sliding window method (Last 2002)
which maintains a sliding window with either fixed or
adaptively adjustable size to determine timeframe of the
knowledge that should be reserved, and fading factor method
(Law and Zaniolo 2005) which assigns a time-decaying factor
(usually in form of inverse exponential) to each hypothesis
built over time. In such a way, old knowledge would gradu-
ally be obsoleted and could be removed when the corre-
sponding factor downgrades itself to below the threshold.
Despite the popularity of data stream study, learning from
nonstationary data stream with skewed class distribution is a
relatively uncharted area, of which the difficulty resides itself
in the context. In static context, the counterpart of this
problem is recognized as ‘‘imbalanced learning’’ which
corresponds to domains where certain types of data distri-
bution over-dominates the instance space compared to other
data distribution (He and Garcia 2009). It is a recently
emerged area and has attracted significantly growing atten-
tion in community (Fan et al. 1999; Chawla et al. 2002,
2003; Hong et al. 2007; Masnadi-Shirazi and Vasconcelos
2007). However the same story does not come to the same
problem in the context of data stream, where the number of
solutions is rather limited. Those on record include (Gao
et al. 2007) which accommodates all previous minority class
examples into the current training data set to compensate
skewed class distribution, upon which an ensemble of
hypotheses is built. In lieu of this aggressive accommodation
mechanism, our previous work SERA (Chen and He 2009)
chooses a portion of previous minority class examples into
the current training data chunk based on their similarity.
Accumulation of previous minority class examples is of
limited volume due to skewed class distribution. Therefore, it
should not be considered as violation of one-pass constraint.
In this paper, we propose a Recursive Ensemble
Approach (REA) in an effort to provide a solution for
handling imbalanced data streams of nonstationary class
concepts. Different from (Gao et al. 2007), REA takes a
similar step as SERA to incorporate part of previous
minority class examples into the current training data
chunk. However in lieu of limiting the availability of
hypotheses on the current training data chunk as in SERA
as well as in literature (Gao et al. 2007), REA combines all
hypotheses built over time in a dynamically weighted
manner to make predictions on the testing data set.
The proposed REA framework in this work is mainly
motivated by our recent approach of MuSeRA (Chen and He
2010). Specifically, in this paper we investigate a different
strategy of estimating the similarity between previous minority
class examples and the current minority class set. Furthermore,
based on the success of SERA (Chen and He 2009) and
MuSeRA (Chen and He 2010), in this work we significantly
extend simulations of REA to both synthetic benchmarks and
real-world data sets. We also further design various simulations
to test the robustness of REA under different parameter
Evolving Systems
123
settings. Such empirical results together with the theoretical
analysis provide a more comprehensive justification of the
effectiveness of the proposed REA framework.
Rest of this paper is organized as follows. Section 2
discusses technical details of REA algorithm. Section 3
gives a theoretical analysis on the prediction error rate of
REA, and compares it with that of the existing algorithm.
Section 4 introduces configuration and assessment metrics
applied to simulations. After that, two artificially synthetic
benchmarks and a real-world data set are used to evaluate the
effectiveness of the proposed REA in terms of its compari-
son with other existing algorithms. Section 5 concludes the
paper and briefly introduce potential improvement that can
be made for REA in the future.
2 The proposed algorithm for nonstationary
imbalanced data stream
2.1 Preliminaries for REA
Before officially elaborating the algorithm-level frame-
work of REA, we would like to introduce some preliminary
knowledge to better facilitate its understandability.
2.1.1 The recursive approach for imbalanced learning
Sampling-based methods account for a very important
category among imbalanced learning family. Generally
speaking, it consists of over-sampling approach and under-
sampling approach (He and Garcia 2009).
Over-sampling approaches, such as SMOTE/SMOTE-
Boost (Chawla et al. 2002, 2003), and dataBoost-IM (Hong
et al. 2007) create synthetic minority class instances based
upon existing minority class examples to balance skewed
class distribution. REA also seeks to amplify the number of
minority class examples in the current training data chunk.
But instead of creating synthetic minority class instances,
REA collects minority examples from previous training
data chunks over time and selectively accommodates those
with high similarity with the current minority class set into
the current training data chunk.
We would like to note that the approach proposed in (Gao
et al. 2007) also collects the previous minority class exam-
ples to amplify the current training data chunk. However, the
difference is that it adopts a ‘‘take-in-all’’ mechanism to put
all previous minority class examples into the current training
data chunk, no matter how many has been accumulated.
Besides, that method takes an under-sampling alike
approach to disintegrate without replacement the majority
class examples into several disjoint subsets. Hypotheses are
built on each of these subsets plus a replica of the amplified
minority class set. The averaged combination of these
hypotheses would be used to make predictions on the current
testing data set. We will see later in this section that REA
uses a different ensemble approach.
2.1.2 The k-nearest neighbors selective accommodation
mechanism
Similar to Chen and He (2009), REA selectively accom-
modates a certain amount of previous minority examples
into the current training data chunk to balance skewed class
distribution. This is different from the mechanism of (Gao
et al. 2007) which amplifies the current training data chunk
by incorporating all previous minority examples regardless
of their similarity degree to the current minority example set.
In Chen and He (2009), it has been shown in empirical study
that performance of SERA was competitive compared to the
take-in-all mechanism employed by Gao et al. (2007). REA
inherits the selective accommodation mechanism of SERA,
which gives it a good chance to receive similar benefits.
In Chen and He (2009), similarity was estimated based
on Mahalanobis distance defined by:
d ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðx� lÞTR�1ðx� lÞq
ð1Þ
where x is the feature set of a previous minority class example,
l and R are mean value and covariance matrix of the current
minority class set, respectively. As shown in Fig. 1a, each
previous minority class example calculates its Mahalanobis
distance to the current minority class set, based on which
SERA determines which part of previous minority class
examples should be added into the current training data chunk.
This method, however, may exhibit a potential flaw: it
assumes that there are no disjoint sub-concepts within the
minority class concept. Otherwise, there may exist several
sub-concepts for the minority class, i.e., D1 and D2 in Fig. 1b
instead of D in Fig. 1a. REA solves this flaw by adopting the
k-nearest neighbors paradigm to estimate such similarity
degree. Specifically, each previous minority class example
determines the number of minority examples that are within
its k-nearest neighbors in the current training data chunk as
its similarity degree to the current minority class set. It can be
illustrated in Fig. 1c. Here highlighted areas surrounded by
dashed circles represent the k-nearest neighbors search area
for each previous minority class examples: S1; S2; S3; S4 and
S5. Search area of Si represents the region where the k-nearest
neighbors of Si in current training data chunk fall in, which
consist of both majority class examples and minority class
examples. Since majority class examples do not affect the
similarity estimation, they are not shown in Fig. 1. Current
minority class examples are represented by bold circles, the
number of which falling in each of ‘‘search areas’’ is 3, 1, 2,
1, and 0, respectively. Therefore, REA decides the similarity
of S1, S2, S3, S4 and S5 to the current minority example set
sorted as: S1 [ S3 [ S2 ¼ S4 [ S5.
Evolving Systems
123
2.1.3 The ensemble approach for concept drifts
Both Gao et al. (2007) and Chen and He (2009) handle the
concept drifts by solely replying on the current training data
chunk (inflated by all/part of the previous minority class
examples). This method makes sense since only the current
training data chunk stands for exactly accurate information
of the class concept due to concept drifts. However, sole
maintenance of a hypothesis/hypotheses on the current
training data chunk is more or less equal to discarding a
significant part of previous knowledge, since knowledge of
previous majority class examples can never be accessed
again either explicitly or implicitly once they have been
processed. This situation, according to Grossberg (1988),
could partially account for ‘‘catastrophic forgetting’’ that a
qualified online learning system should manage to avoid
disconnecting itself from previous knowledge.
To address this issue, REA maintains all hypotheses built
on training data chunks over time. Concerning the way of
hypotheses combination, (Gao et al. 2007) employs a uni-
form voting mechanism to combine hypotheses maintained
this way since it claims practically the class concepts of
testing data set may not necessarily evolve consistently with
the streaming training data sets. Putting aside this debatable
subject, in this work, we assume class distribution of the
testing data set keeps tuned to the evolution of the training
data chunks. Therefore REA weighs hypotheses according
to their classification accuracy on the current training data
chunk. The weighted combination of all the hypotheses
makes prediction on the current testing data set.
As stated in Sect. 2.1.1, Gao et al. (2007) also employs
ensemble approach. Nonetheless, the difference is that
REA dynamically associates all knowledge acquired over
time in a weighted manner which captures the very essence
of incremental learning, while the way by Gao et al. (2007)
is more like an exploitation process on the current training
data chunk using random under-sampling. We will explore
the performances of these two methods in our experiments.
2.2 The REA learning algorithm
The general incremental learning scenario is training data
chunk St with labeled examples and testing data set T t
with unlabeled instances always come in a pairwise manner
at any timestamp t. The task of REA at any timestamp t is
to make predictions on Tt as accurately as possible based
on knowledge learned on ðS1;S2; . . .;StÞ: With loss of
generality, it is assumed in REA that the imbalanced ratios
of all training data chunks are the same. One can easily
generalize it to the case when the training data chunks do
have different imbalanced ratio.
Pseudo-code of the proposed REA for incremental
learning of the nonstationary imbalanced data stream at
timestamp t is thus formulated as follows:
Fig. 1 The selective
accommodation mechanism:
circles denote current minority
class examples; starts represent
previous minority class
examples. a Intuitive approach
to decide similarities based on
Euclidean distance. b Potential
dilemma by applying this
intuitive approach. c Proposed
approach by using the number
of current minority class cases
within k-nearest neighbors of
each previous minority example
to decide similarities
Evolving Systems
123
Evolving Systems
123
Figure 2 shows the system level framework of the pro-
posed REA algorithm. The underlying principle of this
framework is similar to Chen and He (2010). Briefly
speaking, data set G contains all minority training data
prior to the current time. At time t = n, a certain amount
((f - c) 9 m) of minority examples in G are chosen based
on the criterion that the number of majority class cases of
their k-nearest neighbors within the training data chunk Sn
is as large as possible. These examples are then appended
to Sn such that the ratio of minority example in the post-
balanced training data chunk S0n is equal to f. Hypothesis hn
is built upon S0n; and then added into the hypotheses set Hn
to obtain Hn?1. Each of the hypotheses in Hn?1 is applied
on Sn to calculate the error rate {ej} using Eq. (2), which is
then used to calculate the weights {wj} for each of them by
Eq. (3). Large ej means poor performance of hj on Sn. wj
would therefore be very small, which makes hj’s impact on
classifying unlabeled instances in T n negligible. Small ej
generally means hj generalizes well on Sn and should be
given larger weight. However one should be cautious when
ej becomes extremely small, e.g., approaching 0. This
means hj has a great chance of overfitting Sn which would
result in poor generalization performance on I n: When this
situation does happen, one should refrain from adding hj
into the ensemble classifier h(t)final for classifying testing
data set In:
Finally, the hypotheses set Hn?1 are weighted by {wj} to
obtain the final hypothesis h(n)final to make predictions on
the current testing data set T n:
3 The theoretical analysis of prediction accuracy
In this section, we present a brief discussion of the theo-
retical analysis of the REA framework. Since the proof can
be done in a similar way to our previous work of MuSeRA
(Chen and He 2010), here we only highlight several major
steps of the analysis while directing the interested readers
to Chen and He (2010) for further details.
Assuming the majority class examples can be decomposed
into K subsets with approximately identical size of the
minority class set (Gao et al. 2007), then each of these subsets
can be combined with a replicate of the minority example set,
on which a hypothesis hi; ði ¼ 1; . . .;KÞ is developed. We
further assume the probability output of hypothesis hi that the
testing instance x belongs to class c is fci(x), then the corre-
sponding probability output of the ensemble classifier is (we
refer this framework to ‘‘Uncorrelated Bagging’’ abbreviated
as ‘‘UB’’ in the rest of this paper):
f cEðxÞ ¼
1
n
X
K
i¼1
f ci ðxÞ ð5Þ
According to Tumer and Ghosh (1996), the probability
output of a soft-typed classifier for an instance x can be
expressed as:
f cðxÞ ¼ pðcjxÞ þ gcðxÞ ð6Þ
where p(c|x) is a posteriori probability that instance
x belongs to class c, and gc(x) is the error associated with
the ith output.
Fig. 2 The system level REA
framework
Evolving Systems
123
Based on Eq. (6), given that we are targeting binary
classification problem, e.g., class i and j, it was proved in
Tumer and Ghosh (1996) that the expected error can be
reasonably approximated by:
Error ¼ r2gc� pðcjjxÞ � pðcijxÞ
2ð7Þ
where p(cj|x) and p(ci|x) are a posteriori probabilities that
instance x belongs to class i and class j of the true Bayes
model, respectively; they are irrelevant to the training
algorithm itself. Therefore, the expected error is propor-
tional to the variance of gc(x) with a constant, i.e.,
Error / r2gc:
Given the independence of the hypotheses developed in
each consolidated subset, the boundary variance of UB can
be reduced by a factor of K2 (Gao et al. 2007), i.e.,
r2bE ¼
1
K2
X
K
i¼1
r2bi ð8Þ
In our proposed REA framework, since the weights are
determined reversely proportional to the errors the single
classifiers on the current training data chunk, they can
approximately be described by:
wi ¼C
r2gi
c
ð9Þ
where C is a constant for all {wi}.
Based on Eq. (6), the variance gEc(x) is part of REA’s
probability output, and can thus be represented as the
weighted sum of the variance of single classifiers, i.e.,
gEc ðxÞ ¼
PNi¼1 wigi
cðxÞPN
i¼1 wi
ð10Þ
If one makes the similar assumption as in Gao et al.
(2007) that each single classifier is independent from each
other, the variance of gEc(x) can be represented by:
r2gE
c¼PN
i¼1 w2i r
2gi
cðxÞ
PNi¼1 w2
i
ð11Þ
Taking Eq. (9) into consideration, Eq. (11) can be
simplified into:
r2gE
c¼ 1PN
i¼1 1=r2gi
c
ð12Þ
With the estimations in Eqs. (8) and (12), one can follow
similar analysis in Chen and He (2010) to prove:
r2gE
c� r2
bE ð13Þ
According to previous discussion that Error / r2gc; we
can conclude that REA framework can provide less
erroneous prediction results than UB (Gao et al. 2007).
4 Simulation and discussion
Our previous work Chen and He (2009) is based on one
single hypothesis built upon the current amplified training
data chunk. In this section we will show that through
combining all hypotheses built upon the amplified training
data chunks over time in a properly weighted manner,
performance of REA for predicting class labels of the
testing data sets can be considerably improved. Further-
more, we also compare our proposed approach with the
dedicated static imbalanced learning approach such as
SMOTE (Chawla et al. 2002) to demonstrate that our
proposed approach can effectively handle the dynamic
imbalanced data streams.
In our current study, we adopted the alassification and
regression tree (CART) as the base classifier. The strategy
of making CART output likelihoods that the input instance
should belong to any class with is twofold. (1) The leaf
node that the instance under testing falls in is located; (2)
inside that leaf, proportions of training examples belonging
to each class are calculated as the likelihood of the instance
under testing for each class. A toy example would be a leaf
node with 3 majority class examples and 2 minority class
examples falling in it during training. Then as long as an
unlabeled instance reaches this leaf node during testing, it
would be assigned probabilities of 3/5 and 2/5 belonging to
majority class and minority class, respectively.
The whole tree generated by CART should be pruned
thereafter, because otherwise CART would always have a
perfect classification performance on training data set,
which is undesirable due to potential overfitting risk. We
choose to apply the strategy of cost-complexity pruning
process to prune the tree created by CART. Briefly
speaking, the pruning process generates a series of trees
T0 � T1. . . � Tm where T0 is the whole tree and Tm is the
root node (decision stump). Ti is created by replacing a
subtree satisfying certain condition in Ti-1 to be a leaf node.
Then tree Tj with maximum accuracy on training data set is
chosen as the pruned tree. Detailed description of cost-
complexity pruning process can be found in Breiman et al.
(1984).
The reason of choosing CART as the base learner is that
it can provide desired trade-off between speed and per-
formance. Base learners such as logistic regression or
decision stump are not strong enough to efficiently learn
knowledge from data chunks with unnatural class distri-
bution. Other base learners such as neural networks of
multi-layer perceptron (MLP) and support vector machines
(SVMs) are obviously strong enough to effectively learn
from streamed data chunks. However the problem is that
they generally require much more time for the training
process, which makes them no good choices for designing
an on-line learning system. Besides, they usually tend to
Evolving Systems
123
output learning models of high variance, which could result
in low diversity among hypotheses in ensemble pool,
which is what an ensemble classifier should be designed to
avoid.
Configurations of comparative algorithms and their
notation fashion are summarized as follows.
• The REA approach uses k-nearest neighbor, which is
decided through cross-validation, to weigh the similar-
ity between the previous minority class examples and
the current minority class set. The post-balance ratio f is
set to be 0.5.
• The SERA approach uses the hypothesis built on the
amplified current training data chunk to evaluate the
current testing instance set. The post-balance ratio f is
set to be 0.5.
• The approach proposed in Gao et al. (2007), which is
denoted as ‘‘UB’’ in this section.
• The SMOTE approach (Chawla et al. 2002) employs
the synthetic minority over-sampling technique to
balance the class distribution of the current training
data chunk upon which a hypothesis is built to predict
on the current testing data set. The number of the
synthetic minority class instances plus current minority
class examples should be half of the number of majority
class examples in the current amplified training data
chunk. In other words, the post-balance ratio for
SMOTE is also 0.5, if that concept can apply.
• Learning directly upon the training data chunk, which
is denoted as ‘‘Normal’’ in the simulation results.
One may wonder how to decide the k parameter of
k-nearest neighbor for similarity estimation mechanism of
REA. In context of online learning, grid search for optimized
parameter with cross validation may not be applicable.
Conceptually, k should not be set more than the number of
minority class examples in current training data chunk,
because otherwise the search range of ‘‘nearest neighbors’’
would go way beyond local area of the previous minority
class example under consideration and thus make the distinct
among different previous minority examples less obvious. In
this study, we uniformly set k of REA to be 10 for all
benchmarks, which is consistently less than or equal to the
number of minority class examples in training data chunks.
4.1 Evaluation metrics
Following the routine of imbalanced learning study, the
minority class data and the majority class data belong to
positive and negative classes, respectively. Let {p, n}
denote the positive and negative true class labels and
{Y, N} denote the predicted positive and negative class
labels, the confusion matrix for binary classification
problem can be defined in Fig. 3.
By manipulating on the confusion matrix, the overall
prediction accuracy (OA) can be defined as:
OA ¼ TPþ TN
TPþ TNþ FPþ FNð14Þ
OA is usually adopted in traditional learning scenario,
i.e., static data set with balanced class distribution, to
evaluate the performance of algorithms. However, when
the context changes to imbalanced learning, it is wise to
apply other metrics to do such evaluation (He and Garcia
2009), among which Receiver Operation Characteristics
(ROC) curve and Area under ROC curve are what are most
strongly recommended (Fawcett 2003).
Based on the confusion matrix as defined in Fig. 3, one
can calculate the TP rate and FP rate as follows:
TP rate ¼ TP
PR¼ TP
TPþ FNð15Þ
FP rate ¼ FP
NR¼ FP
FPþ TNð16Þ
ROC space is established by plotting TP rate over
FP rate: Generally speaking, hard-type classifiers (those
that only output discrete class labels) correspond to points
in ROC space: ðFP rate; TP rateÞ: On the other hand, soft-
type classifiers (those that output a likelihood that an
instance belongs to either class label) correspond to curves
in ROC space. Such curves are formulated by adjusting the
decision threshold to generate a series of points in ROC
space. For example, if an unlabeled instance xk’s
likelihoods of belonging to minority class and majority
class are 0.3 and 0.7 respectively. Natural decision
threshold d = 0.5 would classify xk as majority class
example, since 0.3\ d. However, d could also be set other
values, e.g., d = 0.2. In this case, xk would be classified as
minority class example, since 0.3[ d. By tuning d from 0
to 1 with a small step H; e.g., H ¼ 0:01; a series of pair-
wise points (FP rate, TP rate) could be created in ROC
space. In order to assess different classifiers’ performance
in this case, one generally uses Area under ROC Curve
(AUROC) as an evaluation criterion; it is defined as the
area between ROC curve and the horizontal axis (axis
representing FP rate).
Fig. 3 Confusion matrix for binary classification
Evolving Systems
123
In order to reflect the ROC curve characteristics for all
random runs, we adopt the vertical averaging approach in
Fawcett (2003) to plot the averaged ROC curves. Our
implementation of the vertical averaging method is illus-
trated in Fig. 4. Assume one would like to average two
ROC curves: l1 and l2; both are formed by a series of points
in the ROC space. The first step is to evenly divide the
range of FP rate into a set of intervals. Then at each
interval, find the corresponding TP rate values of each
ROC curve and average them. In Fig. 4, X1 and Y1 are the
points from l1 and l2 corresponding to the interval
FP rate1. By averaging their TP rate values, the corre-
sponding ROC point Z1 on the averaged ROC curve is
obtained. However, there exist some ROC curves which do
not have corresponding points on certain intervals. In this
case, one can use the linear interpolation method to obtain
the averaged ROC points. For instance, in Fig. 4, the point�X (corresponding to FP rate2) is calculated based on the
linear interpolation of the two neighboring points X2 and
X3. Once �X is obtained, it can be averaged with Y2 to get
the corresponding Z2 point on the averaged ROC curve.
4.2 SEA data set
4.2.1 Data preparation
SEA data set (Street and Kim 2001) is a popular artificial
benchmark to assess the stream data mining algorithms’
performance. It has three features randomized in [0, 10],
where whether the sum of the first two features surpasses a
defined threshold determines the class label. The third
feature is irrelevant and can be considered as noise to test
the robustness of the algorithm under simulation. The
concept drifts are designed to adjust the threshold period-
ically such that the algorithm under simulation would be
confronted with an abrupt change in class concepts after it
lives with a stable concept for several data chunks.
Following the original design of the SEA data set, we
categorize the whole data streams into 4 blocks. Inside
each of these blocks, the threshold value is fixed, i.e., the
class concepts are unchanged. However, whenever a new
block begins, the threshold value will be changed and
retained till the end of this block. The threshold values of
the 4 blocks are set to be 8, 9, 7, and 9.5, respectively,
which again adopts the configuration of (Street and Kim
2001). Each block consists of 10 data chunks, each of
which has 1,000 examples as the training data set and 200
instances as the testing data set. Examples with the sum of
the two features greater than the threshold belong to the
majority class, and those otherwise reside in the minority
class. The number of generated minority class data is
restricted to be 1/20 of the total number of data in the
corresponding data chunk. In other words, the imbalanced
ratio is set to be 0.05 in our simulation. In order to intro-
duce some uncertainty/noise into the data set, 1% of the
examples inside each training data set are randomly sam-
pled to reverse their class labels. In this way, approxi-
mately 1/6 of the minority examples are erroneously
labeled, which raises a challenge on handling noise for all
comparative algorithms learning from this data set.
Fig. 4 Vertical averaging
approach for multiple ROC
curves
Evolving Systems
123
4.2.2 Results and discussion
The simulation results for the SEA data set are averaged
over 10 random runs. During each random run, the data set
is basically generated all over again using the scheme
described in Sect. 4.2.1. To view the performance of the
comparative algorithms in the whole learning life, we
installed ‘‘observation points’’ on chunks 5, 10, 15, 20, 25,
30, 35, and 40. The presentation and discussion of the
simulation results on the SEA data set only cover the whole
or subset of the observations points.
The tendency lines of the averaged prediction accuracy
over the observation points are plotted in Fig. 5a. One can
conclude from this figure that: (1) REA can provide higher
prediction accuracy on testing data over time than UB,
which is consistent with the theoretical conclusion made in
Sect. 3; (2) REA does not perform superiorly in terms of
overall prediction accuracy to other comparative algo-
rithms over time. In fact, it is learning without adding any
ingredients (‘‘Normal’’) that provides the most competitive
results in terms of the overall prediction accuracy on
testing data most of the time. However as discussed pre-
viously, overall prediction accuracy does not come into
the first place that should be cared about in imbalanced
learning scenario. It is metric like ROC/AUROC that
determines how well the algorithm performs on imbal-
anced data sets.
The AUROC values of the comparative algorithms on
the observation points are given in Fig. 5b, complemented
by which are the corresponding ROC curves on data
chunks 10 (Fig. 6a), 20 (Fig. 6b), 30 (Fig. 6c), and 40
(Fig. 6d), respectively, as well as the corresponding
numeric AUROC values on these data chunks shown in
Table 1. One can easily see that in terms of AUROC, REA
shows very competitive performance against other com-
parative algorithms for learning from the SEA data set of
imbalanced class distribution.
4.3 Real-world data set
4.3.1 Data preparation
The electricity market data set (ELEC data set) (Harries
1999) is used in this study to validate the effectiveness of
the proposed algorithm in real-world applications. The data
were collected from the Australian New South Wales
Electricity Market to reflect the electricity price fluctuation
(up/down) affected by demand and supply of the market.
Since how market influences the electricity price evolves
unpredictably in real world, the concrete representation of
the concept drifts embedded inside the data set is thus
inaccessible, which enable us to gain another insight into
the proposed algorithm as compared to artificial benchmark
with predefined design of concept drifts.
The original data set contains 4531 examples dated from
May 1996 to December 1998. We only retain examples after
May 11, 1997 for our simulation, since several features are
missing from the examples before that date and we do not
intend to investigate learning from incomplete feature set in
this study. Each example consists of 8 features. Features 1–2
represent the date, and the day of the week (1–7) for the
collection of the example, respectively. Each example is
sampled within a timeframe of 30 min, i.e., a period, thus
there are altogether 48 examples collected for each day,
which correspond to 48 periods a day. Feature 3 exactly
stands for in what period (1–48) the very example was col-
lected, and thus is a purely periodic number. Features 1–3 are
excluded from the feature set since they just stand for the
timestamp information of the data. According to the data
sheet instruction, feature 4 should also be ignored from the
learning process. Therefore, the remaining features are the
NSW electricity demand, the VIC Price, the VIC electricity
demand, and the scheduled transfer between states, respec-
tively. In summarize, 27,549 examples with the last 4 fea-
tures are extracted from ELEC data set for simulation.
5 10 15 20 25 30 35 400.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Data chunk
Ove
rall
pred
ictio
n ac
cura
cy
SERA
SMOTE
UB REA
Normal
(a)
5 10 15 20 25 30 35 400.85
0.9
0.95
1
Data chunk
Are
a un
der
RO
C c
urve
SMOTE
UB
SERA
Normal
REA
(b)
Fig. 5 OA and AUROC
for SEA data set
Evolving Systems
123
With the order of the examples undisturbed, the
extracted data set is evenly sliced into 40 data chunks.
Inside each data chunk, examples representing electricity
price going down are determined as the majority class data,
while the remaining representing electricity price going up
are randomly under-sampled as the minority class data. The
imbalanced ratio is set to be 0.05, which means only 5% of
the examples inside each data chunk belong to minority
class. To conclude the preparation of this data set, 80% of
the majority class data and the minority class data inside
each data chunk are randomly sampled and merged as the
training data, and the remaining are used to assess the
performance of the corresponding trained hypotheses.
4.3.2 Results and discussion
The results of the simulation are based upon 10 random
runs, where the randomness comes from the random
under-sampling of the minority class data. Like what we
did for SEA data set, observation points are set up in
data chunks 5, 10, 15, 20, 25, 30, 35, and 40, respec-
tively, on which only we present or discuss the simula-
tion results.
Figure 7a plots the averaged overall prediction accuracy
of the comparative algorithms. The conclusions that can be
made are similar with that in Sect. 4.2.2 per se. In brief, in
terms of overall prediction accuracy, REA is consistently
better than UB, but is inferior to some other comparative
algorithms over time. However since we are talking about
data set of imbalanced class distribution under study, it is
ROC/AUROC instead of overall prediction accuracy that
can really decide the performance of the algorithms.
Figure 7b shows the averaged AUROC of the compara-
tive algorithms. As complements, Fig. 8 shows the ROC
curves averaged by 10 random runs of comparative algo-
rithms on data chunk 10 (Fig. 8a), 20 (Fig. 8b), 30 (Fig. 8c),
and 40 (Fig. 8d). Table 2 gives the numerical value for
AUROC of all comparative algorithms on selected data
chunks. One can see that with time goes by, REA performs
very competitive AUROC results against other comparative
algorithms, which leads to the conclusion that REA can
provide much satisfying performance on learning from real-
world ELEC data set with imbalanced class distribution in
temporally streamed format.
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
FP_rate
TP_
rate
Normal
SERAREA
SMOTE
UB
(a)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
FP_rate
TP_
rate
NormalSMOTE
SERA
UB
REA
(b)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
FP_rate
TP_
rate
SERA
Normal
REA
SMOTE
UB
(c)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
FP_rateT
P_ra
te
REA
Normal
SMOTE
SERA
UB
(d)
Fig. 6 ROC curves for selected
data chunks of SEA data set
Table 1 AUROC values for selected data chunks of SEA data set
Data chunk Normal SMOTE UB SERA REA
10 0.9600 0.9749 0.9681 0.9637 1.0000
20 0.9349 0.9397 0.9276 0.9373 0.9966
30 0.9702 0.9602 0.9565 0.9415 0.9964
40 0.9154 0.9770 0.9051 0.9497 1.0000
Evolving Systems
123
4.4 Spinning hyperplane data set
4.4.1 Data preparation
Proposed in Wang et al. (2003), the spinning hyperplane
(SHP) data set defines the class boundary as a hyperplane
in n dimensions by coefficients a1; a2; . . .; an. An example
x ¼ ðx1; x2; . . .; xnÞ is created by randomizing each feature
in the range [0, 1], i.e., xi 2 ½0; 1�; i ¼ 1; . . .; n. A constant
bias is defined as:
a0 ¼1
2
Xn
i¼1ai ð17Þ
Then the class label y of the example x is determined by:
y ¼ 1Pn
i¼1 aixi� a0
0Pn
i¼1 aixi\a0
�
ð18Þ
In contrast to the abrupt concept drifts in SEA data set,
the SHP data set embraces a gradual concept drift scheme
that the class concepts undergo a ‘‘shift’’ whenever a new
5 10 15 20 25 30 35 40
0.4
0.5
0.6
0.7
0.8
0.9
1
Data chunk
Ove
rall
pred
ictio
n ac
cura
cy
SERA
UB
REA
SMOTE
Normal
(a)
5 10 15 20 25 30 35 400.5
0.6
0.7
0.8
0.9
1
Data chunk
Are
a un
der
RO
C c
urve
REA
SERA
SMOTE
UBNormal
(b)
Fig. 7 OA and AUROC
for ELEC data set
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
FP_rate
TP_
rate
SMOTE
NormalUB
REASERA
(a)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
FP_rate
TP_
rate
SERA
SMOTE
REA
UB
Normal
(b)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
TP_rate
FP_r
ate
REA
SMOTE
Normal
UB
SERA
(c)
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
FP_rate
TP_
rate SMOTE
SERA
Normal
REA
UB
(d)
Fig. 8 ROC curves for selected
data chunks of ELEC data set
Evolving Systems
123
example is created. Specifically, part of the coefficients
a1; . . .; an will be randomly sampled to have a small
increment D added whenever a new example has been
created, which is defined as:
D ¼ s� t
Nð19Þ
where t is the magnitude of change for every N examples,
and s alternates in [-1, 1] specifying the direction of
change and has 20% chance of being reversed for every
N examples. a0 is also modified thereafter using Eq. (17).
In such a way, the class boundary would be like a spinning
hyperplane in the process of creating data. Data set with
gradual concept drifts requires the learning algorithm to
adaptively tune its inner parameters constantly in order to
catch up with the continuously change of class concepts.
Following the procedure in Wang et al. (2003), the
number of dimension for the feature set of SHP data set is
deemed to be 10, and the magnitude of change t is set to be
0.1. The number of chunks is set to be 100 instead of 40 as
for the previous two data sets, since we would like to
investigate REA in longer series of streaming data chunks.
Each data chunk has 1,000 examples as the training data set
and 200 instances as the testing data set, i.e., N = 1,200. In
addition to the normal setup of imbalanced ratio being 0.05
and noise level being 0.01, we also generate the data set
when imbalanced ratio being 0.01 and noise level being 0.01
and another one with imbalanced ratio being 0.05 and noise
level being 0.03. In this way, we can evaluate the robustness
of REA handling more imbalanced data set and more noisy
data sets. In the rest of this section, we refer to the three
different setup of imbalanced ratio and noise level using
‘‘setup 1’’ (imbalanced ratio = 0.05, noise level = 0.01), ‘‘setup
2’’ (imbalanced ratio = 0.01, noise level = 0.01), and ‘‘setup 3’’
(imbalanced ratio = 0.05, noise level = 0.03), respectively.
4.4.2 Results and discussion
Like what we did for simulation on previous two data
chunks, the results of all comparative algorithms on SHP
data set are based on the average of 10 random runs. The
observation points are placed in data chunks 10, 20, 30, 40,
50, 60, 70, 80, 90, and 100.
Figures 9, 10, and 11 plot the tendency lines of
overall prediction accuracy and AUROC for comparative
algorithms across observation points under setup 1, setup 2,
and setup 3, respectively.
One can see from these figures that in terms of either
overall prediction accuracy or area under ROC curve, REA
can consistently perform competitively against other
comparative algorithms on SHP data set with different
configurations.
4.4.3 Study of hypotheses removal
In scenario of long-term learning from data stream,
retaining all hypotheses in memory over time may not be a
decent strategy. Besides the concern for memory occupa-
tion, hypotheses built in distant past may hinder the clas-
sification performance on current testing data set, which
therefore should somehow be pruned/removed from the
hypotheses set H. We would like to explore this issue in an
empirical manner.
Imagine H is an FIFO queue. The original design of
REA physically sets the capacity of H to be infinity, since
from the time of its creation, any hypothesis will stay in
memory until the end of data stream. Now let’s assume
H has a smaller capacity. Should the number of stored
hypotheses exceed the capacity of H, the oldest hypothesis
would automatically be removed from H. In this way, it
can be guaranteed that H always maintains the ‘‘freshest‘‘
subset of the generated hypotheses in memory.
Figure 12 shows the AUROC performance of REA on
learning from SHP data sets with 100 data chunks when size
of H, i.e., |H|, is equal to?, 60, 40, and 20, respectively. One
can see that REA initially improves its performance in terms
of AUROC when |H| shrinks. However when |H| = 20,
performance of REA deteriorates, which is worse than the
case when |H| = ?. Based on these observations, one can
conclude that there exists a trade-off between |H| and REA’s
performance. A heuristic would be setting |H| approximately
half the total number of data chunks received over time,
which is impractical since the number of data chunks is
usually unknown in real-world applications. Another
method is to assign for each hypothesis a factor decaying
from the time it is created. When the factor decreases
through a threshold g, corresponding hypothesis is removed
from H, which is pretty much like a queue with dynamic
capacity. The challenge raised by this method is how to
determine g, which could hardly be determined by cross
validation in online learning scenario.
4.5 Time and space complexity
Time and space complexity are of critical importance for
designing real-time online learning algorithm. We expect
the algorithm to learn from data stream as quickly as
possible such that it can keep pace with the data stream
Table 2 AUROC values for selected data chunks of ELEC data set
Data chunk Normal SMOTE UB SERA REA
10 0.6763 0.6608 0.7273 0.7428 0.8152
20 0.5605 0.6715 0.6954 0.7308 0.6429
30 0.6814 0.7007 0.5654 0.6339 0.8789
40 0.7102 0.6284 0.6297 0.7516 0.9462
Evolving Systems
123
which could be of very high speed. It is also desirable that
the algorithm not occupy significantly large memory due to
the concern of scalability. From a view of slightly high
level, time and space complexity of REA should be related
to 1. difference between post-balanced ratio f and imbal-
anced ratio c, i.e., f - c; 2. k parameter of k-nearest
neighbor; 3. capacity of hypothesis set H, i.e., |H|.
To get a quantified insights into time and space com-
plexity of REA as well as other comparative algorithms,
we record their time and space consumption for learning
from SEA (40 chunks), ELEC (40 chunks), SHP (100
chunks) data sets in Tables 3, 4, and 5, respectively. The
hardware configuration for simulation is Intel Core i5
Processor with 4 GB RAM.
One can conclude from these results that: (1) REA does
not take significantly more time than other comparative
algorithms to conduct training on data stream, which makes
it to be a qualified candidate for real-time online learning
system. (2) REA does take much more time to do testing,
which is probably because it has to combine multiple
10 20 30 40 50 60 70 80 90 100
0.4
0.5
0.6
0.7
0.8
0.9
1
Data chunk
Ove
rall
pred
ictio
n ac
cura
cy
SERANormal
UB
REA
SMOTE
(a)
10 20 30 40 50 60 70 80 90 100
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Data chunk
Are
a un
der
RO
C c
urve REA
SMOTE
Normal
SERA
UB
(b)
Fig. 9 OA and AUROC for SHP data set under setup 1
10 20 30 40 50 60 70 80 90 1000.4
0.5
0.6
0.7
0.8
0.9
1
Data chunk
Ove
rall
pred
ictio
n ac
cura
cy
SERA
NormalREA
UB
SMOTE
(a)
10 20 30 40 50 60 70 80 90 1000.4
0.5
0.6
0.7
0.8
0.9
1
Data chunk
Are
a un
der
RO
C c
urve
SMOTE
SERA UBREA
Normal
(b)
Fig. 10 OA and AUROC
for SHP data set under setup 2
10 20 30 40 50 60 70 80 90 1000.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Data chunk
Ove
rall
pred
ictio
n ac
cura
cy
SERASMOTE NormalREA
UB
(a)
10 20 30 40 50 60 70 80 90 100
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Data chunk
Are
a un
der
RO
C c
urve
REA
SERA
Nornal UB
SMOTE
(b)
Fig. 11 OA and AUROC
for SHP data set under setup 3
Evolving Systems
123
hypotheses for classification of testing instances. (3) REA
does not take significantly larger memories than other
comparative algorithms, which makes it possible to upscale
REA to handle database of large or even huge size.
5 Conclusion
In this paper, we propose REA as a framework to learn from
nonstationary imbalanced data streams. The key idea of this
approach is to estimate the similarity between previous
minority class examples and the current minority class set
based on k-nearest neighbor, and then selectively accumu-
late a certain amount of previous minority class examples
into the current data chunk to compensate the skewed class
distribution. After that, base classifier is built on this
amplified data set to develop the decision boundary, which
will contribute to the final decision-making process through
weighted combination. We present a brief theoretical anal-
ysis and detailed empirical study on both synthetic bench-
marks and real-world data sets to show the effectiveness of
this approach. There are several interesting issues to work on
in future for REA. Currently the k parameter of k-nearest
neighbors for REA is decided through some heuristic. How
to adaptively choose the most suitable k for data stream of
different characteristics would be of critical importance for
strengthening REA’s robustness. The other important issue
for REA is the design of hypothesis prune mechanism for
REA. We already showed in our simulation that perfor-
mance of REA could be enhanced by deserting part of old
hypotheses. However how to adaptively decide the appro-
priate proportion between old hypotheses and new hypoth-
eses in memory still demands better strategy. Motivated by
our preliminary results in this work, we believe this frame-
work might provide important insights for incremental
learning from nonstationary imbalanced data streams, and
also provide useful techniques for a wide range of real world
applications.
References
Aggarwal C (2003) A framework for diagnosing changes in evolving
data streams. In: ACM SIGMOD conference, pp 575–586
Aggarwal C (2007) Data streams: models and algorithms. Springer,
New York
Angelov P, Zhou X (2006) Evolving fuzzy systems from data streams
in real-time. In: IEEE symposium on evolving fuzzy systems.
IEEE Press, Ambelside, pp 29–35
Babcock B, Badu S, Datar M, Motwani R, Wisdom J (2002) Models
and issues in data stream systems. In: Proceedings of PODS
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and
regression trees. Wadsworth International, Belmont, CA
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote:
synthetic minority over-sampling technique. J Artif Intell Res
16:321–357
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost:
improving prediction of the minority class in boosting. In:
Proceedings of the principles of knowledge discovery in
databases, PKDD-2003, pp 107–119
Chen S, He H (2009) Sera: selectively recursive approach towards
nonstationary imbalanced stream data mining. IEEE-INNS-
ENNS international joint conference on Neural Networks,
pp 522–529
Chen S, He H (2010) Musera: multiple selectively recursive approach
towards imbalanced stream data mining. In: Proceedings of
world conference computational intellligence
10 20 30 40 50 60 70 80 90 1000.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Data chunk
Are
a U
nder
RO
C C
urve
|H|=40
|H|=60
|H|=20
|H|=∞
Fig. 12 Performance comparison of hypotheses removal
Table 3 Time and space complexity for SEA data set
Complexity Normal SMOTE UB SERA REA
Training time
complexity (s)
2.632 4.092 7.668 3.952 5.356
Testing time complexity (s) 0.088 0.148 0.236 0.156 1.904
Space Complexity (kb) 1,142 1,185 1,362 1,266 1,633
Table 4 Time and space complexity for ELEC data set
Complexity Normal SMOTE UB SERA REA
Training time
complexity (s)
1.376 1.712 2.844 1.824 2.028
Testing time complexity (s) 0.084 0.148 0.244 0.14 1.744
Space Complexity (kb) 129 142 155 151 248
Table 5 Time and space complexity for SHP data set
Complexity Normal SMOTE UB SERA REA
Training time
complexity (s)
10.29 17.89 60.10 16.98 21.62
Testing time
complexity (s)
0.23 0.46 0.53 0.43 11.58
Space Complexity (kb) 6,881 7,061 7,329 9,858 10,274
Evolving Systems
123
Domingos P, Hulten G (2000) Mining high-speed data streams. In:
Proceedings of international conference KDD. ACM Press,
pp 71–80
Dovzan D, Skrjanc I (2010) Predictive functional control based on an
adaptive fuzzy model of a hybrid semi-batch reactor. Control
Eng Practise 18(8):979–989
Fan W (2004) Systematic data selection to mine concept-drifting data
streams. In: Proceedings of ACM SIGKDD international confer-
ence knowledge discovery and data mining. ACM Press,
pp 128–137
Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassifi-
cation cost-sensitive boosting. In: Proceedings of 16th interna-
tional conference on machine learning, pp 97–105
Fawcett T (2003) Roc graphs: notes and practical considerations for
data mining researchers. Technical Report, HPL-2003-4
Filev D, Georgieva O (2010) An extended version of the gustafson-
kessel algorithm for evolving data stream clustering. In: Angelov
P, Filev D, Kasabov N (eds) Evolving intelligent systems:
methodology and applications. IEEE Press Series on Computa-
tional Intelligence, Wiley, pp 273–300
Freund Y, Schapire R (1997) Decision-theoretic generalization of
on-line learning and application to boosting. J Comput Syst Sci
55(1):119–139
Gaber MM, Krishnaswamy S, Zaslavsky A (2003) Adaptive mining
techniques for data streams using algorithm output granularity
mohamed. In: Workshop (AusDM 2003), held in conjunction
with the 2003 congress on evolutionary computation (CEC 2003)
Gao J, Fan W, Han J (2007) On appropriate assumptions to mine data
streams: analysis and practice. In: Proceedings of international
conference data mining, Washington, DC, USA, pp 143–152
Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining
concept-drifting streams with skewed distribution. In: Proceed-
ings of international conference SIAM
Georgieva O, Filev D (2009) Gustafson-kessel algorithm for evolving
data stream clustering. In: Proceedings of international conference
computer systems and technologies for PhD students in computing
Grossberg S (1988) Nonlinear neural networks: principles, mecha-
nisms, and architectures. Neural Netw 1(1):17–161
Harries M (1999) Splice-2 comparative evaluation: electricity pricing.
Tech. rep., The University of South Wales
He H, Chen S (2008) Imorl: Incremental multiple-object recognition
and localization. IEEE Trans Neural Netw 19(10):1727–1738
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans
Knowledge Data Eng 21(9):1263–1284
Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier
for imbalanced data-sets. IEEE Trans Neural Netw 18(1):28–41
Lange S, Grieser G (2002) On the power of incremental learning.
Theor Comput Sci 288(2):277–307
Last M (2002) Online classification of nonstationary data streams.
Intell Data Analysis 6(2):129–147
Law Y, Zaniolo C (2005) An adaptive nearest neighbor classification
algorithm for data streams. In: Proceedings of European
Conference PKDD
Masnadi-Shirazi, Vasconcelos N (2007) Asymmetric boosting. In:
Proceedings of international conference machine learning
Muhlbaier MD, Topalis A, Polikar R (2009) Learn??.nc: Combining
ensemble of classifiers with dynamically weighted consult-and-
vote for efficient incremental learning of new classes. IEEE
Trans Neural Netw 20(1):152–168
Polikar R, Udpa L, Udpa S, Honavar V (2001) Learn??: an
incremental learning algorithm for supervised neural networks.
IEEE TransSyst Man Cybern C Spec Issue Knowledge Manage
31:497–508
Sharma A (1998) A note on batch and incremental learnability.
J Comput Syst Sci 56(3):272–276
Street WN, Kim Y (2001) A streaming ensemble algorithm (sea) for
large-scale classification. In: Proceedings the seventh ACM
SIGKDD internatinal conference knowledge discovery and data
mining. ACM Press, pp 377–382
Tumer K, Ghosh J (1996) Analysis of decision boundaries in linearly
combined neural classifiers. Pattern Recog 29:341–348
Tumer K, Ghosh J (1996) Error correlation and error reduction in
ensemble classifiers. Connect Sci 8(3–4):385–403
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data
streams using ensemble classifiers. In: KDD ’03: Proceedings of
the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining, pp 226–235
Evolving Systems
123