Data Min Knowl Disc Classification of high-dimensional ...
Transcript of Data Min Knowl Disc Classification of high-dimensional ...
unco
rrec
ted
proo
f
Data Min Knowl DiscDOI 10.1007/s10618-017-0500-7
Classification of high-dimensional evolving data streams
via a resource-efficient online ensemble
Tingting Zhai1 · Yang Gao1·
Hao Wang1· Longbing Cao2
Received: 30 March 2016 / Accepted: 27 February 2017© The Author(s) 2017
Abstract A novel online ensemble strategy, ensemble BPegasos (EBPegasos), is pro-1
posed to solve the problems simultaneously caused by concept drifting and the curse2
of dimensionality in classifying high-dimensional evolving data streams, which has3
not been addressed in the literature. First, EBPegasos uses BPegasos, an online ker-4
nelized SVM-based algorithm, as the component classifier to address the scalability 15
and sparsity of high-dimensional data. Second, EBPegasos takes full advantage of the6
characteristics of BPegasos to cope with various types of concept drifts. Specifically,7
EBPegasos constructs diverse component classifiers by controlling the budget size of8
BPegasos; it also equips each component with a drift detector to monitor and evaluate9
its performance, and modifies the ensemble structure only when large performance10
degradation occurs. Such conditional structural modification strategy makes EBPega-11
sos strike a good balance between exploiting and forgetting old knowledge. Lastly,12
we first prove experimentally that EBPegasos is more effective and resource-efficient13
than the tree ensembles on high-dimensional data. Then comprehensive experiments14
Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Céline Robardet.
B Yang [email protected]
Tingting [email protected]
Longbing [email protected]
1 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
2 Advanced Analytics Institute, University of Technology Sydney, Sydney, Australia
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
on synthetic and real-life datasets also show that EBPegasos can cope with various15
types of concept drifts significantly better than the state-of-the-art ensemble frame-16
works when all ensembles use BPegasos as the base learner.17
Keywords High dimensionality · Concept drift · Data stream classification ·18
Online ensemble19
1 Introduction20
Mining evolving data streams has attracted numerous research attention recently21
(Zliobaite et al. 2015; Krempl et al. 2014; Zliobaite and Gabrys 2014; Zhang et al.22
2014). In particular, mining high-dimensional evolving data streams is a challenging23
task, which aims to capture the latest functional relation between the observed variables24
and the target variable and then accurately predict the labels of the upcoming observed25
values. Compared with static data, streaming data has the following characteristics:26
– Infinite amount Data is potentially endless, thus storing all data for multi-pass27
processing is impractical. To process infinite data, classifiers should work in28
resource-limited environments. Typically, the space cost for saving the models29
should be independent of the number of observations.30
– High velocity Data arrives so quickly that real-time response is required. Algo-31
rithms relying on retraining new classifiers on the latest data from scratch may not32
be useful in such environments.33
In addition to the above general characteristics, data considered in this work also34
possesses the following special characteristics:35
– Concept drifting Data is not distributed independently and identically, but has36
evolving distributions (Gama et al. 2013). The relation between the observed vari-37
ables and the target variable is expected to change with time unpredictably, while38
the distribution of the observed variables may either change or remain constant.39
Concept drifts affect decision-making; to handle them well, classifiers should be40
able to forget the old data and be consistent with the most recent data.41
– High dimensionality Data is described by hundreds or thousands of attributes, such42
as in texts, images, sounds, videos, hyper-spectral data, financial data, etc. With43
so many attributes the challenge is well known as the curse of dimensionality.44
One effect is related to the scalability. Some algorithms become intractable in45
both time and space with increasing dimensionality. Another effect is connected46
with the scarcity of the available data because the number of instances required47
to represent any distribution grows exponentially with the number of attributes48
(Tomasev et al. 2014), while each concept in the stream may not contain so huge49
data in many applications. With few instances in high-dimensional space, many50
classification models are very likely to overfit the data, thereby leading to poor51
generalization performance (Pappu and Pardalos 2014).52
The above characteristics appear in many real applications. For example, in senti-53
ment classification on twitter streams (Bifet and Frank 2010), since millions of users54
utter their opinions and make comments now and then, the messages that need to be55
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
processed flow infinitely and fast, meanwhile the vocabularies used to convey positive56
and negative feelings tend to evolve with different contexts. In visual object tracking57
(Liu and Zhou 2014), surveillance cameras produce a stream of images at a rate of58
many frames per second. With the variation of illumination or background in adjacent59
frames, both the features used to describe target and non-target objects change, the60
same as the class boundary. In a personalized email filtering system (Katakis et al.61
2009), a user receives a stream of emails over time. Whether an email is a spam62
depends on the user’s interest, which may drift from time to time. In all the above63
scenarios, data is related to high dimensionality.64
Although many methods have been proposed for data stream classification, they65
generally focus on dealing with either concept drifting or high dimensionality. Few66
works consider both of them. Methods handling only concept drifting of stream-67
ing data can be classified as single-model and multi-model approaches (Hosseini68
et al. 2015). Single-model approaches are mainly incremental counterparts of con-69
ventional batch learning methods, equipped with certain strategies for coping with70
concept drifts. For example, methods based on decision trees (Gama et al. 2006; Bifet71
et al. 2010c; Rutkowski et al. 2013), k-nearest neighbors (Bifet et al. 2013), and sup-72
port vector machines (SVM) (Wang et al. 2012) have been respectively explored.73
Multi-model approaches are namely adaptive ensembles which maintain multiple74
incremental models in memory for making combined decisions. Compared with sin-75
gle models, ensemble methods are more popular since they can easily forget outdated76
data and adapt to the most recent data simply by removing outdated component clas-77
sifiers and creating new ones. Moreover, they generally have stronger generalization78
ability. There are mainly two categories of such methods (Brzezinski and Stefanowski79
2014a). The first category consists of block-based ensembles such as Learn++.NSE80
(Elwell and Polikar 2011), AUE1 (Brzezinski and Stefanowski 2011), AUE2 (Brzezin-81
ski and Stefanowski 2014b), etc., which are designed for situations where data arrives82
in batches or blocks. The second category includes online ensembles which process83
each instance once on its arrival without the requirement for storage and reprocessing.84
Representative methods are ASHTBag and ADWINBag (Bifet et al. 2009), LevBag85
(Bifet et al. 2010b), OAUE (Brzezinski and Stefanowski 2014a), DDD (Minku and Yao86
2012), etc. Among the ensembles mentioned above, Hoeffding trees (Domingos and87
Hulten 2000) are popular base learners due to the fact that, given sufficient samples,88
they are guaranteed to asymptotically approach the trees generated by batch learners.89
However, with few samples in each concept of the high-dimensional data streams,90
Hoeffding tree, as an incremental decision tree induction algorithm, tend to overfit.91
There is much less work on handling high dimensionality of data streams. Aggar-92
wal and Yu (2008) developed an instance-centered subspace classification approach,93
which constructs discriminative subspaces specific to the locality of the particular test94
instance and makes decisions by combining the results of different subspace samples.95
Wang et al. (2014) proposed a framework of sparse online classification, which over-96
comes the scalability problem by introducing sparsity in learnt weights. McCallum97
et al. (1998) considered two different naive bayes models for document classification.98
These methods, however, are not designed to handle concept drifts.99
In addition to the above methods, some improved random forests are reported to100
perform well on very high-dimensional data (Do et al. 2010; Ye et al. 2013). These101
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
methods, however, cannot easily modified to be online methods. There are also some102
works focusing on building the online version of random forests. Some of them, e.g.,103
(Denil et al. 2013; Lakshminarayanan et al. 2014), are not designed to handle concept104
drifts. Some others are aware of concept drifts, but rely on certain prior knowledge,105
e.g., the range of each attribute (Abdulsalam et al. 2007; Saffari et al. 2009; Abdulsalam106
et al. 2011), which do not fit our online environment requirement that the information107
related to the data needs to be calculated or evaluated online rather than be known in108
advance. Moreover, whether these methods are robust to very high-dimensional data109
are also unknown.110
In this paper, we simultaneously tackle concept drifting and high dimensionality111
in mining streaming data. First, in order to weaken the effect caused by the curse of112
dimensionality, we apply BPegasos (Wang et al. 2012), an online kernelized SVM113
method, as the component classifier of ensemble. Second, to better handle various114
types of concept drifts, a new specialized ensemble strategy is designed for BPegasos.115
BPegasos is selected owing to the following reasons:116
1. BPegasos has good scalability. It works in O(Bd) space and its time cost for117
processing one instance is O(Bd), where d is the number of attributes, and B, of118
which the value is independent of d, is the budget for controlling the number of119
support vectors. By introducing the predefined parameter B, BPegasos can control120
the resource consumption according to the actual requirement, which makes it121
memory-efficient.122
2. BPegasos inherits the theoretical advantages of SVM so that it can generalize123
well on small data in high-dimensional space by capacity control through margin124
maximization (Abe 2005; Pappu and Pardalos 2014). The high dimensionality of125
the feature spaces induced by Mercer kernels is even considered as a bonus since126
it allows to build more powerful classifiers.127
While single BPegasos possesses the above merits, it is limited in handling various128
types of concept drifts. Meanwhile, existing ensemble strategies have mainly been129
designed for the tree classifiers, which are very different from BPegasos in stability130
and adaptability. Hence, directly applying these strategies to BPegasos may produce131
unsatisfactory results. Therefore, we design a new ensemble method for BPegasos to132
compensate for the above constraints.133
Our main contributions are threefold:134
1. We formalize the problem of online classification on data streams embedded with135
both high dimensionality and concept drifts. To the best of our knowledge, this136
problem has not been addressed in the literature.137
2. A novel solution for the above problem is developed. The solution uses BPegasos138
as the online component classifier, injects diversity among components by con-139
trolling their budget sizes, and modifies the ensemble structure by taking the data140
characteristics into account.141
3. Extensive experiments are conducted and demonstrate that our solution is superior142
to the typical Hoeffding tree-based ensembles in addressing the proposed problem.143
The superiority lies not only in accuracy but also in the resource-efficiency in144
terms of time and space. When the competitor ensembles also use BPegasos as145
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
Table 1 Notation
Symbols Description
xt ∈ Rd , yt ∈ R A sample and its label at time t
ct : Rd → R The true concept at time t
Pt A fixed distribution at time t
base learners, our solution still shows the best capability to deal with various types146
of drifts.147
The organization of this paper is as follows. Section 2 presents the problem defini-148
tion and related work. Section 3 introduces BPegasos. Our proposed ensemble strategy149
is presented in detail in Sect. 4. Experiments are conducted and the corresponding150
results are analyzed in Sect. 5. Finally, Sect. 6 draws conclusions and discusses future151
work.152
2 Problem definition and related work153
2.1 Problem definition154
Here, we define the problem of classifying high-dimensional evolving data streams.155
For readability, Table 1 summarizes the main notation used in this paper. Assume 2156
that only one instance is available at one time and data arrives sequentially in the157
following way: {(x1, y1), (x2, y2), . . . , (xt , yt ), . . .}, where xt is drawn from distri-158
bution Pt and yt = ct (xt ). In our case, the dimensionality of variables may vary159
from a few dozen to many thousands. If the concept ct keeps constant for the whole160
stream, the stream is said to be stationary; Otherwise, it contains concept drifts. Rely-161
ing on how the time-adjacent concepts change, concept drifts are often categorized162
as sudden, gradual, and incremental drifts (Gama et al. 2014; Brzezinski and Ste-163
fanowski 2014a). Sudden drifts happen when the transition between concepts take164
place within a very limited amount of time. For gradual drifts, the transition may165
last longer and during this time, the probability that examples come from the first166
concept decreases with time, but the probability that examples from the second con-167
cept increases with time. Incremental drifting refers to the case where the concept168
changes slightly but continuously; the change is often unnoticeable until sufficient169
difference is accumulated. When the data flows in, up to time t , the available instance170
sequence is {(x1, y1), (x2, y2), . . . , (xt , yt )} and the corresponding concept sequence171
is {c1, c2, . . . , ct }. Using the available instances, the task of the learner at time t is to172
obtain a good estimate ct+1 of the true concept ct+1 with the time constraint that com-173
puting ct+1 and using it for prediction must be completed before the true label yt+1174
comes and the space constraint that the learner can only use limited memory resource.175
In the long run, the learner hopes to minimize the number of wrong predictions despite176
of the type of drifts.177
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
Table 2 A summary of online ensemble methods
Methods 1© 2© 3© 4© 5© 6©
OzaBag Any OL Resampling − M × ×DWM Any OL Periodically create members ED WM × √
ASHTBag Hoeffding tree Resampling + control tree size EWMA WM × ×ADWINBag Any OL Resampling ADWIN M
√ √
LevBag Any OL Resampling using Poisson(λ) ADWIN M√ √
OAUE Any OL Periodically create members MSE WM × √
DDD Any ensemble Control resampling weights PA SWM√ ×
“OL” = “online learner”; “−” = “not required”; “ED” = “exponentially decreasing”; “EWMA” = “expo-nentially weighted moving average”; “MSE”=“mean square error”; “PA”=“prequential accuracy”; “M”represents such voting mechanism: y ← arg maxy∈Y
∑ki=1 fi,y , where fi,y is the decision of i-th base
learner on the class y and fi,y ∈ {0, 1} or fi,y ∈ [0, 1]; “WM” represents: y ← arg maxy∈Y
∑ki=1 ai fi,y ,
where ai is the weight assigned to the i-th base learner; “SWM” represents that not all the components, butsome are used in the “WM’
In the literature, there exist other definitions of concept drifts which are slightly178
different from ours. In Gama et al. (2014), Gama et al. define a concept as a joint179
probability distribution p(X, y), where X is the observed variable set and y is the180
target variable. A drift happens between time t0 and t1 if ∃X : pt0(X, y) �= pt1(X, y).181
In particular, they distinguish two different types of drifts: real concept drifts and182
virtual ones. The former refer to the changes of p(y|X) either with or without changes183
in p(X) and the latter happen if only p(X) changes. It can be seen that our definition184
corresponds to real concept drifts.185
2.2 Related work186
Online ensembles proposed in recent years are summarized in Table 2. They are187
described from the following aspects: 1© component classifiers wrapped, 2© diversity188
injection methods, 3© evaluation strategy for components, 4© aggregation strategy189
for prediction, 5© whether to use drift detection mechanism and 6© whether to190
modify the ensemble structure (i.e., add new components and discard the outdated191
ones).192
OzaBag (Oza 2005) is an online version of bagging, which simulates offline boot-193
strap sampling by presenting each available instance k times to each component,194
where k ∼ Poisson(1). Such a practice is based on the observation that the probabil-195
ity that an instance is chosen for a replicate in bootstrap sampling tends to observe196
Poisson(1) distribution. Anytime decision is made either by a simple majority vot-197
ing when each component outputs a unique class label, or by finding the maximum198
class support when each component outputs an estimate of class posterior probabili-199
ties. For the lack of structural modification, it is difficult for OzaBag to adapt to fast200
drifts.201
DWM (Kolter and Maloof 2007) maintains a dynamic weighted pool of compo-202
nents. It adjusts the weights of components, creates new components and removes203
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
poorly-performing ones after processing every p instances. Each component has an204
initial weight 1. Whenever a components makes a wrong prediction, its weight will205
be discounted by β ∈ [0, 1). When a components’s weight falls below a threshold206
θ , it will be removed from the ensemble. Since each components’s weight is expo-207
nentially decreasing (ED), DWM is susceptible to noise and may discard some useful208
components incorrectly.209
ASHTBag (Bifet et al. 2009) is an ensemble tailored for Hoeffding trees, which210
adds diversity among components by bagging Hoeffding trees with different sizes. It211
uses exponentially weighted moving average (EWMA) to estimate the error of each212
tree and sets the weight of each tree inversely proportional to the square of its error.213
Like OzaBag, ASHTBag also suffers from fast drifts.214
ADWINBag (Bifet et al. 2009) incorporates the ADWIN drift detector (Bifet and215
Gavalda 2007) into OzaBag for detecting drifts and evaluating performance for struc-216
tural modification. LevBag (Bifet et al. 2010b) further enhances ADWINBag by217
increasing resampling weight using Poisson(λ) distribution. A typical value for λ is 6.218
However, this strategy makes LevBag vulnerable to noise, since the noisy instances219
are also learnt many times via resampling.220
OAUE (Brzezinski and Stefanowski 2014a) simulates the block-based ensembles221
by creating new members and discarding obsolete ones periodically. Performance222
evaluation is done by incrementally calculating the mean squared error (MSE) of each223
component classifier and a random prediction classifier on the last w examples. Like224
block-based ensembles, OAUE may encounter the stability-and-plasticity dilemma.225
On one hand, larger blocks can produce stable classifiers that react slowly to concept226
drifts. On the other hand, smaller blocks are beneficial to react to drifts quickly but227
bring unstable performance.228
DDD (Minku and Yao 2012) maintains four ensembles with different diversity229
levels. Each ensemble resamples using different Poisson (λ) distributions. The idea is230
motivated by the analysis of the impact of diversity on online ensemble learning in231
the presence of concept drifts in Minku et al. (2010). DDD uses some drift detection232
technique for monitoring the state of the stream so as to select different ensembles233
for prediction at different periods. The performance of each ensemble is evaluated234
using the prequential accuracy (PA) from the last moment of detecting a drift to235
current moment. Different from the above-mentioned ensemble methods, DDD is an236
ensemble of ensembles, thus its time and space costs are larger.237
No work has been reported on classifying high-dimensional data streams with238
concept drifting, to the best of our knowledge.239
3 BPegasos240
BPegasos (Wang et al. 2012) is an improvement over kernelized Pegasos (Shalev-241
Shwartz et al. 2011). In this section, we give a brief introduction to Pegasos and242
BPegasos both in the multi-class setting.243
Given a data stream {(xt , yt ), t = 1, 2, . . .}, where xt ∈ Rd , yt ∈ Y =244
{1, 2, . . . , C}, the multi-class decision function at time t is245
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
ft (x) = arg maxi∈Y w(i)t
T
x, (1)246
where w(i)t is the i-th class-specific weight vector at time t .247
Pegasos uses stochastic gradient descent (SGD) to solve the following primal for-248
mulation of the SVM problem249
minw(1),w(2),...,w(C)
λ
2
∑
i∈Y
‖w(i)‖2 + 1
T
T∑
t=1
max{
0, 1 + w(rt )T
xt − w(yt )T
xt
},250
where rt = arg maxi∈Y,i �=yt w(i)Txt and λ is the parameter of the regularization term251
controlling the complexity of the model. At t-th round, when xt is available, Pegasos252
uses w(i)t learned before to predict its label by Eq. (1). After receiving the true label253
yt of xt , w(i)t is updated as254
w(i)t+1 = w
(i)t − ηt∇(i)
t , i = 1, 2, . . . , C, (2)255
where ηt = 1/(λt) is the learning step, and ∇(i)t belongs to the sub-gradient of the256
instantaneous objective function Pt with respect to w(i)t . Pt is defined as257
λ
2
∑
i∈Y
‖w(i)t ‖2 + max
{0, 1 + w
(rt )t
T
xt − w(yt )t
T
xt
},258
where rt = arg maxi∈Y,i �=yt w(i)t
T
xt . Plugging into Eq. (2) the expression of ∇(i)t , the259
update rule becomes260
w(i)t+1 = (1 − ληt )w
(i)t + β
(i)t xt , i = 1, 2, . . . , C,261
where262
β(i)t =
⎧⎨⎩
ηt if i = yt ,
−ηt if i = rt ,
0 otherwise.263
Through derivation, w(i)t can eventually be represented as264
w(i)t =
t−1∑
j=1
α(i)j x j ,265
where266
α(i)j = β
(i)j
t−1∏
k= j+1
(1 − ληk).267
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
α(i)j is the contribution factor of the j-th instance to the i-th class. If α
(i)j is non-zero,268
x j is called a support vector (SV).269
When used for non-linear classification, a non-linear mapping φ can be introduced270
to map the original examples to a new higher dimensional feature space. w(i)t can be271
rewritten as272
w(i)t =
∑
j∈I(i)t
α(i)j φ(x j ), (3)273
where I(i)t is the set of indices of all SVs in w
(i)t . Since φ may be infinite dimensional,274
w(i)t cannot be represented explicitly by a vector just like in linear situations. Instead,275
an SV set St = {(α(i)j , x j ), j ∈ I
(i)t } needs to be maintained. By inserting Eq. (3) into276
Eq. (1), the decision function can be rewritten as277
ft (x) = arg maxi∈Y
∑
j∈I(i)t
α(i)j φ(x j )
Tφ(x). (4)278
In Eq. (4), the inner product can be efficiently computed as φ(x j )Tφ(x) = K(x j , x),279
where K(·) is a Mercer kernel derived from φ. As shown in Eq. (4), the time cost for280
prediction and the space cost for saving the model are both proportional to the number281
of SVs. The main problem of kernelized Pegasos is that the number of SVs may282
grows unboundedly with the number of data processed, which restricts its usage in283
resource-limited streaming environments. BPegasos is proposed to solve the problem.284
BPegasos keeps w(i)t spanned by at most B SVs, where B is a user-defined param-285
eter. Denote It =⋃C
i=1 I(i)t ; when |It | > B, a budget maintenance step is triggered286
and achieved as287
w(i)t ← w
(i)t − ∆
(i)t , i = 1, 2, . . . , C. (5)288
where ∆(i)t is the weight degradation of w
(i)t before and after the budget maintenance.289
Wang et al. (2012) proved that a good budget maintenance strategy should minimize290
the total degradation∑C
i=1 ||∆(i)t ||2. Specifically, they proposed three strategies: (i)291
deleting one SV, (ii) projecting one SV to the remaining ones, and (iii) merging two292
SVs into a newly created one. Among them, the merging strategy is the most appealing,293
since it has time and space complexity of O(Bd) per iteration and can produce smaller294
total weight degradation. The basic idea of merging is to choose two SVs and then295
merge them into a new one. The details are in Wang et al. (2012).296
Algorithm 1 presents the pseudocode of BPegasos. In the remaining sections, when297
it comes to BPegasos, we refer to the BPegasos with the merging strategy. By default,298
the popular RBF kernel with width δ is used (Hsu et al. 2003):299
K(x j , x) = exp
(−‖x j − x‖2
2δ2
).300
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
Algorithm 1: BPegasosInput: Data stream D = {(xt , yt ), t = 1, 2, . . .}; budget B; regularization parameter λ; kernel
function K(·)Output: ft (x)
1 for t = 1, 2, . . . do
2 Receive xt ;3 Predict its label using Eq. (4);4 Receive the true label yt ;
5 Update w(i)t using SGD;
6 if |It | > B then
7 Do the budget maintenance using Eq. (5);
4 Ensemble BPegasos (EBPegasos)301
In order to improve the adaptability of BPegasos to various types of drifts, a novel302
online ensemble method, called EBPegasos, is proposed. In the following, three critical303
parts in EBPegasos are presented respectively:304
1. Diversity injection, which produces a component set as accurately and diversely305
as possible;306
2. Component management, which monitors the performance status of components,307
and makes accurate predictions by aggregating the outputs of the components.308
3. Structural modification, discards the components at the right time, with the objec-309
tive of striking a good balance between exploiting and forgetting old knowledge.310
4.1 Diversity injection311
It is known that diversity plays an important role in learning data streams with concept312
drifts (Minku et al. 2010). For unstable learners such as decision trees, small changes313
in data can bring about large variations in the classification. Accordingly, resampling314
is effective for increasing diversity among unstable learners. Unfortunately, as an315
SVM-based method, BPegasos is relatively stable and thus resampling cannot help316
much.317
Instead of using resampling, EBPegasos creates diversity by allowing different318
budget sizes of BPegasos components. This is due to the fact that varying the budget319
size greatly affects the performance of BPegasos (Wang et al. 2012). In general, a320
larger budget means more stable performance since the margin information is better321
kept by more SVs. This is preferable when the data stream is stable, as BPegasos322
may thus reduce the number of false alarms of concept drifting. However, the stability323
becomes an impediment to BPegasos when the underlying concept in the data stream324
is actually drifting, as it usually takes more time for BPegasos to forget that many325
obsolete SVs. In contrast, a smaller budget means higher instability of performance326
but more sensitivity to changes, which thus may cause more false alarms when the327
data stream is stable but contribute to the early detection of concept drifts when they328
indeed happen. Considering this, we design EBPegasos to include multiple BPegasos329
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
classifiers with different budget sizes, so that it can achieve good performance in both330
stable and unstable periods of the data stream.331
4.2 Component management332
EBPegasos predicts by weighted majority voting. In particular, when making a predic-333
tion yt for the data sample xt , EBPegasos calculates a weighted sum of the predictions334
of the components,335
yt ← arg maxy∈Y
k∑
i=1
ait · I
(f it (xt ) = y
), (6)336
where ait is the weight assigned to the i-th component, and I(·) is an indicator function337
which outputs 1 if the Boolean expression in the brackets is true, or 0 otherwise.338
To obtain ait , EBPegasos equips each of its components with an ADWIN (Bifet and339
Gavalda 2007) to monitor and evaluate its performance. ADWIN is originally designed340
as a drift detector; Fig. 1 presents its working process. Every time a component makes341
a prediction yt , a single bit b is generated (via comparing yt to yt , the true class label342
of xt ) and fed into ADWIN, indicating whether the prediction yt is correct (b = 0) or343
wrong (b = 1). As shown in Fig. 1a, ADWIN tracks a window W of such bits, which344
consists of two large enough sub-windows W0 and W1. The basic idea of ADWIN is345
that, when there is no concept drift in the data stream, the average of the elements346
(i.e., the error rate) in W0 should be statistically consistent with the average in W1.347
Put in other words, whenever the difference between the error rates in W0 and W1 is348
statistically significant, ADWIN alarms a concept drift, in which case ADWIN drops349
the sub-window W0 and makes W1 a new W (Fig. 1b). Note that the overall error rate350
in W (i.e., the number of 1’s in W divided by the length of W ), Errit , can be used as a351
performance measure of the corresponding component. Hence, the weight ait can be352
computed as353
ait = 1 − Erri
t∑ki=1(1 − Erri
t ), i = 1, 2, . . . , k,354
where k is the number of ensemble members.355
00010010010 011010110101
W0 W1
W
1
(a)
00010010010 011010110101
W0 W1
W
1
(b)
Fig. 1 The working process of ADWIN. a W before a drift, b W after a drift
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
As mentioned in Sect.4.1, there are component BPegasos classifiers with various356
budgets in EBPegasos. Hence, another benefit of such a practice is that the ADWIN357
detectors associated with different budgeted classifiers can issue alarms at different358
times, which helps to reduce the false negative (missing alarms) rate of the drift359
detection.360
4.3 Conditional structural modification361
Structural modification involves when to create new components to substitute the362
outdated ones. To handle various types of drifts well, especially for fast incremental363
drifts, structural modification is essential in ensemble methods.364
Existing ensemble methods such as ADWINBag and LevBag modify the structure365
once a drift is detected. However, since these methods are designed for component366
classifiers without forgetting mechanisms, their timing of making the structural modi-367
fication is not suitable for BPegasos. BPegasos is very different from other component368
classifiers used in other ensemble solutions in that it naturally has slow forgetting369
capability due to its budget maintenance. Specifically, when drifts occur, BPega-370
sos has two ways to establish the new decision margin: (i) discarding the previous371
model learnt and retraining a new one on the most recent data; or (ii) keeping372
updating the previous model incrementally. Which way is better depends on the373
intrinsic characteristics of data. Intuitively, the first way is better for severe drifts374
that bring large performance degradation of classifiers; while for slight drifts that375
bring small degradation and that maybe are caused by noise, the second way is376
better.377
In order to identify the severity of drifting, we define r , the model degradation rate378
(MDR), as follows:379
r = f1 − f0
|W1|, (7)380
where f0 and f1 are the errors in the sub-windows W0 and W1 in Fig. 1b respectively.381
f1− f0 reflects how much a model degrades when the concept drifts, and |W1| indicates382
how long it takes for the accumulated degradation to be detected.383
For severe drifts, one may expect a large r since performance degrades greatly in384
relatively short time. In contrast, for slight drifts, r is small since getting a significant385
degradation f1 − f0 needs a larger window W1. Intuitively, with a large r , it is better to386
forget old knowledge rapidly by retraining a new classifier from scratch; and, on the387
contrary, with a small r , exploiting fully the old knowledge by incrementally updating388
the old classifiers is often preferable. Therefore, we define a threshold θ in EBPegasos389
for determining when to use which way. Specifically, when a drift alarm is issued, if r390
is larger than θ , a new classifier is retrained; otherwise, the old classifier is retained and391
keeps updating itself. In this way, EBPegasos strikes a good balance between reusing392
and fast forgetting previously learned knowledge.393
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
4.4 Algorithm pseudocode394
Algorithm 2 presents the pseudocode of EBPegasos.395
Algorithm 2: EBPegasosInput: Data stream D = {(xi , yi ), i = 1, 2, . . .}, budget parameter B, the number of component
classifiers k, regularization parameter λ, RBF kernel parameter δ2, and severity threshold θ
Output: yt
1 Initialize k components and their drift detectors DTi , i = 1, 2, . . . , k; The i-th component has abudget of i · B;
2 newStart ← −1;3 for t = 1, 2, . . . do
4 Receive xt ;5 Predict yt using Eq. (6);6 Receive yt ;
7 Update f it using (xt , yt ), i = 1, 2, . . . , k;
8 Modify ← f alse;9 for i = 1, 2, . . . , k do
10 Input f it (xt ) into DTi ;
11 if DTi issues a drift alarm then
12 Compute r using Eq. (7);13 if r > θ then
14 Modify ← true;
15 if Modify is true then
16 j ← maxi=1...k,i �=newStart Errit ;17 Reset the j-th member and its detector DT j ;18 newStart ← j ;
During initialization, k components are created, each is equipped with a drift detec-396
tor for monitoring its performance, and the i-th component is allocated a budget of i · B397
(Line 1). Then, EBPegasos strictly follows the procedure of online learning: When-398
ever a new example xt arrives, EBPegasos uses each component’s current decision399
function f it (·) to predict its label and calls a weighted majority voting (Lines 4, 5).400
After receiving the true label yt , each component will be incrementally updated using401
(xt , yt ) (Lines 6, 7). When the performance of some component degrades severely, it402
triggers a drift alarm (Lines 9–14), and EBPegasos thus resets the worst-performing403
component as well as its detector (Lines 15–18). Note that a newly-created classifier404
typically requires certain amount of training data to grow stable. Hence, to prevent a405
new classifier from being dropped due to some performance perturbation in its very406
early stage, it is offered an exemption such that it may survive the first structural407
modification after its creation.408
4.5 Complexity analysis409
For the i-th component, the main time cost is in prediction and budget maintenance,410
thus the time complexity of the i-th component is O(i · Bd). In addition, the i-th411
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
component stores i · B SVs, and thus requires O(i · Bd) space. Each ADWIN drift412
detector needs O(log |W |) time and O(log |W |) memory for processing an input bit,413
where W is the maximum stable window (Bifet and Gavalda 2007). Hence, checking414
whether it is necessary to do the structural modification takes O(k log |W |) time,415
dominating the time cost, O(k), for actually performing a structural modification.416
Therefore, to sum up, EBPegasos works in O(k2 Bd + k log |W |) space and costs417
O(k2 Bd + k log |W |) time per iteration.418
It is easy to see that (i) the space cost of EBPegasos remains constant after its initial419
configuration, and (ii) the time cost for processing one data instance is irrelevant to420
the length of the data stream. Hence, EBPegasos is very suitable for resource-limited421
situations.422
5 Experiments423
In this section, four sets of experiments are presented.424
1. Accuracy and efficiency analysis on high-dimensional data. This set of experiments425
demonstrates the advantages of EBPegasos over the state-of-the-art tree ensembles426
on high-dimensional data streams.427
2. Adaptivity analysis. This set of experiments compares the ability of EBPegasos to428
cope with various types of concept drifts with that of the state-of-the-art ensemble429
frameworks, showing that the superiority of EBPegasos comes from not only the430
choice of the components but also its novel component ensemble strategy.431
3. Comparative analysis on low-dimensional data. This set of experiments observes432
the performance of EBPegasos compared to the tree ensembles on low-dimensional433
datasets.434
4. Sensitivity analysis. This set of experiments analyzes the parameter sensitivity of435
EBPegasos.436
All the experiments were conducted in the Massive Online Analysis (MOA)437
framework (Bifet et al. 2010a), which is a dedicated software platform for running438
experiments of evolving data stream mining, containing most of the competitor algo-439
rithms considered in our experiments. All experiments were performed on a machine440
with 2.60GHz Intel(R) Xeon(R) CPU E5-2650 v2 and 64GB 1866MHz DDR3 mem-441
ory.442
Prequential accuracy (Gama et al. 2014, 2013) is used to evaluate the method,443
which is computed at every moment t as soon as the predicition yt is made444
Acc(t) = 1
t
t∑
i=1
L(yi , yi ),445
where L(yi , yi
)= 1 if yi = yi and 0 otherwise. For comparing two methods,446
average prequential accuracy over all the moments is reported. Since we handle447
high-dimensional data in resource-limited scenarios, space and time costs are also of448
essential importance in evaluating an algorithm. To measure them, we use the tools449
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
Table 3 The characteristics of the datasets and the parameter settings of EBPegasos
Name Datasets EBPegasos
#inst #Attr #Classes k λ δ2 B θ
Spam_data 9324 500 2 5 1e−4 10 4 1e−5
Gisette 6000 5000 2 5 1e−4 5 80 1e−5
NewsGroup4 12,000 26,214 2 5 1e−6 60 4 1e−5
Spam_corpus 9324 39,916 2 5 1e−4 70 4 1e−5
provided in MOA. In addition, to verify whether the performance difference is statisti-450
cally significant, Wilcoxon signed-rank test (Demšar 2006) is used, which is regarded451
as the best statistical measure for evaluating two classifiers on multiple datasets.452
5.1 Accuracy and efficiency analysis on high-dimensional data453
This section aims to demonstrate the accuracy and efficiency of EBPegasos in face of454
high-dimensional data, compared with the state-of-the-art tree ensembles. Four real-455
world very high-dimensional datasets are used and their characteristics are shown456
in Table 3. Spam_corpus is from SpamAssasin data collection (Katakis et al. 2009),457
containing 9,324 emails, in which around 20% are spams. Each email is represented458
as a 39,916-dimensional vector, where each binary component indicates whether or459
not a corresponding term appears in this email. Spam_data is the dataset after applying460
feature selection with the χ2 measure on Spam_corpus (Katakis et al. 2010). The task461
of Gisette1 data is to distinguish the highly confusable handwritten “4” and “9”. Each462
record in the dataset describes such a digit. NewsGroup4, created from 20newsGroup,2463
simulates the evolution of a particular user’s interest in four news groups over time.464
The parameter settings of EBPegasos on the above datasets are chosen by using the465
first 20% of data and are displayed in Table 3. The chosen settings are then used on466
the whole stream.467
The compared methods include OzaBag (Oza) (Oza 2005), ASHTBag (ASHT)468
(Bifet et al. 2009), ADWINBag (ADWIN) (Bifet et al. 2009), LevBag (Lev) (Bifet et al.469
2010b), DWM (Kolter and Maloof 2007), OAUE (Brzezinski and Stefanowski 2014a)470
and multinominal naive bayes (MNB) (McCallum et al. 1998). The tree ensemble471
methods use 10 component classifiers, each of which is the Hoeffding Naive Bayes472
tree (HNBT) (Holmes et al. 2005) that adds Naive Bayes prediction at each leaf so473
as to adaptively choose at each leaf between majority class prediction and Naive474
Bayes prediction. In addition, we also consider one possible random forests (RF)475
implementation, which uses as the component classifier random HNBT that selects√
d476
attributes randomly (d is the number of attributes) and only uses these attributes when477
building a tree and making predicition, as Bifet et al. did in their experiments (Bifet478
1 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#gisette.2 http://qwone.com/~jason/20Newsgroups/.
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
et al. 2010b). RF combines 25 such trees using LevBag ensemble framework. Given479
that using of Hoeffding bound hinders trees from growing which may degrade their480
performance, we consider two groups of parameter settings of trees. In the first setting,481
the split confidence is 0.01 and the tie threshold is 0.05. Such setting does not allow482
attributes to split until high quasi-certainty about which attribute test should be used is483
obtained. The second setting using the tie threshold as 1.0 turns a tree into a regular tree484
that introduces an attribute splitting after seeing a relatively small and fixed number485
of instances. The grace period parameter in both settings is chosen from the interval486
[20, 30, . . . , 200]. For DWM, the factor for discounting weights and the threshold for487
deleting component classifiers are 0.5 and 0.01 respectively, as suggested in Kolter488
and Maloof (2007). But the period parameter in DWM for controlling component489
removal, creation and weight update and the window size in OAUE for structural490
modification and component evaluation are both data-specific and are chosen from491
the interval [100, 200, . . . , 1000] so that they achieves the best performance. The492
results of the tree ensembles for the two different parameter settings, including the493
average prequential accuracy, total processing time consisting of training and testing494
time, and maximum space cost have been reported respectively in Table 4, in which, “-495
” represents that the results cannot be obtained since the corresponding algorithm runs496
out of the available memory and interrupts its performing. As a specialized document497
classifier, MNB is not suitable for the Gisette data; thus its result on the dataset is498
shown as “–” in the Table 4. The results of EBPegasos are presented in the last column499
of Table 4.500
As we can see, for the tree ensembles, the second parameter setting generally leads501
to better average prequential accuracy than the first one does, but costs more time and502
space resources. Among the tree ensembles, LevBag tends to use much greater time503
and space resources, which limits its performance on very high dimensional datasets.504
RF is not always more resource-efficient than the other tree ensembles, but achieves505
significantly better accuracy in most cases. As a simple document classifier and also a506
single model, MNB is very resource-efficient. However, for lacking of the mechanism507
to handle concept drifts, its performance suffers greatly on NewsGroup4. Compared508
with all the methods on the four datasets, EBPegasos always achieves the best average509
prequential accuracy, meanwhile its memory and time cost, although larger than MNB,510
is smaller than the tree ensembles in most cases.511
5.2 Adaptability analysis512
In this section, we want to compare the adaptability of EBPegasos to that of state-of-513
the-art ensemble methods mentioned above on datasets with various types of concept514
drifts. All the ensemble algorithms use BPegasos as the component learners, thus the515
base learners in the tree-based ensembles are substituted with BPegasos.3516
3 ASHTBag is excluded here since its ensemble strategy is dedicated to Hoeffding trees.
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
Ta
ble
4T
heav
erag
epr
eque
ntia
lacc
urac
y(%
),pr
oces
sing
tim
e(s
)an
dsp
ace
cost
(MB
)on
the
very
high
-dim
ensi
onal
data
sets
Dat
aset
sO
zaA
SH
TA
DW
INL
evD
WM
OA
UE
RF
MN
BE
BP
Spa
m_d
ata
(%)
92.0
492
.11
92.6
296
.58
93.0
990
.47
95.4
495
.77
96
.90
95.0
194
.69
95.2
996
.44
93.3
590
.89
95.6
5
(s)
16.3
017
.49
19.1
652
.55
19.0
213
.37
16.7
21
.65
11.5
2
30.0
927
.72
29.8
278
.90
26.6
519
.86
20.0
1
(MB
)28
.58
17.4
617
.61
113.
852.
2416
.21
35.1
70
.22
0.60
76.6
936
.54
52.7
620
2.40
3.29
42.2
346
.80
Gis
ette
(%)
61.2
359
.24
61.2
383
.43
62.3
965
.35
85.4
1–
87
.53
77.4
664
.42
77.4
676
.29
83.2
975
.73
83.7
1
(s)
286.
3335
3.42
342.
7170
8.64
88.3
813
2.68
141.
04–
2829
.79
489.
7754
8.52
544.
2116
71.0
18
8.0
541
8.90
311.
21
(MB
)23
0.32
158.
1223
0.34
1159
.15
48.1
784
.34
102.
76–
46
.52
988.
3146
1.64
988.
3215
24.4
912
9.29
577.
5229
0.80
New
sGro
up4
(%)
52.7
052
.03
53.8
262
.95
57.1
751
.48
76.6
075
.95
90
.49
63.5
660
.04
62.7
8-
67.1
857
.65
79.9
4
(s)
2098
. 85
2505
.74
3082
.90
6285
.23
859.
5313
35.9
826
,422
.80
2.6
866
1.61
4088
.70
4260
.18
4577
.57
-33
01.3
441
20.9
992
,935
.80
(MB
)17
69.0
418
75.4
412
09.2
520
31.0
652
1.20
1388
.44
1919
.89
2.9
715
.59
1992
.99
1916
.95
1759
.07
-19
08.7
319
91.9
414
60.8
6
Spa
m_c
orpu
s(%
)87
.52
87.3
489
.69
94.4
190
.32
87.0
592
.84
94.6
59
6.8
5
94.2
292
.44
94.1
6-
91.5
190
.02
95.3
9
(s)
1395
.10
1570
.95
1467
.71
4180
.36
2085
.22
1159
.50
22,1
43. 6
01
85
.41
902.
10
5779
.12
5613
.11
5706
.43
-45
73.8
634
93.7
464
,750
.30
(MB
)18
48.2
512
29.4
211
53.0
714
83.9
813
1.88
1249
.95
1241
.78
26
.15
45.3
6
1841
.36
1916
.07
958.
46-
278.
4816
58.0
719
19.3
5
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
Table 5 The characteristics of the datasets
Datasets #inst #Attr #Classes Noise #Drift Drift type
Hy_f 1 M 50 2 10% – Incremental
Hy_s 1 M 50 2 10% – Incremental
RBF_f 100 k 100 10 – – Incremental
RBF_s 100 k 100 10 – – Incremental
LED_s 400 k 24 10 10–20–15–10% 3 Sudden
LED_g 400 k 24 10 10–20–15–10% 3 Gradual
SEA_s 400 k 3 2 10–20–15–10% 3 Sudden
SEA_g 400 k 3 2 10–20–15–10% 3 Gradual
Tree_s 400 k 30 10 – 3 Sudden
Tree_g 400 k 30 10 – 3 Gradual
Waveform_s 400 k 40 3 – 3 Sudden
Waveform_g 400 k 40 3 – 3 Gradual
Covertype 581 k 54 7 – – Unknown
Kddcup99 494 k 121 23 – – Unknown
Noise is the label noise in each concept
We first use synthetic datasets to analyze the performance of the algorithms. For517
constructing datasets with sudden, gradual and incremental drifts,4 we use the data518
stream generators available in MOA, including the Rotating hyperplane, Random RBF,519
LED, SEA, Random Tree and Waveform generators, which have been used frequently520
in the context of concept drifts (Bifet et al. 2010b; Brzezinski and Stefanowski 2011;521
Brzezinski and Stefanowski 2014a, b). Table 5 presents the characteristics of these522
datasets.523
Besides the synthetic datasets, two real datasets are also considered. The first one is524
the Covertype5 which contains the forest cover type obtained from US Forest Service525
Region 2 Resource Information System data. The second one is the kddcup99,6 which526
includes a wide range of intrusions simulated in a military network and whose task is527
to distinguish between bad connections and normal connections. We have transformed528
the original nominal attributes in Kddcup99 to binary attributes that are then treated529
as numeric.530
To perform fair comparisons, all methods use 5 BPegasos components and take531
the same total budget space B (in terms of the number of SVs) on each dataset.532
With such a resource limit, EBPegasos combines BPegasos components with budgets533
B, 2B, 3B, 4B, 5B to create diversity among components (Sect. 4.1), and guarantees534
15B = B. The other methods use BPegasos components with the same budget B ′,535
and ensures 5B ′ = B. To observe how much improvement EBPegasos can make536
4 The scripts for generating these datasets are available at http://cs.nju.edu.cn/rl/people/zhaitt/datasetsGenerateScript.txt.5 It can be downloaded from http://moa.cms.waikato.ac.nz/datasets/.6 It can be downloaded from http://www.cse.fau.edu/~xqzhu/stream.html.
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
Table 6 The average prequential accuracy using BPegasos as the base learner (%)
Datasets Oza ADWIN Lev DWM OAUE BP EBP
Hy_f 84.66 84.66 84.26 76.66 84.87 84.71 85.13
Hy_s 85.06 85.06 84.80 76.90 85.15 85.04 85.43
RBF_f 21.97 51.52 57.33 52.61 46.49 35.15 55.27
RBF_s 48.01 82.51 75.32 75.11 52.38 57.18 79.86
LED_s 65.18 64.85 62.55 53.94 60.66 64.81 65.66
LED_g 63.58 62.82 58.88 50.70 60.46 63.44 63.91
SEA_s 85.03 84.65 83.41 81.72 84.52 84.78 85.10
SEA_g 84.38 84.04 82.57 81.29 83.90 84.19 84.53
Tree_s 67.24 70.02 70.69 52.48 65.39 67.83 72.73
Tree_g 62.96 62.97 60.74 47.20 59.34 64.11 65.28
Waveform_s 84.89 84.83 83.56 80.26 84.42 83.65 85.54
Waveform_g 84.16 83.81 82.44 79.75 83.99 83.16 84.22
Covertype 89.91 90.43 91.90 92.32 91.30 91.41 92.32
Kddcup99 23.58 98.38 98.35 96.30 80.45 26.87 98.38
Wilcoxon R+ 105 91.5 95 105 105 105
R− 0 13.5 10 0 0 0
over its base learners, we compare EBPegasos with its largest component: “BP” with537
the budget 5B. We use the first 20% of each dataset to seek for each algorithm the538
best parameter setting and then apply that setting to the whole stream. The average539
prequential accuracy and processing time of these methods are reported in Tables 6540
and 7 respectively. The last row of both tables also presents the results of Wilcoxon541
signed-rank test between EBPegasos and each of the other algorithms. In order to542
demonstrate the importance of the conditional structural modification in EBPegasos543
(Sect. 4.3), we also present the results of varying the threshold θ from 0 to 1e−1 but544
keeping the other parameters’ values constant in Table 8.545
As we can see from Table 6, OzaBag and BP perform poorly on RBF_f, RBF_s and546
Kddcup99, which may be due to the fact that concepts in these datasets change faster.547
Actually, this is expected according to Table 8, as preventing the structural modification548
(by setting a larger θ ) in EBPegasos degrades the performance greatly. On the same549
datasets, DWM performs better than OAUE, since the component evaluation strategy550
of DWM encourages removing and creating components more frequently, while the551
periodical component evaluation and substitution of OAUE provides too slow reaction552
to fast changes. Comparing ADWINBag and LevBag on the noisy datasets, i.e., Hy_f,553
Hy_s, LED_s, LED_g, SEA_s, and SEA_g, we find that LevBag is always worse than554
ADWINBag, which results from the fact that increasing resampling weight makes555
LevBag more susceptible to noise.556
On the datasets except RBF_f, RBF_s and Kddcup99, the base learner BP can557
achieve acceptable performance whenever drifts happen suddenly or gradually and558
no matter how many drifts occur. This is owing to its slow forgetting mechanism.559
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
Table 7 The total processing time in second on each dataset, including both the training and the testingtime when using BPegasos as the base learner
Datasets Oza ADWIN Lev DWM OAUE BP EBP
Hy_f 1174.78 1327.49 1903.08 1739.7 2067.46 632.763 1976.16
Hy_s 1199.27 1307.59 1930.51 1685.65 2091.53 623.617 1958.15
RBF_f 2965.60 1537.0 1874.31 2127.24 2932.46 1239.37 2724.67
RBF_s 2832.94 1388.12 1093.24 1702.79 4169.06 1032.45 2361.08
LED_s 1336.73 1223.16 1657.17 2045.43 2384.13 844.78 2255.99
LED_g 1404.38 1290.68 1790.18 2140.01 2529.25 889.15 2580.63
SEA_s 353.71 353.00 659.03 1669.82 616.28 174.21 575.65
SEA_g 359.10 359.16 650.92 1666.93 633.83 178.58 593.64
Tree_s 1660.54 1425.31 1699.37 2206.34 2798.93 671.76 1740.63
Tree_g 1794.4 1609.78 2172.84 2355.32 2917.6 980.18 2324.46
Waveform_s 440.74 390.36 496.32 403.96 916.92 193.68 1083.61
Waveform_g 394.84 364.31 497.76 976.05 705.42 208.12 1175.66
Covertype 443.42 511.42 607.57 539.27 545.22 190.41 565.292
Kddcup99 307.21 1693.14 2465.44 1283.15 1240.11 155.66 2502.26
Wilcoxon R+ 11 0 16 35 69 0
R− 94 105 89 70 36 105
Table 8 The average prequential accuracy of EBPegasos when the threshold θ varies from 0 to 1e−1
Datasets 0 1e−6 1e−5 1e−4 1e−3 1e−2 1e−1
Hy_f 76.14 76.14 76.67 85.13 85.13 85.13 85.13
Hy_s 76.29 76.29 81.38 85.43 85.43 85.43 85.43
RBF_f 55.27 55.27 55.24 54.92 52.31 17.14 17.14
RBF_s 79.22 79.23 79.86 52.38 52.38 52.38 52.38
LED_s 61.35 61.35 64.55 65.29 65.66 65.27 65.27
LED_g 59.18 59.18 63.49 63.91 63.91 63.91 63.91
SEA_s 81.67 81.67 83.55 85.10 85.08 85.08 85.08
SEA_g 81.10 81.10 84.53 84.53 84.53 84.53 84.53
Tree_s 72.38 72.54 72.73 72.73 72.70 69.92 69.92
Tree_g 62.37 62.63 64.82 65.28 65.28 65.28 65.28
Waveform_s 84.29 84.29 84.24 85.54 69.99 69.99 69.99
Waveform_g 84.00 84.22 82.44 69.55 69.60 69.60 69.60
Covertype 92.32 92.27 92.22 92.15 91.63 90.75 90.75
Kddcup99 98.38 98.38 98.37 98.39 98.38 23.25 23.25
Ignoring of the characteristic of BPegasos, ADWINBag, LevBag, DWM and OAUE560
cannot outperform BPegasos in many cases, e.g. on LED_g, SEA_s, SEA_g and561
Tree_g. In contrast, EBPegasos, considering the characteristic, can always obtain better562
performance than BPegasos on all datasets.563
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
For the Wilcoxon signed-rank test, the null hypothesis is that the two algorithms564
being compared can achieve equal performance. When the number of datasets is 14, if565
min{R+, R−}≤ 21, the null hypothesis can be rejected with a confidence level 0.05.566
Seen from the last row of Table 6, we can conclude that EBPegasos is statistically better567
than the other methods in terms of averaged prequential accuracy. In terms of the total568
processing time, Bpegasos, as a single learner, costs the least time on all datasets.569
Among the ensemble methods, OzaBag, ADWINBag and LevBag cost significantly570
less time than EBPegasos.571
From Table 8, we can see clearly that the function of conditional structural modi-572
fication in EBPegasos. On Hy_f, Hy_s, LED_s, LED_g, SEA_s, SEA_g and Tree_g,573
exploiting old knowledge fully by setting a larger θ is more beneficial, while on574
RBF_f, RBF_s, Tree_s, Waveform_s, Waveform_g, Covertype and Kddcup99, pro-575
moting structural modification by setting a smaller θ (thus learning new knowledge576
quickly) is better. From the above analysis, we can conclude that the conditional577
structural modification strategies (Sect. 4.3) makes EBPegasos strike a good balance578
between exploiting old knowledge fully and learning new knowledge. The appropriate579
value for θ may depend on the specific data characteristics.580
5.3 Comparative analysis on low-dimensional data581
It is also very interesting to observe how EBPegasos behaves on the low-dimensional582
data streams compared with the tree ensembles. Thus, we perform the comparative583
experiments on the datasets displayed in Table 5. EBPegasos uses the same parameter584
setting as in Sect. 5.2. All the tree ensembles use the first parameter settings described585
in Sect. 5.1. The average prequential accuracy and the corresponding Wilcoxon signed-586
rank test results between EBPegasos and each of the other algorithms are displayed in587
Table 9. In addition, EBPegasos costs more time but less memory than the tree ensem-588
bles on these datasets, but the results are not displayed due to the space constraint.589
From Table 9, we can conclude that EBPegasos cannot outperform significantly the590
other tree ensembles except RF on these low-dimensional datasets.591
5.4 Sensitivity analysis592
In this section, we analyze the effect of the parameters except the threshold θ in593
EBPegasos. The regularization parameter λ and RBF kernel width δ, are inherent to594
SVM with the RBF kernel. It is well known that they will affect the performance of595
SVM greatly, so we do not discuss about them. We assume that the two parameters596
and the threshold θ have been suitably selected and discuss the effects of the numbers597
of components k and initial budget B.598
Due to the space constraint, we do not display the performance change when these599
parameters are set as different values here, but give concluding remarks. The param-600
eters k and B are related to the sufficiency of the representation of the decision601
margin. For datasets with linear decision margins, such as Hy_f, Hy_s, Spam_data602
and Spam_corpus datasets, small k and B are enough for obtaining competing results,603
and further increasing k and B will not improve the performance significantly. How-604
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
Table 9 The average prequential accuracy of EBPegasos and the tree ensembles on low-dimensionaldatasets (%)
Datasets Oza ASHT ADWIN Lev DWM OAUE RF EBP
Hy_f 75.33 77.65 77.40 72.12 77.00 79.26 69.85 85.13
Hy_s 76.01 78.17 77.94 72.57 77.01 79.47 70.78 85.43
RBF_f 23.03 22.81 34.13 36.70 32.68 19.89 18.15 55.27
RBF_s 56.08 54.73 55.25 65.38 53.52 47.79 57.09 79.86
LED_s 62.52 65.60 66.51 66.02 66.36 65.76 63.42 65.66
LED_g 62.23 64.20 64.52 63.76 64.58 63.94 61.34 63.91
SEA_s 84.39 85.12 85.28 85.74 85.28 84.66 84.97 85.10
SEA_g 84.18 84.64 84.81 85.35 84.94 83.81 84.53 84.53
Tree_s 77.78 82.79 87.40 92.41 89.61 89.68 68.86 72.73
Tree_g 73.97 77.68 84.02 87.20 82.05 81.79 67.59 65.28
Waveform_s 84.32 83.10 84.52 84.41 82.88 83.48 82.91 85.54
Waveform_g 84.12 82.49 84.16 84.09 81.91 83.23 82.50 84.22
Covertype 89.01 86.42 88.58 93.25 88.49 89.17 89.72 92.32
Kddcup99 99.33 99.33 99.33 99.58 99.06 98.03 99.64 98.38
Wilcoxon R+ 82 69 62 52 67 79 95
R− 23 36 43 53 38 26 9
ever, on datasets with nonlinear decision margins, e.g., RBF_f, RBF_s, Kddcup99605
datasets, increasing k and B will first improve the performance considerably, then the606
improvement becomes less and less and finally becomes constant. Clearly, the more607
complex a margin is, the more components and budgets are needed for representing608
it. However, when the margin has been sufficiently represented, further increasing k609
and B will not be necessary.610
6 Conclusions611
High-dimensional evolving streaming data is very common but it is very challenging612
to handle it well. In this paper, we have proposed a novel online ensemble method,613
called EBPegasos, to cope with the challenges simultaneously posed by high dimen-614
sionality and concept drifting in streaming data. In particular, EBPegasos maintains a615
group of BPegasos classifiers with different budget sizes, monitors and evaluates their616
performances using ADWIN drift detectors, and modifies the structure of ensemble617
only when drifts that may produce severe effects are detected. The novelty of EBPega-618
sos originates from the following three aspects: (1) it uses as the base learner an online619
SVM classifier, which endows it with good capability to work on high-dimensional620
data; (2) it injects diversity among components by using the advantages of different621
budget sizes; (3) it uses conditional structural modification, which makes it strike a622
good balance between exploiting and forgetting the old knowledge. Experiments on623
four real-world datasets show that EBPegasos has its advantage on high-dimensional624
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
Classification of high-dimensional evolving data streams…
data streams, in terms of not only the average prequential accuracy but also the resource625
usage, compared with the state-of-the-art tree ensembles. We also show that when all626
comparison methods use BPegasos as the base learner, EBPegasos can still possess the627
best capability to deal with various types of concept drifts. We are further working on628
deriving some theoretical bound for explaining the good adaptability of EBPegasos.629
The bound may show the maximum drifting amount between time-adjacent concepts630
that EBPegasos can tolerate.631
Acknowledgements This work is supported by the National NSF of China (Nos. 61432008, 61503178),632
NSF and Primary R&D Plan of Jiangsu Province, China (Nos. BE2015213, BK20150587), and the Collab-633
orative Innovation Center of Novel Software Technology and Industrialization.634
References635
Abdulsalam H, Skillicorn DB, Martin P (2007) Streaming random forests. In: 11th international database636
engineering and applications symposium, pp 225–232637
Abdulsalam H, Skillicorn DB, Martin P (2011) Classification using streaming random forests. IEEE Trans638
Knowl Data Eng 23(1):22–36639
Abe S (2005) Support vector machines for pattern classification. Springer, London640
Aggarwal CC, Yu PS (2008) Locust: an online analytical processing framework for high dimensional641
classification of data streams. In: Proceedings of the 24th IEEE international conference on data642
engineering, pp 426–435643
Bifet A, Frank E (2010) Sentiment knowledge discovery in twitter streaming data. In: International confer-644
ence on discovery science, pp 1–15645
Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of646
the 7th SIAM international conference on data mining, pp 443–448647
Bifet A, Holmes G, Pfahringer B, Kirkby R, Gavaldà R (2009) New ensemble methods for evolving data648
streams. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery649
and data mining, pp 139–148650
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010a) Moa: massive online analysis. J Mach Learn Res651
11:1601–1604652
Bifet A, Holmes G, Pfahringer B (2010b) Leveraging bagging for evolving data streams. In: Joint European653
conference on machine learning and knowledge discovery in databases, pp 135–150654
Bifet A, Holmes G, Pfahringer B, Frank E (2010c) Fast perceptron decision tree learning from evolving655
data streams. In: Pacific-Asia conference on knowledge discovery and data mining, pp 299–310656
Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive657
windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806658
Brzezinski D, Stefanowski J (2011) Accuracy updated ensemble for data streams with concept drift. In:659
International conference on hybrid artificial intelligence systems, pp 155–163660
Brzezinski D, Stefanowski J (2014a) Combining block-based and online methods in learning ensembles661
from concept drifting data streams. Inf Sci 265:50–67662
Brzezinski D, Stefanowski J (2014b) Reacting to different types of concept drift: the accuracy updated663
ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94664
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30665
Denil M, Matheson D, De Freitas N (2013) Consistency of online random forests. In: Proceedings of the666
30th international conference on machine learning, pp 1256–1264667
Do TN, Lenca P, Lallich S, Pham NK (2010) Classifying very-high-dimensional data with random forests668
of oblique decision trees. In: Advances in knowledge discovery and management. Springer, pp 39–55669
3670
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the 6th ACM SIGKDD671
international conference on knowledge discovery and data mining, pp 71–80672
Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans673
Neural Netw 22(10):1517–1531674
Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10(1):23–45675
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of
unco
rrec
ted
proo
f
T. Zhai et al.
Gama J, Sebastiao R, Rodrigues PP (2013) On evaluating stream learning algorithms. Mach Learn676
90(3):317–346677
Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation.678
ACM Comput Surv 46(4):44679
Holmes G, Kirkby R, Pfahringer B (2005) Stress-testing hoeffding trees. In: European conference on680
principles of data mining and knowledge discovery, pp 495–502681
Hosseini MJ, Gholipour A, Beigy H (2015) An ensemble of cluster-based classifiers for semi-supervised682
classification of non-stationary data streams. Knowl Inf Syst 46:1–31683
Hsu CW, Chang CC, Lin CJ, et al (2003) A practical guide to support vector classification. https://www.cs.684
sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf685
Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I (2009) An adaptive personalized news dis-686
semination system. J Intell Inf Syst 32(2):191–212687
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an688
application to email filtering. Knowl Inf Syst 22(3):371–391689
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J690
Mach Learn Res 8:2755–2790691
Krempl G, Žliobaite I, Brzezinski D, Hüllermeier E, Last M, Lemaire V, Noack T, Shaker A, Sievi S,692
Spiliopoulou M, Stefanowski J (2014) Open challenges for data stream mining research. SIGKDD693
Explor 16(1):1–10694
Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: efficient online random forests. In:695
Advances in neural information processing systems, pp 3140–3148696
Liu Y, Zhou Y (2014) Online detection of concept drift in visual tracking. In: International conference on697
neural information processing, pp 159–166698
McCallum A, Nigam K et al (1998) A comparison of event models for naive bayes text classification. In:699
AAAI-98 workshop on learning for text categorization, vol 752, pp 41–48700
Minku LL, Yao X (2012) Ddd: A new ensemble approach for dealing with concept drift. IEEE Trans Knowl701
Data Eng 24(4):619–633702
Minku LL, White AP, Yao X (2010) The impact of diversity on online ensemble learning in the presence703
of concept drift. IEEE Trans Knowl Data Eng 22(5):730–742704
Oza NC (2005) Online bagging and boosting. In: 2005 IEEE international conference on systems, man and705
cybernetics, vol 3, pp 2340–2345706
Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Clusters, orders, and trees: methods707
and applications. Springer, pp 119–150708
Rutkowski L, Pietruczuk L, Duda P, Jaworski M (2013) Decision trees for mining data streams based on709
the McDiarmid’s bound. IEEE Trans Knowl Data Eng 25(6):1272–1279710
Saffari A, Leistner C, Santner J, Godec M, Bischof H (2009) On-line random forests. In: 2009 IEEE 12th711
international conference on computer vision workshops, pp 1393–1400712
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for713
SVM. Math Program 127(1):3–30714
Tomasev N, Radovanovic M, Mladenic D, Ivanovic M (2014) The role of hubness in clustering high-715
dimensional data. IEEE Trans Knowl Data Eng 26(3):739–751716
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient717
descent for large-scale SVM training. J Mach Learn Res 13(1):3103–3131718
Wang D, Wu P, Zhao P, Wu Y, Miao C, Hoi SC (2014) High-dimensional data stream classification via719
sparse online learning. In: 2014 IEEE international conference on data mining, pp 1007–1012720
Ye Y, Wu Q, Huang JZ, Ng MK, Li X (2013) Stratified sampling for feature subspace selection in random721
forests for high dimensional data. Pattern Recognit 46(3):769–787722
Zhang X, Furtlehner C, Germain-Renaud C, Sebag M (2014) Data stream clustering with affinity propaga-723
tion. IEEE Trans Knowl Data Eng 26(7):1644–1656724
Zliobaite I, Gabrys B (2014) Adaptive preprocessing for streaming data. IEEE Trans Knowl Data Eng725
26(2):309–321726
Zliobaite I, Bifet A, Read J, Pfahringer B, Holmes G (2015) Evaluation methods and decision theory for727
classification of streaming data with temporal dependence. Mach Learn 98(3):455–482728
123
Journal: 10618-DAMI Article No.: 0500 TYPESET DISK LE CP Disp.:2017/3/17 Pages: 24 Layout: Small-X
Au
tho
r P
ro
of