Data Stream Classification and Novel Class Detection

University of Texas at Dallas

Data Stream Classification and Data Stream Classification and Novel Class DetectionNovel Class Detection

Mehedy Masud, Latifur Khan, Qing Chen and Bhavani Thuraisingham

Department of Computer Science , University of Texas at Dallas

Jing Gao, Jiawei Han

Department of Computer Science , University of Illionois at

Urbana Champaign

Charu Aggarwal

IBM T. J. WatsonThis work was funded in part by

Aug 10, 2011Masud et al.


Outline of The PresentationOutline of The Presentation

Background

Data Stream ClassificationNovel Class Detection

Aug 10, 2011Masud et al. 2


IntroductionIntroductionCharacteristics of Data streams are:

◦ Continuous flow of

data

Network traffic

Sensor data Call center

records

◦ Examples:



Uses past labeled data to build classification modelPredicts the labels of future instances using the modelHelps decision making

Data Stream ClassificationData Stream Classification

Network traffic

Classification model

Attack traffic

Firewall

Block and quarantine

Benign traffic

Server

Model update

Expert analysis and labeling



Data Stream Classification Data Stream Classification (cont..)(cont..)

What are the applications?◦Security Monitoring◦Network monitoring and traffic

engineering.◦Business : credit card transaction

flows.◦Telecommunication calling records.◦Web logs and web page click

streams.



Infinite length

Concept-drift

Concept-evolution

Feature Evolution

ChallengesChallenges



Impractical to store and use all historical data

◦ Requires infinite storage

◦ And running time

Infinite LengthInfinite Length



Concept-DriftConcept-Drift

Negative instancePositive instance

A data chunk

Current hyperplane

Previous hyperplane

Instances victim of concept-drift



Concept-EvolutionConcept-Evolution

X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X

X X X X X X

Novel classy

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -- - - - -

Classification rules:

R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +

R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -Existing classification models misclassify novel class instances

AC

D

B

y

x1

y1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -- - - - -

A

CD

B



Dynamic FeaturesDynamic FeaturesWhy new features evolving

◦ Infinite data stream

Normally, global feature set is unknown

New features may appear

◦ Concept drift

As concept drifting, new features may appear

◦ Concept evolution

New type of class normally holds new set of features

Different chunks may have different feature sets



Dynamic FeaturesDynamic Features

Feature Extraction & Selection

i + 1st chunk

ith chunk

Existing classification models need complete fixed features and apply to all the chunks. Global features are difficult to predict. One solution is using all English words and generate vector. Dimension of the vector will be too high.

Current

model

Training New Model

Feature SpaceConversion

Classification &Novel Class Detection

runway, climb

runway, clear, ramp

runway, ground, ramp

ith chunk and i + 1st chunk and models have

different feature sets

Feature Set



Outline of The PresentationOutline of The Presentation

Introduction

Data Stream Classification

Novel Class Detection



DataStream Classification DataStream Classification (cont..) (cont..)

Single Model Incremental Classification

Ensemble – model based classification◦Supervised◦Semi-supervised◦Active learning



Single Model Incremental Classification

Ensemble – model based classification◦Data Selection ◦Semi-supervised◦Skewed Data

I

OverviewOverview



Ensemble of ClassifiersEnsemble of Classifiers

C1

C2

C3

x,?

+

+

-input

ClassifierIndividual outputs

voting

+

Ensemble output



Ensemble Classification of Ensemble Classification of Data StreamsData Streams

Divide the data stream into equal sized chunks◦ Train a classifier from each data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3

Data chunks

Classifiers

D1

C1

D2

C2

D3

C3

Ensemble

C1 C2 C3

D4

Prediction

D4

C4C4

C4

D5D5

C5C5

C5

D6

Labeled chunkUnlabeled chunk

Addresses infinite lengthand concept-drift

Note: Di may contain data points from different classes



A completely new class of data arrives in the stream

Concept-Evolution ProblemConcept-Evolution Problem

y

x1

y1

y2

x

- - - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -- - - - -

x<x1

T F

y<y1y<y2

T F-

T F + -

D B C

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

+A

A

B

C

D

(a) A decision tree, (b) corresponding feature space partitioning

(a) (b)

ECSMiner

y1

X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X XX X X X X X

Novel classy

x1

y2

x

++++ ++

++ + + ++ + +++ ++ + ++ + + + ++ +

+++++ ++++ +++ + ++ + + ++ ++

+

- - - - - - - - - - - - - - -

+ + + + + + + + + + + + + + + +

- - - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - -- - - - -

(c)

(c) A Novel class (denoted by x) arrives in the stream.



ECSMiner: OverviewECSMiner: Overview

Last labeled chunk

Data Stream

Ensemble of L models

Newer instances (unlabeled)

Older instances (labeled)

Training

New mod

el

Update

M1

M2 ML. . .

Overview of ECSMiner algorithm

xnow

Just arrived

Outlier detectio

n

Buffer?

Classification

No

Buffering and novel

class detection

Yes

ECSMiner

Based on: Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009, pp 79-94 (extended version appeared in IEEE Transaction on Knowledge and Data Engineering (TKDE)).



AlgorithmAlgorithm

Training Novel class detection and classification

ECSMiner



Novel Class DetectionNovel Class DetectionNon parametric

◦does not assume any underlying model of existing

classes

Steps:

1.Creating and saving decision boundary during

training

2.Detecting and filtering outliers

3.Measuring cohesion and separation among test and

training instances

ECSMiner



Training: Training: Creating Decision Creating Decision BoundaryBoundary

ECSMiner

++++ ++ + + +

+ +++ ++ +

+ + + + ++ +

+++ ++ ++ +++

+++++ ++++ +++ + ++ + +

++ ++ + ++

- - - - - - - - - - -- - - - - - - - - -

- -- - - - - - - - - -

- -- - - - - - - - - -

-

y

x1

y1

y2B

CA

D

x

-- - - - - - -

- - - - - - - -

+++ ++ + + + + + + + + + + +

Raw training dataClusters are created

y

x1

y1

y2

x

A

D

C

B

Pseudopoints

Addresses Infinite length problem



Outlier Detection and Outlier Detection and FilteringFiltering

x1 x

y

y1

y2B

CA

D x

x

AND

Routlier

Routlier

Routlier

Ensemble of L modelsM1 M2 ML

xTest instance

. . .

True

X is a filtered outlier (Foutlier)(potential novel class instance)

False

X is an existing class instance

Test instance inside decision boundary (not outlier)

Test instance outside decision

boundary Raw outlier or

Routlier

Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible.

ECSMiner



Novel Class DetectionNovel Class Detection

AND

Routlier

Routlier

Routlier

Ensemble of L modelsM1 M2 ML

xTest instance

. . .

True

X is a filtered outlier (Foutlier)(potential novel class instance)

False

X is an existing class instance

ECSMiner

(Step 1)

(Step 2)

Compute q-NSC with all models

and other Foutliers(Step 3)

q-NSC>0

for q’>q

Foutliers with

all models

?

(Step 4)

Novel class found

Y

N Treat as existing class



Computing Cohesion & Computing Cohesion & SeparationSeparation

a(x) = mean distance from an Foutlier x to the instances in o,q(x)bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)q-Neighborhood Silhouette Coefficient (q-NSC):

a(x)),(x)bmax(

a(x)) (x)(b NSC(x)-q

min

min

If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.

ECSMiner

x

o,5(x)

+,5(x)

- - - -

+ + + +

- - - -

-

+ + + + +

-,5(x)

a(x)

b+

(x)b-(x)



Speeding Up Speeding Up Computing N-NSC for every Foutlier instance x

takes quadratic time in the number of Foutliers. In order to make the computation faster,

We create Ko pseudopoints (Fpseudopoints) from Foutliers using K-means clustering,where Ko = (No/S) * K. Here S is the chunk size and No is the number of Foutliers.perform the computations on the Fpseudopoints

Thus, the time complexity

◦ to compute the N-NSC of all of the Fpseudopoints is O(Ko(Ko+K))

◦ which is constant, since both Ko and K are independent of

the input size.

◦ However, by gaining speed we lose some precision,

although the loss is negligible (to be analyzed shortly) Aug 10, 2011Masud et al. 25


Algorithm To Detect Novel Algorithm To Detect Novel ClassClass

ECSMiner



““Speedup” PenaltySpeedup” PenaltyAs discussed earlier

◦by speeding up computation in step – 3, we lose

some precision since the result deviates from exact

result

◦This analysis shows that the deviation is negligiblei

j

x

i

j

(i-j)2

(x-j)2

(x-i)2

Figure 6. Illustrating the computation of deviation. i is an Fpseudopoint, i,e., a cluster of Foutliers, and j is an existing class Pseudopoint, i.e., a cluster of existing class instances. In this particular example, all instances in i belong to a novel class.



““Speedup” PenaltySpeedup” PenaltyApproximate:

Exact:

Deviation:



Experiments - DatasetsExperiments - Datasets• We evaluated our approach on two synthetic

and two real datasets:•SynC – Synthetic data with only concept-drift.

Generated using hyperplane equation. 2 classes, 10 attributes, 250K instances•SynCN – Synthetic data with concept-drift and

novel class. Generated using Gaussian distribution. 20 classes, 40 attributes, 400K instances•KDD cup 1999 intrusion detection (10% version)

– real dataset. 23 classes, 34 attributes, 490K instances•Forest cover – real dataset. 7 classes, 54

attributes, 581K instances



Experiments - SetupExperiments - SetupDevelopment:

◦ Language: Java

H/W:

◦ Intel P-IV with

◦ 2GB memory and

◦ 3GHz dual processor CPU.

Parameter settings:

◦ K (number of pseudopoints per chunk) = 50

◦ N (minimum number of instances required to declare novel

class) = 50

◦ M (ensemble size) = 6

◦ S (chunk size) = 2,000



Experiments - BaselineExperiments - Baseline Competing approaches:

◦ i) MineClass (MC): our approach

◦ ii) WCE-OLINDDA_Parallel (W-OP)

◦ iii) WCE-OLINDDA_Single (W-OS): Where WCE-OLINDDA is a combination of the

Weighted Classifier Ensemble (WCE) and novel class detector OLINDDA, with default

parameter settings for WCE and OLINDDA

We use this combination since to the best of our knowledge there is no approach that Can

classify and detect novel classes simultaneously

OLINDDA assumes there is only one normal class, and all other classes are novel

◦ Therefore, we apply two variations –

W-OP keeps parallel OLINDDA models, one for each class

W-OS keeps a single model that absorbs a novel class when encountered



Experiments - ResultsExperiments - ResultsEvaluation metrics

◦ Mnew = % of novel class instances Misclassified as existing class

= Fn∗100/Nc

◦ Fnew = % of existing class instances Falsely identified as novel

class = Fp∗100/ (N−Nc)

◦ ERR = Total misclassification error (%)(including Mnew and Fnew)

= (Fp+Fn+Fe)∗100/N

◦ where Fn = total novel class instances misclassified as existing

class,

◦ Fp = total existing class instances misclassified as novel class,

◦ Fe = total existing class instances misclassified (other than Fp),

◦ Nc = total novel class instances in the stream,

◦ N = total instances the stream.



Experiments - ResultsExperiments - Results

Forest Cover KDD cup SynCN



Experiments - ResultsExperiments - Results



Experiments – Parameter Experiments – Parameter SensitivitySensitivity



Experiments – RuntimeExperiments – Runtime



Dynamic FeaturesDynamic FeaturesSolution:

◦ Global Features◦ Local Features◦ Union

Mohammad Masud, Qing Chen, Latifur Khan, Jing Gao, Jiawei Han,

and Bhavani Thuraisingham, “Classification and Novel Class

Detection of Data Streams in A Dynamic Feature Space,” in Proc.

of Machine Learning and Knowledge Discovery in Databases,

European Conference, ECML PKDD 2010, Barcelona, Spain, Sept

2010, Springer, Page 337-352



Feature Mapping Across Models Feature Mapping Across Models and Test Data Points and Test Data Points

Feature set varies in different chunks. Especially, when new class appears, new features should be selected and added to the feature set.

Strategy 1 – Lossy fixed (Lossy-F) conversion / Global

◦ Use the same fixed feature in the entire stream.

We call this a lossy conversion because future model and instances

may lose important features due to this mapping.

Strategy 2 – Lossy local (Lossy-L) conversion / Local

◦ We call this lossy conversion because it may loss feature values

during mapping.

Strategy 3 – Dimension preserving (D-Preserving)

Mapping / Union Aug 10, 2011Masud et al. 38


Feature Space Conversion – Feature Space Conversion – Lossy-L Mapping (Local)Lossy-L Mapping (Local)

Assume that each data chunk has different feature vectors

When a classification model is trained, we save the feature vector with the model

When an instance is tested, its feature vector is mapped (i.e., projected) to the model’s feature vector.



Feature Space Conversion – Feature Space Conversion – Lossy-L MappingLossy-L Mapping

For example, ◦ Suppose the model has two features (x,y)◦ The instance has two features (y,z)◦ When testing, assume the instance has

two features (x,y)◦ Where x = 0, and y value is kept as it is



Conversion Strategy II – Lossy-L Conversion Strategy II – Lossy-L MappingMapping

Graphically:



Conversion Strategy III – D-Conversion Strategy III – D-Preserving MappingPreserving Mapping

When an instance is tested, both the model’s feature vector and the instance’s feature vector are mapped (i.e., projected) to the union of their feature vectors.

◦ The feature dimension is increased.

◦ In the mapping, both the features in the testing

instance and model are preserved. The extra

features are filled with all 0s.




For example, ◦ suppose the model has three features

(a,b,c)◦ The instance has four features (b,c,d,e)◦ When testing, we project both the model’s

feature vector and the instance’s feature vector to (a,b,c,d,e)

◦ Therefore, in the model, d, and e will be considered 0s and in the instance, a will be considered 0




Previous Example



DiscussionDiscussionLocal does not favor novel class, it favors

existing classes.

◦ Local features will be enough to model existing

classes.

Union favors novel class.

◦ New features may be discriminating for novel class,

hence Union works.



ComparisonComparisonWhich strategy is the better?Assumption: lossless conversion (union)

preserves the properties of a novel class. In other words, if an instance belongs to a

novel class, it remains outside the decision boundary of any model Mi of the ensemble M in the converted feature space. Lemma:

If a test point x belongs to a novel class, it will be miss-classified by the ensemble M as an existing class instance under certain conditions when the Lossy-L conversion is used.



ComparisonComparison Proof:

Let X1,…,XL,XL+1,…,XM be the dimensions of the model

and

Let X1,…,XL,XM+1,…,XN be the dimensions of the test

point

Suppose the radius of the closest cluster (in the

higher dimension) is R

Also, let the test point be a novel class instance.

Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,

…,XN



ComparisonComparison Proof (continued):

Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN

Centroid of the cluster (original space): X1=x1,

…,XL=xL,XL+1=xL+1,…,XM=xM i.e., x1,…,xL, xL+1,…,xM

Centroid of the cluster (combined space): x1,…,xL, xL+1,

…,xM , 0,…,0

Test point (original space):

X1=x’1,…,XL=x’L,XM+1=x’M+1,…,XN=x’N i.e., x1,…,xL,

x’M+1,…,x’N

Test point (combined space): x’1,…,x’L, 0,…,0, x’M+1,

…,x’N



ComparisonComparison Proof (continued):

Centroid (combined spc): x1,…,xL, xL+1,…,xM , 0 ,…, 0

Test point (combined space): x’1,…,x’L, 0,…, 0, x’M+1,…,x’N

R2< ((x1 –x’1)2+,…, +(xL –x’L)2+ x2L+1+…+x2

M)+ (x’2M+1+…

+x’2N)

R2< a2 + b2

R2 = a2 + b2 - e2 (e2 >0)

a2 = R2 + (e2 – b2)

a2 < R2 (provided that e2 < b2)

Therefore, in Lossy-L conversion, the test point will not be

an outlier



Baseline ApproachesBaseline Approaches WCE is Weighted Classifier Ensemble1, which addresses

multi-class ensemble classifier.

OLINDDA is a novel class detector 2 works only for binary

class.

FAE algorithm is an ensemble classifier that addresses

feature evolution3 and concept drift.

ECSMiner is a multi-class ensemble classifier that addresses

concept drift and concept evolution4.



Approaches ComparisonApproaches Comparison

Proposed techniqu

es

Challenges

Infinite

length

Concep

t-drift

Concept-

evolution

Dynamic

Features

OLINDDA

WCE

FAE

ECSMiner

DXMiner



Experiments: DatasetsExperiments: Datasets We evaluated our approach on different datasets:

Data SetConcep

t Drift

Concept

Evolutio

n

Dynamic

Feature

# of

Instanc

e

# of

Class

KDD 492K 7

Forest Cover 387K 7

NASA 140K 21

Twitter 335K 21



Experiments: ResultsExperiments: Results

Evaluation metrics: let

◦ Fn = total novel class instances misclassified as

existing class,

◦ Fp = total existing class instances misclassified as

novel class,

◦ Fe = total existing class instances misclassified (other

than Fp),


◦ N = total instances the stream



Experiments: ResultsExperiments: ResultsWe use the following performance metrics to

evaluate our technique:

◦ Mnew = % of novel class instances Misclassified

as existing class, i.e,

◦ Fnew = % of existing class instances Falsely identified

as novel class, i.e.,

◦ ERR = Total misclassification error (%)(including Mnew and Fnew), i.e.,



Experiments: SetupExperiments: SetupDevelopment:

◦ Language: Java

H/W:

◦ Intel P-IV with

◦ 3GB memory and


Parameter settings:

◦ K (number of pseudo points per chunk) = 50

◦ q (minimum number of instances required to declare novel

class) = 50

◦ L (ensemble size) = 6




Experiments: BaselineExperiments: BaselineCompeting approaches:

◦ i) DXMiner (DXM): our approach- 4 variations:

Lossy-F conversion

Lossy-L conversion

D-Preserving conversion

◦ ii) FAE-WCE-OLINDDA_Parallel (W-OP)

◦ Assumes there is only one normal class, and all other classes

are novel . W-OP keeps parallel OLINDDA models, one for each

class

We use this combination since to the best of our knowledge there is no

approach that can classify and detect novel classes simultaneously with

feature evolution.

◦ iii) FAE-ECSMiner



Twitter ResultsTwitter Results




D-

preserving

Lossy -

Local

Lossy-

GlobalO-F

AUC 0.88 0.83 0.76 0.56



NASA DatasetNASA Dataset

Deviatio

nInfo Gain O-F

AUC 0.996 0.967 0.876



Forest Cover ResultsForest Cover Results




D-

preservingO-F

AUC 0.97 0.74Aug 10, 2011Masud et al. 61


KDD ResultsKDD Results




D-

preserving

FAE-

Olindda

AUC 0.98 0.96Aug 10, 2011Masud et al. 63


Summary ResultsSummary Results



Improved Outlier Detection and Multiple Novel Improved Outlier Detection and Multiple Novel Class DetectionClass Detection

Challenges◦ High false positive (FP) (existing classes detected as novel) and

false negative (FN) (missed novel classes) rates ◦ Two or more novel classes arrive at a time

Solutions1

◦ Dynamic decision boundary – based on previous mistakes

Inflate the decision boundary if high FP, deflate if high FN

◦ Build statistical model to filter out noise data and concept drift from the outliers.

◦ Multiple novel classes are detected by Constructing a graph where outlier cluster is a vertex Merging the vertices based on silhouette coefficient Counting the number of connected components in the resultant (i.e., merged)

graph

1 Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Charu Aggarwal, Jiawei Han, and Bhavani Thuraisingham, Addressing Concept-Evolution in Concept-Drifting Data Streams, In Proc ICDM ’10, Sydney, Australia, Dec 14-17, 2010.

Proposed Methods



Outlier Threshold (OUTTH)Outlier Threshold (OUTTH)

To declare a testing instance being an outlier, using cluster

radius r is not enough because of the data noise

x

o,5(x)

+,5(x)

+ + + + +

a(x)

b+

(x)

+ +

◦ So, beyond the radius r, a threshold (OUTTH) will

be setup, so that most noisy data around model

cluster will be classified immediately

Proposed Methods




Every instance outside the cluster range has a weight

◦ If wt(x) >= OUTTH, this instance will be consider as

existing class.

◦ If wt(x) < OUTTH, this instance will be an outlier.

Pros:

◦ Noisy data will be classified immediately

Cons

◦ OUTTH is hard to be determined

Noisy data and novel class instance may occur simultaneously

Different dataset may have different OUTTH

Proposed Methods

))(exp()( rbxwt x




If threshold is too high, noisy data may become outlier

◦ FP rate will go up

If threshold is too low, novel class instance will be labeled as

existing class

◦ FN rate will go up

Proposed Methods

x

o,5(x)

+,5(x)

+ + + + +

a(x)

b+

(x)

+ +

OUTTH = ?

We need to balance on these two



Introduction


Clustering Novel Class Detection

• Finer Grain Novel Class Detection

• Dynamic Novel Class Detection

• Multiple Novel Class Detection



Dynamic threshold settingDynamic threshold settingProposed Methods

◦ Defer approach

After a testing chunk has been labeled, based on the marginal FP and FN rate of the this testing chunk

update the OUTTH, and then apply the new OUTTH to the next testing chunk

◦ Eager approach

What is marginal FP or marginal FN

Once a marginal FP or marginal FN instance detected, update OUTTH with step function, and apply the

updated OUTTH to the next testing instance

x

+ + + + +

a(x)

+ +

Marginal FP

Marginal FN



Dynamic threshold settingDynamic threshold settingProposed Methods



Defer approach and Eager Defer approach and Eager approach comparisonapproach comparison

In Defer approach, OUTTH updates after a data chunk

is labeled

◦ Too late – In the testing chunk, many marginal FP or FN

may occur due to an improper OUTTH threshold

◦ Overreact – If there are many marginal FP or FN instances

in the labeled testing chunk, the OUTTH update may

overreact for the next testing chunk

In Eager approach, OUTTH updates aggressively

whenever marginal FP or FN happens.

◦ The model is more tolerate to noisy data and concept drift.

◦ The model is more sensitive to novel class instances.

Proposed Methods



Outliers StatisticsOutliers Statistics For each outlier instance, we calculate the novelty

probability Pnov

◦ If Pnov is large (close to 1), indicates that the outlier has a

high probability of being a novel instance.

Pnov contains two parts

◦ The first part measures how far the outlier being away from

the model cluster

◦ The second part Psc is the Silhouette Coefficient, measures

the cohesion and separation to the model cluster of the q-

Neighbors of the outlier

Proposed Methods

scnov Pxwt

xwtxP

))(min(1

)(1)(



Outliers StatisticsOutliers Statistics

• Concept Drift

• Novel Class

Three scenarios may occur simultaneously

Proposed Methods

• Noise Data



Outlier Statistics Gini Outlier Statistics Gini AnalysisAnalysis The Gini coefficient is a measure of statistical

inequality. The discrete Gini coefficient is:

If we divide 0~1 into n equal size bin, and put all outlier

Pnov into corresponding bin, then we can get cdf yi

◦ If all Pnov is very low, to an extreme cdf yi = 1

◦ If all Pnov are very high, to an extreme cdf yi =0; except

yn=1

Proposed Methods

n

i i

n

i i

y

yinn

nsG

1

11

211

)(

0121

1121

1121

1)( 11

1

1

n

inn

nn

inn

ny

yinn

nsG

n

i

n

in

i i

n

i i

n

nn

ny

yinn

nsG

n

i i

n

i i 121

1121

1)(

1

1



Outlier Statistics Gini Outlier Statistics Gini AnalysisAnalysis

◦ If all outlier Pnov distribute evenly, yi =i/n

Proposed Methods

n

n

n

n

n

n

nn

nnnn

ni

inn

n

i

iinn

ny

yinn

nsG

n

i

n

i

n

i

n

in

i i

n

i i

3

1

3

)12(21

)1(

2

6

)12)(1(21

1121

1

121

1121

1)(

1

1

2

1

1

1

1

After get the outlier Pnov distribution, calculate G(s)

If G(s)> , declare novel class

If G(s) <= , classified the outlier as existing

class instance.

When n ∞, 0.33

n

n

3

1

n

n

3

1

n

n

3

1



Outlier Statistics Gini Analysis Outlier Statistics Gini Analysis LimitationLimitation

◦ To an extreme, it is impossible the differentiate concept drift and

concept evolution by Gini coefficient, when concept drift is just

“looks like” concept evolution.

Proposed Methods



Introduction


Clustering Novel Class Detection

• Finer Grain Novel Class Detection

• Dynamic Novel Class Detection

• Multiple Novel Class Detection



Multi Novel Class DetectionMulti Novel Class DetectionProposed Methods

Data Stream

Novel class A

y

x1

y1

y2

x x1

y2

x

Positive Instance

Negative InstanceNovel Instance

Novel class B

y2

If we always assume novel instances belong to one novel type, one type of novel instances, either A or B, will be misclassified.




The main idea in detecting multiple novel classes is to

construct a graph, and identify the connected

components in the graph.

The number of connected components determines the

number of novel classes.




Two Phases:

◦ Building the connected graph

Build directed nearest neighbor graph.

From each vertex (outlier cluster), add

edge from this vertex to its nearest

neighbor.

Silhouette coefficient from the vertex to

its nearest neighbor is larger than some

threshold, the edge will be removed.

Problem: Linkage Circle

◦ Component merging phase

Gaussian distribution centric

decision




◦ Component merging phase

In probability theory, “ the normal (or Gaussian) distribution, is a continuous

probability distribution that is often used as a first approximation to describe real-

valued random variables that tend to cluster around a single mean value” 1

If two Gaussian Distribution variables (g1, g2) can be separated, the following condition

will be hold:

Since μ is proportion to σ, if the two variables (components) will remain separated;

otherwise, these two components will be merged.

2121 ),(_ ggdistcentroidd

2)10(

2)(

2)

2(

2

1

2

2

2

1 2

2

2

2

2

2

2

2

2

0

2

0 2

22

2

2

0

2

2

eex

dedxxexx

)(),(_ 2121 cggdistcentroidd

1. Amari Shunichi, Nagaoka Hiroshi. Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2, 2000.



Experiments: DatasetsExperiments: Datasets We evaluated our approach on different datasets:

Experiment Results

Data SetConcept Drift

Concept Evolution

Dynamic Feature

# of Instance

# of Class

KDD 492K 7

Forest Cover 387K 7

NASA 140K 21

Twitter 335K 21

SynED 400K 20



Experiments: SetupExperiments: Setup Development:

◦ Language: Java

H/W:

◦ Intel P-IV with

◦ 3GB memory and


Parameter settings:

◦ K (number of pseudo points per chunk) = 50

◦ q (minimum number of instances required to declare novel class) =

50

◦ L (ensemble size) = 6


Experiment Results



Experiments: BaselineExperiments: Baseline

Competing approaches:

◦ i) DEMminer our approach- 5 variations:

Lossy-F conversion

Lossy-L conversion

Lossless conversion - DEMminer

Dynamic OUTTH + Lossless conversion - DEMminer-Ex (without Gini)

Dynamic OUTTH + Gini + Lossless conversion - DEMminer-Ex

◦ ii) WCE-OLINDDA (O-W)

◦ iii) FAE-WCE-OLINDDA_Parallel (O-F)

We use this combination since to the best of our knowledge there is no approach

that can classify and detect novel classes simultaneously with feature evolution.



Experiments: ResultsExperiments: Results Evaluation metrics:

◦ Fn = total novel class instances misclassified as existing

class,

◦ Fp = total existing class instances misclassified as novel

class,

◦ Fe = total existing class instances misclassified (other than

Fp),


◦ N = total instances the stream

Experiment Results



Twitter ResultsTwitter ResultsExperiment Results




DEMminer Lossy -L Lossy-F O-F

AUC 0.88 0.83 0.76 0.56

Experiment Results



Twitter ResultsTwitter ResultsExperiment Results




DEMminer-Ex DEMminer OW

AUC 0.94 0.88 0.56

Experiment Results



Forest Cover ResultsForest Cover ResultsExperiment Results




DEMminerDEMminer-Ex

(without Gini)DEMminer-Ex OW

AUC 0.97 0.99 0.97 0.74

Experiment Results



NASA DatasetNASA DatasetExperiment Results



NASA DatasetNASA Dataset

Deviation Info Gain FAE

AUC 0.996 0.967 0.876

Experiment Results



KDD ResultsKDD ResultsExperiment Results




DEMminer O-F

AUC 0.98 0.96

Experiment Results



Result SummaryResult SummaryExperiment Results

Dataset Method ERR Mnew Fnew AUC FP FN

Twitter

DEMminer

Lossy-F

Lossy-L

O-F

4.2 30.5 0.8

32.5 0.0 32.6

1.6 82.0 0.0

3.4 96.7 1.6

0.877

0.834

0.764

0.557

- -

- -

- -

- -

ASRS

DEMminer

DEMminer(info-gain)

O-F

0.02 - -

1.4 - -

3.4 - -

0.996

0.967

0.876

0.00 0.1

0.04 10.3

0.00 24.7

Forest Cover

DEMminer

O-F

3.6 8.4 1.3

5.9 20.6 1.1

0.973

0.743

- -

- -

KDDDEMminer

O-F

1.2 5.9 0.9

4.7 9.6 4.4

0.986

0.967

- -

- -



Result SummaryResult SummaryExperiment Results

Dataset Method ERR Mnew Fnew AUC

Twitter

DEMminer

DEMminer-Ex

OW

4.2 30.5 0.8

1.8 0.7 0.6

3.4 96.7 1.6

0.877

0.944

0.557

Forest Cover

DEMminer

DEMminer-Ex

OW

3.6 8.4 1.3

3.1 4.0 0.68

5.9 20.6 1.1

0.974

0.990

0.743



Running Time ComparisonRunning Time ComparisonExperiment Results

Dataset

Time(sec)1/K Points/sec Speed gain

DEMminer Lossy-F O-F DEMminer Lossy-F O-F DEMminer over O-F

Twitter 23 3.5 66.7 43 289 15 2.9

ASRS 21 4.3 38.5 47 233 26 1.8

Forest Cover 1.0 1.0 4.7 967 1003 212 4.7

KDD 1.2 1.2 3.3 858 812 334 2.5



Multi Novel Detection Multi Novel Detection ResultsResults

Experiment Results



Multi Novel Detection ResultsMulti Novel Detection ResultsExperiment Results



ConclusionConclusionExperiment Results


•Our data stream classification technique addresses

•Infinite length

•Concept-drift

•Concept-evolution

•Feature-evolution

•Existing approaches only address first two issues

•Applicable to many domains such as

•Intrusion/malware detection

•Text categorization

•Fault detection etc.


ReferencesReferences J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. : BOAT-Optimistic Decision

Tree Construction. In Proc. SIGMOD, 1999.

P. Domingos and G. Hulten, “Mining high-speed data streams”. In Proc.

SIGKDD, pages 71-80, 2000.

Wenerstrom, B., Giraud-Carrier, C., “Temporal data mining in dynamic feature

spaces”. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 1141.1145.

Springer, Heidelberg (2006)

E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. “Cluster-based novel

concept detection in data streams applied to intrusion detection in computer

networks”. In Proc. 2008 ACM symposium on Applied computing, pages 976–

980, (2008).

M. Scholz and R. Klinkenberg. “An ensemble classifier for drifting concepts.” In

Proc. ICML/PKDD Workshop in Knowledge Discovery in Data Streams., 2005.



References (contd.)References (contd.) Brutlag, J.(2000). “Aberrant behavior detection in time series for network

monitoring.” In: Proc. Usenix Fourteenth System Admin. Conf. LISA XIV, New

Orleans, LA. (Dec 2000)

Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: “A geometric framework for

unsupervised anomaly detection: Detection intrusions in unlabeled data.”

Applications of Data Mining in Computer Security, Kluwer (2002).

Fan, W. “Systematic data selection to mine concept-drifting data streams.” In Proc.

KDD 04

Gao, J, Wei Fan, and Jiawei Han. (2007a). "On Appropriate Assumptions to Mine Data

Streams”

Gao, J. Wei Fan, Jiawei Han, Philip S. Yu. (2007b). “A General Framework for Mining

Concept-Drifting Data Streams with Skewed Distributions.” SDM 2007

Goebel, J. and T. Holz. Rishi: “Identify bot contaminated hosts by irc nickname

evaluation. In Usenix/Hotbots” ’07 Workshop, 2007.

Grizzard, J. B., V. Sharma, C. Nunnery, B. B. Kang, and D. Dagon (2007). “Peer-to-

peer botnets: Overview and case study.” In Usenix/Hotbots ’07 Workshop.Aug 10, 2011Masud et al. 106


References (contd.)References (contd.) Keogh & Pazzani, (2000) E.J., J., P.M.: “Scaling up dynamic time warping

for data mining applications.” In: ACM SIGKDD. (2000)

Lemos, R. (2006): Bot software looks to improve peerage. SecurityFocus.

http://www.securityfocus.com/news/11390 (2006).

Livadas, C., B.Walsh, D. Lapsley, and T. Strayer (2006) “Using machine

learning techniques to identify botnet traffic.” In 2nd IEEE LCN Workshop

on Network Security (WoNS’2006), November 2006.

LURHQ Threat Intelligence Group (2004). Sinit p2p trojan analysis.

http://www.lurhq.com/sinit.html (2004)

Rajab, M. A. J. Zarfoss, F. Monrose, and A. Terzis (2006) “A multifaceted

approach to understanding the botnet phenomenon.” In Proceedings of

the 6th ACM SIGCOMM on Internet Measurement Conference (IMC), 2006.

Kagan Tumar and Joydeep ghosh (1996).“Error correlation and error

reduction in ensemble classifiers” (Connection sciece), 8(3-4):385-403Aug 10, 2011Masud et al. 10

7


References (contd.)References (contd.) Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani

Thuraisingham, “A Multi-Partition Multi-Chunk Ensemble Technique to

Classify Concept-Drifting Data Streams.” In Proc, of 13th Pacific-Asia

Conference on Knowledge Discovery and Data Mining (PAKDD-09), Page:

363-375, Bangkok, Thailand, April 2009.

Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani

Thuraisingham, “A Practical Approach to Classify Evolving Data Streams:

Training with Limited Amount of Labeled Data.” In Proc. of 2008 IEEE

International Conference on Data Mining (ICDM 2008), Pisa, Italy, Page

929-934, December, 2008.

Clay Woolam, Mohammed Masud, and Latifur Khan , “Lacking Labels In

The Stream: Classifying Evolving Stream Data With Few Labels”. In Proc.

of 18th International Symposium on Methodologies for Intelligent

Systems (ISMIS), Page 552-562, September 2009 Prague, Czech Republic



References (contd.)References (contd.)

Mohammad Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and

Bhavani Thuraisingham, “Addressing Concept-Evolution in Concept-Drifting Data

Streams”. In Proc. of 2010 10th IEEE International Conference on Data Mining (ICDM

2010), Sydney, Australia, Dec 2010.

Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Jiawei Han, Bhavani

Thuraisingham , “Classification and Novel Class Detection of Data Streams in a Dynamic

Feature Space”. In Proc. of European Conference on Machine Learning and Knowledge

Discovery in Databases, ECML PKDD 2010, Barcelona, Spain, September 20- 24, 2010,

Springer 2010, ISBN 978-3-642-15882-7, Page: 337-352.

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham,

“Classification and Novel Class Detection in Data Streams with Active Mining”. In Proc of

14th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 21-24 June,

2010, Page 311-324, - Hyderabad, India.



References (contd.)References (contd.) Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel

Class Detection in Concept-Drifting Data Streams under Time Constraints" , IEEE Transactions on Knowledge &

Data Engineering (TKDE), 2011, IEEE Computer Society, June 2011, Vol. 23, No. 6, Page 859-874.

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, “A Framework for Clustering Evolving Data streams”

Published in Proceedings VLDB ’03 proceedings of the 29th international conference on Very Large Data Bases-

Volume 29

H. Wang, W. Fan, P. S. Yu, and J. Han. “Mining concept-drifting data streams using ensemble classifiers”. In Proc.

ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235,

Washington, DC, USA, Aug, 2003. ACM.

Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class

Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on

Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled,

Slovenia, 7-11 Sept, 2009.



Questions

Masud et al. 111

Aug 10, 2011

Data Stream Classification and Novel Class Detection

Documents

Transcript of Data Stream Classification and Novel Class Detection