Imbalanced Data Set Learning with Synthetic Examples

25
1 Imbalanced Data Set Learning with Synthetic Examples Benjamin X. Wang and Nathalie Japkowicz

description

Imbalanced Data Set Learning with Synthetic Examples. Benjamin X. Wang and Nathalie Japkowicz. The Class Imbalance Problem I. Data sets are said to be balanced if there are, approximately, as many positive examples of the concept as there are negative ones. - PowerPoint PPT Presentation

Transcript of Imbalanced Data Set Learning with Synthetic Examples

Page 1: Imbalanced Data Set Learning with Synthetic Examples

1

Imbalanced Data Set Learning with Synthetic Examples

Benjamin X. Wang

and

Nathalie Japkowicz

Page 2: Imbalanced Data Set Learning with Synthetic Examples

2

The Class Imbalance Problem I Data sets are said to be balanced if there are,

approximately, as many positive examples of the concept as there are negative ones.

There exist many domains that do not have a balanced data set.

Examples:– Helicopter Gearbox Fault Monitoring– Discrimination between Earthquakes and Nuclear

Explosions– Document Filtering– Detection of Oil Spills– Detection of Fraudulent Telephone Calls

Page 3: Imbalanced Data Set Learning with Synthetic Examples

3

The Class Imbalance Problem II The problem with class imbalances is that

standard learners are often biased towards the majority class.

That is because these classifiers attempt to reduce global quantities such as the error rate, not taking the data distribution into consideration.

As a result examples from the overwhelming class are well-classified whereas examples from the minority class tend to be misclassified.

Page 4: Imbalanced Data Set Learning with Synthetic Examples

4

Some Generalities Evaluating the performance of a learning system

on a class imbalance problem is not done appropriately with the standard accuracy/error rate measures. ROC Analysis is typically used, instead.

There is a parallel between research on class imbalances and cost-sensitive learning.

There are four main ways to deal with class imbalances: re-sampling, re-weighing, adjusting the probabilistic estimate, one-class learning

Page 5: Imbalanced Data Set Learning with Synthetic Examples

5

Advantage of Resampling Re-sampling provides a simple way of biasing the

generalization process. It can do so by:

– Generating synthetic samples accordingly biased– Controlling the amount and placement of the new

samples Note: this type of control can also be achieved by

smoothing the classifier’s probabilistic estimate (e.g., Zadrozny & Elkan, 2001), but that type of control cannot be as localized as the one achieved with re-sampling techniques.

Page 6: Imbalanced Data Set Learning with Synthetic Examples

6

SMOTE: A State-of-the-Art Resampling Approach SMOTE stands for Synthetic Minority

Oversampling Technique. It is a technique designed by Chawla, Hall, &

Kegelmeyer in 2002. It combines Informed Oversampling of the

minority class with Random Undersampling of the majority class.

SMOTE currently yields the best results as far as re-sampling and modifying the probabilistic estimate techniques go (Chawla, 2003).

Page 7: Imbalanced Data Set Learning with Synthetic Examples

7

SMOTE’s Informed Oversampling Procedure II

For each minority Sample– Find its k-nearest minority neighbours– Randomly select j of these neighbours– Randomly generate synthetic samples along the

lines joining the minority sample and its j selected neighbours

(j depends on the amount of oversampling desired)

Page 8: Imbalanced Data Set Learning with Synthetic Examples

8

SMOTE’s Informed vs. Random Oversampling Random Oversampling (with replacement) of the

minority class has the effect of making the decision region for the minority class very specific.

In a decision tree, it would cause a new split and often lead to overfitting.

SMOTE’s informed oversampling generalizes the decision region for the minority class.

As a result, larger and less specific regions are learned, thus, paying attention to minority class samples without causing overfitting.

Page 9: Imbalanced Data Set Learning with Synthetic Examples

9

SMOTE’s Informed Oversampling Procedure I

: Minority sample: Synthetic sample

… But what if thereis a majority sampleNearby?

: Majority sample

Page 10: Imbalanced Data Set Learning with Synthetic Examples

10

SMOTE’s Shortcomings Overgeneralization

– SMOTE’s procedure is inherently dangerous since it blindly generalizes the minority area without regard to the majority class.

– This strategy is particularly problematic in the case of highly skewed class distributions since, in such cases, the minority class is very sparse with respect to the majority class, thus resulting in a greater chance of class mixture.

Lack of Flexibility – The number of synthetic samples generated by SMOTE is

fixed in advance, thus not allowing for any flexibility in the re-balancing rate.

Page 11: Imbalanced Data Set Learning with Synthetic Examples

11

SMOTE’s Tendency for Overgeneralization

: Minority sample: Majority sample

: Synthetic sample

Overgeneralization!!!

Page 12: Imbalanced Data Set Learning with Synthetic Examples

12

Our Proposed Solution In order to avoid overgeneralization, we propose

to use three techniques:– Testing for data sparsity– Clustering the minority class– 2-class (rather than 1-class) sample generalization

In order to avoid SMOTE’s lack of flexibility, we propose one technique:– Multiple Trials/Feedback

We call our Approach: Adaptive Synthetic Minority Oversampling Method (ASMO)

Page 13: Imbalanced Data Set Learning with Synthetic Examples

13

ASMO’s Strategy I Overfitting Avoidance I: Testing for data

sparsity:– For each minority sample m, if m’s g

neighbours are majority samples, then the data set is sparse and ASMO should be used. Otherwise, SMOTE can be used. (As a default, we used g=20).

Overgeneralization Avoidance II: Clustering– We will use k-means or other such clustering

systems on the minority class (for now, this step is done, but in a non-standard way)

Page 14: Imbalanced Data Set Learning with Synthetic Examples

14

ASMO’s Strategy II

Overfitting Avoidance III: Synthetic sample generation using two classes:– Rather than using the k-nearest neighbours of

the minority class to generate new samples, we use the k nearest neighbours of the opposite class.

Page 15: Imbalanced Data Set Learning with Synthetic Examples

15

ASMO’s Strategy III: Overfitting avoidance: Overview

: Minority sample: Synthetic sample: Majority sample

- Clustering-2-class sample generation

Page 16: Imbalanced Data Set Learning with Synthetic Examples

16

ASMO’s Strategy III

Flexibility Enhancement through Multiple Trials and Feedback:– For each Cluster Ci, iterate through different

rates of majority undersampling and synthetic minority generation. Keep the best combination subset Si.

– Merge the Si’s into a single training set S.– Apply the classifier to S.

Page 17: Imbalanced Data Set Learning with Synthetic Examples

17

Discussion of our Technique I Assumption we made/Justification:

– the problem is decomposable. i.e., optimizing each subset will yield an optimal merged set.

– As long as the base classifier we use does some kind of local learning (not just global optimization), this assumption should hold.

Question/Answer:– Why did we use different oversampling and

undersampling rates?– It was previously shown that optimal sampling rates are

problem dependent, and thus, are best set adaptively (Weiss & Provost, 2003, Estabrook & Japkowicz, 2001)

Page 18: Imbalanced Data Set Learning with Synthetic Examples

18

Experiment Setup I We tested our system on three different data sets:

– Lupus (thanks to James Malley of NIH)• Minority class: 2.8%• Dataset Size: 3839

– Abalone-5 (UCI)• Minority class: 2.75%• Dataset Size: 4177

– Connect-4 (UCI)• Minority class: 9.5%• Dataset Size: 11,258

Page 19: Imbalanced Data Set Learning with Synthetic Examples

19

Experiment Setup II ASMO was compared to two other techniques:

– SMOTE– O-D [the Combination of Random Over- and Down (Under)-

sampling; O-D was shown to outperform both Random Oversampling and Random Undersampling in preliminary experiments].

The base classifier in all experiments is SVM; k-NN was used in the syntactic generation process in order to identify the samples’ nearest neighbours (within the minority class or between the minority and majority class).

The results are reported in the form of ROC Curves on 10-fold corss-validation experiments.

Page 20: Imbalanced Data Set Learning with Synthetic Examples

20

Results on Lupus

Page 21: Imbalanced Data Set Learning with Synthetic Examples

21

Results on Abalone-5

Page 22: Imbalanced Data Set Learning with Synthetic Examples

22

Results on Connect-4

Page 23: Imbalanced Data Set Learning with Synthetic Examples

23

Discussion of the Results On every domain, ASMO slightly outperforms

both O-D and SMOTE. In the ROC areas where ASMO does not outperform the other two systems, its performance equals theirs.

ASMO’s effect seems to be one of smoothening SMOTE’s ROC Curve.

SMOTE’s performance is comparatively better in the two domains where the class imbalance is greater (Lupus, Abalone-5). We expect its relative performance to increase as the imbalance grows even more.

Page 24: Imbalanced Data Set Learning with Synthetic Examples

24

Summary

We presented a few modifications to the State-of-the-art re-sampling system, SMOTE.

These modifications had two goals:– To correct for SMOTE’s tendency to

overgeneralize – To make SMOTE more flexible

We observed a slight improved performance on three domains. However that improvement came at the expense of greater time consumption.

Page 25: Imbalanced Data Set Learning with Synthetic Examples

25

Future Work [This was a very preliminary study!]

To clean-up the system (e.g., to use a standard clustering method)

To test the system more rigorously (to test for significance; to use TANGO [used in the medical domain]

To test our system on highly imbalanced data sets, to see if, indeed, our design helps address this particular issue.

To modify the data generation process so as to test biases other than the one proposed by SMOTE.