IMPRECISE CLASSIFICATION WITH CREDAL DECISION TREES Read More:

A cost-sensitive measure to quantify theperformance of an imprecise classifier

Joaquín Abellán, Andrés. R. Masegosa

Department of Computer Science and Artificial IntelligenceUniversity of Granada

Granada, Spain

Oviedo (Asturias), December 2012

ERCIM’12 Oviedo (Spain) 1/30

Outline

1 Introduction

2 Previous Knowledge

3 Imprecise classification with credal decision trees

4 Cost-sensitive imprecise classification

5 Experimental Evaluation

6 Conclusions & Future Works

Introduction

Part I

Introduction

Supervisded Classification

A probabilistic classifier [5] models the conditional probability of the classvariable C given a set of predictive variables X = {X1, ...,XN}:

p(C|X1, ...,XN)

Under supervised classification settings, this conditional probabiblity isestimated from a set of n labelled samples:

D = {(c1, x1), ..., (cn, xn)}

How this conditional probability is learnt depends of the particular classificationmodel: decision trees , Bayesian network classifiers, AODE, ensembles ofdecision trees...

The prediction c? of a new data case xnew is the class with the highestposterior probability:

c? = arg maxc∈Val(C)

P(c|xnew )

Introduction

p(C|X1, ...,XN)

D = {(c1, x1), ..., (cn, xn)}

P(c|xnew )

Introduction

p(C|X1, ...,XN)

D = {(c1, x1), ..., (cn, xn)}

P(c|xnew )

Introduction

p(C|X1, ...,XN)

D = {(c1, x1), ..., (cn, xn)}

P(c|xnew )

Introduction

Imprecise Classification

A imprecise probabilistic classifier [8] learns a set of the conditionalprobabilities P, not a single conditional probability, for the same set of labelleddata samples:

PD(C|X1, ...,XN)

How this set of conditional probabilities is learnt depends of the particularimprecise classification model: credal decision trees [1], naive credalclassifier[4]...

They are based on the so-called Imprecise probability theory [6].

The prediction for a new data case xnew is not always a single class, it can be aset of different classes: the set of non-dominated classes, U.

We have a set of different conditional prob. distributions:

{p(C|xnew ) : p ∈ P}

Introduction

PD(C|X1, ...,XN)

Introduction

PD(C|X1, ...,XN)

Introduction

Measuring the performance of an imprecise classifier

Imprecise classifiers produce set-valuated predictions, U.

Descriptive measures [4]:

Determinancy: the percentage of instances for which the impreciseclassifier returns a unique state.Single Accuracy: the accuracy achieved by the imprecise classifier on theinstances classified determinately.Set Accuracy: the accuracy achieved by the imprecise classifier on theinstances classified indeterminately.Indeterminacy Size: The average size of the set of non-dominatedclasses.

Discounted Accuracy measure [4]:

DACC =1

(accurate)i

Introduction

Measuring the performance of an imprecise classifier

Imprecise classifiers produce set-valuated predictions, U.

Descriptive measures [4]:

Determinancy: the percentage of instances for which the impreciseclassifier returns a unique state.Single Accuracy: the accuracy achieved by the imprecise classifier on theinstances classified determinately.Set Accuracy: the accuracy achieved by the imprecise classifier on theinstances classified indeterminately.Indeterminacy Size: The average size of the set of non-dominatedclasses.

Discounted Accuracy measure [4]:

DACC =1

(accurate)i

Introduction

A cost-sensitive measure to quantify the performance of animprecise classifier

We present a tentative method for quantifying the performance of an impreciseclassifiers with miss-classification costs.

Classification costs considers the weight that an expert would give to eachtype of error. E.g.:

If an email is spam or not.If a mushroom is edible or poisonous.

The set of non-dominated classes are chosen differently in the presence ofmisclassification-costs.

A new cost-sensitive measure for imprecise classifier is presented.

Previous Knowledge

Part II

Previous Knowledge

Imprecise Dirichlet Model

The imprecise Dirichlet model (IDM) [7] was introduced by Walley for inference aboutthe probability distribution of a categorical variable.

p(ci ) = θi , θ = {θ1, ..., θK } and D data set of n i.i.d. samples.

Assuming a Dirichlet prior, π(θ) =∏θ

stj−1j function, the posterior :

π(θ|D) =K∏

θnj+stj−1j

s is the equivalent sample size or number of hidden samples.tj is the proportion of hidden samples in cj (e.g. uniform tj = 1

Exptected value:

E(θj |D) =nj + stjn + s

Previous Knowledge

π(θ|D) =K∏

θnj+stj−1j

Exptected value:

Previous Knowledge

π(θ|D) =K∏

θnj+stj−1j

Exptected value:

Previous Knowledge

π(θ|D) =K∏

θnj+stj−1j

Exptected value:

Previous Knowledge

The IDM assumes prior ignorance and considers a set of Dirichlet priors byvarying the paramters tj .

This is vacuous model: p(cj ) ∈ (0, 1)∀j .

Inferences are made by computing lower and upper probabilities:

p(cj |D) = inf0<tj<1

nj + stjn + s

p(cj |D) = sup0<tj<1

nj + stjn + s

=nj + sn + s

We have a credal set (convex and closed set) of posterior probabilies for C.

Parameter s determines how quickly the lower and upper probabilitiesconverge as more data become available.

Previous Knowledge

nj + stjn + s

=nj + sn + s

Previous Knowledge

nj + stjn + s

=nj + sn + s

Previous Knowledge

Stochastic and credal Dominance on probability intervals

How do we make predictions with imprecise classifiers?.

Select the class values are not defeated (non-dominated) underprobability terms.Use a dominance criterion.

Stochastic Dominance:

ci and cj have associated the intervals [li , ui ] and [lj , uj ], respectively.There is stochastic dominance of cj on ci iff lj ≥ ui .

Credal Dominance:

The probability on how any state of C happens is expressed by anon-empty credal set P.There is credal dominance of ci on cj iif p(ci ) ≥ p(cj ) ∀p ∈ P.

For the IDM model, both previous definitions are equivalent [3].

Previous Knowledge

Credal Dominance:

Previous Knowledge

Credal Dominance:

Previous Knowledge

Credal Dominance:

Imprecise classification with credal decision trees

Part III

Imprecise classification withcredal decision trees

Decision Trees

A decision tree, also called a classification tree, is a simple structure that canbe used as a classifier.

Each node represents an attribute variable

Each branch represents one of the states of this variable

Each tree leaf specifies an expected value of the class variable

Example of decision tree for three attribute variables Xi (i = 1, 2, 3), with twopossible values (0, 1) and a class variable C with cases or states c1, c2, c3:

Building Decision Trees from data

Imprecise Credal Decision Trees (ICDT)Associate to each node the most informative variable about the class C.

Imprecise Information Gain measure based on the maximum entropyof a credal set [2].

Probability intervals at the leaves are estimated using IDM

pσ(c) ∈ [nσ(c)nσ + s

,nσ(c) + s

nσ + s]

Building Decision Trees from data

Imprecise Credal Decision Trees (ICDT)Associate to each node the most informative variable about the class C.

Imprecise Information Gain measure based on the maximum entropyof a credal set [2].

Probability intervals at the leaves are estimated using IDM

pσ(c) ∈ [nσ(c)nσ + s

,nσ(c) + s

nσ + s]

Cost-sensitive imprecise classification

Part IV

Cost-sensitive impreciseclassification

Cost-sensitive supervised classification

We assume the existence of the following miss-classification cost matrix:

m11 m12 · · · m1Km21 m22 · · · m2K

.... . .

...mK 1 · · · mK ,K−1 mK ,K

mij is the cost of prediction ci when the true class is cj .

Using the Bayes decision rule [5], we predict the class with minimumexpected posterior risk:

ct = arg minci∈Val(C)

R(ci |xnew )

R(ci |xnew) =K∑

mij P(cj |xnew )

We assume the existence of the following miss-classification cost matrix:

m11 m12 · · · m1Km21 m22 · · · m2K

.... . .

...mK 1 · · · mK ,K−1 mK ,K

mij is the cost of prediction ci when the true class is cj .

Using the Bayes decision rule [5], we predict the class with minimumexpected posterior risk:

ct = arg minci∈Val(C)

R(ci |xnew )

R(ci |xnew) =K∑

mij P(cj |xnew )

We define the lower and upper posterior risks as follows:

The right hand side of the equations valid for the credal decision trees.

R(ci |xnew) =K∑

mij P(cj |xnew ) =K∑

mijnσ(cj )

nσ + s

R(ci |xnew) =K∑

mijnσ(cj ) + s

nσ + s

Cost-based dominance criteria (stochastic dominance)

ci dominates cj iif R(ci |xnew ) ≤ R(cj |xnew ).

This cost-senstive method also returns a set of non-dominated set of classes,UM .

We define the lower and upper posterior risks as follows:

The right hand side of the equations valid for the credal decision trees.

R(ci |xnew) =K∑

mijnσ(cj )

nσ + s

R(ci |xnew) =K∑

mijnσ(cj ) + s

nσ + s

Cost-based dominance criteria (stochastic dominance)

ci dominates cj iif R(ci |xnew ) ≤ R(cj |xnew ).

This cost-senstive method also returns a set of non-dominated set of classes,UM .

A cost-sensitive measure for imprecise classification

MIC =1N(−

∑i:Success

ln|Ui |K−

1K − 1

∑i:Error

αt,i ln K )

Success is quantified based on the number of non-dominated states produced.

Classification errors considers the miss-classification costs:

We consider the worst case error.

αt,i = maxcj∈Ui

Complete imprecise classifier: MIC = 0.

Classifier with perfect accuracy: MIC = ln K .

Experimental Evaluation

Part V

Experimental Set-up

The Aim

The presence of miss-classification costs affects the performance of an

imprecise classifier.

We compare two imprecise classifiers.

Imprecise credal decision trees (ICDT) and Naive Credal classifier(NCC) [4].

We employ to evaluation measures:

The DACC measure and the MIC measure with different cost-matrices.

Experiments on 40 UCI data sets using a 10-fold-cross validation scheme.

Statistical Tests for the comparison:

Corrected Paired T-Test at 5% significant level.

Wilconxon Test at 5% significant level.

Experimental Set-up

The Aim

The presence of miss-classification costs affects the performance of an

imprecise classifier.

We compare two imprecise classifiers.

Imprecise credal decision trees (ICDT) and Naive Credal classifier(NCC) [4].

We employ to evaluation measures:

The DACC measure and the MIC measure with different cost-matrices.

Experiments on 40 UCI data sets using a 10-fold-cross validation scheme.

Statistical Tests for the comparison:

Corrected Paired T-Test at 5% significant level.

Wilconxon Test at 5% significant level.

Experimental Set-up

Cost-Matrix (I)c1 c2 c3 c4

c1 0 1 1 1c2 2 0 2 2c3 3 3 0 3c4 4 4 4 0

Rows→ Real classes. Columns→ predicted classes. Classes are ordered fromhigh to low frequency.

This cost matrix assumes that miss-classification costs only depends of thereal class:

Diagnosis problem where the patient can be healthy or can suffer severaldiseases from mild to more severe.The cost of an error in the diagnosis mainly depends of the actual disease.

Experimental Set-up

Cost-Matrix (II)c1 c2 c3 c4

c1 0 2 3 4c2 1 0 3 4c3 1 2 0 4c4 1 2 3 0

This cost matrix assumes that miss-classification costs only depends of theclassifier prediction.

Textile industry different kinds of jeans have to be automatically classifiedinto different quality levels.The cost only depends of the action pursued by the prediction of theclassifier.

Experimental Set-up

Cost-Matrix (III) Cost-Matrix (IV)

c4 c3 c2 c1

c4 0 1 1 1c3 2 0 2 2c2 3 3 0 3c1 4 4 4 0

c4 c3 c2 c1c4 0 2 3 4c3 1 0 3 4c2 1 2 0 4c1 1 2 3 0

Same principles than Cost-Matrix (I) and Cost-Matrix (II), respectively.

The highest costs are associated to classes with the highest frequency.

Experimental Evaluation I

ICDT NCCDeterminacy: Av. 94.7% 58.0%Single Acc.: Av. 79.3% 84.4%

Set Acc.: Av. 89.0% 93.3%Indeterm. Size: Av. 5.1 5.5

ICDT and NCC performs differently as imprecise classifiers.

ICDT has a higher determinacy than NCC, but with lower accuracy inthose cases.ICDT returns slightly lower number of non-dominated classes but withslightly lower accuracy.

Experimental Evaluation II

ICDT NCCDACC: Av. 0.768 0.6035

Wilcoxon test *Paired t-test 22 4

ICDT is a better imprecise classifier than NCC.

Strong differences in favor of this ICDT.

Experimental Evaluation II

ICDT NCCMIC: Cost Matrix (I): Av. 0.796 0.790

Wilcoxon test = =Paired t-test 13 10

MIC: Cost Matrix (II): Av. 1.066 0.808Wilcoxon test *Paired t-test 17 9

MIC: Cost Matrix (III): Av. 0.936 1.492Wilcoxon test = =Paired t-test 15 11

MIC: Cost Matrix (IV): Av. 0.771 0.773Wilcoxon test = =Paired t-test 12 13

NCC can be competitive w.r.t. ICDT for some cost-matrices.

Conclusions and Future Works

Part VI

A dominance criteria which considers miss-classification costswhen selecting the non-dominated classes.

A cost-sensitive measure to evaluate imprecise classifierswhich takes miss-classification costs into account.

The performance of a imprecise classifier is affected bymiss-classification costs.

ICDT outperforms NCC in terms of DACC (0/1 cost error).NCC performs competitevely for some cost-matrices.

Future Works:

Apply this approach to a real example.

A dominance criteria which considers miss-classification costswhen selecting the non-dominated classes.

A cost-sensitive measure to evaluate imprecise classifierswhich takes miss-classification costs into account.

The performance of a imprecise classifier is affected bymiss-classification costs.

ICDT outperforms NCC in terms of DACC (0/1 cost error).NCC performs competitevely for some cost-matrices.

Future Works:

Apply this approach to a real example.

Bibliography

J. Abellán and S. Moral, “Building classification trees using the total uncertainty criterion", Int. J. of IntelligentSystems, vol. 18, no. 12, pp. 1215–1225, 2003.

J. Abellán, J. and S. Moral, “Upper entropy of credal sets. Applications to credal classification", Int. J. ofApproximate Reasoning, vol. 39, no. 2-3, pp. 235–255, 2005.

J. Abellán, “Equivalence relations among dominance concepts on probability intervals and general credalsets", Int. J. of General Systems, vol. 41, no. 2, pp. 109–122, 2012.

G. Corani and M. Zaffalon, “Learning reliable classifiers from small or incomplete data sets: the naive credalclassifier 2", J. of Machine Learning Research, vol. 9, pp. 581–621, 2008.

R.O. Duda and P.E. Hart, Pattern classification and scene analysis. John Wiley and Sons, New York, 1973.

Walley, P. (1991) Statistical Reasoning with Imprecise Probabilities (Chapman and Hall, New York).

P. Walley, “Inferences from multinomial data: learning about a bag of marbles", J. Roy. Statist. Soc. B, vol. 58,pp. 3–57, 1996.

M. Zaffalon, “The naive credal classifier", J. of Statistical Planning and Inference, vol. 105, pp. 5–21, 2002.

Thanks for you attention!

Any questions?

IMPRECISE CLASSIFICATION WITH CREDAL DECISION TREES Read More:

Science

Transcript of IMPRECISE CLASSIFICATION WITH CREDAL DECISION TREES Read More: