1 Information Geometry on Classification Logistic, AdaBoost, Area under ROC curve Shinto Eguchi –...

Post on 30-Dec-2015

217 views 0 download

Transcript of 1 Information Geometry on Classification Logistic, AdaBoost, Area under ROC curve Shinto Eguchi –...

1

Information Geometry on Classification

  Logistic, AdaBoost, Area under ROC curve

Shinto Eguchi

– –

ISM seminoron 17/1/2001

This talk is based on one of joint work with Dr J Copas

2

Outline

Problem setting for classification

overview of classification methods

Dw classifications

Dw divergence of discriminant functions

definition from NP Lemma, expected and ovserved expressions

examples of Dwlogistic regression, adaboost, area under ROC curve, hit rate, credit scoring, medical screening

structure of Dw risk functions

optimal Dw under near-logisticimplement by cross-validation

Risk scores of skin cancer

area under ROC curve, comparisondiscussion on other methods

[ http://juban.ism.ac.jp/ ]

3

Standard methods

  Fisher linear discriminant analysis [4]

  Logistic regression [ Cornfield, 1962]

  Multilayer perception

[ http://juban.ism.ac.jp/file_ppt/ 公開講座 ( ニューラル ).

ppt] New approaches

  Boostimg – combining weak learners –

  AdaBoost

[http://juban.ism.ac.jp/file_ppt/ 公開講座( Boost ) .ppt]

  Support vector machine – VCdimension –

[http://juban.ism.ac.jp/file_ppt/open-svm12-21.ppt]

  Kernel method – Mercer theorem –

[http://juban.ism.ac.jp/file_ppt/ 主成分発表原稿 .ppt]

4

Problem setting

input vector

output variable

Definition is a classifier if is onto.

    (direct sum)

the k-th decision space

5

Joint distribution of , y :

where prior distribution

conditional distribution of given y

Probablistic model

Misclassification

error rate

hit rate

6

discriminant function

classifier

Bayes rule Given P(x, y),

Training data (examples)

i-th input i-th input

7

output variable

Reduction of our problem to binary classification

log-likelihood ratio

discriminant function

classifier

error rate

8

Other loss functions for classification

Credit scoring [5]

A cost model : a profit if y = 1; loss if y = 0.

General setting

Let be a cost of classify y as .

The expected cost is

9

hit

correct rejection false negative

false positive

ROC (Reciever Operating Characteristic) curve

10

Main story

linear discriminant function

Given a training data

objective function

proposed estimator

What (U ,V ) is ?

Logistic is OK.

11

log-likelihood ratio

discriminant function

A reinterpretation of Neyman-Pearson Lemma

Proposition

Remark

12

Proof of Proposition

13

Divergence Dw of discriminant function

Def.

Expectation expression

14

Proof

15

Sample expression given a set of training data

Minimum Dw method

for a statistical model F

16

Examples of Dw divergence

(1) logistic regression

(2) Hit rate, Credit scoring, medical screening

17

This Dw is the loss function of AdaBoost, cf. [7], [8].

(3) Area under ROC curve

(4) AdaBoost

18

Structure of Dw risk functions

optimal Dw under near-logisticimplement by cross-validation

Logistic(linear)-parametric model

model distribution of , y :

19

Estimating equation of minimum Dw methods

Remark

20

Cauchy-Schwartz inequality

Prametric assumption

21

Near-Parametric assumption

22

Our risk function of an estimator is

But our situation is

Let

Cross + varianced Risk estimate

the bias term is

where

variance term is

where is the estimate from the training date by leaving thei th-example out.

23

24

Outlier

For

25

Note :

where

26

27

28

29

30

31

32

33

34

References

[1] Begg, C. B., Satogopan, J. M. and Berwick, M. (1998). A new strategy for evaluating the impact of epidemiologic risk factors for cancerwith applications to melanoma. J. Amer. Statist. Assoc. 93, 415-426.

[2] Berwick, M, Begg, C. B., Fine, J. A., Roush, G. C. and Barnhill, R. L. (1996). Screening for cutaneous melanoma by self skin examination. J. National Cancer Inst., 88, 17-23.

[3] Eguchi, S and Copas, J. (2000). A Class of Logistic-type Discriminant Functions. Technical Report of Department of Statistics, University of Warwick.

[4] Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188.

[5] Hand, D. J. and Henley, W. E. (1997). Statistical classification methods in consumer credit scoring: a review. J. Roy. Statist. Soc., A, 160, 523-541.

[6] McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley: New York.

[7] Schapire R., Freund, Y., Bartlett, P. and Lee, W. S. (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Statist. 26, 1651-1686.

[8] Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. Springer: New York.