PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems...

20
PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague [email protected]

Transcript of PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems...

Page 1: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

PKDD Discovery Challenge

(not only) on Financial Data

Petr BerkaLaboratory for Intelligent

SystemsUniversity of Economics,

[email protected]

Page 2: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

2

Cups, Challenges, Competitions

KDD Cups (since 1997) KDD Sisyphus at ECML 1998 PKDD Discovery Challenges (since 1999) COIL Competition 2000 PAKDD Challenge 2000 PT Challenge 2000, 2001 JSAI KDD Challenge 2001 EUNITE Competition 2001, 2002 . . .

Page 3: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

3

PKDD Discovery Challenge Idea

Realistic data mining conditions collaborative rather then competitive nature rather vague specification of the problem

Differences to real KDD projects short time for analysis (2-3 months) only indirect access to domain and data

experts during KDD process

Page 4: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

4

Challenge Settings Data and their full description available

on the web for all participants Submissions evaluated by domain experts

(but no ordering, no winners and losers) Workshop at PKDD to present the results

and discus them with domain experts Results and comments of experts

available on the web (after the workshop)

Page 5: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

5

PKDD Challenges http://lisp.vse.cz/challenge

1999, Prague financial data, thrombosis data

2000, Lyon financial data, modified thrombosis data

2001, Freiburg modified thrombosis data

2002, Helsinki atherosclerosis data, hepatitis data

Page 6: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

6

Financial Challenge Background

Czech bank offering private accounts Available data for pilot study (29000 clients)

personal characteristics basic info about accounts transactions for three months

Proposed tasks segmentation (defining different types of clients w.r.t. debt) early detection of debts

Page 7: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

7

Financial Challenge Data

Disposition

disp_idclient_idaccount_id

Credit Card

disp_id

Account

account_iddistrict_id

Permanentorder

account_id

Loan

account_id

Person

client_iddistrict_id

Transactions

account_id

Demograph.

district_id

Page 8: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

8

Contributions

Method oriented show a method/system working on the data

Problem oriented (prototype solutions) loan and/or credit cards description loan and/or credit cards classification initial exploration relation between branches clients segmentation

Page 9: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

9

Description of loans Relations between loan category and account characteristics

[Coufal et al, 1999 - GUHA] [Mikšovský et al, 1999 - EXCEL]

# LHS loan.status Fisher support confidence

1 avg_sanction_interest(no) good 6.12e-024 603 0.9234

2 avg_sanction_interest(yes) bad 6.12e-024 26 0.8966

3 perm_ord_household(yes) good 5.03e-013 421 0.9546

4 perm_ord_household(no) bad 5.03e-013 56 0.2324

5 credit_card(yes) good 1.38e-005 165 0.9706

6 monthly_payment(<2000) good 3.33e-004 125 0.9690

Page 10: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

10

Classification of loans

Detecting risky clients before they are granted a loan

[Mikšovský et al, 1999 - C5.0]

decision tree to find the relevance of attributes

decision tree for classification (using misclassification costs)

Page 11: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

11

Credit Cards Promotion

Description - find characteristics of a card holder deviation detection

Classification - predict score for „card value“ k-nearest neighbour

[Putten, 1999]

Page 12: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

12

Clients Segmentation Description - segmentation of clients according

to transactions [Hotho, Meadche, 2000] Kohonen map + decision trees

Rule #1 for Cluster 3

If ATTR5 > 9945 and ATTR13 > 0Then -> Cluster 3 (115, 0.983)

Page 13: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

13

Challenge Organizing Lessons

To get and prepare real data is difficult The time for analyzes should be as long

as possible The response rate was rather low (~

10%) No synergy effect observed

Page 14: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

14

DM Lessons (1/4)

Cooperate with experts domain experts data experts . . .

… and with users

Page 15: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

15

DM Lessons (2/4) Use knowledge intensive preprocessing

methods … compute age and sex from birth_number set flags for different types of operations compute monthly characteristics of

transactions (sum, avg, min, max)lbalance = 1/30 i balance(i) days(i).

Page 16: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

16

DM Lessons (3/4) Make the results understandable

[Werner, Fogarty 2001](- ACLIGM (* (* (+ (+ LAC (* TAT (- (- (+ (* (* IGM (/ LDH LAC)) (* UA (+(/ KCT IGG) ALB))) (* UA (/ LDH LAC))) (/ (+ (+ PT C4) C4) UN)) ACLIGM)))ANA) (+ (+ C3 (- LDH (+ UA IGG))) LAC)) (+ (+ (* (* (* IGM (/ LDH LAC)) (/(/ (/ UA (* (+ (/ KCT ACLIGG) (* RF IGA)) (+ (* (/ PLT PIC) (+ LDH TCHO))(+ (- (* UA APTT) (* IGA TAT2)) (/ ACLIGG HGB))))) IGG) (* WBC UN))) HCT)(/ (* (* TAT (- (/ ALP UA) IGG)) (- (- (* (/ LDH LAC) (- TP C3)) (/ (+ (+PT C4) C4) UN)) ACLIGM)) (* (/ IGA (- GOT RBC)) (/ (* TAT2 HCT) (/ (/ (/UPRO SM) (+ (+ UA (+ (+ TCHO (- CENTROMEA LAC)) ACLIGG)) (- (* (- (* UAAPTT) (* IGA TAT2)) (+ (* TAT (+ PT (+ RBC (+ UA IGG)))) TP)) (+ UAIGG)))) (+ ACLIGM (+ (* (+ (+ (+ (* (* IGM (/ LDH (+ (/ (+ RBC (/ LDH LAC))RF) (* UA APTT)))) TP) CENTROMEA) (* (+ PT C4) (- (+ (/ (- LDH (+ (/ KCTACLIGG) (* RF IGA))) (/ ACLIGA SSB)) C3) dt))) (* (+ (* (+ DNAII IGA) HCT)(/ HCT LAC)) (+ RBC (/ (+ RBC (- (* (/ (- TG WBC) GOT) (- (+ (/ 0.08ACLIGA) (+ HGB PT)) dt)) (/ (+ (* (* IGM (/ LDH LAC)) TP) CENTROMEA) C3)))RF)))) HCT) (* IGG GPT)))))))) (+ (* TAT (- (+ (+ C3 (- LDH (+ UA IGG)))(- (* C4 TAT2) LDH)) (+ UA (/ (+ (+ (/ (/ (+ SM GOT) (* WBC UN)) (+ (/ (*(* GLU 0.03) (/ ALP UA)) RF) (* UA APTT))) (+ (+ (- (+ RBC (+ TG (/ (+ (*(* RF IGA) HCT) C4) (+ ACLIGM (- (+ (- TP C3) (/ C3 HGB)) (/ (- TG WBC)GOT)))))) ACLIGM) (+ TAT (+ (/ (* CENTROMEA (/ (* C4 TAT2) (/ (+ RBC (* dtACLIGA)) (* SM SC170)))) (* (/ HGB (- (/ ALP UA) RBC)) (/ ALP UA))) TP)))(/ (/ UPRO SM) (/ (+ RBC (* ACLIGM HGB)) GOT)))) (- (* (- (* UA APTT) (*IGA TAT2)) (+ (* TAT (+ PT (+ RBC (+ (+ PT (+ (+ (* (* IGM (/ LDH LAC))RNP) CENTROMEA) (* (/ (- TG WBC) GOT) (- (+ (/ (/ (+ DNAII IGA) (/ GPTACLIGM)) RF) C3) dt)))) (+ (/ (+ (- (* (+ (* C4 TAT2) (- (* C4 TAT2) PT))(+ RBC (+ (+ PT C4) (- CENTROMEA LAC)))) UA) C4) (+ ACLIGM (- (+ (-CENTROMEA LAC) (/ LDH LAC)) (- CENTROMEA LAC)))) (* (* ACLIGM HGB) (/ HCTLAC))))))) TP)) (+ UA IGG))) RF)))) (- (/ UPRO SM) LAC)))))

Page 17: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

17

DM Lessons (4/4)

Show some (even preliminary) results soon experts are interested in solutions not in

applying sophisticated methods

Page 18: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

18

Discovery Challenge Benefits

Experts deeper insight into the data

Participants experience with analyzing large real data motivations for further research

ML/KDD Community prototype tasks/solutions (like the MiningMart

project?)

Organizators … invitation to DMLL Workshop :-)

Page 19: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

19

Thank You

Page 20: PKDD Discovery Challenge (not only) on Financial Data Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz.

DMLL Workshop, ICML 2002 Petr Berka, LISp, 2002

20

1st. author KDD task KDD steps DM methodCoufal loan preprocessing

descriptionassociation rules

Levin loans + credit cards description association rules,ranking objects

Mikšovský relations among branches preprocessing,description,vizualization

ILP

loans preprocessing,classification,vizualization

classification rules

Pijls initial insight summarizationPutten credit cards preprocessing,

descriptiondeviation detection

preprocessing,prediction

k-NN

Spenke loans vizualization display correlationsWeber loans + credit cards preprocessing,

descriptionassociation rules

Coufal loans Description,classification

association rules +tree

Hotho client profiles based ontransactions

Preprocessing,clustering,classification

SOM, tree

Suzuki loans preprocessing,description,

exception rules

Contributions