1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003...

21
1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez D., Serrano J.M., Vila M.A. University of Granada (Spain)

Transcript of 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003...

Page 1: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

1

Finding Fuzzy Approximate Dependencies within STULONG

Data

Discovery Challenge, ECML/PKDD 2003September 22-27, 2003

Berzal F., Cubero J.C., Sanchez D., Serrano J.M., Vila M.A.

University of Granada (Spain)

Page 2: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

2 Discovery Challenge – ECML/PKDD 2003

Introduction KDD allow us to obtain relations within

data. Non-trivial. Previously unknown. Potentially useful.

Fuzzy data KDD tools and techniques extensions.

Page 3: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

3 Discovery Challenge – ECML/PKDD 2003

Problem representation Fuzzy relational database.

aij values: Numeric, scalar (nominal), linguistic labels.

Membership degrees. Fuzzy similarity relations, SA1, ..., SAm.t# A1 A2 ... Am

t1 a11, t1(A1) a12, t1(A2) ... a1m, t1(Am)

t2 a21, t2(A1) a22, t2(A2) ... a2m, t2(Am)

t3 a31, t3(A1) a32, t3(A2) ... a3m, t3(Am)

… … ... …

Page 4: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

4 Discovery Challenge – ECML/PKDD 2003

Fuzzy Approximate Dependencies We define Fuzzy Approximate Dependencies

relaxing some properties in Functional Dependencies,

V W t,s t[V] = s[V] t[W] = s[W]

Equality relaxation

Considering linguistic labels and membership degrees

Universal quatifier

relaxation (exceptions

allowing)

Page 5: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

5 Discovery Challenge – ECML/PKDD 2003

FAD Measures Relevance degree:

Support, supp(VW) Fulfilment degrees:

Confidence, conf(VW) Certainty factor, CF(VW) [Shortliffe and

Buchanan, 1975] Measures belief degree variations. CF(VW) = 1 Maximum increment (Perfect positive). CF(VW) = –1 Maximum decrement. CF(VW) = 0 Statistical independence.

Page 6: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

6 Discovery Challenge – ECML/PKDD 2003

Applications Fuzzy Databases. Approximate Dependencies Discovery. Functional Dependencies Discovery. Other applications:

Low granularity data. Overlapping semantics.

Page 7: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

7 Discovery Challenge – ECML/PKDD 2003

STULONG Database Entry Table.

Normal Group (attribute KONSKUP having values 1 or 2).

Risk Group (attribute KONSKUP having values 3 or 4).

Pathologic Group (value 5 for attribute KONSKUP).

Page 8: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

8 Discovery Challenge – ECML/PKDD 2003

Data Preprocessing (I) Problem: Semantic overlapping in

symbolic or scalar attributes. Similarity fuzzy relations (subjective). I.e.: DOPRAVA (Means of transport for

getting to work):by bike

public means

car not stated

on foot 0.4 0.3 0.3 0.0

by bike 0.3 0.3 0.0

public means

0.4 0.0

Page 9: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

9 Discovery Challenge – ECML/PKDD 2003

Data Preprocessing (II) Problem: High granularity in numeric

attributes. Linguistic labels sets definition starting from

intervals. Numeric value <Label, degree>

P.e.: BMI (Body mass index):1

25.0 25.1224.73

thin overweight

Page 10: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

10 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (I) Dependencies between social factors and

physical activity.

ROKVSTUP STAV VZDELANI ZODPOV

TELAKTZA 0.67/0.14

0.24/0.37 0.25/0.28

AKTPOZAM 0.14/0.47 0.58/0.28

0.14/0.49 0.18/0.47

DOPRAVA 0.20/0.32 0.64/0.14

0.19/0.32 0.26/0.32

DOPRATRV 0.17/0.47 0.57/0.22

0.16/0.46 0.21/0.44

Page 11: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

11 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (II) Dependencies between social factors and

smoking.

ROKVSTUP STAV VZDELANI ZODPOV

KOURENI 0.68/0.07

DOBAKOUR 0.64/0.11

0.26/0.25

BYVKURAK 0.10/0.64 0.42/0.39

0.09/0.65 0.13/0.64

Page 12: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

12 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (III) Dependencies between social factors and

alcohol consumption.ROKVSTUP STAV VZDELANI ZODPOV

ALKOHOL 0.21/0.35 0.63/0.15 0.19/0.34 0.24/0.31

PIVO10 0.16/0.43 0.58/0.21 0.16/0.43 0.21/0.41

PIVO12 0.10/0.62 0.47/0.39 0.10/0.62 0.13/0.61

VINO 0.16/0.43 0.58/0.21 0.16/0.44 0.21/0.41

LIHOV 0.16/0.43 0.58/0.21 0.16/0.43 0.20/0.41

PIVOMN 0.21/0.33 0.65/0.14 0.20/0.32 0.24/0.29

VINOMN 0.20/0.33 0.64/0.15 0.19/0.33 0.24/0.31

LIHMN 0.20/0.31 0.64/0.14 0.19/0.30 0.25/0.29

Page 13: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

13 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (IV) Dependencies between social factors and

physical features.

ROKVSTUP STAV VZDELANI ZODPOV

BMI 0.16/0.44 0.58/0.23

0.15/0.45 0.20/0.42

SYST1 0.65/0.12

0.25/0.26

DIAST1 0.19/0.32 0.63/0.14

0.19/0.32 0.24/0.30

SYST2 0.65/0.12

0.25/0.25

DIAST2 0.19/0.33 0.63/0.15

0.18/0.33 0.23/0.30

Page 14: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

14 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (V) Dependencies between physical activity

and smoking.

TELAKTZA AKTPOZAM DOPRAVA DOPRATRV

KOURENI 0.50/0.11 0.45/0.13

DOBAKOUR 0.27/0.24 0.47/0.18 0.30/0.24 0.42/0.19

BYVKURAK 0.13/0.62 0.26/0.51 0.15/0.51 0.23/0.55

Page 15: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

15 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (VI) Dependencies between physical activity

and alcohol consumption.TELAKTZA AKTPOZAM DOPRAVA DOPRATRV

ALKOHOL 0.27/0.31 0.46/0.23 0.29/0.30 0.41/0.25

PIVO10 0.22/0.39 0.40/0.30 0.24/0.39 0.35/0.33

PIVO12 0.14/0.59 0.29/0.50 0.16/0.59 0.23/0.50

VINO 0.22/0.40 0.40/0.31 0.24/0.39 0.35/0.33

LIHOV 0.22/0.39 0.39/0.30 0.24/0.38 0.35/0.33

PIVOMN 0.27/0.29 0.46/0.21 0.30/0.29 0.42/0.24

VINOMN 0.27/0.31 0.46/0.23 0.28/0.30 0.41/0.24

LIHMN 0.27/0.28 0.46/0.21 0.29/0.27 0.41/0.23

Page 16: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

16 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (VII) Dependencies between physical activity

and physical features.

TELAKTZA AKTPOZAM DOPRAVA DOPRATRV

BMI 0.21/0.41 0.39/0.32 0.23/0.40 0.34/0.34

SYST1 0.27/0.26 0.46/0.19 0.29/0.25 0.42/0.21

DIAST1 0.25/0.29 0.44/0.22 0.28/0.29 0.39/0.23

SYST2 0.27/0.25 0.47/0.18 0.29/0.24 0.42/0.20

DIAST2 0.25/0.29 0.45/0.22 0.27/0.29 0.39/0.24

Page 17: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

17 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (VIII) Dependencies between physical activity

and cholesterol degrees.

TELAKTZA AKTPOZAM DOPRAVA DOPRATRV

CHLST 0.28/0.24 0.47/0.17 0.30/0.23 0.42/0.19

TRIGL 0.49/0.13 0.45/0.14

Page 18: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

18 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (IX) Dependencies between alcohol

consumption and physical features.BMI SYST1 DIAST1 SYST2 DIAST2

ALKOHOL 0.40/0.24 0.25/0.30 0.28/0.29 0.24/0.31 0.28/0.29

PIVO10 0.35/0.33 0.21/0.39 0.38/0.24 0.20/0.40 0.24/0.38

PIVO12 0.25/0.52 0.14/0.60 0.16/0.59 0.13/0.60 0.17/0.58

VINO 0.35/0.32 0.21/0.40 0.24/0.38 0.20/0.40 0.24/0.38

LIHOV 0.35/0.33 0.21/0.40 0.24/0.38 0.20/0.40 0.24/0.38

PIVOMN 0.41/0.23 0.25/0.28 0.29/0.27 0.25/0.29 0.29/0.27

VINOMN 0.40/0.24 0.25/0.30 0.28/0.28 0.24/0.30 0.28/0.28

LIHMN 0.41/0.22 0.25/0.28 0.29/0.27 0.24/0.28 0.29/0.27

Page 19: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

19 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (X) Dependencies between alcohol

consumption and smoking.KOURENI DOBAKOUR BYVKURAK

ALKOHOL 0.23/0.30 0.61/0.15

PIVO10 0.13/0.44 0.20/0.40 0.56/0.22

PIVO12 0.08/0.65 0.13/0.60 0.44/0.40

VINO 0.13/0.44 0.20/0.40 0.56/0.22

LIHOV 0.13/0.44 0.20/0.40 0.56/0.22

PIVOMN 0.23/0.28 0.61/0.14

VINOMN 0.23/0.30 0.61/0.15

LIHMN 0.24/0.28 0.62/0.14

Page 20: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

20 Discovery Challenge – ECML/PKDD 2003

Analytical Questions (XI) Dependencies between skin folds and BMI,

[TRIC] [BMI], supp 15.85%, CF 0.54 [SUBSC] [BMI], supp 17.28%, CF 0.58

Page 21: 1 Finding Fuzzy Approximate Dependencies within STULONG Data Discovery Challenge, ECML/PKDD 2003 September 22-27, 2003 Berzal F., Cubero J.C., Sanchez.

21 Discovery Challenge – ECML/PKDD 2003

Concluding Remarks FAD’s allow us to discover relations within

imprecise or uncertain data. Experts aid is desirable.

Data preprocessing. Results interpretation.