Contributions to MiningMart
description
Transcript of Contributions to MiningMart
Contributions to MiningMartContributions to MiningMart
Petr BerkaPetr Berka
Laboratory for Intelligent SystemsLaboratory for Intelligent Systems
University of Economics, PragueUniversity of Economics, Prague
[email protected]@vse.cz
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
22
University of Economics, PragueUniversity of Economics, Prague
LISp - LISp - Laboratory for Intelligent SystemsLaboratory for Intelligent Systems
SALOME - SALOME - Laboratory for Multidisciplinary Laboratory for Multidisciplinary Approaches to Decision-making Support in Economics Approaches to Decision-making Support in Economics and Managementand Management
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
33
LISp researchLISp research
probabilistic methods - decomposable probabilistic methods - decomposable probability models and bayesian networks probability models and bayesian networks
symbolic ML methods - 4FT association symbolic ML methods - 4FT association rules and decision rules rules and decision rules
logical calculi for knowledge discovery in logical calculi for knowledge discovery in databasesdatabases
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
44
LISp activitiesLISp activities
Organized conferencesOrganized conferences ECMLECML’97, PKDD’99’97, PKDD’99
Organized workshopsOrganized workshops Discovery Challenge (PKDD‘99, PKDD2000, PKDD20001), Discovery Challenge (PKDD‘99, PKDD2000, PKDD20001),
WUPES‘97, WUPES2000WUPES‘97, WUPES2000
International ProjectsInternational Projects MLNet, Sol-Eu-Net, EUNITE, MUM, MGTMLNet, Sol-Eu-Net, EUNITE, MUM, MGT KDNetKDNet
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
55
SALOME researchSALOME research
Quantitative and AI (pattern recognition, Quantitative and AI (pattern recognition, fuzzy, neural nets) approaches to support of fuzzy, neural nets) approaches to support of decision making in econmics and decision making in econmics and managementmanagement
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
66
SALOME activitiesSALOME activities
Organized workshopsOrganized workshops STIPR‘97, MME‘99STIPR‘97, MME‘99
International ProjectsInternational Projects Univ. Salzburg, Univ. Hokkaido, Univ. CambridgeUniv. Salzburg, Univ. Hokkaido, Univ. Cambridge
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
77
LISp softwareLISp software
LISp-Miner (data mining system)LISp-Miner (data mining system) DataSource (DataSource (forfor data manipulation)data manipulation) 4FT Miner 4FT Miner (4FT association rules) and(4FT association rules) and KEX KEX (decision rules)(decision rules)
experimental software for building graphical experimental software for building graphical modelsmodels
preprocessing procedurespreprocessing procedures related to KEXrelated to KEX based on information theoretic approachbased on information theoretic approach
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
88
LISP-Miner proceduresLISP-Miner procedures
DataSourceDataSourcecreating new (virtual) attributes using SQLcreating new (virtual) attributes using SQL
ekvidistant and equifrequent discretizationekvidistant and equifrequent discretization
grouping attribute values grouping attribute values
computing attribute-value frequenciescomputing attribute-value frequencies
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
99
LISP-Miner proceduresLISP-Miner procedures
4FT-Miner (GUHA procedure)4FT-Miner (GUHA procedure)4FT association rules in the form 4FT association rules in the form
Ant ~ Suc / CondAnt ~ Suc / Cond
KEXKEX
weighted decision rules in the formweighted decision rules in the form
Ant Ant C (weight) C (weight)
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
1010
4FT-Miner basic idea4FT-Miner basic idea
Generate a (potential) rule, e.g.Generate a (potential) rule, e.g.COLOUR(red) COLOUR(red) SIZE(small) SIZE(small) 0.9, 200.9, 20 TEMP(high) TEMP(high)
AGE(21-30) AGE(21-30) SALARY(low) SALARY(low) 0.85,15 0.85,15 PAYMENTS (High) PAYMENTS (High) LOAN(bad) LOAN(bad)
Verify a rule using four-fold tableVerify a rule using four-fold table
Suc Suc
Ant a bAnt c d
pba
aBaTRUEBp
iff ,
pcba
aBaTRUEBp
iff ,
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
1313
KEX basic ideaKEX basic idea
Generate a (potential) rule, e.g.Generate a (potential) rule, e.g.YEARS-IN-COMPANY(0-3) YEARS-IN-COMPANY(0-3) AGE(0-25) AGE(0-25) LOAN(GOOD) LOAN(GOOD)
If rule refines current set of rules If rule refines current set of rules (validity a/(a+b) differs from weight inferred during consultation)(validity a/(a+b) differs from weight inferred during consultation)
add into rule base with proper weightadd into rule base with proper weight
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
1616
LISp-Miner architectureLISp-Miner architecture
Data(ODBC
ACCESS)
MetaData(ODBC ACCESS)
ResultsLM
Windows
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
1717
Preprocessing (LISp) Preprocessing (LISp)
KEX-orientedKEX-oriented (fuzzy) discretization + grouping of values(fuzzy) discretization + grouping of values computing the amount of noise in datacomputing the amount of noise in data random sampling + balancing of datarandom sampling + balancing of data handling missing valueshandling missing values
Information theoryInformation theory attribute selectionattribute selection attribute groupingattribute grouping
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
1818
… fuzzy discretization … fuzzy discretization
NClass(Int)N(Int) < >
NClass
N
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
1919
… amount of noise… amount of noise
Amount of noise: 20% Amount of noise: 20%
max. possible accuracy = 80%max. possible accuracy = 80%
head body smile holding jacket tie classo r y s r y +o r y s r y -o r y f y n -o r y b y n -o r n s r y +
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2020
… data sampling… data sampling
random split into training and testing setrandom split into training and testing set select random stratified sampleselect random stratified sample balance unbalanced classesbalance unbalanced classes
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2121
… handling missing values… handling missing values
remove exampleremove example substitute missing with new valuesubstitute missing with new value substitute missing with majority valuesubstitute missing with majority value proportional substitutionproportional substitution
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2222
… information theory… information theory
Attribute selection - Attribute selection - based on mutual informationbased on mutual information
Attribute grouping - Attribute grouping - based on information contentbased on information content
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2323
Preprocessing architecturePreprocessing architecture
Data(ASCII)
Results procedure
Input data(ASCII)
procedure Output data(ASCII)
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2424
SALOME softwareSALOME software
Feature Selection Toolbox (Feature Selection Toolbox (Multi-Purpose Multi-Purpose Tool for Pattern RecognitionTool for Pattern Recognition))
feature selection feature selection approximation-based modeling approximation-based modeling classification classification
a consulting system helping to choose the most a consulting system helping to choose the most suitable method is being developedsuitable method is being developed
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2525
Search strategies for FSSearch strategies for FS
Search for a subset maximizing a criterion Search for a subset maximizing a criterion function (distance, divergence):function (distance, divergence): with apriori informationwith apriori information
exhaustive searchexhaustive search branch and bound based algorithmsbranch and bound based algorithms floating search algorithmsfloating search algorithms
without apriori informationwithout apriori information approximation methodapproximation method divergence methoddivergence method
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2626
FST architectureFST architecture
Data(ASCII)
ResultsFST
Windows
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2727
ReferencesReferences
LISp-Miner:LISp-Miner: Berka,P. - Ivanek,J.: Automated Knowledge Acquisition for Berka,P. - Ivanek,J.: Automated Knowledge Acquisition for
PROSPECTOR-like Expert Systems. In: (Bergadano, deRaedt PROSPECTOR-like Expert Systems. In: (Bergadano, deRaedt
eds.) Proc. ECML'94, Springer 1994, 339-342.eds.) Proc. ECML'94, Springer 1994, 339-342. Berka,P. - Rauch,J.: Data Mining using GUHA and KEX. In: Berka,P. - Rauch,J.: Data Mining using GUHA and KEX. In:
(Callaos, Yang, Aguilar eds.) 4th. Int. Conf. on Information (Callaos, Yang, Aguilar eds.) 4th. Int. Conf. on Information Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238- 244. Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238- 244.
Rauch,J.: Classes of Four Fold Table Quantifiers. In: (Zytkow, Rauch,J.: Classes of Four Fold Table Quantifiers. In: (Zytkow, Quafafou eds.) Principles of Data Mining and Knowledge Quafafou eds.) Principles of Data Mining and Knowledge Discovery. Springer 1998, 203 - 211. Discovery. Springer 1998, 203 - 211.
MiningMart prezentation (c) Petr MiningMart prezentation (c) Petr Berka, LISp, 2001Berka, LISp, 2001
2828
ReferencesReferences
Preprocessing:Preprocessing: Bruha,I. - Berka,P.: Discretization and Fuzzification of Numerical Bruha,I. - Berka,P.: Discretization and Fuzzification of Numerical
Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa, Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa, Kacprzyk (eds.): Fuzzy Systems in Medicine, Physica Verlag, Kacprzyk (eds.): Fuzzy Systems in Medicine, Physica Verlag,
2000, 112-138.2000, 112-138. Pudil, P., Novovičová J.: Novel Methods for Subset Selection with Pudil, P., Novovičová J.: Novel Methods for Subset Selection with
Respect to Problem Knowledge, IEEE Transactions on Intelligent Respect to Problem Knowledge, IEEE Transactions on Intelligent Systems - Special Issue on Feature Transformation and Subset Systems - Special Issue on Feature Transformation and Subset Selection 1998, 66-74Selection 1998, 66-74
J. Zvarova and M. Studeny: Information theoretical approach to J. Zvarova and M. Studeny: Information theoretical approach to constitution and reduction of medical data. International Journal of constitution and reduction of medical data. International Journal of Medical Informatics 45 (1997), n. 1-2, pp. 65-74. Medical Informatics 45 (1997), n. 1-2, pp. 65-74.