Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge...

30
Knowledge Discovery in Databases Javier B´ ejar cbea LSI-FIB-UPC 2010/2011 Javier B´ ejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 1 / 30

Transcript of Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge...

Page 1: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Knowledge Discovery in Databases

Javier Bejar cbea

LSI-FIB-UPC

2010/2011

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 1 / 30

Page 2: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Outline

1 Knowledge Discovery in DatabasesIntroductionDefinitions of KDD

2 The KDD processSteps of KDDDiscovery goalsMining Methodologies

3 Applications

4 Tools

5 Challenges

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 2 / 30

Page 3: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Knowledge Discovery in Databases Introduction

1 Knowledge Discovery in DatabasesIntroductionDefinitions of KDD

2 The KDD processSteps of KDDDiscovery goalsMining Methodologies

3 Applications

4 Tools

5 Challenges

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 3 / 30

Page 4: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Knowledge Discovery in Databases Introduction

Knowledge Discovery in Databases

It is a practical application of the methodologies of machine learning(Knowledge Discovery in Databases, KDD)

It is of great interest to analyze immense amount of data that isstored in databases in order to obtain any value that it can have

The problem is that to do this manually is impossible

Some methodologies are needed to automate the process of discovery

“We are drowning in information and starving for knowledge”

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 4 / 30

Page 5: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Knowledge Discovery in Databases Introduction

Knowledge Discovery in Databases

The high point of KDD starts in early 2000

Many companies have shown their interest in obtaining the (possibly)valuable information stored in their databases

The goal is to obtain information that can lead to better commercialstrategies and practices (better knowledge of the consumerspreferences and their behaviour)

Many companies are putting a lot of effort to this kind of technology(Microsoft, IBM, Daimler-Benz, VISA, consulting companies...)

Several buzz words have appeared: Business Intelligence, BusinessAnalytics, Predictive Analytics, Data Science, Big Data, ...

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 5 / 30

Page 6: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Knowledge Discovery in Databases Introduction

Knowledge Discovery in Databases

Not only business applications are the promoters of this area

The necessity to analyze scientific data has supposed an importantpart of the methodologies developed

Space probesSatellitesAstronomical observationsGenome Project ⇒ Bioinformatics

Frequently data grows faster that the ability to analyze

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 6 / 30

Page 7: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Knowledge Discovery in Databases Introduction

KDD: Origins and influences

KDD is an area of research that is the intersection of different areas:

Statistical data analysis: Classical data analysis and modellingmethodologies

Machine learning and pattern recognition: Methods for machineknowledge discovery and knowledge characterization

Databases: Data access efficiency

Data visualization: Tool to help in the discovery process andinterpretation

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 7 / 30

Page 8: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Knowledge Discovery in Databases Definitions of KDD

KDD definitions

“It is the search of valuable information in great volumes of data”

“It is the explorations and analysis, by automatic orsemiautomatic tools, of great volumes of data in order todiscover patterns and rules”

“It is the nontrivial process of identifying valid, novel,potentially useful, and ultimately understandable patterns indata”

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 8 / 30

Page 9: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Knowledge Discovery in Databases Definitions of KDD

Elements of KDD

Pattern: Any representation formalism capable to describe thecommon characteristics of a group if instances

Valid: A pattern is valid if it is able to predict the behaviour of newinformation with a degree of certainty

Novelty: It is novel any knowledge that it is not know respect thedomain knowledge and any previous discovered knowledge

Useful: New knowledge is useful if it allows to perform actions thatyield some benefit given a established criteria

Understandable: The knowledge discovered must be analyzed by an expertin the domain, in consequence the interpretability of theresult is important

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 9 / 30

Page 10: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process

1 Knowledge Discovery in DatabasesIntroductionDefinitions of KDD

2 The KDD processSteps of KDDDiscovery goalsMining Methodologies

3 Applications

4 Tools

5 Challenges

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 10 / 30

Page 11: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process

KDD as a process

The actual discovery of pattern is only one part of a more complexprocess

Raw data in not always ready for processing (80/20 project effort)

Some general methodologies have been defined for the whole process(CRISP-DM or SEMMA)

These methodologies address KDD as an engineering process, despitebeing business oriented are general enough to be applied on any datadiscovery application

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 11 / 30

Page 12: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Steps of KDD

The KDD process (I)

Steps of the Knowledge Discoveryin DB process

1 Domain study2 Creating the dataset3 Data preprocessing4 Dimensionality reduction5 Selection of the discovery goal6 Selection of the adequate

methodologies7 Data Mining8 Result assessment and

interpretation9 Using the knowledge

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 12 / 30

Page 13: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Steps of KDD

The KDD process (II)

1. Study of the domainGather information about the domain. Characteristics, goal of thediscovering process (attributes, representative examples, types ofpattern, sources of data)

2. Creating the datasetFrom the information of the previous step it is decided what source ofdata will be used. It has to be decided what attributes will describethe data and what examples are needed for the goals of the discoveryprocess

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 13 / 30

Page 14: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Steps of KDD

The KDD process (III)

3. Data preprocessing and cleaningIt has to be studied the circumstances that affect the quality of thedata

OutliersNoise (exists, presents any pattern, it can be reduced)Missing valuesDiscretization of continuous values

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 14 / 30

Page 15: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Steps of KDD

The KDD process (IV) - Discretization

Discretization allows to use methods that only treat qualitative values

It can improve the interpretability of the results

Automatic methods for discretization:

Direct: Equal size bins, Equal frequency binsStatistical distribution approximation: Histograms, function fittingEntropy based

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 15 / 30

Page 16: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Steps of KDD

The KDD process (V)

4. Data reduction and projectionWe have to study what attributes are relevant to our goal (dependingon the task some techniques can be used to measure the relevance ofthe attributes) and the number of examples that are needed. Not allthe datamining algorithms are scalable

Instance selection (do we need all the examples? sampling techniques)Attribute selection (what is really relevant?)

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 16 / 30

Page 17: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Steps of KDD

The KDD process (VI) - Attribute selection

It is very important to use methods for attribute selection: Reducesthe dimensionality, eliminates irrelevant and redundant information,the result of the process is usually better (curse of dimensionality)

Attribute selection techniques;

Mathematical/Statistical techniques: Principal component analysis(PCA), projection pursuit, Multidimensional scalingHeuristics functions for attribute relevance (ranking of attributes,search in the space of subsets of attributes)

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 17 / 30

Page 18: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Steps of KDD

The KDD process (VII)

5. Selecting the discovery goalThe characteristics of the data, the domain and the aim of the projectdetermines what kind of analysis are feasible or possible(group partitioning, summarization, classification, discovery ofattribute relations, ...)

6. Selecting the adequate methodologiesThe goal and the characteristics of the data determines the moreadequate methodologies

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 18 / 30

Page 19: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Steps of KDD

The KDD process (VIII)

7. Applying the methodologiesThe different parameters of the chosen methodologies has to beadjusted by experimentation and analysis in order to obtain the bestpossible results

8. Interpreting the resultsFrom the knowledge of the domain (expert) it will be assessed therelevance and importance of the result. This interpretation step couldsuppose feedback for the previous steps, it is possible that someadjustments are needed or some previous decisions have to be changed

9. Incorporating the new knowledgeThe new knowledge is used to perform the intended task goal of thediscovery process

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 19 / 30

Page 20: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Discovery goals

Goals of the KDD process

There are different goals that can be pursued as the result of the discoveryprocess, among them:

Classification: We need models that allow to discriminate instances thatbelong to a previously known set of groups (the model couldor could not be interpretable)

Clustering/Partitioning/Segmentation: We need to discover models thatclusters the data in groups with common characteristics (acharacterizations of the groups is desirable)

Regression: We look for models that predicts the behaviour ofcontinuous variables as a function of others

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 20 / 30

Page 21: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Discovery goals

Goals of the KDD process

Summarization: We look for a compact description that summarizes thecharacteristics of the data

Causal dependence: We need models that reveal the causal dependenceamong the variables and assess the strength of thisdependence

Structure dependence: We need models that reveal patterns among therelatinos that describe the structure of the data

Change: We need models that discover patterns in data that hastemporal or spatial dependence

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 21 / 30

Page 22: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Mining Methodologies

Methodologies for KDD

There are a lot of methodologies that can be applied in the discoveryprocess, the more usual are:

Decision trees, decision rules:

Usually are interpretable modelsCan be used for: Classification, regression, and summarizationtrees: C4.5, CART, QUEST, rules: RIPPER, CN2, ..

Classificators, Regression:

Low interpretability but good accuracyCan be used for: Classification and regressionStatistical regression, function approximation, Neural networks,Support Vector Machines, k-NN, Local Weighted Regression, ...

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 22 / 30

Page 23: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

The KDD process Mining Methodologies

Methodologies for KDD

Clustering:

Its goal is to partition datasets or discover groupsCan be used for: Clustering, summarizationStatistical Clustering, Unsupervised Machine learning, UnsupervisedNeural networks (Self-Organizing Maps)

Dependency models (attribute dependence, temporaldependence, graph substructures)

Its goal is to obtain models (some interpretables) of the dependencerelations (structural, causal temporal) among attributes/instancesCan be used for: causal dependence discovery, temporal change,substructure discoveryBayesian networks, association rules, Markov models, graph algorithms,...

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 23 / 30

Page 24: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Applications

1 Knowledge Discovery in DatabasesIntroductionDefinitions of KDD

2 The KDD processSteps of KDDDiscovery goalsMining Methodologies

3 Applications

4 Tools

5 Challenges

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 24 / 30

Page 25: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Applications

Applications

Bussines:

Costumer segmentation, costumer profiling, costumer transaction data,customer churnFraud detectionControl/analysis of industrial processese-commerce, on-line recommendationFinancial data (stock market analysis)

WEB mining

Text mining, document search/organizationSocial networks analysisUser behavior

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 25 / 30

Page 26: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Applications

Applications

Scientific applications:

Medicine (patient data, MRI scans, ECG, EEG, ...)Pharmacology (Drug discovery, screening, in-silicon testing)Astronomy (astronomical bodies identification)Genetics (gen identification, DNA microarrays, bioinformatics)Satellite/Probe data (meteorology, astronomy, geological, ...)Large scientific experiments (CERN LHC, ITER)

Surveillance (Spying people)

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 26 / 30

Page 27: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Tools

1 Knowledge Discovery in DatabasesIntroductionDefinitions of KDD

2 The KDD processSteps of KDDDiscovery goalsMining Methodologies

3 Applications

4 Tools

5 Challenges

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 27 / 30

Page 28: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Tools

Tools for KDD

There are a lot of tools available for KDD

Some tools were developed at universities (C5.0, CART/MARS) andhave become a commercial product, others still remain open source(Weka, R, Rapid Miner, scikit-learn)

Big fish eats little fish (C5.0 → Clementine → SPSS-clementine →IBM DBMiner)

Data analysis companies incorporate KDD techniques inside classicaldata analysis tools (SPSS, SAS)

Companies selling databases add KDD tools as an added value (IBMDB2 (intelligent Miner), SQL Server, Oracle)

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 28 / 30

Page 29: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Challenges

1 Knowledge Discovery in DatabasesIntroductionDefinitions of KDD

2 The KDD processSteps of KDDDiscovery goalsMining Methodologies

3 Applications

4 Tools

5 Challenges

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 29 / 30

Page 30: Knowledge Discovery in Databases - lsi.upc.esbejar/apren/docum/trans/10-ini-dm-eng.pdf · Knowledge Discovery in Databases De nitions of KDD KDD de nitions \It is the search of valuable

Challenges

Open problems

Scalability (More data, more attributes)

Overfitting (patterns with low interest)

Statistical significance of the results

Methods for temporal data/relational data/structured data

Methods for data cleaning (Missing data and noise)

Pattern comprehensibility

Use of domain knowledge

Integration with other techniques (OLAP, DataWarehousing, BusinessIntelligence, Intelligent Decision Support Systems)

Privacy

Javier Bejar cbea (LSI-FIB-UPC) Knowledge Discovery in Databases 2010/2011 30 / 30