slides session 1

48
DATA MINING - 10 FEBRUARY 2004 Data Mining Luc Dehaspe K.U.L. Computer Science Department - Marc Van Hulle K.U.L. Neurofysiologie Department http://toledo.kuleuven.ac.be/

description

 

Transcript of slides session 1

Page 1: slides session 1

DATA MINING - 10 FEBRUARY 2004

Data Mining

Luc Dehaspe

K.U.L. Computer Science Department

-

Marc Van Hulle

K.U.L. Neurofysiologie Department

http://toledo.kuleuven.ac.be/

Page 2: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Course overview

Data Mining

Session 1: Introduction

Session 2-3: Data warehousing/preparation

Session 4-6: Symbolic Data Mining techniques

Session 7: Application + Evaluation of Data Mining results

Session 8-14: Numeric Data Mining methods• statistical techniques• self-organizing techniques

(Hands-on) Exercise sessions

Page 3: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Exercise session

Part 1 (L. Dehaspe) 2* 2.5 h “paper-and-pencil” sessions

application of algorithms

Part 2 (M. Van Hulle) hands-on exercises

Page 4: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Exam

Written exam, closed book

Part 1 (Sessions 1-7): 50% Coverage

Questions RESTRICTED TO CONTENT OF SLIDES Occasional pointers to additional material: I do not expect you to study this

material Questions

One main question: apply+understand algorithm (30%) Two smaller questions: explain concept, compute model quality, … (2*10%)

Part 2 (Sessions 8-14): 50% (explained later by Marc Van Hulle)

Page 5: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Working definition data mining

tools to search data for patterns and relationships that lead to better business decisions

“business”: commercial/scientific

Page 6: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Overview

myths and facts

the Data Mining process

methods visual non-visual

Page 7: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Myths and facts

New technology cycle phase 1: hype

unrealistic expectations “naive” users

phase 2: frustration phase 3: rejection

Alternative: realistic view on vital technology

Page 8: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Myth 1: tabula rasa (virgin territory) Data mining methods are fundamentally different

from previous methods

Fact Underlying ideas often decades old

neural networks: 1940 k-nearest neighbour: 1950 CART (regression trees): 1960

Novel integrated applications to general “business” problems more data, more computing power non-academic users

Page 9: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Take home lesson 1

Not: 1 optimal method optimal

But: portfolio of tools, mixture of old and new

Page 10: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Myth 2: manna from heaven Data mining produces surprising results

that will turn your “business” upside-down without any input of domain expert knowledge without any tuning of the technology

Fact incremental changes rather than revolutionary

long term competitive advantage occasional breakthroughs (e.g. link aspirine-Reyes Syndrome)

technology assistant to the domain expert

careful selection required of: goal technology

Page 11: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Take home lesson 2

Crucial combination of “business” (application domain) expertise data mining technology expertise

Page 12: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Data Mining process model

Definition

Link with the scientific method

Page 13: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

The data mining process

process : iterative; learn to ask better questions

valid : patterns can be generalized to new data

novel and useful : offer a competitive advantage

understandable : contribute to insight in the domain

The non-trivial process of finding valid, novel, potentially useful, and ultimately understandable patterns in data

Page 14: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Interrogating the databaseLook-up queries

What is the average toxicity of cadmium chloride?

Biological dataBiological data

Clinical dataClinical dataChemical dataChemical data

How many earthquakes have occurred last year?

Which customers have a car insurance?

How did HIV patient p123 react to AZT?

Page 15: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Interrogating the databaseFinding patterns

What is the relation between geological features and the occurrence of earthquakes?

Data MiningData Mining

Biological dataBiological data

Clinical dataClinical dataChemical dataChemical data

What is the relation between in vitro activity and chemical structure?

What is the relation between the HIV patient’s therapy history and response to AZT?

What is the profile of returning customers?

Page 16: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

2

3 4

1

NON-ACTIVE

6

7

8

5

ACTIVE

Science

Page 17: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Science

collect data

build hypothesis

verify hypothesis

formulate theory

Tycho Brahe (1546-1601)

observational genius

collected data on Mars

Johannes Kepler (1571-1630)

mined Brahe’s data

discovered laws of planetary motion

The formation of hypotheses is the most mysterious of all the categories of scientific method. Where they come from, no one knows. A person is setting somewhere, minding his own business, and suddenly - flash! - he understands something he didn’t understand before.

Robert M. Pirsig, Zen and the Art of Motorcycle maintenance

The formation of hypotheses is the most mysterious of all the categories of scientific method. Where they come from, no one knows. A person is setting somewhere, minding his own business, and suddenly - flash! - he understands something he didn’t understand before.

Robert M. Pirsig, Zen and the Art of Motorcycle maintenance

The actual discovery of such an explanatory hypothesis is a process of creation, in which imagination as well as knowledge is involved.

Irving Copi, Introduction to Logic, 1986

The actual discovery of such an explanatory hypothesis is a process of creation, in which imagination as well as knowledge is involved.

Irving Copi, Introduction to Logic, 1986

Page 18: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Evolution of data generation

Data source

Data analyst

Data

< 1950 > 2000

Data RichKnowledge Poor

Data RichKnowledge Poor

Everyone, even the most patient and thorough investigator, must pick and choose, deciding which facts to study and which to pass over.

Irving Copi, Introduction to Logic, 1986

Everyone, even the most patient and thorough investigator, must pick and choose, deciding which facts to study and which to pass over.

Irving Copi, Introduction to Logic, 1986

Page 19: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

The scientific method

collect data

build hypothesis

verify hypothesis

formulate theory

Data Mining

Statistics - OLAP

care inspiration

Knowledge discovery in Databases

Data warehousing

Page 20: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Data Mining Definition:

Extracting or “mining” knowledge from large amounts of data

CRISP-DM process modelCRISP-DM process model

Page 21: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Data mining in industry

An in silico research assistant allowing researchers to Explore integrated database For variety of research purposes (“business goals”) Using optimal selection of data mining technologies

pattern

knowledge

Page 22: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Data Mining process model CRISP-DM

Page 23: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Business understanding

Which are the business goals?

Translation to data mining problem definition

Design of a plan to meet objectives

Page 24: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Data understanding

First collection of data

Becoming familiar with the data

Judge data quality

Discovery of first insights interesting subsets

Page 25: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Data preparation

Extract final data set from original set

Selection of tables records attributes

transformation

data cleaning

Page 26: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Modelling

Selection modelling techniques

calibrating parameters

regular backtracking to adapt data to technology

(some techniques discussed further on)

Page 27: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Evaluation

Decide whether to use Data Mining results

Verification of all steps

Check whether business goals have been met

Page 28: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Deployment

Organisation & presentation of new insights

variable complexity deliver report implement software that allows process to be repeated

Page 29: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Visual Data Mining methods

Pro image has got broader information-bandwidth than text

(cf., an image tells more than a thousand words)

Con problems with representation of > 3 dimensions not effective in case of color blindness interpretation gives more information on subject than on object

stars, clouds, Hermann Rorschach test

Page 30: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Visual Data Mining methods Error detection

Page 31: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Visual Data Mining methods Linkage analysis

Page 32: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Visual Data Mining methods Conditional probabilities

Page 33: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Visual Data Mining methods landscapes

Page 34: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Visual Data Mining methods Scatter plots

Page 35: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-visual data mining methods Statistics - OLAP

descriptive: average, median, standard deviation, distribution hypothesis testing: (observed differences)/(random variation) discriminant analysis predictive regression analysis: linear, non-linear clustering

Neural networks

Decision trees and rules

Conceptual clustering

Association rules

Page 36: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

(Non-)visual Data Mining methodsOLAP - Data cubes

CityDate

Pro

du

ct

JuiceColaMilk

CreamToothpaste

SoapPizza

Cheese1 2 3 4 5 6 7

LeuvenNYTokyo

CasablancaRio

10

50

35

60

20

15

70

25

Fact data: sales volume in $100

Online analytical processing

Classical statistical methods

+database technology

real-time calculations

powerful visualisation methods

Page 37: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-Visual Data Mining methodsRegression

Page 38: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-visual Data Mining methodsDiscriminant analysis

R.A. Fischer, 1936

discovers planes that separate classes

Page 39: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-Visual Data Mining methodsNeural Networks

Represent functions with output a discrete value, a real value, or a vector

Neurobiological motivation

Parameters network tuned on basis of input-output examples (backpropagation)

e.g. . input from sensors camera (face recognition) microphone (speech recognition)

Page 40: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-Visual Data Mining methodsDecision trees

Page 41: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-Visual Data Mining methodsDecision trees

Attribute selection information gain “how well does an attribute distribute

the data according to their target class maximal reduction of Entropy =

- pM log2 pM - pF log2 pF

Page 42: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-Visual Data Mining methodsDecision rules

IF Frame = 2-Door AND Engine V6 AND Age < 50 AND Cost > 30K AND Color = Red

THEN buyer is highly likely to be male

Page 43: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-Visual Data Mining methods

Clustering

Eisen et al, PNAS 1998

Cholesterol biosynthesis

Cell cycleEarly responseSignaling and angiogenesis

Wound healing

Page 44: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-Visual Data Mining methodsConceptual clustering

Groups examples and provides description of each group

: all examples

A : Age=-20

B : Age =20-40

b1 : Age =20-40 en Frame=2-Door

b2 : Age =20-40 en Frame = 4-Door

C : Age =40-60

D : Age =+60

d1 : Age =+60 en Frame = 2-Door

d2 : Age =+60 en Frame = 4-Door

AC

D

Bb2b1

d1 d2

Page 45: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Non-Visual Data Mining methodsAssociation rules

40 %60 %

Wine and PizzaWine and Pizza Wine, Pizza, Floppy, and CheeseWine, Pizza, Floppy, and Cheese

item sets

IF Wine and Pizza THEN Floppy and CheeseIF Wine and Pizza THEN Floppy and Cheese

associatio

n-

rule

frequency: 40 %

accuracy: 40% / 60% = 66%

IF-THEN rules show relationships

e.g. . Which products bought together?

Page 46: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Evaluation: pitfallsPost hoc ergo propter hoc

Everyone who drank Stella in the year 1743 is now dead.

Therefore, Stella is fatal.

Page 47: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Evaluation: pitfallsCorrelation does not imply Causality

Palm size correlates with your life expectancy

The larger your palm, the less you will live, on average.

Women have smaller palms

and live 6 years longer on average

Why?

!actions inspired by data mining results!

Page 48: slides session 1

DATA MINING - 10 FEBRUARY 2004 © LUC DEHASPE - 2004

Evaluation: pitfallsHypothesis validation

descriptive statistics: 1 hypothesis

data mining: 1 hypothesis-SPACE much higher probability of random relationships validation on separate data set required