Descriptive Granularity - Building Foundations of Data Mining

DESCRIPTIVE GRANULARITYBuilding Foundations of Data

Mining

In Memory of my Professors: Zdzislaw Pawlak,

Helena Rasiowa and Roman Sikorski

Anita Wasilewska

Computer Science Department

Stony Brook University

Stony Brook, NY

1

Part 1: INTRODUCTION

2

We all have scientific history;

All problems we work on have history;

It is important to trace history

of problems we work on;

We all build scientific history;

The future belongs to us,

and so does the past.

3

We all have scientific history;

Here is my LATEST history (of building Foun-dations of Data Mining)

1995- 1998 I supervised PhD Thesis of

Ernestina Menasalvas, now Professor and aVice-Rector of Madrid Polytechnic.

We (with some others) went from buildingmodels for concrete implementations (1996-2002) to

developing a general language for Founda-tions of Data Mining (2002 -2004) to

building a general foundational model for DataMining (2005- ).

4

It has been a slow process but finally a com-

munity and specialized conferences devel-

oped, books started to appear:

Foundations and Novel Approaches in Data

Mining, T.Y. Lin, S. Ohsuga, C. J. Liau,

and X. Hu , editors, Springer 2006,

Data Mining: Foundations and Practice, Tsau

Young Lin, Ying Xie, Anita Wasilewska,

Churn-Jung Liau, editors, Studies in Com-

putational Intelligence (SCI)118, Springer-

Verlag 2008,

and a field Foundations of Data Mining was

created.

We all build the scientific history and it takes

TIME and patience to do so.

5

Our work in Data Mining Foundations ma-

tured and finally we were invited by T.Y.

LIN to write a 20 pages long entry about

our research in the Encyclopedia of Com-

plexity and System Science published by

Springer in 2008.

The Encyclopedia is Springer’s latest and

prestigious initiative with its Board of Ed-

itors including between others Ahmed Ze-

wail, Nobel in Chemistry, Thomas Schelling,

Nobel in Economics, Richard E. Stearns,

1993 Turing Award, Pierre-Louis Lions, 1994

Fields Medal, and Lotfi Zadeh, IEEE Medal

of Honor.

All entries were by invitation only and the in-

clusion of our work shows the recognition

of the need for foundational studies in

newly developing domains.

6

All problems we work on have history

Short History of Foundational Studies

The origins of Foundational Studies can be

traced back to David Hilbert, a German

mathematician, recognized as one of the

most influential and universal mathemati-

cians of the 19th and early 20th centuries.

7

Hilbert Problems: In 1900 he proposed at the

Paris conference of the International Congress

of Mathematicians 23 problems for the fu-

ture century.

Several of them turned out to be very influ-

ential for 20th century mathematics and

later Computer Science.

Of the cleanly-formulated Hilbert problems,

TEN problems: 3, 7, 10, 11, 13, 14, 17, 19,

20, and 21 have solutions that are ac-

cepted by consensus.

8

TWO Problems: 1, 2 are FOUNDATIONAL

Problems; 1 concerning Continuum Hypoth-

esis was solved by Cohen in 1963, and 2

concerning Consistency of Arithmetic was

solved by and Godel and Gentzen in 1936

FIVE Problems: 5, 9, 15, 18, and 22 have

partial solutions,

FOUR problems: 4, 6, 16, and 23 are too

loosely formulated to be ever described

as possible to be solved.

TWO Problems: 8 (the Riemann Hypothe-

sis, along with the Goldbach conjecture is

a part of it) and 12 are still OPEN, both

being in number theory.

9

Riemann hypothesis was proposed by Bern-

hard Riemann (1859)

It is a conjecture about the distribution of the

zeros of the Riemann zeta function which

states that all non-trivial zeros have real

part 1/2.

The Riemann hypothesis implies results about

the distribution of prime numbers that are

in some ways as good as possible.

Along with suitable generalizations, it is con-

sidered by some mathematicians to be the

most important unresolved problem in pure

mathematics.

10

Pierre Deligne proved in 1973 analogue of the

Riemann Hypothesis for zeta functions of

varieties defined over finite fields.

The full version of the hypothesis remains un-

solved, although

computer calculations have shown that the

first 10 trillion zeros lie on the critical line.

11

Goldbach’s conjecture (1742) is one of the

oldest unsolved problems in number theory

and in all of mathematics. It states:

Every even integer greater than 2 can

be expressed as the sum of two primes

For example;

4 = 2 + 2, 6 = 3 + 3, 8 = 3 + 5,

10 = 7+3, or 5+5, 12 = 5+7,14 = ....

T. Oliveira e Silva is running a distributed com-

puter search that has verified the conjec-

ture for n ≤ 1.609× 1018 and some higher

small ranges up to 4× 1018.

12

Hilbert Program

Hilbert proposed, in 1920 a research project

that became known as Hilbert’s Program.

1. He wanted mathematics to be formulated

on a solid and complete logical founda-

tion.

2. He believed that in principle this could be

done, by showing that all of mathematics

follows from a correctly-chosen finite sys-

tem of axioms and that some such axiom

system is provably consistent.

3. He also believed that one can have such

a system in which proofs of theorems can

be deduced automatically from the way

the theorems are built.

13

In 1931 Kurt Godel showed that Hilbert’s grandplan 1. and 2. was impossible as stated.

Godel proved in what is now called Godel’sIncompleteness Theorem that any noncontradictory formal system, which was com-prehensive enough to include at least arith-metic, cannot demonstrate its complete-ness by way of its own axioms.

In 1933-34 Gerhard Gentzen gave a positiveanswer to 3. in a case of classical proposi-tional logic, and partially positive answer incase of (semi-undecidable) predicate logic.

Nevertheless Hilbert’s and Godel’s work ledto the development of recursion theory

and then mathematical logic and foun-

dations of mathematics as autonomousdisciplines.

14

Gentzen’s work led to the development of Proof

Theory and Automated Theorem Prov-

ing as separate Mathematics and Computer

Science domains.

Godel inspired works of Alonzo Church and

Alan Turing that became the basis for

theoretical computer science and also

led to the further development of a unique

phenomenon called the Polish School of

Mathematics and later to the creation of

Foundational Studies in Computer Science.

15

Personal History: my Master Thesis in Com-

puter Science (under Pawlak and Rasiowa)

consisted of a solution of Gentzen’s con-

juncture for Modal S4 and S5 Logics and

consequently I also developed first world

theorem prover for S4 Modal Logic in

1967.

As a result I have spent first 15 years of my

scientific life (before coming to USA) work-

ing in Proof Theory for non-classical log-

ics, formulated (as a pure mathematician)

a General Theory of Gentzen Type For-

malizations and established various re-

sults about connections and relationships

between certain Classes of Logics, For-

mal Languages and Theory of Programs

(as computer scientist).

16

Polish School of Mathematics

The term Polish School of Mathematics refers

to groups of mathematicians of the 1920’s

and 1930’s working on common subjects.

The main two groups were situated in War-

saw and Lvov (now Lviv, the biggest city

in Western Ukraine).

We talk hence more specifically about War-

saw and Lvov Schools of Mathematics

and additionally of Warsaw-Lvov School

of Logic working in Warsaw.

17

Any list of important twentieth century math-

ematicians contains Polish names in a fre-

quency out of proportion to the size of the

country.

Poland was partitioned by Russia, Germany,

and Austria and was under foreign domi-

nation for 200 years, from 1795 until the

end of World War I.

What was to become known as the Polish

School of Mathematics was possible be-

cause it was carefully planned, agreed

upon, and executed.

18

Independent Poland was crated in 1918 and

University of Warsaw re-opened with

Janiszewski, Mazurkiewicz, and Sierpin-

ski as professors of mathematics.

They chose logic, set theory, point-set topol-

ogy, and real functions as the area of

concentration.

The journal Fundamenta Mathematicae was

founded in 1920 and is still in print.

It was the first specialized mathematical

journal in the world.

19

The choice of title was deliberate to reflect

that all areas published there were to be

connected with foundational studies.

It should be remembered that at the time

these areas had not yet received full

acceptance by the mathematical commu-

nity.

The choice reflected both insight and courage

20

The notable mathematicians of the Warsaw

and Lvov Schools of Mathematics were,

between others Stefan Banach, Stanis-

lam Ulam and after the war, Roman

Sikorski.

Stefan Banach was self-taught mathematics

prodigy and the founder of modern func-

tional analysis.

Mathematical concepts named after Banach

include the Banach-Tarski paradox, Hahn-

Banach theorem, BanachSteinhaus theo-

rem, Banach-Mazur game and Banach spaces.

21

Stanislaw Ulam emigrated to America just be-

fore the war and became American math-

ematician of Polish-Jewish origins.

He participated in the Manhattan Project

and originated the Teller-Ulam design of

thermonuclear weapons.

He also invented nuclear pulse propulsion and

developed a number of mathematical tools

in number theory, set theory, ergodic the-

ory and algebraic topology.

22

Roman Sikorski reputation was established by

his outstanding results in Boolean algebras,

functional analysis, theories of distribution,

measure theory, general topology, descrip-

tive set theory, and in Algebraic Math-

ematical Logic (with collaboration with

Rasiowa).

In axiomatic set theory, the Rasiowa-Sikorski

Lemma is one of the most fundamental

facts used in the technique of forcing.

23

The notable logicians of the Lvov-Warsaw

School of Logic were:

Alfred Tarski - since 1942 in Berkeley and

founder of American School of Founda-

tions of Mathematics,

Jan Lukasiewicz, Andrzej Mostowski, and

after the second world war Helena Ra-

siowa.

24

Helena Rasiowa became, in 1977 the founder

of Fundamenta Informaticae the first world

journal specialized in foundation of com-

puter science.

The choice of the title Fundamenta Infor-

maticae was again deliberate.

It reflected not only the subject, but also

stresses that the new research area being

developed in Warsaw is a direct continu-

ation of the tradition of the Foundational

Studies of Polish School of Mathemat-

ics.

25

Part 2:

DESCRIPTIVE GRANULARITY

A Model for Data Mining

26

We present here a formal syntax and seman-

tics for a notion of a descriptive granu-

larity.

We do so in terms of three abstract models:

Descriptive, Semantic, and Granular.

Descriptive model formalizes the syntactical

concepts and properties of the data min-

ing, or learning process.

Semantic model formalizes its semantical prop-

erties.

Granular model establishes a relationship be-

tween the Descriptive and Semantic mod-

els in terms of a formal satisfaction rela-

tion.

27

Data Mining - Informal Definition

One of the main goals of Data Mining is to

provide comprehensible descriptions of

information extracted from the data bases.

We are hence interested in building models

for a descriptive data mining, i.e. the

data mining which main goal is to produce

a set of descriptions in a language easily

comprehensible to the user.

28

The descriptions come in different forms.

In case of classification problems it might be

a set of characteristic or discriminant rules,

it might be a decision tree or a neural net-

work with fixed set of weights.

In case of association analysis it is a set of

associations (frequent item sets), or asso-

ciation rules with accuracy parameters.

In case of cluster analysis it is a set of clus-

ters, each of which has its own description

and a cluster name.

29

In case of approximate classification by the

Rough Set analysis it is usually a set of dis-

criminant or characteristic rules (with or

without accuracy parameters) or a set of

decision tables.

Data Mining results are usually presented to

the user in their descriptive, i.e. syntac-

tic form as it is the most natural form of

communication.

But the Data Mining process is deeply

semantical in its nature.

We hence build our Granular Model on two

levels: syntactic and semantic.

30

SYNTAX

We understand] by syntax, or syntacticalconcepts simple relations among symbolsand expressions of formal symbolic lan-guages.

A symbolic language is a pair

L = (A, E),where A is an alphabet and E is the set ofexpressions of L.

The expressions of formal languages, even ifcreated with a specific meaning in mind,do not carry themselves any meaning, theyare just finite sequences of certain symbols.

The meaning is being assigned to themby establishing a proper semantics.

31

SEMANTICS

Semantics for as given symbolic language Lassigns a specific interpretation in some

domain to all symbols and expressions

of the language.

It also involves related ideas such as truth

and model. They are called semantical

concepts to distinguish them from the syn-

tactical ones.

32

MODEL

The word model is used in many situations

and has many meanings but they all reflect

some parts, if not all, of its following formal

meaning.

A structure M, called also an interpretation,

is a model for a set E0 ⊆ E of expressions

of a formal language L if and only if every

expression E ∈ E0 is true in M .

33

All our Models are abstract structures that

allow us to formalize some general prop-

erties of Data Mining process and address

the semantics-syntax duality inherent to

any Data Mining process.

Moreover, it allows us to provide a formal def-

inition of a generalization and of Data

Mining as the process of information gen-

eralization.

34

The notion of generalization is defined in

terms of granularity of steps of the pro-

cess.

Data is represented in the model in a form of

Knowledge Systems.

Each Knowledge System has a granularity

associated with it and the process changes,

or not, its granularity.

Granularity is the crucial for defining some

notions and components of the model, hence

the Granular Model name.

35

Granular Model

Granular Model is a system

GM = ( SM, DM, |= ) where:

• SM is a Semantic Model;

• DM is a Descriptive Model;

• |= ⊆ P(U) × E is called a satisfaction

relation, where U is the universe of SMand E is the set of descriptions defined

by the DM.

Satisfaction |= establishes truth relationship

between the data mining model and the

descriptive model.

36

Semantic Model definition motivation.

First step in any data mining procedures is to

drop the key attribute.

This step allows us to introduce similarities

in the database as records do not have their

unique identification anymore.

The input into the data mining process is

hence always a a data table obtained from

the target data by removal of the key at-

tribute.

We call it a target data table.

37

As the next step we represent, following Rough

Set model our target data table as Pawlak’s

Information System with the universe U

by adding a new, non attribute column for

the record names, i.e. objects of U . We

take this set U as the universe of our model

of SM.

Why Information system?

We want to model Data Mining as a process

of generalization.

In order to model this process we have first

to define what does it mean from seman-

tical point of view that one stage of the

process is more general then the other.

38

The idea behind is very simple. It is the

same as saying that (a+b)2 = a2+2ab+b2

is a more general formula then the formula

(2 + 3)2 = 22 + 2 · 2 · 3 + 32.

This means that one description (formula)

is more general then the other if it de-

scribes more objects.

From semantical point of view it means that

data mining process consists of putting ob-

jects (records) in sets of objects.

From syntactical point of view data min-

ing process consists of building descrip-

tions (in terms of attribute, values of at-

tributes pairs) of these sets of objects, with

some extra parameters, if needed.

39

To model a situation that allows us to talk

about descriptions of sets of records (ob-

jects) we extend the notion of Pawlak’s

model of information system to our notion

of Knowledge System.

The universe of a knowledge system con-

tains some subsets of U , i.e. elements of

P(U).

For example a target data table (after pre-

processing) and the corresponding repre-

sentation by Pawlak’s information system,

and a knowledge system with universe

U of granularity one are as follows.

40

Target Data Table T0a1 a2 a3

small small mediummedium small mediumsmall small mediumbig small small

medium medium bigsmall small mediumbig small small

medium medium bigsmall small mediumbig small medium

medium medium smallsmall small mediumbig small big

medium medium small

Target Information System I0U a1 a2 a3x1 small small mediumx2 medium small mediumx3 small small mediumx4 big small smallx5 medium medium bigx6 small small mediumx7 big small smallx8 medium medium bigx9 small small mediumx10 big small mediumx11 medium medium smallx12 small small mediumx13 big small bigx14 medium medium small

41

Knowledge System of granularity one (all

objects are one element sets) correspond-

ing to target table T0 is as follows.

Target Knowledge System K0

P1(U) a1 a2 a3{x1} small small medium{x2} medium small medium{x3} small small medium{x4} big small small{x5} medium medium big{x6} small small medium{x7} big small small{x8} medium medium big{x9} small small medium{x10} big small medium{x11} medium medium small{x12} small small medium{x13} big small big{x14} medium medium small

42

Assume now that we have applied some algo-

rithm ALG1 and it has returned a following

set

D = {D1, D2, ...D7}of descriptions.

D1 : (a1 = s) ∩ (a2 = s) ∩ (a3 = m),

D2 : (a1 = m) ∩ (a2 = s) ∩ (a3 = m),

D3 : (a1 = m) ∩ (a2 = m) ∩ (a3 = b),

D4 : (a1 = m) ∩ (a2 = m) ∩ (a3 = s),

D5 : (a1 = b) ∩ (a2 = s) ∩ (a3 = s),

D6 : (a1 = b) ∩ (a2 = s) ∩ (a3 = m),

D7 : (a1 = b) ∩ (a2 = s) ∩ (a3 = b).

43

Questions

Q1 How well this set of descriptions describes

our original data i.e. how accurate is the

algorithm ALG1 we have used to find them,

Q2 how accurate is the knowledge we have

thus obtained out of our data.

The answer is formulated in terms of the tar-

get information system with the universe

U , and the sets S(D) defined (after Pawlak)

for any description D ∈ D as follows.

S(D) = {x ∈ U : D}.

We call S(D) the truth set for D.

44

Intuitively, the sets

S(D) = {x ∈ U : D}contain all records (i.e. their identifiers)

with the same description given in terms

of attribute, values of attribute pairs.

The descriptions do not need to utilize all at-

tributes of the target data, as it is often

the case, and one of ultimate goals of data

mining is to find descriptions with as few

attributes as possible.

45

In association analysis the descriptions can rep-

resent the frequent item sets.

For example , for a frequent three itemset

D = i1i2i3, the truth set S(D) represents

all all transactions that contain items i1, i2, i3.

In general description come in different forms,

depending on the data mining goal and ap-

plication.

We define formally a general form of descrip-

tions as a part of the Descriptive Model

46

For the target data and descriptions Di ∈ Dpresented in the above examples the sets

S(Di) are as follows.

S1 = S(D1) = {x ∈ U : D1} = {x1, x3, x6, x9, x12},

S2 = S(D2) = {x ∈ U : D2} = {x2},

S3 = S(D3) = {x ∈ U : D3} = {x5, x8},

S4 = S(D4) = {x ∈ U : D4} = {x11, x14},

S5 = S(D5) = {x ∈ U : D5} = {x4, x7},

S6 = S(D6) = {x ∈ U : D6} = {x10},

S7 = S(D7) = {x ∈ U : D7} = {x13}.

47

We represent our results in a form of a Knowl-

edge System as follows.

Resulting Knowledge System K1P(U) a1 a2 a3{x1, x3, x6, x9, x12} s s m{x2} m s m{x5, x8} m m b{x11, x14} m m s{x4, x7} b s s{x10} b s s{x13} b s b

P(U) a1 a2 a3S1 s s mS2 m s mS3 m m bS4 m m sS5 b s sS6 b s sS7 b s b

48

The representation of data mining results in

a form of a knowledge system allows us to

define how good is the knowledge ob-

tained by a given algorithm.

In our case the knowledge obtained describes

100% of our target data as

S1 ∪ S2 ∪ S3 ∩ ... ∪ S7 = {x1, x2, ..., x14} = U.

Observe that the sets S1, ..S7 are also disjoint

and non-empty, i.e. they form a partition

of the universe U .

We define such knowledge as exact.

49

Moreover, we can see that the resulting sys-

tem K1 is more general then the input

data K0 because its granularity is higher

the the granularity of K0.

Definition: Granularity of a knowledge sys-

tem is the maximum of cardinality of its

granules, i.e. elements of its universe.

The granularity of all Target Knowledge Sys-

tems is one.

Granularity of K1 is

max{|S1|, ...|S7|} = max{5,1,2, } = 5.

50

Now assume that we have applied to out tar-get data T (represented by K0 ) anotheralgorithm ALG2 and it returned two de-scriptions D1, D2 under a condition that weneed only descriptions of the length 2 andwith frequency ≥ 30%. The descriptionsare:

D1 : (a1 = s) ∩ (a2 = s),

D2 : (a2 = s) ∩ (a3 = m).

Now we evaluate:

S1 = S(D1) = {x1, x3, x6, x9, x12},

S2 = S(D2) = {x1, x2, x3, x6, x9, x10, x12}.

51

Incorporating the algorithm parameters im-posed by the ALG2 into our KnowledgeSystem we obtain the following table.

Resulting Knowledge System K2

P(U) a1 a2 a3 #of attr frequencyS1 s s - 2 36%S2 - s m 2 50%

The sets S1, S2 do not form a partition of theuniverse U as S1 ∩ S2 6= ∅ and moreover,S1 ∪ S2 6= U .

The knowledge obtained by the algorithm ALG2

is hence not exact.

It describes only 57% of the target data andwhat is described is described following cer-tain (frequency) conditions.

Of course K2 is more general then K0.

52

The algorithm ALG2 generalized the target

data, even if in an incomplete way.

The formal definitions of Information System,

Knowledge and Target Knowledge Systems,

and their granularity and exactness are as

follows.

53

Knowledge System is an extension of the fol-

lowing notion of Pawlak’s information sys-

tem.

Information System is a system

I = (U, A, VA, f),

where U 6= ∅ is called a set of objects,

A 6= ∅, VA 6= ∅ are called the set of at-

tributes and values of of attributes, re-

spectively,

f is called an information function and

f : U ×A −→ VA

54

A knowledge system based on the informa-

tion system

I = (U, A, VA, f)

is a system

KI = (P(U), A, E, VA, VE, g)

where

E is a finite set of knowledge attributes (k-

attributes) such that A ∩ E = ∅.

VE is a finite set of values of k- attributes.

55

g is a partial function called knowledge in-

formation function(k-function)

g : P(U)× (A ∪ E) −→ (VA ∪ VE)

such that

(i) g | (⋃x∈U{x} ×A) = f

(ii) ∀S∈P(U)∀a∈A((S, a) ∈ dom(g) ⇒ g(S, a) ∈VA)

(iii) ∀S∈P(U)∀e∈E((S, e) ∈ dom(g) ⇒ g(S, e) ∈VE)

56

We use the above notion of knowledge sys-

tem to define the granules of the universe

and the granularity of the system, an hence

later, the granularity of the data mining

process.

Granule: Any set S ∈ P(U) i.e. S ⊆ U is

called a granule of U .

Granularity of S: The cardinality |S| of S is

called a granularity of S.

Granule Universe: The set

GrK = {S ∈ P : ∃b ∈ (E∪A)((S, b) ∈ dom(g))}is called a granule universe of KI.

Granularity of K: A number grK = max{|S| :S ∈ GrK} is called a granularity of K.

57

A knowledge system K = (P(U), A, E, VA, VE, g)

is called exact if and only if all its granules

GrK form a partition of the universe U .

Operators: In our Model we represent data

mining algorithms as certain operators.

For example our ALG1 is represented in the

semantic model by an operator p1 acting

on some subset of a set K of knowledge

systems, such that

p1(K0) = K1.

ALG2 is represented in the model by an op-

erator p2 also acting on some (may be dif-

ferent) subset of the set K of knowledge

systems, such that

p2(K0) = K2.

58

We put all the above observations into a for-

mal notion of a semantic model.

Semantic Model is a system

SM = (P(U), K, G),where:

• U 6= ∅ is the universe;

• K 6= ∅ is a set of knowledge systems,

called also data mining process states;

• G 6= ∅ is the set of operators;

• Each operator p ∈ G is a partial function

on the set of all data mining process

states, i.e. p : K −→ K.

59

The semantic model is always being built for

a given application.

The target data is represented first in a form

the target information system with the uni-

verse U , and then in the form of target

knowledge system K0, as we showed in our

examples.

60

The semantic model based on our examples

is as follows.

SM = (P(U), K, G),where:

• U = {x1, x2, ...x14};

• K = {K0, K1, K2};

• G = {p1, p2};

• Each pi ∈ G for (i = 1,2) is a partial

function pi : K1 −→ K1, such that

p1(K0) = K1, p2(K0) = K2.

61

Data Mining as Generalization

We model data mining as a process of gen-

eralization in terms of the generalization

relation based on a notion of granularity

and generalization operators.

Definition: A relation ¹⊆ K × K is called a

generalization relation if the following

condition holds for any K, K′ ∈ K.

K ¹ K′ if and only if grK ≤ grK′,

where grK denotes the granularity of K.

62

Observe that for K0, K1, K2 from our exam-

ples grK0= 1 ≤ 5 = grK1

≤ 7 = grK2, and

the system K2 is the most general.

But at the same time K1 is exact and K2 is

not exact, so we have a trade off between

exactness and generality.

Definition: an operator g ∈ G is called a gen-

eralization operator if for any K, K′ ∈ Ksuch that g(K) = K′, we have that

K ¹ K′.

Observe that both operators p1, p2 in our ex-

ample are generalization operators.

63

Data Mining Operators G

In data mining process the preprocessing and

data mining proper are disjoint , inclu-

sive/exlusive categories.

The preprocessing is an integral and very im-

portant stage of the data mining process

and needs as careful analysis as the data

mining proper.

Our framework allows us distinguish two dis-

joint classes of operators: the preprocess-

ing operators Gprep and data mining proper

operators Gdm and we put

G = Gprep ∪ Gdm.

64

We provide also a detailed formal definitions,

their motivation, and discussion of these

two classes.

Data Mining and preprocessing operators de-

fine different kind of generalizations.

The model presented in our examples didn’t

include the preprocessing stage; it used the

data mining proper operators only.

65

The main idea behind the concept of the

operator is to capture not only the fact

that data mining techniques generalize the

data but also to categorize existing meth-

ods.

We define within our model three classes of

data mining operators: classification Gclass,

clustering Gclust, and association Gassoc.

We don’t include in our analysis purely sta-

tistical methods like regression, etc...

66

We prove the following theorem.

Theorem Let Gclass,Gclust and Gassoc be the

sets of all classification, clustering, and as-

sociation operators, respectively.

The following conditions hold.

(1) Gclass 6= Gclust 6= Gassoc

(2) Gassoc ∩ Gclass = ∅,

(3) Gassoc ∩ Gclust = ∅.

67

Data Mining Process

Definition Any sequence

K1, K2, ....Kn (n ≥ 1)

of data mining states is called a data pre-

processing process, if there is a prepro-

cessing operator G ∈ Gprep, such that

G(Ki) = Ki+1, i = 1,2, ...n− 1.

Definition Any sequence

K1, K2, ....Kn (n ≥ 1)

of data mining states is called a data min-

ing proper process , if there is a data

mining proper operator G ∈ Gdm, such

that

G(Ki) = Ki+1, i = 1,2, ...n− 1.

68

The data mining process consists of the pre-

processing process (that might be empty)

and the data mining proper process.

We know that the sets Gprep and Gdm are dis-

joint. This justifies the the following defi-

nition.

Definition Data mining process process is any

sequence

K1, K2, ....Kn (n ≥ 1)

of data mining states, such that

K1, ..Ki (0 ≤ i ≤ n)

is a preprocessing process and

Ki+1, ...., Kn

is a data mining proper process.

69

Granular ModelSyntax- Semantic Duality of Data Mining

Granular Model is a system

GM = ( SM, DM, |= ) where:




relation, where U is the universe of SMand E is the set of descriptions defined

by the DM.

Satisfaction |= establishes relationship between

the semantic model and the descriptive model.

70

Descriptive Model

For any Semantic Model SM = (P(U), K, G, )we associate with it its descriptive counter-part defined below.

A Descriptive Model is a system

DM = ( L, E,DK ),

where:

L = ( A, E ) is called a descriptive lan-

guage;

A is a countably infinite set called the alpha-

bet;

E 6= ∅ and E ⊆ A∗ is the set of descriptive

expressions of L;

71

DK 6= ∅ and DK ⊆ P(E) is a set of descrip-tions of knowledge states.

As in a case of semantic model, we build thedescriptive model for a given application.

We define here only a general form of themodel.

We assume however, that whatever is the ap-plication, the descriptions are always buildin terms of attributes and values of theattributes, some logical connectives, somepredicates and some extra parameters, ifneeded.

The commonly used descriptions have the form(a = v) to denote that the attribute a hasa value v, but one might also use, as it isoften done, a predicate form a(v) or a(x, v)instead.

72

For example, a neural network with its nodes

and weights can be seen as a formal de-

scription (in an appropriate descriptive lan-

guage), and the knowledge states would

represent changes in parameters during the

neural network training process.

The model we build here is a model for, what

we call a descriptive data mining, i.e. the

data mining for which the goal of the data

mining process is to produce a set of de-

scriptions in a language easily comprehen-

sible to the user.

For that purpose in the model we identify the

decision tree constructed by the classifica-

tion by Decision Tree algorithm with the

set of discriminant rules obtained from the

tree.

73

Granular Model is a systemGM = ( SM, DM, |= ) where:




relation, where U is the universe of SMand E is the set of descriptions definedby the DM.

Satisfaction |= establishes relationship betweenthe semantic model and the descriptive model.

We define the Satisfaction |= component ofthe Granular Model DM in the followingstages.

Stage1 For each K ∈ K, we define its owndescriptive language LK = ( AK, EK ).

74

Stage2 For each K ∈ K, and descriptive ex-

pression F ∈ EK, we define what does it

mean that D satisfied in K; i.e. we define

a satisfaction relation |=K.

Stage3 For each K ∈ K, and descriptive ex-

pression F ∈ EK, we define what does it

mean that D is true K, i.e. |=KD.

Stage4 We use the satisfaction relation |=K

to define, for each K ∈ K, the set DK ⊆P(EK) of descriptions of its own knowl-

edge.

Stage5 We use the languages LK to define

the descriptive language L.

Stage6 We use the descriptive expressions

EK of LK to define the set E of descriptive

expressions of L.

Stage7 We use the satisfaction relations |=K

to define the satisfaction relation |= of

the Granular Model GM.

75

Part 3: TRACING THE

HISTORY

Mathematics Genealogy Projectgenealogy.math.ndsu.nodak.edu

76

We all have a history

We are all mathematicians

Mission Statement of the Mathematics Ge-

nealogy Project defines a mathematician

as follows.

” ... Throughout this project when we use

the word ”mathematics” or ”mathemati-

cian” we mean that word in a very inclu-

sive sense. Thus, all relevant data from

statistics, computer science, or operations

research is welcome....”

Computer Science classification within the

project is: Mathematics Subject Classifi-

cation: 68Computer Science.

77

The Genealogy Project solicits information from

all schools who participate in the devel-

opment of research level mathematics and

from all individuals who may know desired

information. It means Computer Science

as well.

For them, and the history, we are all math-

ematicians.

78

Below are some links (sequences of connected

people) for a computer scientist.

Any two people in the sequence are listed in

order PhD student, Adviser.

If a person has more then one adviser the ad-

viser is preceded with a number; i.e.

adviser 1 is listed as 1. adviser Name,

adviser 2 is listed as 2. adviser Name, etc...

79

A mathematician would say:

For any element A of the sequence, if A

has more then one adviser, then for any

1 ≤ k ≤ n , an adviser k is listed as k.Name

of the adviser k,

and the number in front of the name is

omitted otherwise.

80

Link to Nicolaus Copernicus

(Mikolaj Kopernik)

He has 1598 descendants

Anita Wasilewska, Ph.D. Warsaw University,

1975, Poland, Helena Rasiowa, Ph.D. War-

saw University,1950, Andrzej Mostowski,

Ph.D. Warsaw University, 1938, 2. Alfred

Tarski, Ph.D. Warsaw University, 1924,

Stanislaw Lesniewski, Ph.D. University of

Lvov, 1912, Kazimierz Twardowski, Ph.D.

Universitat Wien, 1891, Franz Clemens

Brentano, Ph.D. Eberhard Karls Universi-

tat, Tubingen 1862, 2. Friedrich Adolf

Trendelenburg, Dr. phil. Universitat Leipzig,

1826, 1. Georg Ludwig Konig, Artium

Liberalium Magister, Georg August Univer-

sitat, Gottingen, 1790, Christian Heyne,

Magister Juris, Universitat Leipzig, 1752,

81

1. Johann August Bach, Magister philosophiae,

Universitat Leipzig, 1744, 1.Christian Kust-

ner, Magister philosophiae, Universitat Leipzig,

1742, Johann Ernesti, Magister philosophiae,

Universitat Leipzig, 1730, Johann Gesner,

Magister artium, Friedrich Schiller Univer-

sitat Jena, 1715, Johann Buddeus, Magis-

ter artium, Martin Luther Universitat, Halle

Wittenberg, 1687, Michael Walther, Jr.,

Magister artium, Theol. Dr., Martin Luther

Universitat, Halle Wittenberg, 1661, 1687,

2.Johann Quenstedt, Magister artium, Theol.

Dr., Universitat Helmstedt, Martin Luther

Universitat,b Halle Wittenberg, 1643, 1644,

Christoph Notnagel, Magister artium, Mar-

tin Luther Universitat, Halle Wittenberg,

1630, Ambrosius Rhodius, Magister artium,

Medicinae Dr., Martin Luther Universitat,

Halle Wittenberg, 1600, 1610,

82

1.Melchior Jostel, Magister artium, Medici-

nae Dr., Martin Luther Universitat, Halle

Wittenberg, 1583, 1600, 1.Valentin Otto,

Magister artium, Martin Luther Universi-

tat, Halle Wittenberg, 1570, Georg Joachim

Rheticus, Magister artium, Martin Luther

Universitat, Halle Wittenberg 1535,

2. Nicolaus Copernicus, Juris utriusque,

Doctor, Uniwersytet Jagiellonski (Cra-

cow Jagellonian University), Universita

di Bologna, Universita degli Studi di

Ferrara, Universita di Padova, 1499,

Poland-Italy,

2.Domenico Novara da Ferrara, Universita di

Firenze, 1483, 1. Johannes Regiomon-

tanus, Magister artium, Universitat Leipzig,

Universitat Wien, 1457,

83

Georg von Peuerbach, Magister artium, Uni-

versitat Wien, 1440, Johannes von Gmunden,

Magister artium, Universitat Wien, 1406,

Heinrich von Langenstein, Magister artium,

Theol. Dr., Universite de Paris, 1363,

1375, unknown.

Georg von Peuerbach, 1375 is my ”oldest”

ancestor.

THERE ARE 3 more lines of ancestry; also

interesting, if not so illustrious. Here they

are.

84

Link to Gottfried Leibniz

(54209 descendants),

Immanuel Kant

( 2176 descendants), and

Desiderius Erasmus of Rotterdam

(57416 descendants)



saw University, 1950, Andrzej Mostowski,

Ph.D. Warsaw University, 1938, 2. Alfred

Tarski, Ph.D. Warsaw University, 1924,

Stanislaw Lesniewski, Ph.D. University of

Lvov, 1912, Kazimierz Twardowski, Ph.D.

Universitat Wien, 1891, Franz Clemens

Brentano, Ph.D. Eberhard Karls Univer-

sitat, Tubingen 1862, 2. Friedrich Adolf

Trendelenburg, Dr. Phil. Universitat Leipzig,

1826, 2. Karl Reinhold, PhD.,

85

Immanuel Kant, Ph.D. Universitat Konigs-

berg 1770,

Martin Knutzen, Dr. Phil. Universitat Konigs-

berg, 1732, Christian von Wolff, Dr. phil.,

Universitat Leipzig, 1700,

2. Gottfried Leibniz, Dr. jur. Universitat

Altdorf, 1666,

2. Christiaan Huygens, Artium Liberalium

Magister, Jurisutriusque Doctor, Universiteit

Leiden, Universite d’Angers, 1647, 1655,

Frans van Schooten, Jr., Artium Liberal-

ium Magister, Universiteit Leiden, 1635,

Jacobus Golius, Artium Liberalium Magis-

ter, Philosophiae Doctor Universiteit Lei-

den, 1612, 1621, 1. Willebrord (Snel van

Royen) Snellius, Artium Liberalium Magis-

ter, Universiteit Leiden, 1607, 2. Rudolph

86

(Snel van Royen) Snellius, Artium liberal-ium Magister, Universitat zu Koln, RuprechtKarls Universitat Heidelberg, 1572, 1. Valen-tine Naibod, Magister Artium, Martin LutherUniversitat, Halle Wittenberg, UniversitatErfur, Erasmus Reinhold, Magister Artium,Martin Luther Universitat, Halle Witten-berg, 1535, Jakob Milich, Liberalium Ar-tium Magister, Med. Dr., Albert LudwigsUniversitat Freiburg, Breisgau, UniversitatWien, 1520, 1524,

Desiderius Erasmus Roterodamus (sometimesknown as Desiderius Erasmus of Rot-terdam), University of Paris, TheologiaeBaccalaureus, College de Montaigu, 1497,

Jan Standonck, Magister Artium, Theol. Dr.,College Sainte-Barbe, College de Montaigu,1474, 1490, unknown

Link to Pierre-Simon Laplace

( 50295 descendants) andJean Le Rond d’Alembert

Anita Wasilewska, Ph.D. Warsaw University,1975, Poland, Helena Rasiowa, Ph.D. War-saw University, 1950, Andrzej Mostowski,Ph.D. Warsaw University, 1938, 1. Kaz-

imierz Kuratowski, Ph.D. Warsaw Uni-versity, 1921, 1. Stefan Mazurkiewicz,Ph.D. University of Lvov, 1913, WaclawSierpinski, Ph.D. Uniwersytet Jagiellonski,1906, 1. Stanislaw Zaremba, Ph.D. Uni-versite Paris IV-Sorbonne, 1889, GastonDarboux, Ph.D. Ecole Normale Superieure,Paris, 1866, Michel Chasles, Ph.D. EcolePolytechnique, 1814, Simeon Poisson, Ph.D.Ecole Polytechnique, 1800, 2. Pierre-Simon

Laplace, Ph.D., Jean Le Rond d’Alembert,unknown

87

Link to Emile Borel

(2506 descendants),

Leonhard Euler

(52555 descendants)




Ph.D. Warsaw University, 1938, 2. Zyg-

munt Janiszewski, Ph.D. Ecole Normale

Superieure Paris, 1911, Henri Lebesgue,

Ph.D. Universite Henri Poincare Nancy 1,

1902, Emile Borel, Ph.D. Ecole Normale

Superieure, Paris, 1893, Gaston Darboux,

Ph.D. Ecole Normale Superieure, Paris, 1866,

Michel Chasles, Ph.D., Ecole Polytechnique,

1814, Simeon Poisson, Ph.D. Ecole Poly-

technique, 1800,

88

1. Joseph Lagrange, no degree, student of

Leonhard Euler, Ph.D. Universitat Basel,

1726, Dr. med. Universitat Basel, 1694,

Dr. hab. Sci. Universitat Basel, 1684,

Gottfried Leibniz, Dr. jur. Universitat Alt-

dorf, 1666, 1.Johann Bernoulli, Dr. med.

Universitt Basel 1694, Jacob Bernoulli, Dr.

hab. Sci. Universitt Basel, 1684, Got-

tfried Wilhelm Leibniz, Dr. jur. Universitt

Altdorf, 1666, 1. Erhard Weigel, Ph.D.

Universitt Leipzig, 1650, unknown.

89

Link to Andrei Markov

(4824 descendants), and

Pafnuty Chebyshev (5964 descendants)




Ph.D. Warsaw University, 1938, 1. Kaz-

imierz Kuratowski, Ph.D. Warsaw Uni-

versity,1921, 1. Stefan Mazurkiewicz,

Ph.D. University of Lvov, 1913, Waclaw

Sierpinski, Ph.D. Uniwersytet Jagiellonski,

1906, 2. Georgy Fedoseevich Voronoy,

Ph.D. University of St. Petersburg, 1896,

Andrei Markov, Ph.D. University of St.

Petersburg, 1884, Pafnuty Chebyshev,

Ph.D. University of St. Petersburg, 1849,

Nikolai Dmitrievich Brashman, Ph.D. Moscow

State University, 1834, Joseph Johann von

Littrow, Ph.D., unknown

90

MY PhD COUSINS include

Kurt Goedel

Alain Turing

Alonso Church

Roman Sikorski

Zdzislam Pawlak

and many others....I am sure some of them

in this room!

91

In Stony Brook CS Department I traced 10

of them.

WE ALL ARE A BIG SCIENTIFIC

FAMILY!

92

Descriptive Granularity - Building Foundations of Data Mining

Education

Transcript of Descriptive Granularity - Building Foundations of Data Mining