Semantic data mining: an ontology based approach

66
Semantic Data Mining: an Ontology Based Approach Agnieszka Lawrynowicz Institute of Computing Science Poznan University of Technology April 12, 2016 Seminar of the Institute of Computing Science Poznan University of Technology Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 1

Transcript of Semantic data mining: an ontology based approach

Page 1: Semantic data mining: an ontology based approach

Semantic Data Mining: an Ontology Based Approach

Agnieszka Lawrynowicz

Institute of Computing SciencePoznan University of Technology

April 12, 2016Seminar of the Institute of Computing Science

Poznan University of Technology

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 1

Page 2: Semantic data mining: an ontology based approach

Outline

Introduction to semantic data mining

Ontology in computer science

Semantic meta-mining▸ Use Case: e-LICO Intelligent Discovery Assistant▸ Background knowledge: Data Mining OPtimization Ontology▸ DM method: Pattern discovery with Fr-ONT-Qu▸ Sharing: Standardization of data mining and machine learning schemas

Summary

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 2

Page 3: Semantic data mining: an ontology based approach

Outline

Introduction to semantic data mining

Ontology in computer science

Semantic meta-mining▸ Use Case: e-LICO Intelligent Discovery Assistant▸ Background knowledge: Data Mining OPtimization Ontology▸ DM method: Pattern discovery with Fr-ONT-Qu▸ Sharing: Standardization of data mining and machine learning schemas

Summary

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 3

Page 4: Semantic data mining: an ontology based approach

Introduction: data mining

Input: a data table, text documents, ...Output: a model, a pattern set

DATA$MINING$

Model,$pa0erns$data$

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 4

Page 5: Semantic data mining: an ontology based approach

Introduction: using background knowledge in data mining

Using background knowledge in data mining has been extensivelyresearched

hierarchy/taxonomy of attributes (Michalski et al., 1986, Srikant,Agrawal, 1995)

Inductive Logic Programming (Muggleton, 1991, Lavrac andDzeroski, 1994)

relational learning (Quinlan, 1993, de Raedt, 2008)

semantic data mining tutorial @ ECML/PKDD’2011 (Lavrac,Vavpetic, Lawrynowicz, Potoniec, Hilario, Kalousis)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 5

Page 6: Semantic data mining: an ontology based approach

Introduction: relational data mining

Input: a relational database, a graph, a set of logical facts, ...Output: a model, a pattern set

RELATIONAL)DATA)MINING)

Model,)pa4erns)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 6

Page 7: Semantic data mining: an ontology based approach

Semantic data mining

Input:

a data table, text documents, Web pages, a relational database, agraph, a set of logical facts, ...

one or more ontologies

Output: a model, a pattern set

SEMANTIC)DATA)MINING)

Model,)pa3erns)

Data)

Ontologies)

annota;ons)mappings)vocabulary)reBuse)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 7

Page 8: Semantic data mining: an ontology based approach

Outline

Introduction to semantic data mining

Ontology in computer science

Semantic meta-mining▸ Use Case: e-LICO Intelligent Discovery Assistant▸ Background knowledge: Data Mining OPtimization Ontology▸ DM method: Pattern discovery with Fr-ONT-Qu▸ Sharing: Standardization of data mining and machine learning schemas

Summary

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 8

Page 9: Semantic data mining: an ontology based approach

Ontology in computer science

“engineering artefact [...]“ (Guarino 98)

“An ontology is aformal specification ê machine interpretationof a shared ê group of people, consensusconceptualization ê abstract model of phenomena, conceptsof a domain of interest“ ê domain knowledge(Gruber 93, Studer 98)

Ontology = formal specification of a terminological knowledge (most oftenfrom a particular domain)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 9

Page 10: Semantic data mining: an ontology based approach

Semantic Web layer cakeStosjęzykówSieciSemantycznej

Języki modelowania ontologii

Dane

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 10

Page 11: Semantic data mining: an ontology based approach

Ontologies + data = knowledge graph

reviewer1 paper10metaReviews

PeerReviewedPaperMetaReviewer metaReviews

reviews

RDF

RDFS

rdf:type rdf:type

rdfs:domain rdfs:range

rdfs:subPropertyOf

rdfs:subClassOf

OWL

owl:Restric>on

rdfs:subClassOf

Reviewer

rdf:type

owl:someValuesFrom

owl:onPropertyreviewedBy

owl:inverseOf

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 11

Page 12: Semantic data mining: an ontology based approach

Logical meaning of OWL

Description Logics, DLs = family of first order logic-based formalismssuitable for representing knowledge, especially terminologies, ontologies,underpinning the Web Ontology Language (OWL).

Basic building blocks: concepts, roles, constructors, individuals

Example

TB

ox

Atomic concept: Reviewer, PaperRoles: reviews, metaReviews, reviewedByConstructors: ⊓, ∃Axiom (concept definition):PeerReviewedPaper ≡ Paper ⊓ ∃reviewedBy.ReviewerAxiom (concept description ”each meta reviewer is a reviewer”):MetaReviewer ⊑ Reviewer

AB

ox

Fact assertion: metaReviews(reviewer1, paper10)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 12

Page 13: Semantic data mining: an ontology based approach

Outline

Introduction to semantic data mining

Ontology in computer science

Semantic meta-mining▸ Use Case: e-LICO Intelligent Discovery Assistant▸ Background knowledge: Data Mining OPtimization Ontology▸ DM method: Pattern discovery with Fr-ONT-Qu▸ Sharing: Standardization of data mining and machine learning schemas

Summary

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 13

Page 14: Semantic data mining: an ontology based approach

Overview of meta-learning

Meta-learning: learning to learn

application of machine learning techniques to meta-data about pastmachine learning experiments;

the goal: to modify some aspect of the learning process to improvethe performance of the resulting model;

meta-mining: meta-learning applied to full data mining process

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 14

Page 15: Semantic data mining: an ontology based approach

Overview of the e-LICO system (EU FP7 2009-2012)

!"#$%&'()*+,'-!./01' ' ' '(23"$%4'567879'

"':'"'

'

! "#$%&'()&*+,-./,012*+3*2-%4,&

!56 78+*8$+9&21&/:+&+;<=>7&?"&<#@&4;!' <=!*+0)/' />1,)*!?' )*' @!=1)/*' 5' A)!,?<' +' <!1' /B' 0!C>)0!D!*1<' /*' 1;!' >*?!0,A*E' ?+1+' D)*)*E'.,+1B/0DF'4;)<'<!=1)/*'.0!<!*1<'1;!'?)BB!0!*1'=/D./*!*1<'/B'1;!'!"#$%&'+0=;)1!=1>0!'G()E>0!'7H'+*?'<;/I<';/I'1;!A')*1!0+=1'1/'+=;)!J!'1;!'><!0K<'L*/I,!?E!'?)<=/J!0A'E/+,F''

4;!'!"#$%&')*B0+<10>=1>0!'G?!.)=1!?')*'1;!'B)E>0!'>*?!0'1;!'?+<;!?',)*!H')<'1;!'D!+*<'MA'I;)=;'1;!'?+1+"D)*)*E' .,+1B/0D' )<' ?!,)J!0!?' 1/' <=)!*1)<1<F' 4;!' )**/J+1)J!' =/0!' ' /B' 1;!' !"#$%&'.,+1B/0D' )<' 1;!'!"#$%%&'$"#( )&*+,-$./( 0**&*#1"#' G$NOP' +M/J!' 1;!' ?+<;!?' ,)*!H' I)1;' )1<' .,+**!0' +*?' D!1+",!+0*!0F'Q/I!J!0P'1/'?!,)J!0'1;!'?+1+"D)*)*E'.,+1B/0D'1/')1<'<=)!*1)<1'><!0<P'1;!0!'+0!'<!J!0+,'/1;!0'<!0J)=!<'+*?'=/D./*!*1<F'()E>0!'7'<;/I<'+*'/J!0J)!I'/B'!"#$%&R<'=/D./*!*1<'+*?';/I'1;!A' )*1!0+=1'I)1;'!+=;'/1;!0F'

'()E>0!'7F'&J!0J)!I'/B'1;!'!"#$%&'<A<1!DF''

4;!0!'+0!'1I/'><!0"B+=)*E'=/D./*!*1<'B/0'1;!'!"#$%&'.,+1B/0DS'1;!<!'+,,/I'<=)!*1)<1<'1/'+==!<<'?+1+"D)*)*E' /.!0+1/0<' +*?T/0' /1;!0' ?+1+' .0/=!<<)*E' <!0J)=!<P' 1/' =/D./<!' 1;!D' )*1/' I/0LB,/I<' +*?'!U!=>1!' 1;!DP' =/,,!=1)*E' 1;!' 0!<>,1<' B/0' )*1!0.0!1+1)/*' /0' B>01;!0' +*+,A<)<F' 4;!<!' 1I/' =!*10+,')*B0+<10>=1>0!'=/D./*!*1<'+0!V'

7F 213&45&"$.V' O*' +..,)=+1)/*' 1;+1' E)J!<' +==!<<' 1/' +' I)?!' J+0)!1A' /B' ?+1+"D)*)*E' /.!0+1/0<P'1/E!1;!0'I)1;'1;!'D!+*<'1/'=/D./<!'1;!D')*1/'I/0LB,/I<F'

5F 61-$."1V' O' I/0LB,/I' =0!+1)/*' +*?' !*+=1D!*1' I/0LM!*=;' 1;+1' E)J!<' +==!<<' 1/' +0M)10+0A'W!M'<!0J)=!<'+*?'D+*A'/1;!0'L)*?<'/B'<!0J)=!<F' $1' )<'I)?!,A'><!?' )*'M)/)*B/0D+1)=<P'M>1'+,</' )*'D+*A'/1;!0'?)<=).,)*!<F'

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 15

Page 16: Semantic data mining: an ontology based approach

Background knowledge: DM OPtimization Ontology

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 16

Page 17: Semantic data mining: an ontology based approach

Data Mining OPtimization Ontology (DMOP)

the primary goal of DMOP is to support all decision-making stepsthat determine the outcome of the data mining process;

development started in EU FP7 project e-LICO (2009-2012);

DMOP v5.5: 723 classes, 111 properties, 4291 axioms;

highly axiomatized;

represented in Web Ontology Language (OWL 2);

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 17

Page 18: Semantic data mining: an ontology based approach

Competency questions

”Given a data mining task/data set, which of the valid or applicableworkflows/algorithms will yield optimal results (or at least better resultsthan the others)?”

”Given a set of candidate workflows/algorithms for a given task/dataset, which data set/workflow/algorithm characteristics should betaken into account in order to select the most appropriate one?”

and others more fine-grained, e.g.:

”Which induction algorithms should I use (or avoid) when my datasethas many more variables than instances?”

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 18

Page 19: Semantic data mining: an ontology based approach

Architecture of DMOP knowledge base and its satellitetriple stores

TBox%

DMOP%

ABox%

Operator%DB%

DMEX(DB1%%%%DMEX(DB2%%…%%%DMEX(DBk%

OWL2%

RDF%

Triple%

Store%

Formal%Conceptual%Framework%%of%Data%Mining%Domain%

Accepted%Knowledge%of%DM%Tasks,%Algorithms,%Operators%%

Specific%DM%ApplicaFons%Datasets,%Workflows,%Results%

MetaHminer’s%training%data%

MetaHminer’s%prior%%

DM%knowledge%

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 19

Page 20: Semantic data mining: an ontology based approach

The core concepts of DMOP (simplified)

Fig. 1. The core concepts of DMOP.

more than specify their input/output types; only processes called DM-Operations haveactual inputs and outputs. A process that executes a DM-Operator also realizes the DM-Algorithm implemented by the operator and achieves the DM-Task addressed by thealgorithm. Finally, a DM-Workflow is a complex structure composed of DM operators, aDM-Experiment is a complex process composed of operations (or operator executions).An experiment is described by all the objects that participate in the process: a workflow,data sets used and produced by the different data processing phases, the resulting mod-els, and meta-data quantifying their performance. In the following, the basic elementsof DMOP are detailed.

DM Tasks: The top-level DM tasks are defined by their inputs and outputs. ADataProcessingTask receives and outputs data. Its three subclasses produce new databy cleansing (DataCleaningTask), reducing (DataReductionTask), or otherwise trans-forming the input data (DataTransformationTask). These classes are further articulatedin subclasses representing more fine-grained tasks for each category. An Induction-Task consumes data and produces hypotheses. It can be either a ModelingTask or aPatternDiscoveryTask, based on whether it generates hypotheses in the form of globalmodels or local pattern sets. Modeling tasks can be predictive (e.g. classification) ordescriptive (e.g., clustering), while pattern discovery tasks are further subdivided intoclasses based on the nature of the extracted patterns: associations, dissociations, devia-tions, or subgroups. A HypothesisProcessingTask consumes hypotheses and transforms(e.g., rewrites or prunes) them to produce enhanced—less complex or more readable—versions of the input hypotheses.

Data: As the primary resource that feeds the knowledge discovery process, datahave been a natural research focus for data miners. Over the past decades meta-learningresearchers have actively investigated data characteristics that might explain generaliza-tion success or failure. Fig. 2 shows the characteristics associated with the different Datasubclasses (shaded boxes). Most of these are statistical measures, such as the number of

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 20

Page 21: Semantic data mining: an ontology based approach

DMOP: algorithm representation

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 21

Page 22: Semantic data mining: an ontology based approach

Alignment of DMOP with DOLCE 1/3

Two main reasons to align DMOP with a foundational ontology:

considerations about attributes and data properties; extantnon-foundational ontology solutions were partial re-inventions of howthey are treated in a foundational ontology;

reuse of the ontology’s object properties;

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 22

Page 23: Semantic data mining: an ontology based approach

Alignment of DMOP with DOLCE 2/3

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 23

Page 24: Semantic data mining: an ontology based approach

Alignment of DMOP with DOLCE 3/3

Perdurant: DM-Experiment and DM-Operation are subclasses ofdolce:process;

Endurant: most DM classes, such as algorithm, software, strategy,task, and optimization problem, are subclasses ofdolce:non-physical-endurant;

Quality: characteristics and parameters of DM entities madesubclasses of dolce:abstract-quality;

Abstract: for identifying discrete values, classes added as subclassesof dolce:abstract-region;

object properties: DMOP reuses mainly DOLCE’s parthood, quality,and quale relations;

each of the four DOLCE main branches have been used.

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 24

Page 25: Semantic data mining: an ontology based approach

Qualities and attributes 1/3

How to handle ’attributes’ in OWL ontologies, and, in a broader context,measurements?

easy way: attribute is a binary functional relation between a class anda datatype

Elephant ⊑ =1 hasWeight.integerElephant ⊑ =1 hasWeightPrecise.realElephant ⊑ =1 hasWeightImperial.integer (in lbs)

building into one’s ontology application decisions about how to storethe data (and in which unit it is) /

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 25

Page 26: Semantic data mining: an ontology based approach

Qualities and attributes 2/3

How to handle ’attributes’ in OWL ontologies, and, in a broader context,measurements?

more elaborate way: unfold the notion of an object’s property (e.g.weight) from one attribute/OWL data property into at least twoproperties:

▸ one OWL object property from the object to the ’reified attribute’(“quality property” represented as an OWL class)

▸ and another property to the value(s)

favoured in foundational ontologies;

solves the problem of non-reusability of the ’attribute’ and preventsduplication of data properties;

measurements for DMOP more alike values for parameters;

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 26

Page 27: Semantic data mining: an ontology based approach

Qualities and attributes 3/3

ModelingAlgorithm ⊑ =1 dolce:has-quality.LearningPolicy

LearningPolicy ⊑ =1 dolce:has-quale.Eager-Lazy

Eager-Lazy ⊑ ≤ 1 hasDataValue.anyType

LearningPolicy is a subclass of dolce:quality

Eager-Lazy is a subclass of dolce:abstract-region

In this way, the ontology can be linked to many different applications, whoeven may use different data types, yet still agree on the meaning of thecharacteristics and parameters (’attributes’) of the algorithms, tasks, andother DM endurants.

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 27

Page 28: Semantic data mining: an ontology based approach

Meta-modeling in DMOP 1/4

only processes (executions of workflows) and operations (executionsof operators) consume inputs and produce outputs

DM algorithms (as well as operators and workflows) can only specifythe type of input or output

inputs and outputs (DM-Dataset and DM-Hypothesis class hierarchy,respectively) are modeled as subclasses of IO-Object class

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 28

Page 29: Semantic data mining: an ontology based approach

Meta-modeling in DMOP 2/4

DM algorithms: classes or individuals? Individuals.

Problem: expressing types of inputs/outputs associated withalgorithm

”C4.5 specifiesInputClass CategoricalLabeledDataSet” 8

↗ ↖Individual Class(instance of DM-Algorithm) (subclass of DM-Hypothesis)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 29

Page 30: Semantic data mining: an ontology based approach

Meta-modeling in DMOP 3/4

Initial solution: one artificial class per each single algorithm with asingle instance corresponding to this particular algorithm

Problem: hasInput, hasOutput, specifiesInputClass,specifiesOutputClass—assigned a common range—IO-Object

”C4.5 specifiesInputClass Iris” ?

↗ ↖Individual Individual(instance of DM-Algorithm) (instance of DM-Hypothesis)

Iris is a concrete dataset. Clearly, any DM algorithm is not designedto handle only a particular dataset.

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 30

Page 31: Semantic data mining: an ontology based approach

Meta-modeling in DMOP 4/4

Final solution: weak form of punning available in OWL 2

IO-Class: meta-class—the class of all classes of input and outputobjects

”C4.5 specifiesInputClass CategoricalLabeledDataSet” 4

↗ ↖Individual Individual(instance of DM-Algorithm) (instance of IO-Class)

”DM-Process hasInput some CategoricalLabeledDataSet” 4↗ ↖Class Class(subclass of dolce:process) (subclass of IO-Object)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 31

Page 32: Semantic data mining: an ontology based approach

DM method: Fr-ONT-Qu semantic pattern miner

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 32

Page 33: Semantic data mining: an ontology based approach

Data mining as search

learning in description logics (DLs) and other relational data can beseen as search in space of concepts / RDF triples / clauses /(conjunctive / SPARQL) queries, ...

it is possible to impose ordering on this search space, e.g., usingsubsumption as natural quasi-order and generality relation betweenDL concepts

▸ if D ⊑ C then C covers all instances that are covered by D

refinement operators may be applied to traverse the space bycomputing a set of specializations (resp. generalizations) of a concept/ RDF triples/ clauses/ (conjunctive / SPARQL) queries, ...

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 33

Page 34: Semantic data mining: an ontology based approach

Properties of refinement operators

Consider downward refinement operator ρ and by C ;ρ D denote arefinement chain from a DL concept C to D

complete: each point in lattice is reachable (for D ⊑ C there exists Esuch that E ≡ D and a refinement chain C ;ρ ... ;ρ E

weakly complete: for any concept C with C ⊑ ⊺, concept E withE ≡ C can be reached from ⊺finite: finite for any concept

redundant: there exist two different refinement chains from C to D

proper: C ;ρ D implies C /≡ D

ideal = complete + proper + finite

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 34

Page 35: Semantic data mining: an ontology based approach

Learning in DLs and in clausal languages is hard

Lehmann & Hitzler (ILP 2007, MLJ 2010) proved for many DLs and(Nienhuys-Cheng & Wolf, 1997) for clausal languages that no idealrefinement operator exists.

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 35

Page 36: Semantic data mining: an ontology based approach

Fr-ONT-Qu

algorithm for mining patterns in RDF(s) data

patterns expressed as SPARQL queries

generality relation: taxonomical subsumption

consists of: a refinement operator ρ and a strategy to select bestpatterns for further refinement

Example SPARQL queryhead SELECT ?x WHERE {body ?x rdf:type :Paper .

?x rdf:type :PeerReviewedPaper .

?x :reviewedBy ?y

}

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 36

Page 37: Semantic data mining: an ontology based approach

New generality relation: taxonomical subsumption

Taxonomically closed pattern

A pattern Q is taxonomically closed, or t-closed, w.r.t. the background knowledgeG if for each triple of the form (?x rdf:type c) in Q, Q also contains thetransitive closure of (?x rdf:type c) w.r.t. G , and for each triple of the form(?x p ?y) that appears in the pattern Q, Q also contains the transitive closureof (?x p ?y) w.r.t. G .

Taxonomical subsumption

Given two patterns Q1 and Q2 over ρdf dataset G , and their t-closures Q1t and

Q2t respectively, Q1 taxonomically subsumes (t-subsumes) Q2 iff there exists a

mapping σ such that a set of triple patterns and FILTER expressions fromσ(body(Q1

t )) is a subset of a set of triple patterns and FILTER expressions frombody(Q2

t ).

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 37

Page 38: Semantic data mining: an ontology based approach

Input of the algorithm

a declarative bias (B) to limit a search space (i.e. classes andproperties to use) and maximal number of iterations

2 thresholds: for keeping good enough patterns and for refining bestpatterns

choice from several quality measures to select for thresholds (e.g.support on knowledge base)

beam search size

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 38

Page 39: Semantic data mining: an ontology based approach

Example

B: classes: PeerReviewedPaper, JournalPaper, property: reviewedBy

1 Refine every pattern from the previous iteration by adding a singlerestriction for a variable already existing in the pattern. E.g. forpatern {?x rdf:type :Paper.}, its refinements are:

▸ {?x rdf:type :Paper . ?x rdf:type :PeerReviewedPaper .}▸ {?x rdf:type :Paper . ?x rdf:type :JournalPaper . }▸ {?x rdf:type :Paper . ?x :reviewedBy ?y}

2 Evaluate patterns (with some quality measure as support on a dataset) and select only the best ones

3 Repeat steps 1-2 as long as there are patterns for refinement andmaximal number of iterations is not exceeded

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 39

Page 40: Semantic data mining: an ontology based approach

Refinement operator ρ: uses trie data structure

ρ: (locally) finite and complete

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 40

Page 41: Semantic data mining: an ontology based approach

Pattern based classification 1/2

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 41

Page 42: Semantic data mining: an ontology based approach

Pattern based classification 2/2

We learn features that are optimized with regard to the (classification) task

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 42

Page 43: Semantic data mining: an ontology based approach

Propositionalisation 1/2

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 43

Page 44: Semantic data mining: an ontology based approach

Propositionalisation 2/2

In this way, learned features may be consumed by any out-of-the-shelf’attribute-value’ classification algorithm

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 44

Page 45: Semantic data mining: an ontology based approach

Comparative experiments on classification of semantic data1/2

we considered published work with available results and datasets(including ESWC 2008 best paper, ESWC 2012 best paper)

various types of methods: kernel methods, statistical relationalclassifier, concept learning algorithms

we strictly followed the tasks, protocols and experimental setups ofthe methods

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 45

Page 46: Semantic data mining: an ontology based approach

Comparative experiments on classification of semantic data2/2

For classification task Fr-ONT-Qu outperformed state-of-art approaches toclassification of Semantic Web data(see: ”Pattern based feature construction in semantic data mining” by A.Lawrynowicz, J. Potoniec, IJSWIS 10(1), 2014):

kernel methods Bloehdorn et al. (2007), Loesch et al. (ESWC 2012best paper),

statistical relational classifier SPARQL-ML by Kiefer et al (ESWC2008 best paper),

concept learning algorithms DL-FOIL by Fanizzi et al (2008),DL-Learner cutting-edge CELOE variant by Lehmann (2009)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 46

Page 47: Semantic data mining: an ontology based approach

What is RapidMiner? 1/2

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 47

Page 48: Semantic data mining: an ontology based approach

What is RapidMiner? 2/2

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 48

Page 49: Semantic data mining: an ontology based approach

RapidMiner XML based workflow representation

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 49

Page 50: Semantic data mining: an ontology based approach

Creating (meta-)dataset for meta-mining

DMOP-basedrepositoryofDMprocesses(DMEX-DB)

Datasetfortrainingmeta-miner

>85mlnRDFtriples

BaselineDMexperiment

set

1581RapidMinerexecutedworkflows

Baselinedatasets

11UCIdatasets

DataCharacters6csTool(DCT)

DMOPontology

Transforma6ontoRDF

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 50

Page 51: Semantic data mining: an ontology based approach

Propositionalisation

Workflowpa*erns

Dataset

DMOP-basedRDFrepositoryofDM

processes

Results of experiments. Below we present the results of experimental evaluation of Fr-ONT-Qu in the meta-mining scenario. In the experiments, we used OWLIM SE (v5.3.5849) as an underlying reasoning engine and a semantic store with the owl2-rl-reduced-optimized ruleset. The choice of such a ruleset was motivated by the expressivity of our background knowledge base, e.g. existence of object property chains. During each cycle of cross-validation, Fr-ONT-Qu discovered around 2000 patterns, and redundant patterns were subsequently pruned. We discuss some of the discovered patterns below (for compactness denoting by Bd the body of the base pattern used in the experiments). The first example pattern: Q1 = select distinct ?x where { Bd ∪ ?opex2!dmop:executes ?front0 .! ?opex2!dmop:executes rm:RM-Decision_Tree .! ?opex2!dmop:hasParameterSetting ?front1.! ?front0!dmop:executes rm:DM-Operator .! ?front0!dmop:implements ?front2 .!!! ?front2 a dmop:DM-Algorithm . ?front2 a dmop:InductionAlgorithm .!!! ?front2 a dmop:ModelingAlgorithm .!!! ?front2 a dmop:ClassificationModelingAlgorithm .!!! ?front2 a dmop:ClassificationTreeInductionAlgorithm .!}!

was mined when Fr-ONT-Qu traversed down the algorithm classes hierarchy specializing variable ?front2. In this way, it is possible to abstract from the level of operators (algorithm implementations) to the level of algorithms and their taxonomy. For instance, both rm:RM-Decision_Tree and weka:Weka-J48 operators implement a classification tree induction algorithm and one may generalize over it. The patterns containing class hierarchies provide similar expressivity to this of patterns mined in so-called generalized association rule mining.

The following pattern covers only those workflows that contain ‘Decision Tree’ operator, for which the parameter minimal size for split has value between 2 and 5.5: Q2 = select distinct ?x where { Bd ∪ ?opex2!dmop:executes ?front0 .! ?opex2!dmop:executes rm:RM-Decision_Tree .! ?opex2!dmop:hasParameterSetting ?front1.! ?front0!dmop:executes rm:DM-Operator .! ?front1!dmop:setsValueOf ?front2.! ?front1!dmop:hasValue ?front3.! filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 16.000000) . ?front2!dmop:hasParameterKey 'minimal_size_for_split'.! ?front1!dmop:hasValue ?front3.! filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 9.000000) . ?front1!dmop:hasValue ?front3.! filter(2.000000 <= xsd:double(?front3) && xsd:double(?front3) <= 5.500000) . }

Datasetcharacteris3cs…

Features

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 51

Page 52: Semantic data mining: an ontology based approach

Semantic meta-mining results

McNemar’s test for pairs of classifiers performed with the nullhypothesis that a classifier built using dataset characteristics and amined pattern set has the same error rate as the baseline that useddataset characteristics and only the names of the machine learningDM operators

Test confirmed that classifiers trained using workflow patternsperformed significantly better (in terms of accuracy) than the baseline

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 52

Page 53: Semantic data mining: an ontology based approach

Sharing: Standardization of DM/ML schemas

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 53

Page 54: Semantic data mining: an ontology based approach

Evolution of the field of DM/ML ontologies

20092008 2011 2012

OntoDM

20142008

DMOP

ontologies/vocabularies

events

Experiment Databasesplatform

2010

ExposéDMWF

Data Mining OntologyJamboree(Slovenia)

2015

MEX

OpenML 2016(Netherlands)

W3C Machine Learning Schema Community Group

OpenMLplatform

2016

ML Schema Core

2013

others: KDDONTO, KD, ...Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 54

Page 55: Semantic data mining: an ontology based approach

OntoDM

Pance Panov, Larisa N. Soldatova, Saso Dzeroski: Ontology of core data miningentities. Data Min. Knowl. Discov. 28(5-6): 1222-1265 (2014)

built in compliance to upper level ontologies BFO, OBI, IAO, modularized

incorporates structured data mining

Use case: generic, middle level ontology for ML; representing QSAR entities fordrug design, used by Eve Robot Scientist

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 55

Page 56: Semantic data mining: an ontology based approach

DMOP: Data Mining Optimization Ontology

C. Maria Keet, Agnieszka Lawrynowicz, Claudia d’Amato, Alexandros Kalousis, PhongNguyen, Raul Palma, Robert Stevens, Melanie Hilario: The Data Mining OPtimizationOntology. J. Web Sem. 32: 43-53 (2015)

development started in e-LICO EU FP7 project (2009-2012)

detailed algorithm internal characteristics (’qualities’)

Use case: meta-learning (’whitebox’), meta-mining, used to produce IntelligentDiscovery Assistant for RapidMiner

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 56

Page 57: Semantic data mining: an ontology based approach

Expose

Joaquin Vanschoren, Hendrik Blockeel, Bernhard Pfahringer, Geoffrey Holmes:Experiment databases - A new way to share, organize and learn from experiments.Machine Learning 87(2): 127-158 (2012)

re-uses OntoDM (at top-level) and DMOP (at bottom level)

superseded by OpenML DB schema

Use case: experiment databases, ExpML markup

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 57

Page 58: Semantic data mining: an ontology based approach

Early work towards aligning DM/ML ontologies (2010)

DMO Ontology Jamboree, Josef Stefan Institute, Slovenia

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 58

Page 59: Semantic data mining: an ontology based approach

MEX vocabulary

Diego Esteves, Diego Moussallem, Ciro Baron Neto, Tommaso Soru, Ricardo Usbeck,Markus Ackermann, Jens Lehmann: MEX vocabulary: a lightweight interchange formatfor machine learning experiments. SEMANTICS 2015: 169-176

lightweight interchange format

maps to PROV

Use case: annotating ML experiments and interchanging ML metadata

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 59

Page 60: Semantic data mining: an ontology based approach

How to make existing DM/ML ontologies compatible?

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 60

Page 61: Semantic data mining: an ontology based approach

W3C Machine Learning Schema Community Group (2015)

https://www.w3.org/community/ml-schema/

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 61

Page 62: Semantic data mining: an ontology based approach

OpenML, Lorentz Center, Netherlands (2016)

First draft of ML Schema Core https://github.com/ML-Schema/core

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 62

Page 63: Semantic data mining: an ontology based approach

Sharing beyond DM/ML domain

Mapping DMOP to workflow ontologies (Research Objects, OPMW)(ROHub hosted by Poznan Supercomputing and Networking Center)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 63

Page 64: Semantic data mining: an ontology based approach

Semantic data mining: more information

Semantic data mining tutorial @ ECML/PKDD’2011http://videolectures.net/ecmlpkdd2011_lavrac_vavpetic_mining/

peculiarities of the learning setting: Open World Assumption, what is a”truly semantic” similarity measure?, ...

methods, applications, tools

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 64

Page 65: Semantic data mining: an ontology based approach

Summary

semantic data mining: data mining with ontologies asbackground/prior knowledge, most often from structured data

ontologies best if engineered with uses cases in mind

learning in description logics and clausal languages is hard; heuristics,dealing with peculiarities

Fr-ONT-Qu semantic pattern mining algorithm: theorethicalproperties, practical evaluation

use case: semantic meta-mining for constructing Intelligent DataMining Assistant

importance of interoperability (for scientific reproducibility, forinter-domain applications)

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 65

Page 66: Semantic data mining: an ontology based approach

Acknowledgements

Polish National Science Center under the SONATA program”ARISTOTELES: Methodology and algorithms for automatic revision ofontologies in task based scenarios” (2014/13/D/ST6/02076) (2015-2018)

Foundation for Polish Science under the POMOST programme, cofinancedfrom European Union, Regional Development Fund (POMOST/2013-7/8)(2013-2015)

EU FP7 ICT-2007.4.4 (231519) ”e-LICO: An e-Laboratory forInterdisciplinary Collaborative Research in Data Mining and Data-IntensiveScience” (2009-2012)

Fr-ONT-Qu, meta-mining experiments done jointly with Jedrzej Potoniec

Contributors to the development of DMOP and/or other e-LICOinfrastructure used in the research described in this presentation: MelanieHilario, C. Maria Keet, Claudia d’Amato, Huyen Do, Simon Fischer, DraganGamberger, Lina Al-Jadir, Simon Jupp, Alexandros Kalousis, JoergUwe-Kietz, Petra Kralj Novak, Babak Mougouie, Phong Nguyen, RaulPalma, Floarea Serban, Robert Stevens, Anze Vavpetic, Jun Wang, DerryWijaya, Adam Woznica

Agnieszka Lawrynowicz Semantic Data Mining: an Ontology Based Approach 66