Mining Usage Patterns from a Repository of Scientific Workflow

61
Mining Usage Patterns from a Repository of Scientific Workflows Claudia Diamantini, Domenico Potena, Emanuele Storti DII, Universit ´ a Politecnica delle Marche, Ancona, Italy SAC 2012, Riva del Garda, March 26-30 Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 1 / 21

description

Full paper: http://boole.diiga.univpm.it/paper/sac2012.pdf In many experimental domains, especially e-Science, workflow management systems are gaining increasing attention to design and execute in-silico experiments involving data analysis tools. As a by-product, a repository of workflows is generated, that formally describes experimental protocols and the way different tools are combined inside experiments. In this paper we describe the use of the SUBDUE graph clustering algorithm to discover sub-workflows from a repository. Since sub-workflows represent significant usage patterns of tools, the discovered knowledge can be exploited by scientists to learn by-example about design practices, or to retrieve and reuse workflows. Such a knowledge, ultimately, leverages the potential of scientific workflow repositories to become a knowledge-asset. A set of experiments is conducted on the myExperiment repository to assess the effectiveness of the approach.

Transcript of Mining Usage Patterns from a Repository of Scientific Workflow

Page 1: Mining Usage Patterns from a Repository of Scientific Workflow

Mining Usage Patterns from a Repositoryof Scientific Workflows

Claudia Diamantini, Domenico Potena, Emanuele Storti

DII, Universita Politecnica delle Marche, Ancona, Italy

SAC 2012, Riva del Garda, March 26-30

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 1 / 21

Page 2: Mining Usage Patterns from a Repository of Scientific Workflow

Introduction

Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:

audit trails of a workflow management systemtransaction logs of an enterprise resource planning system

Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21

Page 3: Mining Usage Patterns from a Repository of Scientific Workflow

Introduction

Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:

audit trails of a workflow management systemtransaction logs of an enterprise resource planning system

Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21

Page 4: Mining Usage Patterns from a Repository of Scientific Workflow

Introduction

Process mining techniques allow for extracting knowledge from eventlogs, i.e. traces of running processes:

audit trails of a workflow management systemtransaction logs of an enterprise resource planning system

Main applications:process discovery: what is really happening?conformance check: are we doing what was agreed upon?process extension: how can we redesign the process?

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 2 / 21

Page 5: Mining Usage Patterns from a Repository of Scientific Workflow

Motivation

Little work about how to extract knowledge from a set of models (i.e.:process schemas)

understanding of common patterns of usage (best/worstpractices)support in process designprocess integration (from multiple sources)process optimization/re-engineeringindexing of processes and their retrieval

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 3 / 21

Page 6: Mining Usage Patterns from a Repository of Scientific Workflow

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21

Page 7: Mining Usage Patterns from a Repository of Scientific Workflow

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21

Page 8: Mining Usage Patterns from a Repository of Scientific Workflow

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21

Page 9: Mining Usage Patterns from a Repository of Scientific Workflow

Methodology

Processes schemas have an inherent graph structure: manygraph-mining tools available

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Which representation form?Which specific technique?Graph clustering techniques (sub-processes are clusters) with ahierarchical approach (various levels of abstractions, more/lessspecificity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 4 / 21

Page 10: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL

Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21

Page 11: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL

Comments:to compress G with a SUB: to replace every occurrence of SUBin G with a single new nodeSUB best compresses G: num. of bits needed to represent G afterthe compression is minimum

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21

Page 12: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

SUBDUE [Joyner, 2000]graph-based hierarchical clustering algorithmsuited for discrete-valued and structured datasearches for substructures (i.e., subgraphs) that best compressthe input graph, according to MDL

Comments:to compress G with a SUB: to replace every occurrence of SUB inG with a single new nodeSUB best compresses G: num. of bits needed to represent Gafter the compression is minimum

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 5 / 21

Page 13: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

Page 14: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

Page 15: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

Page 16: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

Page 17: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

Page 18: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

Page 19: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

1 searches for the best SUB (i.e., that minimizes the DescriptionLength of the graph)

2 compresses the graph by using the best SUB

Subsequent SUBs may be defined in terms of previously definedSUBs. Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 6 / 21

Page 20: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21

Page 21: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21

Page 22: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21

Page 23: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE (cont’d)

Traditional cluster evaluation is not applicable to hierarchic domains

quality = interClusterDistanceintraClusterDistance

Desirable features of a good clustering:few and big clusters (large coverage, good generality)minimal/no overlap (better defined concepts)

New indexescompleteness: % of original nodes/edges still present in the finallatticerepresentativeness (of a SUB): % of input graphs holding theSUB at least oncesignificance: qualitative measure of the meaningfulness of acluster w.r.t. the domain and application

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 7 / 21

Page 24: Mining Usage Patterns from a Repository of Scientific Workflow

Representation

Approach1 representation of original processes into graphs2 usage of graph-mining techniques to extract SUBs

Common operators:sequencesplit-AND (parallel split), split-XOR (exclusive choice)join-AND (synchronization), join-XOR (simple merge)

How to map the process to the graph? Several choicesEmanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 8 / 21

Page 25: Mining Usage Patterns from a Repository of Scientific Workflow

Representation models: A rule

lowest level of compactnessevery operator is represented by a nodecalled operator, linked to another nodespecifying its type (AND/OR/SEQ) and toits operandsjoin/split are distinguishable by thenumber of in/out-going arcs

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 9 / 21

Page 26: Mining Usage Patterns from a Repository of Scientific Workflow

Representation models: B rule

medium level of compactnessthe closest to the original representationjoin/split are replaced by different nodes,one for each kind of operator

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 10 / 21

Page 27: Mining Usage Patterns from a Repository of Scientific Workflow

Representation models: C rule

highest level of compactnessimplicit representation of operatorsmultiple alternative graphs in case ofSPLIT-XOR or JOIN-XORarcs are labeled to preserve informationabout domain/range nodes

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 11 / 21

Page 28: Mining Usage Patterns from a Repository of Scientific Workflow

Evaluation

Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):

attribute name → edgevalue → a node

Need of experimentations to assess which one performs best w.r.t.the quality of clustering.

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21

Page 29: Mining Usage Patterns from a Repository of Scientific Workflow

Evaluation

Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):

attribute name → edgevalue → a node

Need of experimentations to assess which one performs best w.r.t.the quality of clustering.

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21

Page 30: Mining Usage Patterns from a Repository of Scientific Workflow

Evaluation

Commentsthe three representations hold the same information, with no losseasy extension with more attributes (actor name/role, info abouttime/resources, ...):

attribute name → edgevalue → a node

Need of experimentations to assess which one performs best w.r.t.the quality of clustering.

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 12 / 21

Page 31: Mining Usage Patterns from a Repository of Scientific Workflow

myExperiment

Dataset:258 processes (XScufl language for Taverna framework)e-Science WF for distributed computation over scientific dataapplications: local tools, Web Services, scripts

Setting:WF extraction/parsingpreprocessingtranslation into graphs (A,B,Crules)SUBDUE execution

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 13 / 21

Page 32: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

Output lattice (fragment)

Red boxes = top level SUBs

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 14 / 21

Page 33: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Page 34: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Page 35: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Page 36: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Page 37: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Page 38: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Page 39: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Page 40: Mining Usage Patterns from a Repository of Scientific Workflow

Experimental results

A rule B rule C ruleSize (nodes) 5618 2871 2414SUBs (num) 671 611 580top SUBs 2.24% 52.37% 52.93%Completeness 95.41% 90.90% 90.27%Representativeness (first 10 top SUBs) 99.22% 31.78% 23.26%Significance of top level clusters – + ++Execution time (hh:mm) 03:38 00:06 00:03

Considerations:A model: low significance, high execution timeNo definitively best choice between B and C:

B model: higher representativeness → indexingC model: higher significance → usage pattern discovery

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 15 / 21

Page 41: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Page 42: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Page 43: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Page 44: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Page 45: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Page 46: Mining Usage Patterns from a Repository of Scientific Workflow

Use Case

Scenario: a scientist wants to use web service X for DNA alignment

Baseline: browse myExperiment to find every workflow with X:too many resultsgiven a workflow: common practice or exception?

New approach: search for X in the lattice’s SUBs:1 find all usage patterns for X (i.e., top SUBs containing X)2 analyse their representativeness3 choose one of them or browse more in depth the lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 16 / 21

Page 47: Mining Usage Patterns from a Repository of Scientific Workflow

Use case (cont’d)

Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21

Page 48: Mining Usage Patterns from a Repository of Scientific Workflow

Use case (cont’d)

Applications:reference during process design (learn-by-example)process retrieval: find every myExperiment WF containing such apattern

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 17 / 21

Page 49: Mining Usage Patterns from a Repository of Scientific Workflow

Discussion

A graph-based clustering technique (SUBDUE) to recognize the mostfrequent/common subprocesses among a set of workflows/processschemas.

Commentsseveral representation alternatives for application of SUBDUEevaluations on specialized e-Science domainuseful to find typical patterns and schemas of usage for tools/tasks

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 18 / 21

Page 50: Mining Usage Patterns from a Repository of Scientific Workflow

Conclusion

Applications:recognition of reference processes (common/best/worstpractices)organize a process repository (by indexing processes throughsubstructures)enterprise integration: find similarities (differences, overlapping,complementariness) in BP in different companies

Future work:extending experimentations, especially with (real) BusinessProcessesmanagement of heterogeneity (syntactic/semantic, level ofgranularity)

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 19 / 21

Page 51: Mining Usage Patterns from a Repository of Scientific Workflow

Mining Usage Patterns from a Repositoryof Scientific Workflows

Claudia Diamantini, Domenico Potena, Emanuele Storti

DII, Universita Politecnica delle Marche, Ancona, Italy

SAC 2012, Riva del Garda, March 26-30

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 20 / 21

Page 52: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 53: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 54: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 55: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 56: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 57: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 58: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 59: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

2 Compress the graph with the best SUB

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 60: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

2 Compress the graph with the best SUB

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21

Page 61: Mining Usage Patterns from a Repository of Scientific Workflow

SUBDUE

Subdue uses a variant of beam search (heuristic)Best SUB minimizes the Description Length of the graph

1 Search the best SUBInit : set of SUBs consisting of all uniquely labeled verticesextend each SUB in all possible ways by a single edge and a vertexorder SUBs according to MDL principlekeep only the top n SUBsEnd : exhaustion of search space or user constraints

2 Compress the graph with the best SUB

Subsequent SUBs may be defined in terms of previously defined SUBs

Iterations of this basic step results in a lattice

Emanuele Storti (UNIVPM) Mining Usage Patterns SAC 2012 21 / 21