[IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) - Haikou,...

5
Workflow Similarity Measure for Process Clustering in Grid Yi Wang, Minglu Li, Jian Cao, Xinhua Lin, Feilong Tang Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China {wangsuper, li-ml, cao-jian, lin-xh, tang-fl}@cs.sjtu.edu.cn Abstract In grid environment, workflow process can be seen as not only cooperative approach of grid services and resources, but also reusable and sharable knowledge to settle specific problem. The research of grid work- flow process clustering can promote knowledge dis- covery and reuse in grid. In this paper, we put forward a grid workflow process design method using Event- Condition-Action (ECA) rule, and propose a new proc- ess similarity measure approach. Then, we use a case to prove the feasibility of the approach and show how to revise present clustering algorithm with the similar- ity measure approach briefly. 1. Introduction Grid workflow plays more and more important role in the emerging grid technology. With the trend of merging grid and service-oriented technology, grid has changed into a distributed Problem Solve Environment (PSE) among different users, and a grid workflow process can be seen as not only cooperative approach of grid services and resources, but also sharable knowl- edge to settle specific problem. So it is necessary to cluster grid workflow processes to reduce the large amount of raw processes by categorizing them into smaller sets of similar items. We give a novel ap- proach for calculating process similarity of ECA rule- based grid workflow, and introduce a similarity-based algorithm for grid workflow clustering. The remainder of this paper is organized as follows. Next section overviews the related works. Section 3 briefly introduces ECA rule and analyze how ECA rule supports typical workflow patterns. Section 4 proposes a new process similarity measure approach based on the comparison of ECA rule. A similarity measure case is used to prove the feasibility of our approach in sec- tion 5. The last Section concludes the whole paper and points out some future works briefly. 2. Related Works The approach proposed in [1] refers to the cluster- ing of execution traces of processes or logs based on k- means clustering. Ref [2] puts forward a process simi- larity measure approach based on both domain classifi- cation and pattern analysis. Ref [3] converts each workflow dependency graph into binary branch vector, and distance between the binary branch vectors is the distance of two processes. An inexact process match- ing approach is introduced in [4], it use ontology path to calculate the distance between two activities and give some rules for similarity comparison. A weighted graph is introduced in [5] for comparing processes. The graph similarity is the weighted sum of similarity be- tween sets of services and sets of service links. 3. Process Design Based on ECA Rule Event-Condition-Action (ECA) rule is put forward in the research field of active database [6]. The rule make the data repositories react to internal or external events and trigger a chain of activities that includes no- tifying users and applications or performing database updates. It is similar to the business process. So, we developed a grid workflow management system based on ECA rule [7]. Figure 1. ECA Rule As be shown in Figure 1, An ECA rule consists of two parts essentially: an event and a list of condition- action pairs. When an event has occurred, a list of con- ditions are evaluated, if any condition is satisfied, the relative action is executed. A formal definition of ECA rule-based Workflow can be found in [7]. Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) 0-7695-2874-0/07 $25.00 © 2007

Transcript of [IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) - Haikou,...

Page 1: [IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) - Haikou, China (2007.08.24-2007.08.27)] Fourth International Conference on Fuzzy Systems

Workflow Similarity Measure for Process Clustering in Grid

Yi Wang, Minglu Li, Jian Cao, Xinhua Lin, Feilong Tang Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai

200240, China {wangsuper, li-ml, cao-jian, lin-xh, tang-fl}@cs.sjtu.edu.cn

Abstract

In grid environment, workflow process can be seen as not only cooperative approach of grid services and resources, but also reusable and sharable knowledge to settle specific problem. The research of grid work-flow process clustering can promote knowledge dis-covery and reuse in grid. In this paper, we put forward a grid workflow process design method using Event-Condition-Action (ECA) rule, and propose a new proc-ess similarity measure approach. Then, we use a case to prove the feasibility of the approach and show how to revise present clustering algorithm with the similar-ity measure approach briefly.

1. Introduction

Grid workflow plays more and more important role in the emerging grid technology. With the trend of merging grid and service-oriented technology, grid has changed into a distributed Problem Solve Environment (PSE) among different users, and a grid workflow process can be seen as not only cooperative approach of grid services and resources, but also sharable knowl-edge to settle specific problem. So it is necessary to cluster grid workflow processes to reduce the large amount of raw processes by categorizing them into smaller sets of similar items. We give a novel ap-proach for calculating process similarity of ECA rule-based grid workflow, and introduce a similarity-based algorithm for grid workflow clustering.

The remainder of this paper is organized as follows. Next section overviews the related works. Section 3 briefly introduces ECA rule and analyze how ECA rule supports typical workflow patterns. Section 4 proposes a new process similarity measure approach based on the comparison of ECA rule. A similarity measure case is used to prove the feasibility of our approach in sec-tion 5. The last Section concludes the whole paper and points out some future works briefly.

2. Related Works

The approach proposed in [1] refers to the cluster-ing of execution traces of processes or logs based on k-means clustering. Ref [2] puts forward a process simi-larity measure approach based on both domain classifi-cation and pattern analysis. Ref [3] converts each workflow dependency graph into binary branch vector, and distance between the binary branch vectors is the distance of two processes. An inexact process match-ing approach is introduced in [4], it use ontology path to calculate the distance between two activities and give some rules for similarity comparison. A weighted graph is introduced in [5] for comparing processes. The graph similarity is the weighted sum of similarity be-tween sets of services and sets of service links.

3. Process Design Based on ECA Rule

Event-Condition-Action (ECA) rule is put forward in the research field of active database [6]. The rule make the data repositories react to internal or external events and trigger a chain of activities that includes no-tifying users and applications or performing database updates. It is similar to the business process. So, we developed a grid workflow management system based on ECA rule [7].

Figure 1. ECA Rule

As be shown in Figure 1, An ECA rule consists of two parts essentially: an event and a list of condition-action pairs. When an event has occurred, a list of con-ditions are evaluated, if any condition is satisfied, the relative action is executed. A formal definition of ECA rule-based Workflow can be found in [7].

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)0-7695-2874-0/07 $25.00 © 2007

Page 2: [IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) - Haikou, China (2007.08.24-2007.08.27)] Fourth International Conference on Fuzzy Systems

Table 1 shows the ECA rules for the typical work-flow patterns. It only summarizes the ECA rules trig-gered by EndOf(a) Event. We also defined other events, such as “BeginOf(a)” and “ErrorOf(a)”.

Table1 ECA Rules for Basic Workflow Pattern

4. Process Similarity Measure

4.1. Activity Similarity Measure

Measure of activity distance estimates the function dissimilarity between two activities. Here we borrow the idea from [4] and do some modification. We use

)a' ADis(a, to represent the distance of activity a

and a' .

Table 2. Activity Distance Measure

Cate(a) Cate( a' ) LinkNumber (a, a' ) )a' ADis(a,

Start Start no ontology 0 End End no ontology 0

Delay Delay no ontology 0 Assign Assign no ontology 0 Service Service n n

Other cases +

Table 2 shows the value of )a' ADis(a, in the dif-

ferent case. If the categories of a and a' are same and are not Service activity, )a' ADis(a, is 0. If a and a'

are both service activities, we will count the minimal link number “n” from a to a' in ontology tree. If a is same as a' , )a' ADis(a, =0; and in other cases,

)a' ADis(a, + . After get the distance of activity a

and a' , we can calculate activity similarity as: 1))a' 1/(ADis(a,)a' ASim(a, (1)

4.2. Event Similarity Measure

4.2.1 Atomic Event Similarity Measure

Event can be differed as atomic event and compos-ite event. Table 3 shows the evaluation method of the similarity between two atomic events ae and ae' , a and a' is the relative activity of ae and ae' respectively,

)ae' AESim(ae, represents the similarity of ae and ae' .

If category (such as Begin, End, Error) of ae is same as category of ae' , )a' Asim(a,)ae' AESim(ae, , else

0)ae' AESim(ae, .

Table 3. Atomic Event Similarity Measure

Cate(ae) is same as Cate( ae' ) )ae' AESim(ae,

Yes )a' ASim(a,

No 0

4.2.2 Complex Event Similarity Measure

Complex event can be expressed with an atomic event sequence connected by logic nodes “ ” and “ ”. A complex event has only one principal disjunc-tive normal form (PDNF). So, before we calculate the similarity between two complex events, we always transform them into atomic event sequences with PDNF at first. Assume ce and ce' are two complex event, and we use )ce' CESim(ce, to represent the

similarity of ce and ce' , we can do as the following steps:

a. Transform ce and ce' into PDNF of atomic event sequences. Assume that ce and ce' have ccn and

ccn' clauses respectively, ccn >= ccn' .

jik

j

n

iecccece icc a..

11 (2)

jik

j

n

iecccece icc 'a.''.'

'

1

'

1 (3)

In (2), ice.cc is the ith clause of ce, jae.ce.cci is

the jth atomic event in ice.cc , ik is the amount

of atomic events in ice.cc . The symbols in (3)

have the similar meaning. b. Take ccn' clauses from ce arbitrarily, we can get

cc

cc

nnP ' disjunctive form as

jik

j

n

ieccpcepce

icc a.].[*][**

1

'

1 (4)

Which cc

cc

nnP ' means the amount of permutations to

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)0-7695-2874-0/07 $25.00 © 2007

Page 3: [IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) - Haikou, China (2007.08.24-2007.08.27)] Fourth International Conference on Fuzzy Systems

take ccn' clauses from ccn clauses, 0<p< cc

cc

nnP ' , p

is an integer, and ][* pce is the pth permutation

of ce. c. To each ][* pce , calculate the similarity of

][* pce and 'ce )'],[*( cepceTCESim with the

average of similarity of clauses in ][* pce and

'ce .ccn

i

iicc

ccceccpceCCSimn

cepceTCESim

'

1

)''.,].[*('

1)'],[*( (5)

To any two clauses icc and icc' , if

i'

i* k!k , )',( ii ccccCCSim =0, i

*k and i'k are the

amounts of atomic events in icc and icc' respec-

tively.Otherwise, we can calculate )',( ii ccccCCSim

with following steps: . Permute the atomic events of icc , get

!''' kPk

k atomic event sequences connected

with “ ” as jik

ji eccpcci

a.][ #1

##'

,

which 0< #p < !''' kPk

k, #p is an integer, and

][ ## pcci is the #p th permutation of icc .

. To each ][ ## pcci , calculate the similarity

of ][ ## pcci and icc' as '

1

### )a.',a.('

1)'],[(

k

j

jijiii ecceccAESSimk

ccpccTCCSim (6)

. )',( ii ccccCCSim = ))'],[(( ##ii ccpccTCCSimMax (7)

d. )'],[*('

)',( cepceTCESimMaxn

nceceCESim

cc

cc (8)

4.3. Condition-Action Similarity Measure

Condition presents the limitation of the relationship of objects or relationship of objects and constants. No-tice that a condition and an action always appears in pairs, we consider the similarity of condition and ac-tion simultaneously, i.e., condition-action similarity. Here we assume that:

If the condition is “null”, Pro(ac) =1, so “null” is also seem as a special condition;

Otherwise, consider an ECA rule has k condition-action pairs which conditions are not null, to each con-dition-action pair, 1/kPro(ac) . Pro(ac) is probability

of the implementation of action ac. Assume con-ac and con'-ac' are two condition-

action pairs, amount of activities in ac and ac' is an

and an' respectively, an >= an' , so ac and ac' can be

represented as:

in

i

a a.acac1

(9)

in

i

a a'.ac'ac''

1 (10)

We use )con'-ac' ac,-CASim(con to represent the

similarity of con-ac and con'-ac' , then, we can do as the following steps:

a. Calculate the probabilities of the implementation of ac and ac' , get pro(ac) and Pro( ac' ) respec-tively.

b. Take an' activities from ac arbitrarily, we can get

a

a

nnP ' action as

in

i

a a.[p]*ac[p]*ac'

1 (11)

Which a

a

nnP ' means the amount of permutations to

take an' activities from action ac, 0<p< a

a

nnP ' , p is

an integer. c. To each ][*ac p , calculate the similarity of

][*ac p and 'ac )'ac],[*ac( pTCASim as

an

iii

a

pASimn

pTCASim'

1

)'a'.ac,a].[*ac('1

)'ac],[*ac(

(12) d. let minp= ))'(oPr),(o(Pr acacMin ,

maxp= ))'(oPr),(o(Pr acacMax ,

)'ac],[*ac(maxp

minp')con'-ac' ac,-CASim(con pTCASimMax

n

n

a

a (13)

4.4. Rule Similarity Measure

Assume we have two ECA rule, ))ac,L(con (E,r ii , ))ac',(con'L' ,(E'r' jj , we can

measure the rule similarity of two ECA rule r and r'with following steps:

a. Measure )',( EECESim , notice that if E or E' is

atomic event, it also be dealt as a composite event. b. Assure the list L and list L' have can and can'

condition-action pairs respectively, can >= can' ,

Take can' condition-action pairs from the condi-

tion-action list L arbitrarily, we can get ca

ca

nnP '

condition-action lists as )ac.[p]*L,,ac.[p]*L , ,ac.[p]*L([p]*L '1 cani conconcon (14)

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)0-7695-2874-0/07 $25.00 © 2007

Page 4: [IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) - Haikou, China (2007.08.24-2007.08.27)] Fourth International Conference on Fuzzy Systems

Which ca

ca

nnP ' means the amount of permutations

to take can' condition-action pairs from list L,

0<p< ca

ca

nnP ' , p is an integer, and ][* pL is the pth

permutation of L. c. To each ][*L p , calculate the similarity of

][*L p and 'L )'L],[*L( pTLSim with the aver-

age of similarity of condition-action pairs in ][*L p and 'L .

ca

ca

n

i

ii acconLacconpCASimn

pTLSim

'

1

)'''.,].[*L('1

)'L],[*L(

(15) d. The similarity of condition-action list of two ECA

rule is )'L],[*L()',( pTLSimMaxLLLSim (16)

e. The similarity of ECA rule r and r' is

))',()',((2

1)',( LLLSimEEESimrrRSim (17)

4.4. Process Similarity Measure

In our system, a process is built up with some ECA rules. So, the similarity of process Pr and Pr' can be measured as:

a. Assure the Pr and Pr' have rn and rn' ECA

rules respectively, rn >= rn' , Take n' rules from

Pr arbitrarily, we can get rn

rnP ' rule permutation as

).[p]*P,,.[p]*Pr , ,.[p]*P([p]*Pr '1 rni prrrr (18)

Which rn

rnP ' means the amount of permutations to

take rn' rules from process Pr, 0<p< rn

rnP ' , p is an

integer, and ][Pr* p is the pth permutation of Pr.

b. To each ][*P pr , calculate the similarity of

][*P pr and 'Pr )'P],[*P(TPrSim rpr with the

average of similarity of rules in ][*P pr and 'P .

rn

i

iir

rpRSimn

p

'

1

)'rPr'.,].[*Pr('

1)'P],[*Pr(TPrSim (19)

c. The similarity of two processes is

)'P],[*Pr(TPrSim'

)Pr'(Pr,PrSim pMaxn

n

r

r (20)

5 Case Study

5.1. Similarity of Image Processing Workflows

Figure 2 and figure 3 show two workflow processes

of image processing designed with our grid workflow

system [7]. Icon is the symbol of the activity to in-voke a grid service, the ontology of the activity is on the top of the symbol and the name is below of it. The conditions also appear with control flows.

Figure 2. Image Processing Workflow A

Figure 3. Image Processing Workflow B

Figure 4. Image Process Ontology

Figure 4 is an example of ontology tree to share the concept of Image Processing. The ontology of service activity “WSReverse” is “Reverse”, the ontology of service activity “WSBorLap” is “Border with Laplace” as well. The link numbers between “Reverse” and “BorLap” is 3, and the similarity of the two activities is 0.25. Rules created by workflow A and B are shown in figure 5 and figure 6 respectively.

Then, we can measure the similarity of the two workflow process A and B. Follow the steps proposed in section 4, we can get

4

3)

4

1,1(

1

5.01(

2

1)(

2

1)1.,1.( MaxLSimESimRulePRulePRsim BA

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)0-7695-2874-0/07 $25.00 © 2007

Page 5: [IEEE Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007) - Haikou, China (2007.08.24-2007.08.27)] Fourth International Conference on Fuzzy Systems

In the same way,18

1)2.,1.( RulePRulePRsim BA ,

48

7)1.,2.( RulePRulePRsim BA , and

12

7)2.,2.( RulePRulePRsim BA .

So, the similarity of process A and B is

67.03

2)

48

7

18

1(

2

1),

12

7

4

3(

2

1(

2

2

))1.,2.()2.,1.((2

1

)),2.,2.()1.,1.((2

1

2

2

Max

RulePRulePRsimRulePRulePRsim

RulePRulePRsimRulePRulePRsimMaxPSim

BABA

BABA

Figure 5. Rules Created by Workflow A

Figure 6. Rules Created by Workflow B

5.2. Process Clustering Based on similarity measure discussed above, the

clustering of processes is a quite easy job. As be shown in figure 7, we can use DBSCAN [8] as our clustering algorithm, notice that we will use DisSim(Pr, Pr' )=1-PrSim(Pr, Pr' ) to replace the distance of two object in the algorithm.

Figure 7. ProcessClustering(Pr[n], , MinPts)

6 Conclusion and Future Works

This paper proposes a novel approach for calculat-ing process similarity based on ECA rule to collect and cluster grid workflow processes, and the approach can dealt with different types of events and activities, which are not considered by former literatures.

We just present the feasibility of our approach. In our future work, we will implement it as a grid work-flow recommendation module in our system.

Acknowledgement

This paper is supported by National Scientific Fund of China (No.60503041), National High Technology Research and Development Program of China (No .20 06AA04Z152 , No .2006AA01A124No.2006AA01Z247, No.2006AA01Z172), Shanghai-Grid grand project of Science, Technology Commis-sion of Shanghai Municipality (05DZ15005), and Natural Science Foundation of Shanghai (05ZR14081).

References

[1] Gianluigi Greco, Antonella Guzzo, Luigi Pontieri, et al, “Mining Expressive Process Models by Clustering Workflow Traces”, In Proc. of PAKDD2004, 2004, pp. 52-62

[2] Jae-Yoon Jung, Joonsoo Bae, “Workflow Clustering Method Based on Process Similarity”, In Proc. of ICCSA2006, 2006, pp. 379-389

[3] J. Bae, J. Caverlee, L. Liu, et al, “Process Mining by Measuring Process Block Similarity”, In Proc. of Intl. Workshop on Business Process Intelligence, 2006

[4] Hai Zhuge, “A process matching approach for flexible workflow process reuse”, Information and Software Technology, vol. 44, 2002, pp. 445-450

[5] Kui Huang, Zhaotao Zhou, Yanbo Han, et al, “An Algorithm for Calculating Process Similarity to Cluster Open-Source Process Designs”, In Proc. of the Third International Conference on Grid and Cooperative Computing, 2004, pp. 107-114

[6] Dayal, U., Buchmann, A. P., McCarthy, D. R., “Rules are Objects Too: A Knowledge Model For An Active, Object-Oriented Database System”, In Proc. of the 2nd Intl. Workshop on Advances in Object-Oriented Database System, 1988, pp. 129-143

[7] Lin Chen, Minglu Li, Jian Cao, “ECA Rule-Based Workflow Modeling and Implementation for Service Composition”, IEICE Transactions on Information and Systems, vol. E89-D, no.2, 2006, pp. 624-630

[8] M. Ester, H.-P. Kriegel, J. Sander, et al, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases”, In Proc. of the second International Conference on Knowledge Discovery and Data Mining , 1996, pp. 226-231

Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)0-7695-2874-0/07 $25.00 © 2007