Fuzzy Mining – Adaptive Process Simpliﬁcation Based …wvdaalst/publications/p400.pdf · Fuzzy...

Fuzzy Mining – Adaptive Process SimplificationBased on Multi-Perspective Metrics

Christian W. Gunther and Wil M.P. van der Aalst

Eindhoven University of TechnologyP.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands{c.w.gunther, w.m.p.v.d.aalst}@tue.nl

Abstract. Process Mining is a technique for extracting process models from ex-ecution logs. This is particularly useful in situations where people have an ide-alized view of reality. Real-life processes turn out to be less structured than peo-ple tend to believe. Unfortunately, traditional process mining approaches haveproblems dealing with unstructured processes. The discovered models are often“spaghetti-like”, showing all details without distinguishing what is important andwhat is not. This paper proposes a new process mining approach to overcome thisproblem. The approach is configurable and allows for different faithfully simpli-fied views of a particular process. To do this, the concept of a roadmap is used asa metaphor. Just like different roadmaps provide suitable abstractions of reality,process models should provide meaningful abstractions of operational processesencountered in domains ranging from healthcare and logistics to web servicesand public administration.

1 Introduction

Business processes, whether defined and prescribed or implicit and ad-hoc, drive andsupport most of the functions and services in enterprises and administrative bodies oftoday’s world. For describing such processes, modeling them as graphs has proven tobe a useful and intuitive tool. While modeling is well-established in process design, itis complicated to do for monitoring and documentation purposes. However, especiallyfor monitoring, process models are valuable artifacts, because they allow us to commu-nicate complex knowledge in intuitive, compact, and high-level form.

Process mining is a line of research which attempts to extract such abstract, compactrepresentations of processes from their logs, i.e. execution histories [1–3, 5–7, 10, 14].The α-algorithm, for example, can create a Petri net process model from an executionlog [2]. In the last years, a number of process mining approaches have been developed,which address the various perspectives of a process (e.g., control flow, social network),and use various techniques to generalize from the log (e.g., genetic algorithms, theoryof regions [12, 4]). Applied to explicitly designed, well-structured, and rigidly enforcedprocesses, these techniques are able to deliver an impressive set of information, yet theirpurpose is somewhat limited to verifying the compliant execution. However, most pro-cesses in real life have not been purposefully designed and optimized, but have evolvedover time or are not even explicitly defined. In such situations, the application of pro-cess mining is far more interesting, as it is not limited to re-discovering what we alreadyknow, but it can be used to unveil previously hidden knowledge.

2

Over the last couple of years we obtained much experience in applying the tried-and-tested set of mining algorithms to real-life processes. Existing algorithms tend toperform well on structured processes, but often fail to provide insightful models for lessstructured processes. The phrase “spaghetti models” is often used to refer to the resultsof such efforts. The problem is not that existing techniques produce incorrect results.In fact, some of the more robust process mining techniques guarantee that the resultingmodel is “correct” in the sense that reality fits into the model. The problem is that theresulting model shows all details without providing a suitable abstraction. This is com-parable to looking at the map of a country where all cities and towns are represented byidentical nodes and all roads are depicted in the same manner. The resulting map is cor-rect, but not very suitable. Therefore, the concept of a roadmap is used as a metaphor tovisualize the resulting models. Based on an analysis of the log, the importance of activ-ities and relations among activities are taken into account. Activities and their relationscan be clustered or removed depending on their role in the process. Moreover, certainaspects can be emphasized graphically just like a roadmap emphasizes highways andlarge cities over dirt roads and small towns. As will be demonstrated in this paper, theroadmap metaphor allows for meaningful process models.

In this paper we analyze the problems traditional mining algorithms have with less-structured processes (Section 2), and use the metaphor of maps to derive a novel, moreappropriate approach from these lessons (Section 3). We abandon the idea of perform-ing process mining confined to one perspective only, and propose a multi-perspectiveset of log-based process metrics (Section 4). Based on these, we have developed a flexi-ble approach for Fuzzy Mining, i.e. adaptively simplifying mined process models (Sec-tion 5).

2 Less-Structured Processes – The Infamous Spaghetti Affair

The fundamental idea of process mining is both simple and persuasive: There is a pro-cess which is unknown to us, but we can follow the traces of its behavior, i.e. we haveaccess to enactment logs. Feeding those into a process mining technique will yield anaggregate description of that observed behavior, e.g. in form of a process model.

In the beginning of process mining research, mostly artificially generated logs wereused to develop and verify mining algorithms. Then, also logs from real-life work-flow management systems, e.g. Staffware, could be successfully mined with these tech-niques. Early mining algorithms had high requirements towards the qualities of log files,e.g. they were supposed to be complete and limited to events of interest. Yet, most ofthe resulting problems could be easily remedied with more data, filtering the log andtuning the algorithm to better cope with problematic data.

While these successes were certainly convincing, most real-life processes are notexecuted within rigid, inflexible workflow management systems and the like, which en-force correct, predictive behavior. It is the inherent inflexibility of these systems whichdrove the majority of process owners (i.e., organizations having the need to supportprocesses) to choose more flexible or ad-hoc solutions. Concepts like Adaptive Work-flow or Case Handling either allow users to change the process at runtime, or defineprocesses in a somewhat more “loose” manner which does not strictly define a specific

3

path of execution. Yet the most popular solutions for supporting processes do not en-force any defined behavior at all, but merely offer functionality like sharing data andpassing messages between users and resources. Examples for these systems are ERP(Enterprise Resource Planning) and CSCW (Computer-Supported Cooperative Work)systems, custom-built solutions, or plain E-Mail.

It is obvious that executing a process within such less restrictive environments willlead to more diverse and less-structured behavior. This abundance of observed behav-ior, however, unveiled a fundamental weakness in most of the early process miningalgorithms. When these are used to mine logs from less-structured processes, the resultis usually just as unstructured and hard to understand. These “spaghetti” process mod-els do not provide any meaningful abstraction from the event logs themselves, and aretherefore useless to process analysts. It is important to note that these “spaghetti” mod-els are not incorrect. The problem is that the processes themselves are really “spaghetti-like”, i.e., the model is an accurate reflection of reality.

PYSY(complete)

108

0.971

47

PYWZ(complete)

316

0.917

39

DSYE(complete)

69

0.667

3

LEWP(complete)

2 0

1

0.917

27

0.994

197

OSYW(complete)

181

0.833

41

OSCW(complete)

94

0.667

31

YWNB(complete)

84

0.833

19

DSWZ(complete)

31

0.667

11

ZEZR(complete)

1 0.5

1

IWLH(complete)

252

0.99

162

IWNP(complete)

303

0.527

75

0.986

159

UNYK(complete)

35

0.667

18

CEOI(complete)

450

0.88

61

ONIX(complete)

12

0.8

12

IWOL(complete)

230

0.968

99

IWDB(complete)

64

0.588

43

0.75

60

VPMO(complete)

4

0.5

2

IWOW(complete)

144

0.857

24

0.964

61

OSIX(complete)

20

0.5

4

ONXB(complete)

169

0.75

19

KZWZ(complete)

224

0.75

15

LEYW(complete)

115

0.75

14

PYSK(complete)

83

0.971

38

OSWZ(complete)

78

0.667

14

YWNA(complete)1651

0.75

26

0.968

46

ONWZ(complete)

34 0.75

7

YWZP(complete)

23 0.667

10

0.987

125

OSZY(complete)

28

0.8

10

AHCW(complete)

1

0.5

1

0.955

52

OSVO(complete)

210

0.769

29

OSIO(complete)

5

0.667

2

0.991

149

ONYW(complete)

55

0.8

25

OSPD(complete)

17

0.5

9

0.923

17

0.5

5

UNHL(complete)

91

0.979

74

VPPQ(complete)

184

0.667

11

UNEL(complete)

155

0.977

86

UNLE(complete)

41

0.545

18

0.881

36

0.5

16

UNEN(complete)

44

NEOI(complete)

15 0.75

9

DNEZ(complete)

6

0.833

6

0.978

50

YWHY(complete)

44

0.75

27YWWY

(complete)94

0.889

32

0.8

29

YWLY(complete)

138 0.886

53

HQOQ(complete)6163

0.667

53

ONOW(complete)

5

0.5

2

ONZY(complete)

394

0.992

244

ONVL(complete)

155

0.756

70

ONZO(complete)

173

0.956

78

0.955

28

DSYN(complete)

57 0.921

37

DSYV(complete)

52

0.939

49

DSPY(complete)

35

0.8

18

DSLQ(complete)

30

0.909

16

0.667

10

0.917

13

OSYN(complete)

106

0.5

7

0.957

52

OSDL(complete)

80

0.821

28

OSAT(complete)

8

0.75

5

0.929

19

DSSV(complete)

234

0.966

50

OSAI(complete)

8

0.8

5

OSHY(complete)

224

0.989

153

AHZI(complete)

68

0.878

50

0.983

88

DSSN(complete)

304

0.854

129

OSVZ(complete)

18

0.5

4

APPP(complete)9110

0.794

60

0.8

43

0.9

28

1

7387

AHHW(complete)

197

0.917

76

ONVY(complete)

160

0.941

53

ZEDS(complete)

195

0.828

118

XIEO(complete)1038

0.929

366

DNLP(complete)

264

0.929

117

AIVY(complete)

552

0.941

178

HQQL(complete)1153

0.909

379

KZPA(complete)

224

0.941

100

TUTC(complete)

23

0.667

23

OSYR(complete)

19 0.5

11

OSXZ(complete)

10

0.5

2

WSII(complete)

13 0.5

6

YVYW(complete)

1

0.5

1

AISB(complete)

4

0.5

3

XIWW(complete)1422

0.996

770YWSM

(complete)648

0.923

123

AICK(complete)

56 0.936

49

0.984

165

OSHB(complete)

231

0.944

65

OSZO(complete)

61 0.8

28

OSOS(complete)

35

OSSP(complete)

358

0.667

17

DSSA(complete)

238

0.98

147

0.969

66

VPPN(complete)

2

0.5

1

DSVM(complete)

29

0.833

12 0.756

42

0.984

128

SVEI(complete)

302

0.825

38

POZI(complete)

263

0.8

20

0.932

51

0.991

173

SCEI(complete)

449

0.944

126

OSOI(complete)

291

0.981

110

POLA(complete)1255

0.923

169

OSLP(complete)

75

0.8

5

ONCZ(complete)

11

0.5

2

0.5

5

0.923

89

0.998

871

OSWL(complete)

103

0.877

54

AISW(complete)

430 0.667

34

ONPI(complete)

264

0.819

104

LELY(complete)

126

0.667

12

KZOL(complete)

166

0.8

36

SPWV(complete)

114

0.929

28

ONYZ(complete)

7

0.5

6

OSPL(complete)

4

0.5

1

0.667

20

0.995

281

YHLH(complete)

749

0.918

127

0.994

437

AHHJ(complete)

264

0.972

56

0.942

84

SPWB(complete)

214

0.938

54

AHBL(complete)

271

0.981

45

AISY(complete)

845

0.917

43

DSYA(complete)

3

0.5

2

0.99

154

AHCA(complete)

206

0.978

51

0.888

51

0.986

114

0.889

58

DSNA(complete)1159

0.833

36

0.857

8

0.997

1054 DSCL(complete)

3 0.5

3

DSIC(complete)

20

0.75

11

LEOJ(complete)

7

0.667

5

DSRY(complete)

2

0.5

2

0.5

7

0.667

55

0.986

104

OONO(complete)

28

0.5

6

0.8

5

0.974

71

ONAZ(complete)

171

0.909

41

ONHL(complete)

264

0.974

108

VPHT(complete)

118

0.975

42

ONYN(complete)

142

0.942

88

ONIZ(complete)

23 0.75

8

LEVN(complete)

5

0.5

2

VPNY(complete)

10

0.5

1

0.982

59

VPNH(complete)

160

0.854

53

0.983

75

ONLW(complete)

100

0.978

46

UNEA(complete)

239

0.941

36

VPNE(complete)

15

0.667

2

0.957

52

0.75

24

0.909

14

0.985

94

NEOW(complete)

60

0.923

34

NEPL(complete)

47

0.885

26

0.909

19

0.667

24

NEWN(complete)

96

0.667

23

0.958

49

0.985

161

KZOO(complete)

26

0.8

18

OVWZ(complete)

21 0.5

6

LEWO(complete)

2

0.5

2

0.941

37

XISH(complete)

866

0.833

59

OSEL(complete)

1

0.5

1

0.9

65

0.995

285

0.833

29

0.667

29

0.8

46

0.997

729

AIVM(complete)

130

0.667

44

YWOK(complete)

856

0.998

604

YWVP(complete)

469 0.769

134

AHMK(complete)

1

0.5

1

0.991

112

0.923

46

0.997

407

YWXB(complete)

257

0.896

92

0.917

64

0.984

69

SVPZ(complete)

203

0.98

64

OSAX(complete)

6

0.833

5

AHNP(complete)

301

0.993

253

AHMI(complete)

10

0.875

8

0.997

370

0.8

40

HDEI(complete)

40

0.5

3

TUII(complete)

90

0.667

9

0.995

245

0.667

27

0.999

1471

0.812

101

AISI(complete)

842

0.75

27

IOII(complete)

243 0.872

43

BKZV(complete)

4

0.5

1

0.993

158

0.881

73

0.9

62

0.976

47YWPZ

(complete)92

0.944

54

CESO(complete)

123

0.82

45

0.875

85

0.977

55

0.921

40

0.989

123

SCPZ(complete)

154

0.889

28

DNWZ(complete)

22

0.667

8

0.994

168

HNRN(complete)

49

0.944

27

AHZE(complete)

1

0.5

1

VPPK(complete)

350

0.904

80

0.986

154

0.9

38

0.909

72

0.964

25

0.984

101

VPHA(complete)

102 0.9

43

0.981

173

0.978

44

0.909

37

ONOI(complete)

703

0.667

54

0.941

104

0.966

56

ONHI(complete)

78

0.947

60

ONHB(complete)

178

0.952

79

ONHY(complete)

74

0.983

64

ONNY(complete)

191

0.938

64

0.988

108

0.75

77

0.75

60

0.998

522

VPHN(complete)

279

0.867

140

0.985

136

0.75

97

0.985

151

0.75

18

0.985

83

LEJO(complete)

7 0.5

5

LEJV(complete)

2

0.5

1

AHLZ(complete)

5

0.5

1

PONA(complete)

65

0.667

11

0.938

38

0.98

91

ONIY(complete)

117

0.921

70

0.981

70

0.97

43

HQWL(complete)

221

0.99

165

HQXZ(complete)

83

0.759

35

0.75

17

1

5207

INCY(complete)

50

0.812

45

HQXB(complete)

775

0.909

277

KZZL(complete)1427

0.917

270

HQOH(complete)

59

0.848

59

AIYW(complete)

114

0.941

67

BLZI(complete)

3

0.5

3

NWYA(complete)

1

0.5

1

0.968

47

HQOZ(complete)

601

0.75

13

HQHI(complete)4504

1

4245

INKY(complete)

93

0.963

44

INLO(complete)

119

0.885

35

KZOY(complete)

122 0.875

35

0.982

84

LEJE(complete)

50

0.923

41

LEOO(complete)

85

0.914

37

LELZ(complete)

10

0.75

3

0.917

35

LEVL(complete)

176

0.786

43

0.967

63

ZEPZ(complete)

141

0.955

121

ZELC(complete)

304

0.988

128

0.9

117

0.975

109

KZOI(complete)

107

0.909

42

OVNA(complete)

155

0.909

21

0.929

253

0.998

629

AIHT(complete)

87

0.978

53

OVLE(complete)

107

0.875

16

0.932

53

TURC(complete)

53

0.95

26

BKXT(complete)

1

0.5

1

0.929

152

0.909

37 0.977

111

0.923

15

OONV(complete)

26

0.667

9

0.667

14

0.909

11

OPNO(complete)

2

0.5

1

0.762

67

0.75

62

0.99

107

AIVJ(complete)

7 0.5

1

HQJW(complete)

2

0.5

1

AIGP(complete)

877

0.983

85

0.996

445

XISV(complete)

418 0.917

149

0.929

32

CNNN(complete)

60

0.743

46

HDEA(complete)

243

0.727

18

HDWE(complete)

43

0.833

6

TUGP(complete)

73

0.688

19

DSCW(complete)

20

0.667

4

0.944

95

0.957

47

0.917

15

0.994

230

AINO(complete)

270

0.872

47

ZENC(complete)

806

0.933

101

0.909

53

0.75

29

0.976

112

HQPI(complete)

48

0.667

7

0.8

31

0.909

70

INOL(complete)

53 0.909

34

0.955

35

0.917

12

INOM(complete)

44

0.861

38

0.96

27

HQEL(complete)

256

0.8

16

0.909

385

0.998

390

0.857

50

0.992

205

0.75

14

0.893

42

0.97

30

0.991

187

0.75

32

0.75

37

INVW(complete)

42

0.727

17 0.929

22

0.667

5

0.976

41

0.963

47

OVNZ(complete)

163

0.828

56

0.983

97

HQLE(complete)6010

0.833

66

0.941

250

0.934

160

0.996

285

OSOL(complete)

28 0.667

6

TUXW(complete)

1

0.5

1

HQQY(complete)

185

0.962

83

HQLY(complete)

80

0.902

62

HQLK(complete)

1

0.5

1

0.896

79

0.909

425

0.998

682

0.981

56

0.857

30

0.667

10

0.845

96 1

5791

TUWI(complete)

43

0.667

27

KZEY(complete)

43 0.667

18

HQXY(complete)

12

0.667

4

LEOR(complete)

4

0.5

3

0.998

651

ZENA(complete)1094

0.795

151

0.939

113 0.999

931

0.941

104

0.904

78

0.99

115

0.711

52

0.998

779

KZWN(complete)

87

0.964

55

0.516

30

LGZI(complete)

6

0.5

1

0.917

262

0.75

160

0.999

1124

TUJI(complete)

296

0.889

13

HLPA(complete)

122

0.75

21

AIHN(complete)

23

0.667

3

0.909

98

DSPO(complete)

3

0.5

1

0.833

30

HDOE(complete)

10

0.5

1

CWNW(complete)

782

0.667

15

BKZI(complete)

8 0.5

1

KZZO(complete)

111

0.955

46

0.929

74

0.998

689

AINP(complete)

7

0.5

2

OSIZ(complete)

161

0.917

28

AIYB(complete)

2

0.667

2

0.875

80

0.995

312

0.987

171

POZW(complete)

103

0.933

75

0.8

43

0.933

15

0.933

27

0.97

72

OVBK(complete)

84

0.932

53

0.941

29

0.964

22

0.955

29

YWWP(complete)

226

0.994

173

ZEIN(complete)

887

0.923

15

CWSW(complete)

177 0.971

34

0.889

59

0.998

743

ZEBZ(complete)

179

0.917

56

ZEWZ(complete)

9

0.5

4

HQLI(complete)

28

0.75

5

0.909

18

0.993

158

IOIK(complete)

76 0.617

67

0.938

31

0.952

26

DSLI(complete)

7 0.5

3

0.881

51

0.989

121

0.667

18

TUMI(complete)

409

0.995

310

0.923

10

HDPT(complete)

145

0.833

10

BKMI(complete)

195

0.923

60

TUMK(complete)

40

0.571

16

0.848

39

0.986

97

SCNI(complete)

4

0.5

1

0.917

20

0.944

73

CNAT(complete)

65

0.95

24

CNNS(complete)

68

0.889

40

0.794

29

CNIK(complete)

43

0.971

42

0.941

16

0.923

18

0.938

17

TURI(complete)

53

0.839

34

0.909

15

TURK(complete)

42

0.825

38

0.9

20

TUSC(complete)

166

0.833

17

0.967

36

SPWA(complete)

129

0.824

72

0.962

45

SPWN(complete)

120

0.9

74

0.957

45

SPWI(complete)

99

0.963

72

0.914

46

0.909

27

TUZV(complete)

308

0.986

119

TUZC(complete)

390

0.929

148

TUJV(complete)

92

0.889

28

0.929

14

0.987

101

TUZI(complete)

395

0.944

252

TUWV(complete)

5

0.5

1

TUTP(complete)

24

0.667

4

0.944

2

0.98

119

TUOK(complete)

44 0.975

42

TUPK(complete)

66

0.954

65

TUZK(complete)

207

0.969

156

0.875

43

0.972

34

TUPC(complete)

18

0.75

8

BKYA(complete)

153

0.97

55

BKYI(complete)

160

0.892

93

KZAL(complete)

2

0.5

2

0.952

37

BKYV(complete)

144

0.884

109

TUSI(complete)

306

0.667

10

LEBR(complete)

1 0.5

1

0.933

21

BLYI(complete)

128

0.797

120

0.968

56

0.978

50

0.96

80

0.957

84

0.938

49

BCCC(complete)

82

0.667

18

TUEW(complete)

154

0.688

35

TUIV(complete)

1

0.5

1

0.97

82

0.8

16

HDWP(complete)

11

0.5

1

TUSK(complete)

88

0.667

13

0.667

32

0.989

209

0.923

10

0.941

25

0.941

66

0.981

46

0.5

7

0.667

23 0.667

8

0.941

43

0.962

51

TUJC(complete)

121

0.857

22

0.9

37

0.955

32

TUIC(complete)

24

0.667

7

TUJP(complete)

203

0.964

99

0.923

86

TUGV(complete)

39

0.8

13 0.923

56

0.986

100

TUJK(complete)

131

0.852

124

TUMC(complete)

192

0.8

49

LEBS(complete)

197

0.833

38

TUAB(complete)

12

0.5

2

TUYK(complete)

1

0.5

1

TUKW(complete)

73

0.833

12

0.955

45

0.5

1

0.667

7

0.955

28

0.75

12

0.981

100

HDPR(complete)

35

0.594

31

0.8

8

BCCV(complete)

55

0.75

22

TUGC(complete)

79

0.667

20

0.989

134

0.857

19 0.875

21

0.964

36

0.8

18

0.955

41

0.75

27

0.967

49

TUGI(complete)

152

0.986

119

TUGK(complete)

40

0.857

21

HLPO(complete)

4

0.5

3

BCCI(complete)

177

0.989

139

BCCK(complete)

63

0.923

30

0.993

231

0.917

54

TUTI(complete)

26

0.667

7

0.5

8

0.5

4

0.5

1

0.75

21

0.998

744

0.917

56

0.992

122

DSLN(complete)

30

0.917

13

DSLL(complete)

36

0.759

28

DSZN(complete)

25

0.958

25

DSLX(complete)

48

0.87

20

0.889

26

0.947

20

DSZV(complete)

22

0.957

22

DSLV(complete)

26

0.952

20

0.95

15

HDOP(complete)

7

0.5

6

0.667

3

0.8

30

TUMV(complete)

96

0.75

41

0.938

38

AIWZ(complete)

116

0.75

14

BKMA(complete)

130

0.8

29

0.944

69

0.875

53

0.99

129

0.923

44

0.96

80

BKMV(complete)

143

0.912

66

0.75

17

0.923

14

0.917

45

0.7

14 0.955

17

0.923

17

0.929

18

BCCW(complete)

4

0.4

4

0.75

19

0.667

28

0.8

28

0.5

3

DSIY(complete)

14

0.667

4

DSIT(complete)

13 0.727

13

0.875

8

0.533

10

0.5

2

OSOW(complete)

6

0.5

2

0.833

10

0.941

17

0.5

2

0.667

3

LECL(complete)

211

0.75

23

0.991

159

0.75

37

0.99

135

TUIP(complete)

39

0.75

28

0.667

20

0.909

86

SYPI(complete)

6

0.667

5

0.976

85

HLPN(complete)

99

0.8

33

0.8

2

0.98

66

TUIK(complete)

24

0.842

21

0.889

43

0.917

50

BLOM(complete)

22

0.522

20

BLAM(complete)

63

HLPW(complete)

32

0.833

11

TUQK(complete)

1

0.5

1

0.889

18

0.857

26

0.833

26 0.977

76

BLBO(complete)

10 0.5

2

0.75

15

0.929

19

TUWP(complete)

39

0.944

19

0.5

7

0.923

20

0.545

17

TUWK(complete)

23

0.938

22

0.667

2

0.667

12

0.958

64

0.857

15

0.75

27

HLPI(complete)

6 0.667

4

0.667

7

BKZA(complete)

1

0.5

1

TUTK(complete)

13

0.857

13

0.667

7

0.667

12

0.5

2

0.917

31

0.5

2

0.981

111

0.8

11

OSHW(complete)

4

0.5

1

OSXY(complete)

10

0.667

2

0.667

40

0.933

30

AHWZ(complete)

3

0.5

1

0.5

1

0.864

24

0.964

21

0.75

23

0.667

3

0.5

1

0.5

3

0.938

18

0.909

94

0.989

470

HQWP(complete)

1

0.5

1

0.5

1

LEJC(complete)

6

0.667

4

0.75

5

NEBI(complete)

33

0.917

13

0.947

16

NEOO(complete)

16 0.562

14

NEEK(complete)

18

0.9

11

NELT(complete)

11

0.909

11

NELO(complete)

13

0.909

10

0.667

6

0.5

3

OSLL(complete)

61

0.75

16

0.941

35

OSHD(complete)

17

0.667

6

0.5

10

0.812

30

0.667

8

0.667

8

0.933

13

0.5

2

HLPJ(complete)

1

0.5

1

ONWL(complete)

21

0.875

7

0.8

10

0.625

6

0.667

4

0.667

5

0.5

2

BKXI(complete)

1

0.5

1

BKXA(complete)

1

0.5

1

0.5

1

0.875

17

0.5

1

0.5

2

0.875

21

0.5

1

0.4

4

0.5

1

0.75

3

0.75

4

0.929

14

0.5

3

0.5

2

0.667

2

0.75

2

0.5

1

0

1

0.667

7

TUAV(complete)

5

0.5

7

0.25

4

TUAI(complete)

1

0.5

1

TUAK(complete)

2

0.5

1

0.5

2

0.5

1

0.8

6

0.96

22

0.5

2

0.5

2

0.667

2

0.5

4

0.5

1

AHMY(complete)

10

0.889

8

HNWZ(complete)

15

0.25

2

AHMH(complete)

9

0.727

9

AHMW(complete)

8

0.889

8

AHMO(complete)

8

0.889

8 AHMZ(complete)

8

0.875

8

AHMN(complete)

8

0.875

8

0.875

6

0.5

1 0.917

12

0.5

1

0.5

2

SCNA(complete)

3

0.5

2

SCNK(complete)

1 0.5

1

0.5

1

0.5

1

0.5

1

0.667

4

0.75

2

0.5

2

0.5

1

0.5

1

0.5

1

0.5

2

0.5

1

0.5

7

0.5

1

0.5

1

0.5

2

TUXK(complete)

2

0.5

2

0.5

2 0.5

1

0.5

4

0.833

8

0.833

5

0.75

5 0.667

2

0.929

17

0.667

4

0.5

1

0.5

1

0.25

2

PYYW(complete)

1

0

1

0.5

3

0.5

1

0.5

1

0.5

1

Fig. 1. Excerpt of a typical “Spaghetti” process model (ca. 20% of complete model).

An example of such a “spaghetti” model is given in Figure 1. It is noteworthy thatthis figure shows only a small excerpt (ca. 20%) of a highly unstructured process model.It has been mined from machine test logs using the Heuristics Miner, one of the tradi-tional process mining techniques which is most resilient towards noise in logs [14].Although this result is rather useful, certainly in comparison with other early processmining techniques, it is plain to see that deriving helpful information from it is not easy.

Event classes found in the log are interpreted as activity nodes in the process model.Their sheer amount makes it difficult to focus on the interesting parts of the process. Theabundance of arcs in the model, which constitute the actual “spaghetti”, introduce aneven greater challenge for interpretation. Separating cause from effect, or the generaldirection in which the process is executed, is not possible because virtually every nodeis transitively connected to any other node in both directions. This mirrors the crux offlexibility in process execution – when people are free to execute anything in any givenorder they will usually make use of such feature, which renders monitoring businessactivities an essentially infeasible task.

4

We argue that the fault for these problems lies neither with less-structured pro-cesses, nor with process mining itself. Rather, it is the result of a number of, mostlyimplicit, assumptions which process mining has historically made, both with respectto the event logs under consideration, and regarding the processes which have gener-ated them. While being perfectly sound in structured, controlled environments, theseassumptions do not hold in less-structured, real-life environments, and thus ultimatelymake traditional process mining fail there.

Assumption 1: All logs are reliable and trustworthy. Any event type found in thelog is assumed to have a corresponding logical activity in the process. However,activities in real-life processes may raise a random number of seemingly unrelatedevents. Activities may also go unrecorded, while other events do not correspond toany activity at all.The assumption that logs are well-formed and homogeneous is also often not true.For example, a process found in the log is assumed to correspond to one logicalentity. In less-structured environments, however, there are often a number of “tacit”process types which are executed, and thus logged, under the same name.Also, the idea that all events are raised on the same level of abstraction, and arethus equally important, is not true in real-life settings. Events on different levels are“flattened” into the same event log, while there is also a high amount of informa-tional events (e.g., debug messages from the system) which need to be disregarded.

Assumption 2: There exists an exact process which is reflected in the logs. This as-sumption implies that there is the one perfect solution out there, which needs tobe found. Consequently, the mining result should model the process completely,accurately, and precisely. However, as stated before, spaghetti models are not nec-essarily incorrect – the models look like spaghetti, because they precisely describeevery detail of the less-structured behavior found in the log. A more high-level so-lution, which is able to abstract from details, would thus be preferable.Traditional mining algorithms have also been confined to a single perspective (e.g.,control flow, data), as such isolated view is supposed to yield higher precision.However, perspectives are interacting in less-structured processes, e.g. the data flowmay complement the control flow, and thus also needs to be taken into account.In general, the assumption of a perfect solution is not well-suited for real-life appli-cation. Reality often differs significantly from theory, in ways that had not been an-ticipated. Consequently, useful tools for practical application must be explorative,i.e. support the analyst to tweak results and thus capitalize on their knowledge.

We have conducted process mining case studies in organizations like Philips Med-ical Systems, UWV, Rijkswaterstaat, the Catharina Hospital Eindhoven and the AMChospital Amsterdam, and the Dutch municipalities of Alkmaar and Heusden. Our ex-periences in these case studies have shown the above assumptions to be violated in allways imaginable. Therefore, to make process mining a useful tool in practical, less-structured settings, these assumptions need to be discarded. The next section introducesthe main concept of our mining approach, which takes these lessons into account.

5

3 An Adaptive Approach for Process Simplification

Process mining techniques which are suitable for less-structured environments need tobe able to provide a high-level view on the process, abstracting from undesired details.The field of cartography has always been faced with a quite similar challenge, namelyto simplify highly complex and unstructured topologies. Activities in a process can berelated to locations in a topology (e.g. towns or road crossings) and precedence relationsto traffic connections between them (e.g., railways or motorways).

insignificant roadsare not shown.

parts of the city are merged.

Focuses on theintended use andlevel of detail.

Highways are highlighted by size, contrast and color.

Abstraction

Aggregation

Customization

Emphasis

Fig. 2. Example of a road map.

When one takes a closer look at maps (such as the example in Figure 2), the solutioncartography has come up with to simplify and present complex topologies, one canderive a number of valuable concepts from them.

Aggregation: To limit the number of information items displayed, maps often showcoherent clusters of low-level detail information in an aggregated manner. One ex-ample are cities in road maps, where particular houses and streets are combinedwithin the city’s transitive closure (e.g., the city of Eindhoven in Figure 2).

Abstraction: Lower-level information which is insignificant in the chosen context issimply omitted from the visualization. Examples are bicycle paths, which are of nointerest in a motorist’s map.

Emphasis: More significant information is highlighted by visual means such as color,contrast, saturation, and size. For example, maps emphasize more important roadsby displaying them as thicker, more colorful and contrasting lines (e.g., motorway“E25” in Figure 2).

Customization: There is no one single map for the world. Maps are specialized on adefined local context, have a specific level of detail (city maps vs highway maps),and a dedicated purpose (interregional travel vs alpine hiking).

These concepts are universal, well-understood, and established. In this paper we ex-plore how they can be used to simplify and properly visualize complex, less-structuredprocesses. To do that, we need to develop appropriate decision criteria on which to basethe simplification and visualization of process models. We have identified two funda-mental metrics which can support such decisions: (1) significance and (2) correlation.

6

Significance, which can be determined both for event classes (i.e., activities) andbinary precedence relations over them (i.e., edges), measures the relative importance ofbehavior. As such, it specifies the level of interest we have in events, or their occurringafter one another. One example for measuring significance is by frequency, i.e. events orprecedence relations which are observed more frequently are deemed more significant.

Correlation on the other hand is only relevant for precedence relations over events.It measures how closely related two events following one another are. Examples formeasuring correlation include determining the overlap of data attributes associated totwo events following one another, or comparing the similarity of their event names.More closely correlated events are assumed to share a large amount of their data, or havetheir similarity expressed in their recorded names (e.g. “check customer application”and “approve customer application”).

Based on these two metrics, which have been defined specially for this purpose, wecan sketch our approach for process simplification as follows.

– Highly significant behavior is preserved, i.e. contained in the simplified model.– Less significant but highly correlated behavior is aggregated, i.e. hidden in clusters

within the simplified model.– Less significant and lowly correlated behavior is abstracted from, i.e. removed from

the simplified model.

This approach can greatly reduce and focus the displayed behavior, by employingthe concepts of aggregation and abstraction. Based on such simplified model, we canemploy the concept of emphasis, by highlighting more significant behavior.

Fig. 3. Excerpt of a simplified and decorated process model.

Figure 3 shows an excerpt from a simplified process model, which has been cre-ated using our approach. Bright square nodes represent significant activities, the darkeroctagonal node is an aggregated cluster of three less-significant activities. All nodesare labeled with their respective significance, with clusters displaying the mean sig-nificance of their elements. The brightness of edges between nodes emphasizes theirsignificance, i.e. more significant relations are darker. Edges are also labeled with theirrespective significance and correlation values. By either removing or hiding less signif-icant information, this visualization enables the user to focus on the most interestingbehavior in the process.

7

Yet, the question of what constitutes “interesting” behavior can have a number ofanswers, based on the process, the purpose of analysis, or the desired level of abstrac-tion. In order to yield the most appropriate result, significance and correlation measuresneed to be configurable. We have thus developed a set of metrics, which can each mea-sure significance or correlation based on different perspectives (e.g., control flow ordata) of the process. By influencing the “mix” of these metrics and the simplificationprocedure itself, the user can customize the produced results to a large degree.

The following section introduces some of the metrics we have developed for signif-icance and correlation in more detail.

4 Log-Based Process Metrics

Significance and correlation, as introduced in the previous section, are suitable conceptsfor describing the importance of behavior in a process in a compact manner. However,because they represent very generalized, condensed metrics, it is important to mea-sure them in an appropriate manner. Taking into account the wide variety of processes,analysis questions and objectives, and levels of abstraction, it is necessary to make thismeasurement adaptable to such parameters.

Our approach is based on a configurable and extensible framework for measuringsignificance and correlation. The design of this framework is introduced in the nextsubsection, followed by detailed introductions to the three primary types of metrics:unary significance, binary significance, and binary correlation.

4.1 Metrics Framework

An important property of our measurement framework is that for each of the three pri-mary types of metrics (unary significance, binary significance, and binary correlation)different implementations may be used. A metric may either be measured directly fromthe log (log-based metric), or it may be based on measurements of other, log-basedmetrics (derivative metric).

When the log contains a large number of undesired events, which occur in betweendesired ones, actual causal dependencies between the desired event classes may go un-recorded. To counter this, our approach also measures long-term relations, i.e. when thesequence A,B,C is found in the log, we will not only record the relations A→ B andB → C, but also the length-2-relationship A → C. We allow measuring relationshipsof arbitrary length, while the measured value will be attenuated, i.e. decreased, withincreasing length of relationship.

4.2 Unary Significance Metrics

Unary significance describes the relative importance of an event class, which will berepresented as a node in the process model. As our approach is based on removing lesssignificant behavior, and as removing a node implies removing all of its connected arcs,unary significance is the primary driver of simplification.

8

One metric for unary significance is frequency significance, i.e. the more often acertain event class was observed in the log, the more significant it is. Frequency is a log-based metric, and is in fact the most intuitive of all metrics. Traditional process miningtechniques are built solely on the principle of measuring frequency, and it remains animportant foundation of our approach. However, real-life logs often contain a largenumber of events which are in fact not very significant, e.g. an event which describessaving the process state after every five activities. In such situations, frequency plays adiminished role and can rather distort results.

Another, derivate metric for unary significance is routing significance. The ideabehind routing significance is that points, at which the process either forks (i.e., splitnodes) or synchronizes (i.e., join nodes), are interesting in that they substantially definethe structure of a process. The higher the number and significance of predecessors for anode (i.e., its incoming arcs) differs from the number and significance of its successors(i.e., outgoing arcs), the more important that node is for routing in the process. Routingsignificance is important as amplifier metric, i.e. it helps separating important routingnodes (whose significance it increases) from those less important.

4.3 Binary Significance Metrics

Binary significance describes the relative importance of a precedence relation betweentwo event classes, i.e. an edge in the process model. Its purpose is to amplify and toisolate the observed behavior which is supposed to be of the greatest interest. In oursimplification approach, it primarily influences the selection of edges which will beincluded in the simplified process model.

Like for unary significance, the log-based frequency significance metric is alsothe most important implementation for binary significance. The more often two eventclasses are observed after one another, the more significant their precedence relation.

The distance significance metric is a derivative implementation of binary signifi-cance. The more the significance of a relation differs from its source and target nodes’significances, the less its distance significance value. The rationale behind this metric is,that globally important relations are also always the most important relations for theirendpoints. Distance significance locally amplifies crucial key relations between eventclasses, and weakens already insignificant relations. Thereby, it can clarify ambiguoussituations in edge abstraction, where many relations “compete” over being included inthe simplified process model. Especially in very unstructured execution logs, this metricis an indispensible tool for isolating behavior of interest.

4.4 Binary Correlation Metrics

Binary correlation measures the distance of events in a precedence relation, i.e. howclosely related two events following one another are. Distance, in the process domain,can be equated to the magnitude of context change between two activity executions.Subsequently occurring activities which have a more similar context (e.g., which areexecuted by the same person or in a short timeframe) are thus evaluated to be highercorrelated. Binary correlation is the main driver of the decision between aggregation orabstraction of less-significant behavior.

9

Proximity correlation evaluates event classes, which occur shortly after one another,as highly correlated. This is important for identifying clusters of events which corre-spond to one logical activity, as these are commonly executed within a short timeframe.

Another feature of such clusters of events occurring within the realm of one higher-level activity is, that they are executed by the same person. Originator correlation be-tween event classes is determined from the names of the persons, which have triggeredtwo subsequent events. The more similar these names, the higher correlated the respec-tive event classes. In real applications, user names often include job titles or functionidentifiers (e.g.“sales John” and “sales Paul”). Therefore, this metric implementation isa valuable tool also for unveiling implicit correlation between events.

Endpoint correlation is quite similar, however, instead of resources it compares theactivity names of subsequent events. More similar names will be interpreted as highercorrelation. This is important for low-level logs including a large amount of less sig-nificant events which are closely related. Most of the time, events which reflect similartasks also are given similar names (e.g., “open valve13” and “close valve13”), and thismetric can unveil these implicit dependencies.

In most logs, events also include additional attributes, containing snapshots from thedata perspective of the process (e.g., the value of an insurance claim). In such cases,the selection of attributes logged for each event can be interpreted as its context. Thus,the data type correlation metric evaluates event classes, where subsequent events sharea large amount of data types (i.e., attribute keys), as highly correlated. Data value cor-relation is more specific, in that it also takes the values of these common attributes intoaccount. In that, it uses relative similarity, i.e. small changes of an attribute value willcompromise correlation less than a completely different value.

Currently, all implementations for binary correlation in our approach are log-based.The next section introduces our approach for adaptive simplification and visualizationof complex process models, which is based on the aggregated measurements of allmetric implementations which have been introduced in this section.

5 Adaptive Graph Simplification

Most process mining techniques follow an interpretative approach, i.e. they attempt tomap behavior found in the log to typical process design patterns (e.g., whether a splitnode has AND- or XOR-semantics). Our approach, in contrast, focuses on high-levelmapping of behavior found in the log, while not attempting to discover such patterns.Thus, creating the initial (non-simplified) process model is straightforward: All eventclasses found in the log are translated to activity nodes, whose importance is expressedby unary significance. For every observed precedence relation between event classes, acorresponding directed edge is added to the process model. This edge is described bythe binary significance and correlation of the ordering relation it represents.

Subsequently, we apply three transformation methods to the process model, whichwill successively simplify specific aspects of it. The first two phases, conflict reso-lution and edge filtering, remove edges (i.e., precedence relations) between activitynodes, while the final aggregation and abstraction phase removes and/or clusters less-significant nodes. Removing edges from the model first is important – due to the less-

10

structured nature of real-life processes and our measurement of long-term relationships,the initial model contains deceptive ordering relations, which do not correspond to validbehavior and need to be discarded. The following sections provide details about thethree phases of our simplification approach, given in the order in which they are appliedto the initial model.

5.1 Conflict Resolution in Binary Relations

Whenever two nodes in the initial process model are connected by edges in both direc-tions, they are defined to be in conflict. Depending on their specific properties, conflictsmay represent one of three possible situations in the process:

– Length-2-loop: Two activities A and B constitute a loop in the process model,i.e. after executing A and B in sequence, one may return to A and start over. Inthis case, the conflicting ordering relations between these activities are explicitlyallowed in the original process, and thus need to be preserved.

– Exception: The process orders A → B in sequence, however, during real-life exe-cution the exceptional case of B → A also occurs. Most of the time, the “normal”behavior is clearly more significant. In such cases, the “weaker” relation needs tobe discarded to focus on the main behavior.

– Concurrency: A and B can be executed in any order (i.e., they are on two distinct,parallel paths), the log will most likely record both possible cases, i.e. A → Band B → A, which will create a conflict. In this case, both conflicting orderingrelations need to be removed from the process model.

Conflict resolution attempts to classify each conflict as one of these three cases, andthen resolves it accordingly. For that, it first determines the relative significance of bothconflicting relations.

A BA → B

B → A

Ain

Aout

Bin

Bout

Fig. 4. Evaluating the relative significance of conflicting relations.

Figure 4 shows an example of two activities A and B in conflict. The relative sig-nificance for an edge A→ B can be determined as follows.

Definition 1 (Relative significance). LetN be the set of nodes in a process model, andlet sig : N × N → R+

0 be a relation that assigns to each pair of nodes A,B ∈ Nthe significance of a precedence relation over them. rel : N × N → R+

0 is a relationwhich assigns to each pair of nodes A,B ∈ N the relative importance of their orderingrelation: rel(A,B) = 1

2 ·sig(A,B)P

X∈N sig(A,X) + 12 ·

sig(A,B)PX∈N sig(X,B)

11

Every ordering relationA→ B has a set of competing relationsCompAB = Aout∪Bin. This set of competing relations is composed ofAout, i.e. all edges starting fromA,and ofBin, i.e. all edges pointing toB (cf. Figure 4). Note that this set also contains thereference relation itself, i.e. more specifically:Bin∩Aout = {A→ B}. By dividing thesignificance of an ordering relation A→ B with the sum of all its competing relations’significances, we get the importance of this relation in its local context.

If the relative significance of both conflicting relations, rel(A,B) and rel(B,A)exceeds a specified preserve threshold value, this signifies that A and B are apparentlyforming a length-2-loop, which is their most significant behavior in the process. Thus,in this case, both A→ B and B → A will be preserved.

In case at least one conflicting relation’s relative significance is below this threshold,the offset between both relations’ relative significances is determined, i.e. ofs(A,B) =|rel(A,B)−rel(B,A)|. The larger this offset value, the more the relative significancesof both conflicting relations differ, i.e. one of them is clearly more important. Thus, ifthe offset value exceeds a specified ratio threshold, we assume that the relatively lesssignificant relation is in fact an exception and remove it from the process model.

Otherwise, i.e. if at least one of the relations has a relative significance below thepreserve threshold and their offset is smaller than the ratio threshold, this signifies thatbothA→ B andB → A are relations which are of no greater importance for both theirsource and target activities. This low, yet balanced relative significance of conflictingrelations hints at A and B being executed concurrently, i.e. in two separate threads ofthe process. Consequently, both edges are removed from the process model, as they donot correspond to factual ordering relations.

5.2 Edge Filtering

Although conflict resolution removes a number of edges from the process model, themodel still contains a large amount of precedence relations. To infer further structurefrom this model, it is necessary to remove most of these remaining edges by edge fil-tering, which isolates the most important behavior. The obvious solution is to removethe globally least significant edges, leaving only highly significant behavior. However,this approach yields sub-optimal results, as it is prone to create small, disparate clustersof highly frequent behavior. Also, in the subsequent aggregation step, highly corre-lated relations play an important part in connecting clusters, even if they are not verysignificant.

Therefore, our edge filtering approach evaluates each edge A → B by its utilityutil(A,B), a weighed sum of its significance and correlation. A configurable utilityratio ur ∈ [0, 1] determines the weight, such that util(A,B) = ur · sig(A,B) + (1−ur) · cor(A,B). A larger value for ur will preserve more significant edges, while asmaller value will favor highly correlated edges.

Figure 5 shows an example for processing the incoming arcs of a node A. Usinga utility ratio of 0.5, i.e. taking significance and correlation equally into account, theutility value is calculated, which ranges from 0.4 to 1.0 in this example.

Filtering edges is performed on a local basis, i.e. for each node in the process model,the algorithm preserves its incoming and outgoing edges with the highest utility value.The decision of which edges get preserved is configured by the edge cutoff parameter

12

A

1.0P

Q

R

S

0.9

0.7

0.5

1.0

0.2

0.8

0.3

Sign. Corr.

1.0

0.55

0.75

0.4

1.0

0.25

0.58

0.0

Util. Normal. Util.

ur = 0.5 co = 0.41.0

0.58

Preserve

AQ

S

P

R

Fig. 5. Filtering the set of incoming edges for a node A.

co ∈ [0, 1]. For every node N , the utility values for each incoming edge X → N arenormalized to [0, 1], so that the weakest edge is assigned 0 and the strongest one 1. Alledges whose normalized utility value exceeds co are added to the preserved set. In theexample in Figure 5, only two of the original edges, are preserved, using a edge cutoffvalue of 0.4: P → A (with normalized utility of 1.0) and R → A (norm. utility of0.56). The outgoing edges are processed in the same manner for each node.

The edge cutoff parameter determines the aggressiveness of the algorithm, i.e. thehigher its value, the more likely the algorithm is to remove edges. In very unstructuredprocesses, where precedence relations are likely to have a balanced significance, it isoften useful to use a lower utility ratio, such that correlation will be taken more intoaccount and resolve such ambiguous situations. On top of that, a high edge cutoff willact as an amplifier, helping to distinguish the most important edges.

Note that our edge filtering approach starts from an empty set of precedence rela-tions, i.e. all edges are removed by default. Only if an edge is selected locally for at leastone node, it will be preserved. This approach keeps the process model connected, whileclarifying ambiguous situations. Whether an edge is preserved depends on its utility fordescribing the behavior of the activities it connects – and not on global comparisonswith other parts of the model, which it does not even interact with.

topcomplete0.517

midtopcomplete0.406

0.7950.815

sponsorscomplete0.473

0.0490.697

toolbarcomplete0.408

0.0820.697

layout_files/_rootcomplete0.896

0.0150.672

bottomcomplete0.417

0.0410.638

witcomplete0.657

0.0510.660

zwartcomplete0.610

0.3640.651

stylescomplete0.785

0.0240.676

introductioncomplete0.430

0.1570.653

subtopcomplete0.437

0.3430.768

processmining/_rootcomplete1.000

0.0400.697

0.0380.846

0.0960.671

0.2210.618

0.0230.551

0.1250.703

0.0460.734

0.0790.667

0.0380.380

0.4220.698

0.8960.820

0.0010.395

0.0260.640

0.0290.675

0.0081.000

0.0320.719

0.2270.494

0.0590.749

0.2060.598

0.4790.679

0.4720.637

0.0680.669

0.0370.651

0.0660.484

0.0310.766

0.0200.660

0.4290.685

0.0620.453

0.9810.674

0.0610.545

0.1230.640

0.1320.528

0.0710.708

0.0230.669

0.0310.589

0.0700.471

0.0180.509

0.0200.615

0.0140.444

0.2410.716

0.0070.412

0.0750.413

0.0480.455

0.2950.531

0.0590.408

0.0490.507

0.1990.476

0.0230.765

0.0280.781

0.9820.726

0.1010.707

0.1210.517

0.0020.853

0.1030.620

0.2220.592

0.2200.556

0.0400.730

0.0290.764

0.0480.586

0.5920.612

0.1990.707

0.0340.515

0.0560.589

0.1790.473

0.0310.669

0.0390.936

0.6700.743

0.4410.493

0.0600.592

0.1910.679

0.0320.595

0.5730.639

0.5410.623

0.0370.636

0.0960.627

0.1160.515

0.0720.627

0.4830.782

0.0400.904

0.2430.507

0.1220.650

0.2850.622

0.0380.651

0.0510.414

0.0300.603

0.0160.677

0.0160.433

0.9840.574

0.0380.530

0.0730.372

0.0600.457

0.4920.707

0.0980.392

0.0120.572

0.1850.377

0.0370.687

0.0180.744

0.1790.615

0.9570.699

0.0490.406

0.4370.685

0.0600.568

0.0760.662

0.1670.520

0.0070.682

0.0260.736

0.0220.382

0.0710.843

0.1220.834

0.0860.674

0.3920.610

0.0180.421

0.1840.697

0.0230.651

0.0930.660

0.0460.588

0.9140.714

0.0200.672

0.2660.549

0.1040.630

0.0230.702

0.0250.500

0.0070.794

0.0230.669

1.0000.614

0.3060.558

0.0060.718

0.0320.613

0.0620.565

0.0080.682

topcomplete0.517

midtopcomplete0.406

0.7950.815

subtopcomplete0.437

0.8960.820

sponsorscomplete0.473

0.0081.000

zwartcomplete0.610

0.4790.679

stylescomplete0.785

0.4720.637

toolbarcomplete0.408

bottomcomplete0.417

0.9810.674

layout_files/_rootcomplete0.896

0.2410.716

processmining/_rootcomplete1.000

0.1990.476

0.9820.726

0.0020.853

witcomplete0.657

0.5920.612

0.0390.936

0.4410.493

0.5730.639

0.5410.623

0.0400.904

0.9840.574

0.4920.707

introductioncomplete0.430

0.9570.699

0.0070.682

0.9140.714

1.0000.614

0.0080.682

Fig. 6. Example of a process model before (left) and after (right) edge filtering.

Figure 6 shows the effect of edge filtering applied to a small, but very unstructuredprocess. The number of nodes remains the same, while removing an appropriate subsetof edges clearly brings structure to the previously chaotic process model.

5.3 Node Aggregation and Abstraction

While removing edges brings structure to the process model, the most effective tool forsimplification is removing nodes. It enables the analyst to focus on an interesting subset

13

of activities. Our approach preserves highly correlated groups of less-significant nodesas aggregated clusters, while removing isolated, less-significant nodes. Removing nodesis based on the node cutoff parameter. Every node whose unary significance is belowthis threshold becomes a victim, i.e. it will either be aggregated or abstracted from. Thefirst phase of our algorithm builds initial clusters of less-significant behavior as follows.

– For each victim, find the most highly correlated neighbor (i.e., connected node)– If this neighbor is a cluster node, add the victim to this cluster.– Otherwise, create a new cluster node, and add the victim as its first element.

Whenever a node is added to a cluster, the cluster will “inherit” the ordering rela-tions of that node, i.e. its incoming and outgoing arcs, while the actual node will behidden. The second phase is merging the clusters, which is necessary as most clusterswill, at this stage, only consist of one single victim. The following routine is performedto aggregate larger clusters and decrease their number.

– For each cluster, check whether all predecessors or all successors are also clusters.– If all predecessor nodes are clusters as well, merge with the most highly correlated

one and move on to the next cluster.– If all successors are clusters as well, merge with the most highly correlated one.– Otherwise, i.e. if both the cluster’s pre- and postset contain regular nodes, the clus-

ter is left untouched.

Xstart

0.507Cluster B

2 primitives~ 0.127

0.4470.902

Cluster C1 primitives

~ 0.156

0.4470.902 Y

start0.927

0.0000.451Cluster A

2 primitives~ 0.169

0.0000.564

0.0010.676

Fig. 7. Excerpt of a clustered model after the first aggregation phase.

It is important that clusters will only be merged, if the “victim” has only clusters inhis pre- or postset. Figure 7 shows an example of a process model after the first phaseof clustering. Cluster A cannot merge with cluster B, as they are also both connectedto node X . Otherwise, X would be connected to the merged cluster in both directions,making the model less informative. However, clusters B and C can merge, as B’s post-set consists only of C. This simplification of the model does not lessen the amount ofinformation, and is thus valid.

The last phase, which constitutes the abstraction, removes isolated and singularclusters. Isolated clusters are detached parts of the process, which are less significantand highly correlated, and which have thus been folded into one single, isolated clusternode. It is obvious that such detached nodes do not contribute to the process model,which is why they are simply removed. Singular clusters consist only of one, singleactivity node. Thus, they represent less-significant behavior which is not highly corre-lated to adjacent behavior. Singular clusters are undesired, because they do not simplifythe model. Therefore, they are removed from the model, while their most significantprecedence relations are transitively preserved (i.e., their predecessors are artificiallyconnected to their successors, if such edge does not already exist in the model).

14

6 Implementation and Application

We have implemented our approach as the Fuzzy Miner plugin for the ProM framework[8]. All metrics introduced in Section 4 have been implemented, and can be configuredby the user. Figure 8 shows the result view of the Fuzzy Miner, with the simplified graphview on the left, and a configuration pane for simplification parameters on the right.Alternative views allow the user to inspect and verify measurements for all metrics,which helps to tune these metrics to the log.

Fig. 8. Screenshot of the Fuzzy Miner, applied to the very large and unstructured log also usedfor mining the model in Figure 1.

Note that this approach is the result of valuable lessons learnt from a great number ofcase studies with real-life logs. As such, both the applied metrics and the simplificationalgorithm have been optimized using large amounts actual, less-structured data. Whileit is difficult to validate the approach formally, the Fuzzy Miner has already becomeone of the most useful tools in case study applications.

For example, Figure 8 shows the result of applying the Fuzzy Miner to a large testlog of manufacturing machines (155.296 events in 16 cases, 826 event classes). Theapproach has been shown to scale well and is of linear complexity. Deriving all metricsfrom the mentioned log was performed in less than ten seconds, while simplifying theresulting model took less than two seconds on a 1.8 GHz dual-core machine. While thismodel has been created from the raw logs, Figure 1 has been mined with the HeuristicsMiner, after applying log filters [14]. It is obvious that the Fuzzy Miner is able to cleanup a large amount of confusing behavior, and to infer and extract structure from whatis chaotic. We have successfully used the Fuzzy Miner on various machinery test andusage logs, development process logs, hospital patient treatment logs, logs from case

15

handling systems and web servers, among others. These are notoriously flexible andunstructured environments, and we hold our approach to be one of the most useful toolsfor analyzing them so far.

7 Related Work

While parting with some characteristics of traditional process mining techniques, suchas absolute precision and describing the complete behavior, it is obvious that the ap-proach described in this paper is based on previous work in this domain, most specif-ically control-flow mining algorithms [2, 14, 7]. The mining algorithm most related toFuzzy Mining is the Heuristics Miner, which also employs heuristics to limit the set ofprecedence relations included in the model [14]. Our approach also incorporates con-cepts from log filtering, i.e. removing less significant events from the logs [8]. However,the foundation on multi-perspective metrics, i.e. looking at all aspects of the process atonce, its interactive and explorative nature, and the integrated simplification algorithmclearly distinguishes Fuzzy Mining from all previous process mining techniques.

The adaptive simplification approach presented in Section 5 uses concepts from thedomains of data clustering and graph clustering. Data clustering attempts to find relatedsubsets of attributes, based on a binary distance metric inferred upon them [11]. Graphclustering algorithms, on the other hand, are based on analyzing structural properties ofgraphs, from which they derive partitioning strategies [13, 9]. Our approach, however,is based on a unique combination of analyzing the significance and correlation of graphelements, which are based on a wide set of process perspectives. It integrates abstractionand aggregation, and is also more specialized towards the process domain.

8 Discussion and Future Work

We have described the problems traditional process mining techniques face when ap-plied to large, less-structured processes, as often found in practice. Subsequently, wehave analyzed the causes for these problems, which lie in a mismatch between fun-damental assumptions of traditional process mining, and the characteristics of real-lifeprocesses. Based on this analysis, we have developed an adaptive simplification and vi-sualization technique for process models, which is based on two novel metrics, signifi-cance and correlation. We have described a framework for deriving these metrics froman enactment log, which can be adjusted to particular situations and analysis questions.

While the fine-grained configurability of the algorithm and its metrics makes ourapproach universally applicable, it is also one of its weaknesses, as finding the “right”parameter settings can sometimes be time-consuming. Thus, our next steps will concen-trate on deriving higher-level parameters and sensible default settings, while preservingthe full range of parameters for advanced users. Further work will concentrate on ex-tending the set of metric implementations and improving the simplification algorithm.

It is our belief that process mining, in order to become more meaningful, and tobecome applicable in a wider array of practical settings, needs to address the problemsit has with unstructured processes. We have shown that the traditional desire to modelthe complete behavior of a process in a precise manner conflicts with the original goal,

16

i.e. to provide the user with understandable, high-level information. The success ofprocess mining will depend on whether it is able to balance these conflicting goalssensibly. Fuzzy Mining is a first step in that direction.

Acknowledgements This research is supported by the Technology Foundation STW,applied science division of NWO and the technology programme of the Dutch Ministryof Economic Affairs.

References

1. W.M.P. van der Aalst, B.F. van Dongen, J. Herbst, L. Maruster, G. Schimm, and A.J.M.M.Weijters. Workflow Mining: A Survey of Issues and Approaches. Data and KnowledgeEngineering, 47(2):237–267, 2003.

2. W.M.P. van der Aalst, A.J.M.M. Weijters, and L. Maruster. Workflow Mining: DiscoveringProcess Models from Event Logs. IEEE Transactions on Knowledge and Data Engineering,16(9):1128–1142, 2004.

3. R. Agrawal, D. Gunopulos, and F. Leymann. Mining Process Models from Workflow Logs.In Sixth International Conference on Extending Database Technology, pages 469–483, 1998.

4. E. Badouel, L. Bernardinello, and P. Darondeau. The Synthesis Problem for Elementary NetSystems is NP-complete. Theoretical Computer Science, 186(1-2):107–134, 1997.

5. J.E. Cook and A.L. Wolf. Discovering Models of Software Processes from Event-BasedData. ACM Transactions on Software Engineering and Methodology, 7(3):215–249, 1998.

6. A. Datta. Automating the Discovery of As-Is Business Process Models: Probabilistic andAlgorithmic Approaches. Information Systems Research, 9(3):275–301, 1998.

7. B.F. van Dongen and W.M.P. van der Aalst. Multi-Phase Process Mining: Building InstanceGraphs. In P. Atzeni, W. Chu, H. Lu, S. Zhou, and T.W. Ling, editors, International Con-ference on Conceptual Modeling (ER 2004), volume 3288 of Lecture Notes in ComputerScience, pages 362–376. Springer-Verlag, Berlin, 2004.

8. B.F. van Dongen, A.K. Alves de Medeiros, H.M.W. Verbeek, A.J.M.M. Weijters, and W.M.P.van der Aalst. The ProM framework: A New Era in Process Mining Tool Support. InG. Ciardo and P. Darondeau, editors, Application and Theory of Petri Nets 2005, volume3536 of Lecture Notes in Computer Science, pages 444–454. Springer-Verlag, Berlin, 2005.

9. S. van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht,2000.

10. J. Herbst. A Machine Learning Approach to Workflow Management. In Proceedings 11thEuropean Conference on Machine Learning, volume 1810 of Lecture Notes in ComputerScience, pages 183–194. Springer-Verlag, Berlin, 2000.

11. A.K. Jain, M.N. Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys,31(3):264–323, 1999.

12. A.K.A. de Medeiros, A.J.M.M. Weijters, and W.M.P. van der Aalst. Genetic Process Mining:A Basic Approach and its Challenges. In C. Bussler et al., editor, BPM 2005 Workshops(Workshop on Business Process Intelligence), volume 3812 of Lecture Notes in ComputerScience, pages 203–215. Springer-Verlag, Berlin, 2006.

13. A. Pothen, H. D. Simon, and K. Liou. Partitioning sparse matrics with eigenvectors of graphs.SIAM J. Matrix Anal. Appl., 11(3):430–452, July 1990.

14. A.J.M.M. Weijters and W.M.P. van der Aalst. Rediscovering Workflow Models from Event-Based Data using Little Thumb. Integrated Computer-Aided Engineering, 10(2):151–162,2003.

Fuzzy Mining – Adaptive Process Simpliﬁcation Based …wvdaalst/publications/p400.pdf · Fuzzy...

Documents

Transcript of Fuzzy Mining – Adaptive Process Simpliﬁcation Based …wvdaalst/publications/p400.pdf · Fuzzy...