Distributed Process Discovery and Conformance Checking
-
Upload
wil-van-der-aalst -
Category
Technology
-
view
263 -
download
1
description
Transcript of Distributed Process Discovery and Conformance Checking
Distributed Process Discovery and Conformance Checking
prof.dr.ir. Wil van der Aalstwww.processmining.org
PAGE 1
On the different roles of (process) models …
Play-Out
PAGE 2
Play-Out (Classical use of models)
PAGE 3
A B C D
A C B DA B C D
A E D
A C B DA C B D
A E D
A E D
Play-In
PAGE 4
Play-In
PAGE 5
A C B DA B C D
A E D
A C B DA C B D
A E D
A E DA B C D
Example Process Discovery(Vestia, Dutch housing agency, 208 cases, 5987 events)
PAGE 6
Example Process Discovery(ASML, test process lithography systems, 154966 events)
PAGE 7
Example Process Discovery(AMC, 627 gynecological oncology patients, 24331 events)
PAGE 8
Replay
PAGE 9
Replay
PAGE 10
A B C D
Replay
PAGE 11
A E D
Replay can detect problems
PAGE 12
AC D
Problem!missing token
Problem!token left behind
Conformance Checking (WOZ objections Dutch municipality, 745 objections, 9583 event, f= 0.988)
PAGE 13
Replay can extract timing information
PAGE 14
A5B8 C9 D13
5
8
9
13
3
4
5
43
265
8
764
7
74
3
PAGE 15
Performance Analysis Using Replay(WOZ objections Dutch municipality, 745 objections, 9583 event, f= 0.988)
PAGE 16
Big Data
Big Data
PAGE 17
Source: “Big Data: The Next Frontier for Innovation, Competition, and Productivity” McKinsey Global Institute, 2011.
“Enterprises globally stored more than 7 exabytesof new data on disk drives in 2010, while consumers stored morethan 6 exabytes of new data on devices such as PCs and notebooks.”
“All of the world's music can be stored
on a $600 disk drive.”
“Indeed, we are generating so much data today that it is
physically impossible to store it all. Health
care providers, for instance, discard 90
percent of the data that they generate.”
PAGE 18
Hilbert and Lopez. The World's Technological Capacity to Store, Communicate, and Compute Information. Science, 332(6025):60-65, 2011.
PAGE 19
www.olifantenpaadjes.nl
PAGE 20
PAGE 21
Evidence-Based Computer Science
PAGE 22
PAGE 23
How good is my model?
Four Competing Quality Criteria
PAGE 24
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
Example: one log four models
PAGE 25
astart register
request
bexamine thoroughly
cexamine casually
d checkticket
decide
pay compensation
reject request
reinitiate requeste
g
hfend
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N3 : fitness = +, precision = -, generalization = +, simplicity = +
N2 : fitness = -, precision = +, generalization = -, simplicity = +
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
N1 : fitness = +, precision = +, generalization = +, simplicity = +
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
(all 21 variants seen in the log)
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
process discovery
fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
Model N1
PAGE 26
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
Model N2
PAGE 27
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
Model N3
PAGE 28
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
Model N4
PAGE 29
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
(all 21 variants seen in the log)
PAGE 30
Process Discovery
Process Discovery (small selection)
PAGE 31
α algorithm
α++ algorithm
α# algorithm
language-based regions
state-based regionsgenetic mining
heuristic mining
hidden Markov models
neural networks
automata-based learning
stochastic task graphs
conformal process graph
mining block structures
multi-phase miningpartial-order based mining
fuzzy mining
LTL mining
ILP mining
distributed genetic mining
Petri net view: Just discover the places …
PAGE 32
Adding a place limits behavior:• overfitting ≈ adding too many places • underfitting ≈ adding too few places
Example: Process Discovery UsingState-Based Regions
PAGE 33
[ ]
a
d
b dc
e
cb
[a,b,c,d][a,b,c][a,c]
[ a,b][a,e] [a,d,e]
[a]
a
b
c
de
p2
end
p4
p3p1
start
Example of Region
PAGE 34
[ ]
a
d
b dc
e
cb
[a,b,c,d][a,b,c][a,c]
[ a,b][a,e] [a,d,e]
[a]
a
b
c
de
p2
end
p4
p3p1
start
enter: b,eleave: ddo-not-cross: a,c
Example: Process Discovery UsingLanguage-Based Regions
PAGE 35
R
A place is feasible if it can be added without disabling any of thetraces in the event log.
PAGE 36
Conformance Checking
Replaying trace “abeg”
PAGE 37
m=1r=1
a b e g
6
11
6= 0.83333
Can be lifted to log level
PAGE 38
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
p3p1
p2 p4
N1
p5
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
From “playing the token game” to optimal alignments …
a b » e ga b d e g
PAGE 39
observed trace: “abeg”
move in model only
Another alignment
a b c d e ga b » d e g
PAGE 40
observed trace: “abcdeg”
move in log only
Moves in an alignment
PAGE 41
a b » d e ga » c d e g
trace in event log
possible run of model
move in log
move in model move in both
Optimal alignment describes modeled behavior closest to observed behavior
Moves have costs
• Standard cost function:
−c(x,») = 1−c(»,y) = 1−c(x,y) = 0, if x=y−c(x,y) = ∞, if x≠y
PAGE 42
… a …… » …
… » …… a …
… a …… a …
… a …… b …
Non-fitting trace: abefdeg
PAGE 43
abefdeg
a b » e f d » e ga b d e f d b e g 2
2a b e f d e ga b » » d e g
Any cost structure is possible
PAGE 44
… send-letter(John,2 weeks, $400)
…
… send-email(Sue,3weeks,$500)
…
• Similar activities (more similarity implies lower costs).• Resource conformance (done by someone that does
not have the specified role).• Data conformance (path is not possible for this
customer). • Time conformance (missed the legal deadline)
Fitness
PAGE 45
astart register
request
bexamine thoroughly
cexamine casually
d checkticket
decide
pay compensation
reject request
reinitiate requeste
g
hfend
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N3 : fitness = +, precision = -, generalization = +, simplicity = +
N2 : fitness = -, precision = +, generalization = -, simplicity = +
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
N1 : fitness = +, precision = +, generalization = +, simplicity = +
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
(all 21 variants seen in the log)
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
1.0
0.8
1.0
1.0
Our A* algorithm exploits the Petri net marking equation and uses other “tricks” to prune the search space.
Aligned event log is starting point for other types of analysis.
PAGE 46
Distributing/Decomposing “Big Data” Process Mining Problems
PAGE 47
What if?
PAGE 48
book car
c
add extra insurance
d change booking
e
confirm initiate check-in
j
check driver’s license
k
charge credit card
i
select car
g
supply car
in
a
b
skip extrainsurance
f
h
add extra insurance
skip extra insurance
l
out
c1 c2
c3
c4
c5
c6
c7
c8
c9
c10
c11
acefgijklacddefhkjilabdefjkgil
acdddefkhijlacefgijklabefgjikl
...
processdiscovery
conformance checking
there are more than 1000 different
activities?
there are more than 1.000.000
cases?
there are more than 100.000.000
events?
Distributed computing
• multicore CPU• manycore GPU• cluster computing• grid computing• cloud computing• …
PAGE 49
How to distribute process discovery?
PAGE 50
How to distribute conformance checking?
PAGE 51
abcdegadcefbcfdeg
abdcegabcdefbcdegabdfcefdcegacdefbdceg
abcdegabdceg
abdcefbdcefbdcegabcdeg
abcdefbcdefbdcegabcdefbdceg
acdefgadcfeg
abdcefcdfegabcdeg
a b
c
d
e
c1in
c2
c3
c4
c5
g
outc6
f
abcdegabdceg
abcdefbcdegabcdegabdceg
abdcefbdcefbdcegabcdeg
abcdefbcdefbdcegabcdefbdceg
abcdeg
a b
c
d
e
c1in
c2
c3
c4
c5
g
outc6
f
adcefbcfdegabdfcefdcegacdefbdceg
acdefgadcfeg
abdcefcdfeg
b is often skipped
f occurs too often
Classification based on partitioning of event log: vertical and horizontal
PAGE 52
sets of cases
sets of activities
Replication: Same event log on all computing nodes
PAGE 53
Only makes sense if random elements, e.g., genetic process mining.
Vertical distribution I: Split cases arbitrarily
PAGE 54
abcdegabdcefbcdeg
abdcegabcdefbcdegabdcefbdcegabcdefbdceg
abcdegabdceg
abdcefbdcefbdcegabcdeg
abcdefbcdefbdcegabcdefbdceg
abcdegabdceg
abdcefbcdegabcdeg
abcdegabdcefbcdeg
abdcegabcdefbcdegabdcefbdcegabcdefbdceg
abcdegabdceg
abdcefbdcefbdcegabcdeg
abcdefbcdefbdcegabcdefbdceg
abcdegabdceg
abdcefbcdegabcdeg
sets of cases
Vertical distribution II: Split cases based on a specific feature
PAGE 55
abcdegabdcefbcdeg
abdcegabcdefbcdegabdcefbdcegabcdefbdceg
abcdegabdceg
abdcefbdcefbdcegabcdeg
abcdefbcdefbdcegabcdefbdceg
abcdegabdceg
abdcefbcdegabcdeg
abcdegabdcegabcdegabdcegabcdegabcdegabdcegabcdeg
abdcefbcdegabcdefbcdegabdcefbdcegabcdefbdcegabcdefbdcegabdcefbcdeg
abdcefbdcefbdcegabcdefbcdefbdceg
Horizontal distribution
PAGE 56
sets of activities
Horizontal distribution: The key idea
PAGE 57
projected on a,b,e,f,g
projected on b,c,d,e
PAGE 58
Passages for Horizontal Distribution
Passages
PAGE 59
Passage P=(X,Y)
PAGE 60
causal dependency: may trigger or enable
Minimal passages
PAGE 61
a passage is minimal if it does not contain smaller passages
Passages define an equivalence relation on the edges in the graph
PAGE 62
Minimal passage 1: ({a},{b,c})
PAGE 63
Minimal passage 2: ({b,c,d},{d,e,f})
PAGE 64
Minimal passage 3: ({e},{g})
PAGE 65
Minimal passage 4: ({f},{h})
PAGE 66
Minimal passage 5: ({g,h},{i})
PAGE 67
So What?
• Any process model can be partitioned in minimal passages.
• Discovery and conformance checking can be done per passage!
PAGE 68
b
a
c
d
e
j
i
hf nk
l
m
o
pgi o
clouds may contain arbitrary subprocesses not explicitly recorded in the
event log (invisible activities or small networks used for routing, e.g. XOR/AND/OR-
split/joins)
Example result for Petri nets
PAGE 69
“The event log fits all passages if and only if the event log fits the whole model.”
b
a
c
d
e
j
i
hf nk
l
m
o
pgi o
Key insight: interface transitions controlled by event log
causal structure obtained using heuristics & domain knowledge
Discovery example
PAGE 70
a
in
g
out
b
f
a
b
c
d
c
d
e
f
e
g
a b
c
d
e
c1in
c2
c3
c4
c5
g
outc6
f
Conformance checking
PAGE 71
book car
c
add extra insurance
d change booking
e
confirm initiate check-in
j
check driver’s license
k
charge credit card
i
select car
g
supply car
in
a
b
skip extrainsurance
f
h
add extra insurance
skip extra insurance
l
out
c1 c2
c3
c4
c5
c6
c7
c8
c9
c10
c11
aceflacddeflabdefl
acdddeflaceflabefl
...
Create Skeleton
PAGE 72
Net fragments per passage
PAGE 73
Initial implementation in ProM
PAGE 74
Super linear speedups possible (even when using a single computer decomposition helps)
PAGE 75
PAGE 76
Conclusion
Conclusion
PAGE 77“Big Data”
b
a
c
d
e
j
i
hf nk
l
m
o
pgi o
PAGE 78
www.processmining.org
www.win.tue.nl/ieeetfpm/