Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache...

34
Enabling Large-Scale Process Discovery Sergio Hernández de Mesa { [email protected] ,[email protected] } Eindhoven, The Netherlands 9th July, 2015

Transcript of Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache...

Page 1: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Enabling Large-Scale Process Discovery

Sergio Hernández de Mesa{ [email protected],[email protected] }

Eindhoven, The Netherlands

9th July, 2015

Page 2: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Motivation

• MapReduce-based distributed process discovery

• Integration within ProM

• Evaluation

• Summary and Future Work

Outline

Sergio Hernández de Mesa 2

Page 3: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Motivation

Sergio Hernández de Mesa 3

Data explosion phenomenon

Page 4: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Motivation

Sergio Hernández de Mesa 4

Big Data era

Page 5: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Motivation

Sergio Hernández de Mesa 5

Big Data and Process Mining

Page 6: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Motivation

Sergio Hernández de Mesa 6

Big Data and Process Mining

Page 7: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Motivation

• MapReduce-based distributed process discovery

• Integration within ProM

• Evaluation

• Summary and Future Work

Outline

Sergio Hernández de Mesa 7

Page 8: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Programming model for data-oriented applications

• Proposed by Google

• Inspired by functional programming

• Scalable and easy-to-use

• Map: (k1, v1) list (k2,v2)

• Reduce: (k2, list(v2) ) list (v3)

Sergio Hernández de Mesa 8

MapReduce-based distributed process discoveryMapReduce

Page 9: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Framework for reliable and scalable distributed computing

• Developed by Apache

• Core components:- Hadoop Distributed File System (HDFS)

- Hadoop YARN (Yet Another Resource Manager)

Sergio Hernández de Mesa 9

MapReduce-based distributed process discoveryHadoop

HDFS overview YARN overview

Page 10: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Distribute/Parallelize process discovery techniques- Step 1: considering relations at trace level

- Step 2: aggregating information

- Step 3: apply some magic (discovery algorithm)

• HPC infrastructures

• Parallel programming models and technologies- MapReduce

- Hadoop

MapReduce-based distributed process discoveryPerformance improvement opportunities

Sergio Hernández de Mesa 10

Page 11: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• XES as input format

• Split log in smaller sublogs

- Horizontal partitioning

- Automatically managed by HDFS

• MapReduce-based approach

- Map: analyse event relations data (dfg, long-distance, split/joins, etc.…)

- Reduce: aggregate data and simple transformations

• Process model inside ProM

- Reuse algorithms, representations

- Visualize results

Sergio Hernández de Mesa 11

MapReduce-based distributed process discoveryHighlights

Page 12: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Sergio Hernández de Mesa 12

MapReduce-based distributed process discoveryOverview of process discovery techniques

Alpha Miner

Inductive Miner

Flexible Heuristics Miner

Page 13: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

HDFS (Hadoop

DistributedFile System)

HDFS (Hadoop Distributed

File System)

XES Logs

Block 1

Block 2

Block N

<trace>...

</trace>

<trace>…..

</trace>

… MAP 1

MAP 2

MAP N

…<trace>

…</trace>

<trace>…

</trace>

<trace>…

</trace>

<trace>…

</trace>

DFG 1

DFG 2

DFG N

REDUCE

FINAL DFG

Splitphase

MapReduce-based distributed process discoveryComputing DFG: Hadoop/MapReduce approach

… …

Sergio Hernández de Mesa 13

Page 14: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Motivation

• MapReduce-based distributed process discovery

• Integration within ProM

• Evaluation

• Summary and Future Work

Outline

Sergio Hernández de Mesa 14

Page 15: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Hadoop Cluster Parameters

- Connection with a Hadoop cluster

- Verify user has access to the cluster and HDFS is accessible

• Hadoop XLog

- Extend XLog interface

- Just a reference to the file when it is imported

- Actually loaded in memory if the plugin request some information

Integration within ProMCore Concepts

Sergio Hernández de Mesa 15

Page 16: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Integration within ProM

Sergio Hernández de Mesa 16

1.- Connection with the Hadoop cluster

2.- Virtual import the log

3.- Send executable jar file

4.- Execute MapReduce job

5.- Retrieve result

6.- Get final process model

Basic Operation

Page 17: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Integration within ProM

Sergio Hernández de Mesa 17

Screenshot

Page 18: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Motivation

• MapReduce-based distributed process discovery

• Integration within ProM

• Evaluation

• Summary and Future Work

Outline

Sergio Hernández de Mesa 18

Page 19: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• AIS Hadoop cluster

• 1 Master Node

- 8 Intel XEON CPU E5430 at 2.66 GHz

- 32 GB of RAM

- 5 300 GB hard disks

• 4 Worker Nodes

- 8 Intel XEON CPU E5430 at 2.66 GHz

- 64 GB of RAM

- 8 1 TB hard disks

EvaluationExperimental setup: Hardware configuration

Sergio Hernández de Mesa 19

Page 20: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Apache Hadoop 2.6.0

• Up to 16 tasks (virtual cores) per worker node

• Up to 56 GB of RAM per worker node

• HDFS Block size: 256 MB

• 2 replicas per block

• Master node: Namenode and Resource Manager services

• Worker nodes: Datanode and Node Manager services

EvaluationExperimental setup: Hadoop configuration

20

Page 21: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Alpha Miner

- No configuration parameters

• Inductive Miner

- Inductive Miner infrequent

- Noise thresholds: 0.2

• Flexible Heuristics Miner

- Heuristics: all tasks connected and long distance dependency

- Dependency thresholds 90.0

- Relative-to-best threshold: 5.0

EvaluationExperimental setup: Process mining techniques

Sergio Hernández de Mesa 21

Page 22: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Synthetic datasets

- Process tree of 40 activities

- Random generation

• Log 1

- Average: 35 events per trace

EvaluationExperimental setup: Datasets

Sergio Hernández de Mesa 22

Page 23: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Log 2

- 2 iterations synthetic dataset

- 40 activities

- Average: 70 events per trace

• Log 3

- Renamed activities 2nd iteration

- 80 activities

- Average: 70 events per trace

EvaluationExperimental setup: Datasets

Sergio Hernández de Mesa 23

Page 24: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

EvaluationExperimentation: Put logs in HDFS

Sergio Hernández de Mesa 24

0

10

20

30

40

50

60

70

0 32 64 96 128 160 192 224 256 288 320 352 384 416

Tim

e (m

inut

es)

Log size (GB)

Put logs in HDFS

Log 3

Log 2

Log 1

Page 25: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Log 1

EvaluationExperimentation: Scalability – log size

Sergio Hernández de Mesa 25

0

5

10

15

20

25

0 32 64 96 128 160 192 224 256

Tim

e (m

inut

es)

Log size (GB)

Alpha Miner

Inductive Miner

Page 26: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Log 1

EvaluationExperimentation: Scalability – log size

Sergio Hernández de Mesa 26

0

25

50

75

100

125

150

175

200

0 32 64 96 128 160 192 224 256

Tim

e (m

inut

es)

Log size (GB)

Flexible Heuristics Miner

XES 2 DGraph

DGraph 2 AnnotatedDGraph

Page 27: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Log 1

EvaluationExperimentation: Scalability – worker nodes

Sergio Hernández de Mesa 27

0

0.5

1

1.5

2

2.5

3

3.5

4

0 32 64 96 128 160 192 224 256

Spee

d-up

Log size (GB)

Computing directly-follows graph

1 worker 2 workers

3 workers 4 workers

Page 28: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Log 1

EvaluationExperimentation: Scalability – worker nodes

Sergio Hernández de Mesa 28

0

0.5

1

1.5

2

2.5

3

3.5

4

0 32 64 96 128 160 192 224 256

Spee

d-up

Log size (GB)

Flexible Heuristics Miner

1 worker 2 workers

3 workers 4 workers

Page 29: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Evaluation

Sergio Hernández de Mesa 29

0

5

10

15

20

25

30

35

40

45

50

0 32 64 96 128 160 192 224 256 288 320 352 384 416

Tim

e (m

inut

es)

Log size (GB)

Inductive Miner

Log 3

Log 2

Log 1

Experimentation: 3 datasets comparison

Page 30: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Evaluation

Sergio Hernández de Mesa 30

0

100

200

300

400

500

600

700

800

0 32 64 96 128 160 192 224 256 288 320 352 384 416

Tim

e (m

inut

es)

Log size (GB)

Flexible Heuristics Miner

Log 3

Log 2

Log 1

Experimentation: 3 datasets comparison

Page 31: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Motivation

• MapReduce-based distributed process discovery

• Integration within ProM

• Evaluation

• Summary and Future Work

Outline

Sergio Hernández de Mesa 31

Page 32: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• MapReduce-based approach for process discovery- Split the input event log in smaller sublogs

- Map phase: Computing intermediate data from sublogs

- Reduce phase: Aggregating all data

- Process model in ProM

• Results show the approach is scalable- Log size (number of events and traces)

- Computer resources

• Integration of Apache Hadoop within ProM- Using implemented algorithms

- Developing new Hadoop-based techniquesSergio Hernández de Mesa 32

Summary and Future WorkConclusions

Page 33: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

• Developing new process discovery algorithms- ILP miner, social network miner, etc.

• Extend to other process mining dimensions- Computing alignments

• Explore other input formats- CSV

- Apache Avro

• Explore other distributed computing approaches- Apache Stark

- Cloud computing

Sergio Hernández de Mesa 33

Summary and Future WorkFuture work

Page 34: Enabling Large-Scale Process Discovery 9th July, 2015 •Motivation • ... Integration of Apache Hadoop within ProM-Using implemented algorithms-Developing new Hadoop-based techniques.

Enabling Large-Scale Process DiscoverySergio Hernández de Mesa

{ [email protected],[email protected] }