ANALYSIS OF GLOBAL VIRTUAL TIME ALGORITHMS FOR …

ANALYSIS OF GLOBAL VIRTUAL TIME ALGORITHMS FOR PARALLEL

DISCRETE EVENT SIMULATION ON MANY-CORE SYSTEMS

BY

ALI ARDA EKER

BS, Binghamton University, 2017BS, Istanbul Technical University, 2017

THESIS

Submitted in partial fulfillment of the requirements forthe degree of Master of Science in Computer Science

in the Graduate School ofBinghamton University

State University of New York2019

ProQuest Number:

All rights reserved

INFORMATION TO ALL USERSThe quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted. Also, if material had to be removed,

a note will indicate the deletion.

ProQuest

Published by ProQuest LLC ( ). Copyright of the Dissertation is held by the Author.

All rights reserved.This work is protected against unauthorized copying under Title 17, United States Code

Microform Edition © ProQuest LLC.

ProQuest LLC.789 East Eisenhower Parkway

P.O. Box 1346Ann Arbor, MI 48106 - 1346

13865127

13865127

2019

c© Copyright by Ali Arda Eker 2019All Rights Reserved

Accepted in partial fulfillment of the requirements forthe degree of Master of Science in Computer Science

in the Graduate School ofBinghamton University

State University of New York2019

July 2, 2019

Dmitry Ponomarev, Faculty AdvisorDepartment of Computer Science, Binghamton University

Kenneth Chiu, MemberDepartment of Computer Science, Binghamton University

iii

Abstract

Global Virtual Time (GVT) algorithms compute the snapshot of a distributed

simulation system to determine a consistent global state across all simulation pro-

cesses. These algorithms aim to affect the underlying computation at a minimum

level while computing a global state which would consist of local states of all pro-

cesses and the states of in transit messages between them. In other words, GVT

algorithms implement the monotonic functions which would give the lower bound of

the simulation time to which a distributed simulation system has advanced. In Paral-

lel Discrete Event Simulation (PDES), GVT is used to determine the correct time for

non reversible operations such as garbage collection, I/O operations and terminating

the simulation.

In this project, we implemented two asynchronous GVT algorithms: Wait-Free

and Mattern’s GVT and compared them with a barrier based synchronous GVT

algorithm [11]. We evaluated GVT algorithms based on the PDES performance un-

der different multi-core architectures: a classical 12-core Xeon machine and a high

performance computing Xeon Phi Processor (Knights Landing). Using the ROSS

simulator, we demonstrated that an efficient GVT algorithm can lead to significant

improvements in scalability depending on the simulation model. We observed that the

synchronous Barrier GVT algorithm with imbalanced models and the asynchronous

Wait-Free algorithm with balanced models both result in the simulation to scale

in performance all the way to 250 threads in a single machine. We also performed

detailed simulation profiling to understand the underlying reasons for different perfor-

mance trends based on the GVT algorithm choice, simulation model and parameters.

iv

To my family & Albinacik

vii

ACKNOWLEDGEMENTS

I appreciate Dr. Dmitry Ponomarev and Dr. Kenneth Chiu for always directing

me to do my best.

I also thank Barry Williams and Dr. Nael Abu-Ghazaleh for their invaluable

advices on my work.

viii

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2 BACKGROUND: PDES, ROSS SIMULATOR AND MULTI-

CORE ARCHITECTURES . . . . . . . . . . . . . . . . . 5

2.1 Parallel Discrete Event Simulation . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Synchronization Issues . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Computing the Global Virtual Time . . . . . . . . . . . . . . 11

2.1.3 Transient Message Problem . . . . . . . . . . . . . . . . . . . 13

2.1.4 Simultaneous Reporting Problem . . . . . . . . . . . . . . . . 15

2.2 ROSS Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 PHOLD Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Intel Xeon & Xeon Phi Architectures . . . . . . . . . . . . . . . . . . 21

2.5 Experimental Setup & Parameters . . . . . . . . . . . . . . . . . . . . 22

Chapter 3 GLOBAL VIRTUAL TIME ALGORITHMS . . . . . . 26

3.1 Synchronous GVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.1 Barrier GVT Algorithm . . . . . . . . . . . . . . . . . . . . . 27

ix

3.2 Asynchronous GVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Samadi’s GVT Algorithm . . . . . . . . . . . . . . . . . . . . 29

3.2.2 Mattern’s GVT Algorithm . . . . . . . . . . . . . . . . . . . . 30

3.2.3 Wait-Free GVT Algorithm . . . . . . . . . . . . . . . . . . . . 43

Chapter 4 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . 48

4.1 GVT Performance on 12-core Classical Xeon Machine . . . . . . . . . 48

4.1.1 Model 1: Balanced Loading & Fast Event Processing . . . . . 49

4.1.2 Model 2: Balanced Loading & Slower Event Processing . . . . 50

4.2 GVT Performance on 64-core Knights Landing Architecture . . . . . 51

4.2.1 Model 1: Balanced Loading & Fast Event Processing . . . . . 51

4.2.2 Model 2: Balanced Loading & Slower Event Processing . . . . 54

4.2.3 Model 3: Imbalanced Communication . . . . . . . . . . . . . . 54

4.2.4 Model 4: Imbalanced Event Processing . . . . . . . . . . . . . 56

4.3 Profiling and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Chapter 5 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . 62

Chapter 6 CONCLUSIONS AND FUTURE WORK . . . . . . . . 64

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

x

LIST OF TABLES

2.1 Details of Experimental Platforms . . . . . . . . . . . . . . . . . . . . 23

2.2 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Performance statistics for Xeon: 24 Threads 0% Remote, Balanced, 0

EPG Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Performance statistics for KNL: 128 Threads 0% Remote, Balanced, 0

EPG Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Performance statistics for KNL: 250 Threads 10% Remote, Imbalanced,

0 EPG Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

xi

LIST OF FIGURES

2.1 PDES Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Rollback Example (Step 1): before rollback . . . . . . . . . . . . . . . 9

2.3 Rollback Example (Step 2): after straggler message is received, rollback

is initiated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Rollback Example (Step 3 a): after rollback completed in a determin-

istic simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Rollback Example (Step 3 b): after rollback completed in a non-

deterministic simulation . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 GVT is the time stamp of message sent from LP 2 to LP 3. . . . . . . 12

2.7 Transient message problem. . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 Prevention of transient message problem using message acknowledge-

ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.9 Simultaneous message problem. . . . . . . . . . . . . . . . . . . . . . 16

2.10 Hierarchy of simulation structures in ROSS. . . . . . . . . . . . . . . 18

2.11 Communication architecture in single node ROSS . . . . . . . . . . . 19

2.12 Intel Knights Landing Architecture . . . . . . . . . . . . . . . . . . . 22

2.13 Imbalanced Communication . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Snapshot of a Barrier GVT computation . . . . . . . . . . . . . . . . 28

xii

3.2 Snapshot of a Samadi’s GVT Computation . . . . . . . . . . . . . . . 31

3.3 Cut divides simulation into two: past and future. . . . . . . . . . . . 31

3.4 Events before the cut line colored as white, after the cut line colored

as red (dotted arrow). . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Second cut should stretch towards future so that there should be no

message sent from the white phase and received in the consecutive

white phase (dotted arrow). . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Snapshot of a Mattern’s GVT computation . . . . . . . . . . . . . . . 39

3.7 Snapshot of a Wait-Free GVT computation . . . . . . . . . . . . . . . 46

4.1 Committed Event Rate on Xeon with Balanced Loading and 0 EPG:

10% Remote Events (left) & 100% Remote Events (right) . . . . . . . 49

4.2 Committed Event Rate on Xeon with Balanced Loading and 50% Re-

mote Events: 100 EPG (left) & 500 EPG (right) . . . . . . . . . . . . 50

4.3 Committed Event Rate on KNL with Balanced Loading and EPG of

0: 0% Remote Events (left) & 10% Remote Events (right) . . . . . . 52

4.4 Committed Event Rate on KNL with Balanced Loading and EPG of

0: 50% Remote Events (left) & 100% Remote Events (right) . . . . . 53

4.5 Committed Event Rate on KNL with Balanced Loading and the EPG

of 100: 10% Remote Events (left) & 100% Remote Events (right) . . 55

4.6 Imbalanced in terms of Communication on KNL: Committed Event

Rate (left) & Efficiency (right) . . . . . . . . . . . . . . . . . . . . . . 56

xiii

4.7 Imbalanced in terms of Event Processing on KNL: Committed Event

Rate (left) & Efficiency (right) . . . . . . . . . . . . . . . . . . . . . . 57

xiv

Chapter 1

INTRODUCTION

The interest in scalable performance of Parallel Discrete Event Simulations (PDES)

has increased after the emergence of many-core architectures [7,8,27]. These machines

offer a tight integration among a large number of cores on the same chip in contrast

to traditional clusters and small-scale multi-core systems. With these new systems,

researchers aim to efficiently exploit the shared memory which feeds a large number

of threads. For example, 4-way hyper-threaded 64 cores share a 96 GB memory on In-

tel’s second generation of Xeon Phi - Knights Landing processor (KNL). Thus, the low

cost of on-chip communication on these processors offers a promise to substantially

improve scalability in PDES.

The emergence of many-core architectures offered to alleviate the communication

bottlenecks that hindered many previous attempts to design a scalable PDES [10,16,

20,28,43,44]. However, recent studies reported generally underwhelming performance

results [7]. Performance challenges and the lack of scalability partially stem from an

inefficient Global Virtual Time (GVT) algorithm used in these studies. A detailed

study [45] examined and characterized PDES performance and scalability on KNL.

This work demonstrated the lack of scalability for most models and execution sce-

narios when the thread count exceeded 128, or 2 threads for each core on the KNL

chip. One of the reasons cited in [45] is the higher overhead of Global Virtual Time

(GVT) computations with larger number of threads. Thus, we investigated PDES

performance on multi-core processors under more optimized GVT implementations.

1

As a result, our study offers more encouraging conclusions about PDES scalability

properties on multi-core systems.

We pursue our investigations using the ROSS parallel discrete event simulation

kernel [3] on a single-node Intel’s Xeon Phi processor. For comparison purposes, we

also evaluate several GVT algorithms on the traditional 12-core Xeon processor. The

default GVT implementation in ROSS is a synchronous GVT using native POSIX

barriers. Furthermore, we implemented two asynchronous GVT algorithms to further

boost performance. The first one was inspired by Mattern’s GVT algorithm, but

appropriately adjusted for shared memory systems within our framework [33–35]. The

second one is the more recent wait-free GVT algorithm proposed in [37]. We compare

the performance of the three implementations under different models, settings, and

conditions. Our experiments are driven by the classical PHOLD benchmark, as well

as its variants that provide uneven loading of threads, vary percentage of remote

communications, and change the event processing granularity.

In the Background section, we explain the concept of Parallel Discrete Event

Simulation and why an efficient Global Virtual Time computation is needed. We

also study the main challenges to be considered when calculating the GVT. The

Transient Message Problem and Simultaneous Reporting Problem are demonstrated

as examples. We also clarify the intrinsics of ROSS and discuss why it needs to

implement a GVT algorithm. Then, we study the PHOLD benchmark and its versions

used in our experiments and the effects of the simulation model on the choice of GVT

algorithm. Finally, we examine the Intel Xeon and Xeon Phi Architectures with their

experimental setup and parameters to understand their effects on the scalability.

We first explain and demonstrate the default synchronous GVT implementation

in the GVT section. Then, we analyze what makes it synchronous, as well as the

possible advantages and disadvantages of asynchronous implementations. In order to

2

identify the concept of the asynchronous GVT algorithm, we study Mattern’s GVT

and Wait-Free GVT in detail. We also examine the modifications made on those

algorithms to exploit the shared memory in a more efficient way.

Simulation models investigated on the Xeon and the KNL processor are pre-

sented in the Experimental Results chapter. We first evaluate and compare the GVT

performances under the Xeon processor using a balanced model with changing event

processing granularity on a smaller scale. A Knights Landing processor is used to

experiment with balanced and imbalanced core loading models. We also explore

communication and event processing dominated scenerios to further understand the

benefits and drawbacks of three GVT algorithms. Finally, we demonstrate the scal-

ability of some scenarios up to 250 threads and explain them with detailed profiling.

Main contributions of this thesis are:

• We extend the previous studies of PDES performance using the KNL proces-

sor under more efficient GVT algorithms: a synchronous barrier-based GVT,

and two asynchronous implementations: one inspired by Mattern’s GVT algo-

rithm [34] and one that is based on the wait-free algorithm of [37]. As a result,

we removed a significant bottleneck presented in the earlier study and demon-

strated that under a more efficient GVT, the simulation can often scale all the

way to 250 threads.

• This is the first study that comparatively evaluates synchronous and asyn-

chronous GVT algorithms on a many-core platform such as the KNL. We also

compare the results on traditional Xeon machines.

• We show that while the most efficient asynchronous algorithm (the wait-free

GVT) significantly outperforms other alternatives for balanced models on both

Xeon and KNL systems, the barrier-based synchronous implementation results

in better performance on KNL with imbalanced models, especially with larger

3

thread counts.

• We analyze the reasons for this behavior on KNL systems using a number of

profiling tools and offer the explanations for the observed results based on this

analysis.

4

Chapter 2

BACKGROUND: PDES, ROSS SIMULATOR

AND MULTI-CORE ARCHITECTURES

In this section we overview the Parallel Discrete Event Simulation and show the

challenges that arise when computing the Global Virtual Time. We also review the

ROSS design and PHOLD benchmark with its variants and describe the architectures

and experimental setup used in these experiments.

2.1 Parallel Discrete Event Simulation

In order to understand PDES, one should first understand Discrete Event Simu-

lation. The main objective of DES is to model a physical system which is composed of

some number of physical processes that interact with each other [12,19]. In DES, each

physical process is modeled as a logical process (LP) and interactions between physi-

cal processes are simulated by exchanging time-stamped event messages between the

associated logical processes. The computation performed by each LP is a sequence

of event processing which can modify the state of the LP and schedule new events

for itself or other LPs. For example, in an airport traffic simulation, each airport

is represented by an LP, while airplane arrivals and departures are event messages

which provide the communication between airports.

PDES is a parallel implementation of DES [13], extending the performance ad-

vantages of parallel processing for simulation kernels. The primary concept of PDES is

5

to divide the simulation entities into multiple Logical Processes (LPs) and to execute

them on different cores or nodes in parallel.

The LPs communicate with each other by exchanging time-stamped event mes-

sages [15, 18, 29]. Time-stamps contain virtual time and are not associated with the

real time (wall clock time). Sender LP generates an event message according to its

task and computes the message’s time-stamp by adding a look-ahead value to its

local virtual time (LVT). The Look-ahead value can be configured as a constant or

can vary according to LP’s task or virtual position in the simulation.

The LPs have local event queues and process the events from these queues in

a time-stamped order. Processing an event will guarantee generating a new event

with a larger time-stamp than the last processed event’s time-stamp. This new event

can be sent to any other LP including to itself. Depending on the simulation model,

some LPs can be chosen as message destinations more often than others but every

LP should receive events once in a while to advance in the simulation time. One

can compare this to the peaceful airport of Binghamton with JFK which receives

hundreds of departures daily.

Figure 2.1 presents an overview of PDES kernel where three LPs communicate

with time stamped event messages. Incoming messages stored in the Event Queue and

the State Queue is updated based on the last processed event. Message destinations

and look ahead values are chosen randomly.

As mentioned previously, some events are generated locally within the LP, and

some events are generated remotely and sent over the network. The timing of the

event arrival to a destination LP depends on the physical delays that the message

encounters while traversing the network. The on-chip interconnect for core-to-core

communication within a chip or network links for cluster-level communication can

determine the delay of the arrival.

6

Fig. 2.1. PDES Overview

2.1.1 Synchronization Issues

Sequential event processing by LPs in DES should be computed in parallel in

a PDES environment. This generates a synchronization problem since one cannot

simply map the different logical processes to different cores or nodes and allow each

LP to proceed forward by executing events in the incoming order of arrival. Event

messages should be causally consistent with each other based on their time stamps.

LPs should process events, both those generated locally and those generated by other

LPs in the time-stamp order. Failure to accomplish this could cause the processing

of an event E without processing the events which caused the generation of E. Errors

resulting out of order time-stamp event processing are referred as causality errors.

When an LP processes an event, there is no priori guarantee that an event with

smaller time-stamp will not arrive from some other LP due to physical delays in

the system. These events are called straggler events and violate the causality order

between events. Therefore, a PDES simulation engine needs to use a synchronization

mechanism to ensure that events are executed at different LPs in the correct time-

stamped order.

7

For example, a passenger travels from JFK, New York to Istanbul, Turkey

through Heathrow, London. They will switch from plane A to plane B in London.

If B arrives at London before the passenger arrives, this is normal in reality, but in

a PDES environment it causes message B to be processed. Therefore, the passenger

misses their flight to Istanbul. In this case, B is a straggler message with a time

stamp larger than A but arrives to the destination (Heathrow) before A.

There are two proposed synchronization approaches to solve this issue. The first

one is a conservative approach which uses synchronization and message exchanges

to guarantee that no straggler event will ever be generated and the causality order

is never violated. In contrast, the second approach is an optimistic solution which

allows LPs to process events forward without a global synchronization [4, 14, 17, 30].

Causality violations are handled by rolling back to a point in virtual time which is

earlier than the straggler message’s time stamp [21].

An analogy to optimistic processing would be speculative execution in Micro-

processors. Speculative execution provides a mechanism to fetching, decoding and

executing new instructions based on a branch prediction. Many modern micropro-

cessors predict the result of a branch instruction and optimistically begin executing

instructions according to this prediction. This is problematic since the predictions can

be incorrect. Thus, the CPU must have some way to back out of the wrong sequence

of instructions that it began to execute and start executing the correct sequence.

Rollbacks in an optimistic simulation require reverting the LP to a previous state.

Such reversions can be realized by either checkpointing or using reverse computation;

however, both methods require maintaining a list of event histories. In both methods,

LPs revert all the messages they sent when encountering a rollback (messages with a

time stamp larger than straggler message’s time stamp). This is realized by sending

an anti message for each normal (positive) message sent to the same target LP so

8

that the simulation backs up with a cascading effect.

Figure 2.2 through Figure 2.5 demonstrate an example of a rollback. In step

1, the LP is about to receive a straggler message with time stamp 7 while it has

two processed events with time stamps 5 and 10, respectively. The straggler message

causes a rollback because the causal consistency between events 7 and 10 is broken.

In Figure 2.3, the LP cancels the events with time stamp larger than 7, sends

anti messages for each positive message it sent (event with time stamp 11, targeting

LP A) and processes the straggler message. The next step is dependent on whether

the simulation is deterministic or non-deterministic.

Figure 2.4 presents a scenario after a rollback is completed in a deterministic

simulation where a reverted event with time stamp 10 is created again. The cancelled

positive message with time stamp 11 is also generated again with the same target LP

A. The unprocessed event is generated in the same way. However, Figure 2.5 shows a

situation where the simulation is non-deterministic so that a random event with time

stamp 9 is generated and it causes the sending of another random message with time

stamp 10 to a different LP (B).

Fig. 2.2. Rollback Example (Step 1): before rollback

9

Fig. 2.3. Rollback Example (Step 2): after straggler message is received, rollback isinitiated

Fig. 2.4. Rollback Example (Step 3 a): after rollback completed in a deterministicsimulation

Checkpointing requires LPs to save their states periodically between fixed inter-

vals. In the case of a straggler message triggering a rollback, LPs discard all the work

they have done between the last checkpoint (prior to the straggler) and their current

LVT. They set their LVT to the rollbacked checkpoint and restart the simulation from

that point. This can be implemented in a deterministic and non deterministic way.

The latter one does not guarantee the regeneration of the rollbacked events, while the

deterministic approach preserves the order of events.

Reverse computation is another approach to implement optimistic simulation.

This is based on reversible operations such as addition and subtraction. Upon a

10

Fig. 2.5. Rollback Example (Step 3 b): after rollback completed in anon-deterministic simulation

straggler message, each LP reverses the event messages starting from the most recent

event until the event just prior to the straggler message. More precisely, the reverse

computation code is carried with every event to reverse event’s effect to restore the

state during a rollback.

2.1.2 Computing the Global Virtual Time

Both checkpointing and reverse computation require maintaining a previous

event history. These histories accumulate over time and create a memory overhead

which leads to lower cache utilization. Therefore, a mechanism is needed to reclaim

the resources that are no longer needed. This mechanism can also be used to perform

operations which cannot be reverted, such as I/O.

The memory problems related to the optimistic simulation can be solved if one

can guarantee that certain events are no longer prone to rollback. Specifically, state

histories prior to time T can be freed if no rollbacks prior to time T will ever be

needed. Similarly, I/O operations issued by any LP with a smaller LVT than T can

be executed. Therefore, if we can determine a lower bound on the time stamp of any

future rollback, we can use it to free the memory. This lower bound is referred to

as Global Virtual Time (GVT). If we could capture the snapshot of the simulation

11

system, the minimum time stamp among all anti-messages, positive messages and

unprocessed events in the system would represent the lower bound on the time stamp

of any future rollback. Thus, the memory for the state histories and event messages

that have a lower time stamp than the GVT can be reclaimed, and I/O operations

issued before the GVT can be executed.

Figure 2.6 shows a possible GVT value among simulation processes and in transit

messages. White circles depict the virtual positions of LPs in the simulation. The

LVT for an LP is computed by taking the minimum time stamp of the unprocessed

events in LP’s event queue. GVT is the minimum of the minimum time stamp of

in-transit messages (min(4, 12)) and minimum of LVTs (min(5, 7, 10)). Therefore,

the GVT is is 4 (min(4, 5)).

Fig. 2.6. GVT is the time stamp of message sent from LP 2 to LP 3.

If one could freeze the simulation as a whole and capture the information of

all in-transit messages and unprocessed events, then computing the GVT would be

trivial. One would need to compute the minimum of 1) minimum time stamp of all

in transit messages and 2) minimum LVTs among all LPs. Intuitively, this approach

is not possible because once a message is sent, there is no way to stop it during

its transmission. Therefore, in order to determine the GVT, we need a mechanism

which computes a snapshot of the simulation as a whole. There are two challenging

12

problems associated with making such a snapshot: the transient message problem

and the simultaneous reporting problem.

2.1.3 Transient Message Problem

We mentioned that thee Local Virtual Time (LVT) of an LP determines its

position in the simulation system with respect to other LPs. The LVT is determined

by the last processed event’s time stamp. Thus, it is guaranteed that the LVT is

the lower bound on the time stamps of unprocessed events in an LP’s event queue

(assuming events are sorted in the event queue based on their time stamp).

Therefore, one could try to compute the GVT by instantaneously signaling all the

LPs to report their LVTs and computing the minimum LVT. However, this minimum

LVT might not be a correct GVT value. This is because there may be some messages

in the network which are sent by one LP, but not yet received by its destination LP.

These messages may be straggler messages that possibly cause a rollback. Thus, these

transient messages must be included in the GVT computation.

Figure 2.7 shows an example of the transient message problem. The leader LP

sends a message to LP 1 and LP 2 to report their LVTs, this being 10 and 20 at

the point of receiving the signal, respectively (this is shown as dashed arrows). The

transient message with time stamp 5 is not yet received; thus, it is not incorporated

into the LVT calculation of LP 2. The computed GVT value is 10 (min(10, 20)), but

should be 5.

There are essentially two approaches to solve the transient message problem:

1) The sender LP is responsible for taking into account the time stamp of transient

messages it sends, or 2) it is the receiver LP’s responsibility to take into account the

time stamp of transient messages when they arrive.

Both solutions require message acknowledgments. In the first approach, receiver

13

Fig. 2.7. Transient message problem.

should send a message of acknowledgment to the sender upon the arrival of every

message. The sender of each message is responsible for accounting the message in its

LVT until it receives the acknowledgment. It is acceptable for more than one LP to

account for the same message because it would not affect the GVT computation. This

handshake between the sender and the receiver ensures that no transient messages

“fall between the cracks” during the GVT computation [19].

Figure 2.8 shows how the transient message problem is eliminated using the above

solution. Message acknowledgements are sent when the receiver receives the message

(these are shown as dotted arrows). Since LP 1 did not receive the acknowledgment

yet, it remembers the time stamp of the message it sent (5) when the Report LVT

signal is received.

The main problem with the message acknowledgments is that it requires too

many message transmissions. It doubles the message count and overloads the network

if the simulation is dominated by a communication overhead. It can also increase the

message transmission time based on the underlying network organization. In a shared

memory architecture, it can result in a lower cache utilization because of memory

pressure.

Samadi’s GVT algorithm utilizes message acknowledgments while Mattern’s GVT

14

Fig. 2.8. Prevention of transient message problem using message acknowledgements.

uses a set of message counters and a control message to guarantee every message is

accounted for the GVT computation. These algorithms are described in the Global

Virtual Time Algorithms chapter in detail.

2.1.4 Simultaneous Reporting Problem

If we had a simple GVT algorithm, which would require LPs to report their

LVTs without stopping and then computing a minimum among those values, this

global minimum would still not be a correct GVT value. Although the transient

message problem is seemingly handled, the simultaneous reporting problem arises if

LPs do not report their LVTs at precisely the same instant in wallclock time. This can

result in some messages not being accounted by either the sender or the receiver LP,

thus creating a scenario where some messages “slip between the cracks”. Accounting

for unprocessed messages in the system becomes more complicated if LPs are allowed

to process events while the GVT computation is in progress.

Figure 2.9 shows an example of a simultaneous reporting problem. After LP

2 reports its LVT to the leader (shown as a dashed arrow), it receives a straggler

message with a time stamp of 15. This message is not considered during the GVT

computation. GVT is found to be 20 (min(20, 30)), although it should have been 15.

15

Fig. 2.9. Simultaneous message problem.

2.2 ROSS Simulator

Rensselaer’s Optimistic Simulation System (ROSS) [3] is used as the base simu-

lator for our studies. ROSS is a state-of-the art PDES simulation environment that

supports both conservative and optimistic synchronization. In the optimistic mode,

ROSS uses reverse computation [4] in place of state saving to rollback to a safe state

upon a straggler message. The default GVT algorithm in ROSS is a barrier based

synchronous implementation which is described in the next chapter in detail.

The original ROSS implementation utilized processes which communicated with

message passing using the MPI library. In our studies, we use a multi-threaded

version of ROSS [43] in order to effectively exploit the shared memory available

on the Knights Landing processor. In this version, processes are implemented as

threads which would require no expensive MPI-based communication; thus, directly

exploiting the shared memory.

Simulation tasks are executed repeatedly in the core simulation loop. Treads

execute four main simulation tasks in this tight loop in parallel. At each iteration,

threads first read messages they have received, process and generate new messages,

send them to appropriate threads, and participate in GVT computation as shown in

Algorithm 1.

16

ALGORITHM 1: Simulation Core Loop

1 while PE → GVT < simulation end time do

2 event e read = read message(PE → event queue)

3 event e new = process message(e read)

4 send message(e new)

5 participate gvt(PE)

6 end

In each iteration, threads can read, process and send messages up to a predefined

constant. This constant variable is called batch size. Furthermore, participation in the

GVT computation is controlled by the GVT interval constant. This enables threads

to start the computation at every GVT interval, unless the memory is exhausted. If

that is the case, freeing the memory becomes crucial for the simulation to proceed and

GVT computation can be triggered explicitly. Threads execute the core simulation

loop until GVT reaches the predefined simulation completion time. The shared data

structures between threads are protected by fine grained mutex locks, conditional

variables, pthread barriers or atomic operations.

In ROSS, simulation structures are distributed into three categories. At the high-

est level, Physical Entities (PE) execute the core simulation loop and call subroutines

for the main simulation tasks. Each PE is serviced by a posix thread. PEs manage

some local queues for incoming/outgoing messages and perform garbage collection.

Threads execute in parallel on different cores. Each posix thread is pinned on a single

core with CPU affinity and scheduled in a round robin fashion after they are spawned.

Each PE has local variables to indicate their Local Virtual Time and Global

Virtual Time. PEs also have a pointer for a currently processed event and bookkeeping

information for participating in GVT and other simulation tasks. Four main data

structures are managed by each PE locally: event queue, priority queue, cancelled

17

queue and free queue. Event queue is a linked list of events sent to this PE. Priority

queue is used to sort the received events according to their time stamps. Cancelled

queue is a linked list of cancelled events and free queue holds the list of free events.

PEs also have a linked list of Logical Processes (LP) and Kernel Processes (KP)

that they service. Each LP holds the state of a simulated entity. For example, the

state of an airport (capacity, congestion, or weather situation) would be encapsulated

in an LP. KPs are responsible for garbage collection. Each KP holds a list of processed

events for a collection of LPs it services. In the ROSS framework, garbage collection

is referred to as fossil collection, in order to conform to the PDES community. Fossil

collection is done by each KP by clearing its processed events list at each GVT

computation. The number of LPs and KPs is configured at the compile time and they

are executed sequentially by the PE (thread) that services them. The organization

of PEs, LPs and KPs is presented in Figure 2.10.

Fig. 2.10. Hierarchy of simulation structures in ROSS.

Communication between PEs are implemented through a shared data structure

called mt out q. During each iteration of the core simulation loop, processing an event

causes a new event to be generated. This new event is sent to its destination by being

18

written into mt out q where the id of the destination PE is the index. Total number

of PEs determine the size of the mt out q. Each cell of mt out q holds a pointer to

a thread local event queue named inq, and to a mutex lock. This lock protects only

the writes issued to the mt out q.

When an event e is written into mt out q[destination PE id], destination PE

receives it instantly because the pointer in the updated cell points to the inq of the

destination PE, shown in Figure 2.11. When a receiver PE wants to read its messages,

it pops events from its inq one by one, and then pushes them into its event queue.

Popping from the inq is protected by the lock stored at the mt out q[receiver PE id].

This is an efficient implementation, since instead of locking an entire data structure for

each write and read, only a single index is locked. Furthermore, message transmission

is abstracted from the event buffers, so that PEs do not need to lock them during

core simulation tasks.

Fig. 2.11. Communication architecture in single node ROSS

19

2.3 PHOLD Benchmark

PDES simulation have to be driven by the benchmarks. The most popular and

versatile benchmark for evaluating PDES, is the classical Phold model. Phold is a

synthetic benchmark that allows characterization of the performance of applications

under different scenarios. For example, it allows control of the percentage of events

generated locally to the same core and the percentage of events generated for the

other cores, thus requiring inter-core communication and delays.

Phold can also be used to alter the event processing granularity (EPG) to control

how much CPU processing is required for each event. As a result, this allows us to

evaluate systems with different computation/communication balance (by varying the

EPG) and with different execution locality patterns (by varying the percentage of

remote events).

Phold benchmark starts by initializing each LP with a number of events. Event

processing amounts to picking a destination LP according to some algorithm (for

example, randomly), sending a message to that destination LP and the EPG delay.

Upon receiving a message, the destination LP picks another destination and so on.

Therefore, the number of events in the system remains constant. One can load some

cores more than others by choosing LPs residing on those cores as message destina-

tions more than other LPs.

In our experiments, we created imbalanced scenarios in terms of communication

and event processing, by assigning more message loads or different EPG delays to

some of the LPs.

For our experiments, we assigned 128 LPs and 32 KPs to a single PE which is

handled by a hardware thread. Each LP is initialized with 1 starting event. PEs

process a maximum of 8 events at once before handling GVT and receiving new

events. GVT computation is initiated after every GVT interval iterations of the core

20

simulaiton loop. PEs are assigned with a lookahead value of 1 which determines

the time stamp of the newly generated event based on the last processed event. For

example, after a PE processes an event with time-stamp 15, it generates a new event

and sets its time-stamp as 16.

2.4 Intel Xeon & Xeon Phi Architectures

The Intel Knights Landing [8,42] is the second generation of the Many Integrated

Core (MIC) architectures designed to be used both as a standalone processor or as

a co-processor for High Performance Computing (HPC) applications. Both Knights

Landing and Knights Corner (first generation MIC) architectures compose Intel’s

Xeon Phi Architecture group.

Knights Landing processors feature up to 72 cores, each capable of executing 4

simultaneous threads. The cores run at a maximum frequency of 1.3 GHz and can

achieve better than 6000 Gflops/s single precision and 3000 Gflops/s double precision

when the vector processing units are utilized fully.

A major upgrade to the Knights Corner architecture, Knights Landing adds

branch prediction and out-of-order execution logic to each core. Vector Processing

Units (VPU) have been increased to two per core. Also, a 1 Mbyte L2 cache is

now shared between every core pair, forming a tile. Significantly, KNL systems now

include a 16 GB on-package DDR memory (HBM), which can be used as an L3 cache

to off-package DDR4 memory, or as the sole memory. Our versions of the Knights

Landing processor has 64 cores and 96 GB of DDR4 memory. High-level diagram

of the Knights Landing architecture used for this study (not including the DDR4

memory) is shown in Figure 2.12.

The KNL processor is commonly socketed and utilized as a standalone CPU,

as is the case in our experimental system. The KC processor had to be used as an

21

Fig. 2.12. Intel Knights Landing Architecture

accelerator. KNL runs standard Linux distributions as a full host computer, thus

eliminating the idiosyncrasies of accelerator interfacing.

2.5 Experimental Setup & Parameters

Table 2.1 summarizes the configurations and hardware details of the host Xeon

processor and the Xeon Phi Knights Landing (KNL) processor. Note that Xeon

system is only used for comparison purposes in the results presented in Section 4.1

For all presented experiments, we execute the multi-threaded version of the ROSS

simulator driven by the Phold benchmark. In a Phold benchmark, we vary the thread

22

Platform Xeon KNL

Model E5-2620 7230

Frequency 2.40GHz 1.3GHz

# of Cores 12 64

Memory Type DDR4 2133 DDR4 2400

Memory Size 60G 96GB + 16GB

OS CentOS 6.6 CentOS 7.2

Compiler GNU GNU

Compiler Compiler

Toolchains Toolchains

Table 2.1. Details of Experimental Platforms

count, the percentage of remotely generated events, the event processing ganularity

(EPG) and the loading of threads in terms of communication or computation. The

EPG represents the amount of work required to process a single event, and is specified

in units approximately equal to 1 FLOP per unit. This artificial event processing delay

enables us to create scenerios where computation dominates over communication

Our goal is to understand the behavior and scaling trends of the ROSS simulator

while executing on a single Knights Landing node with different GVT algorithms. We

also compare the performance of the KNL processor against a 12-core Xeon processor

to understand the trends between large number of smaller cores and the fewer number

of more powerful cores.

We consider both balanced and imbalanced Phold models in our evaluations. In a

balanced model, a destination LP is randomly chosen and every LP sends and receives

about the same number of messages during the course of simulation. Moreover,

LPs are delayed by the same EPG overhead. For an imbalanced model in terms of

communication latency, LPs are grouped into four different groups. An LP in the

first group sends messages to any LP, an LP in the second group sends to the first

23

half of the LPs, an LP in the third group sends to the first 30% of the LPs, and the

last group sends messages to the first quarter of the LPs. Figure 2.13 depicts the

communication pattern between the four groups.

Fig. 2.13. Imbalanced Communication

As a result, while some LPs can be destinations regardless of the source (and

therefore receive a larger number of messages), other LPs receive very few messages.

While this model is simplistic, it allows us to gauge the performance of systems

where the threads are not equally loaded. The imbalanced model in terms of event

processing is created by assigning varying EPG delays to the LPs. Therefore, each

core executes a different number of instructions due to a heavier processing time for

a single event, although the number of events processed are about the same.

We report the performance results in terms of committed events per second. As

we increase the number of processing nodes (thread), we maintain the number of

starting events per node, thus proportionately increasing the total number of events

generated by the simulator. If the underlying system is capable of efficiently keeping

up with this load without incurring additional delays, we can expect the committed

event rate to also show improvements to commensurate with the increase in the

24

number of nodes. This is known as weak scaling [2]. A thorough summary of the

simulation parameters is shown in Table 2.2.

Variable Value Description

Architecture Xeon, KNL Processor Model

GVT Barrier, Mattern, Global Virtual Time

Wait-Free

GVT Interval 128, 200, 400 GVT computation

frequency

Remote % 0, 10, 50, 100 Proportion of events

sent outside

of the core

EPG 0, 100, 500 Event Processing

Granularity

Simulation Model Balanced, Overloading cores

Imbalanced in terms of

communication or

computation

CPU Affinity Round Robin Scheduling threads

to cores

# PE # CPUs Physical Entities:

Posix threads

# LP 128 * # PE Logical Processes:

simulation objects

# KP 32 * # PE Kernel Processes

Initial Events 1 * # LP Number of events

to start simulation

Look-ahead LVT + 1 Time stamp of

the new event

Table 2.2. Simulation Parameters

25

Chapter 3

GLOBAL VIRTUAL TIME ALGORITHMS

In order to support recovery to a safe state upon a rollback in an optimistic

simulator, event histories must be saved during the simulation. These can grow large

over time and need to be garbage collected periodically to prevent memory exhaustion.

To achieve this, the Global Virtual Time (GVT) is periodically computed to determine

the earliest conceivable time to which rollback may be required. It is essentially the

greatest lower bound on the local virtual time of all PEs and the time stamp of all

in-flight messages. Principally, GVT algorithms come in two flavors: synchronous

and asynchronous.

3.1 Synchronous GVT

A synchronous implementation essentially follows the “stop-synchronize-and-go”

model where threads periodically stop processing, wait until all transient messages

arrive, collectively compute the new GVT value and then proceed again. In a thread-

based implementation of PDES, synchronous GVT computation utilizes a pthread

barrier. Synchronous implementations may be inefficient when threads arrive at the

barrier at different times. The threads that arrive early must wait until the slower

threads catch up. This cycle repeats at each GVT round.

The periodic synchronization of the PEs limits the optimism of ROSS. Faster

PEs have to be stalled thus, resulting in idle CPUs which do not do useful simulation

work until the slower PEs catch up. Therefore, synchronous GVT algorithms impose

26

indirect conservatism into optimistic simulation kernels.

3.1.1 Barrier GVT Algorithm

The Barrier GVT algorithm is fundamentally synchronous since it utilizes pthread

barriers. As shown in Algorithm 2, each PE executes a barrier call in a tight loop

until the number of transient messages is checked as zero. Barrier call collects the

local message counters from each PE as the input and reduces local counters to a

single sum. This sum represents the number of in transit messages in the system and

returned to each PE. When the reduced sum is zero, each PE breaks out of the loop

and synchronizes one last time to compute the minimum LVT among themselves. At

this point, there are no in-flight messages, thus a new GVT value can be computed

by reducing the LVTs into a single min value. Once all the PEs get the new GVT,

they fossil collect and leave the GVT routine.

ALGORITHM 2: Barrier GVT Algorithm

// PEs loop until there are no more in transit messages

1 while 1 do

2 int msg counter = PE → msg sent - PE → msg received

3 int msg intransit = sum barrier(msg counter)

4 if msg intransit == 0 then

5 Break

6 end

7 end

// No more in transit messages at this point

8 int new GVT = min barrier(PE → LVT)

9 PE → GVT = new GVT

10 fossil collect(PE)

A diagram of the barrier GVT computation is shown in Figure 3.1. Once the

GVT computation begins, all PEs are blocked until all the messages have been re-

27

ceived. Due to the nature of synchronous GVT computation, the transient message

problem and the simultaneous message problem are eliminated. This is because the

idle time during the barrier synchronization lasts until all of the transient messages

(that possibly cause a rollback) are received.

Fig. 3.1. Snapshot of a Barrier GVT computation

3.2 Asynchronous GVT

In contrast, in asynchronous GVT algorithms, GVT computations proceed “in-

line” with event processing, thus obviating the need to halt threads. In this approach,

the GVT is computed at the background asynchronously, without interfering with the

other simulation tasks. Asynchronous GVT algorithms do not block the PEs, thus

yielding higher CPU utilization since PEs always do useful work.

However, a higher computational overhead may be involved because some form

of thread synchronization and management is required to manage the participation

of threads in the GVT process. CPUs process more instructions related to GVT

computation compared to synchronous GVT algorithms. This is examined in detail

in the Profiling and Analysis subsection.

We implemented two different asynchronous GVT algorithms. One originates

from Mattern’s GVT algorithm [34] which relies on control messages and a locking28

mechanism. The second one is a Wait-Free GVT algorithm [37] which maintains a

set of phases to compute GVT using atomic operations. We also sought to explained

Samadi’s GVT computation [41] since it is a fundamental asynchronous algorithm.

We did not evaluate Samadi’s GVT algorithm because it is intuitively less efficient

than Mattern’s GVT due to message acknowledgments.

We implemented three versions of Mattern’s GVT algorithm for evaluation pur-

poses. First, we utilized mutex locks with a tree based lock structure to reduce the

contention. Second, we took advantage of try locks to measure the lock contention.

And third, we used atomic operations instead of locks whenever they were suitable.

We also implemented two versions of the Wait-Free GVT algorithm. One is the five

phase computation as proposed in the original paper and the other one is our three

phase implementation. For our experiments, we choose the mutex locks implementa-

tion of Matter’s GVT and five phase Wait-Free GVT because of their reliablity and

higher performance.

3.2.1 Samadi’s GVT Algorithm

Samadi’s GVT algorithm requires message acknowledgments to be sent on every

message sent between PEs [41]. The sender PE is responsible for accounting for each

message it has sent until it receives the acknowledgment, thereby solving the transient

message problem.

The simultaneous reporting problem is solved by having PEs tag any acknowl-

edgment message that they send between reporting their LVT and receiving new GVT

value. This identifies messages that might “slip between the cracks” and notifies the

sender PE for accounting the message before reporting its LVT.

Specifically, Samadi’s asynchronous GVT computation progresses in five main

steps as shown below:

29

1. One of the PEs is chosen as the leader and at each GVT cycle, it broadcasts

a Report-LVT message to all other PEs in order to initiate the GVT computation.

2. Upon receiving the Report-LVT message, the PEs send their LVT to the

leader. Specifically, they send a message indicating the minimum time stamp among

1) all unprocessed events in their event queue, 2) all unacknowledged messages and

anti-messages they have sent and 3) all marked acknowledgment messages they have

received since the last received new GVT. PEs now can set a flag indicating that they

are in find phase.

3. For each message or anti-message received by the PE while it is in find phase,

the PE sends a marked acknowledgment message indicating the time stamp of the

message it is acknowledging. An unmarked acknowledgment message is sent for all

messages received while not in find phase.

4. When the leader receives a local minimum value from every PE in the system,

it computes the minimum of all these values as the new GVT and broadcasts it to all

the PEs in the system.

5. Upon receiving the new GVT value, each PE switches from find phase to the

normal phase and continues main simulations tasks until the next GVT round.

Figure 3.2 shows an example of Samadi‘s GVT computation. Acknowledgement

for message from PE 1 to PE 2 arrives after PE 1 reports its LVT, thus PE 1 has

to account the time stamp 15 when it reports to the leader. If the acknowledgement

had arrived before PE 1 reported, then accounting the time stamp 15 would be PE

2’s responsibility.

3.2.2 Mattern’s GVT Algorithm

One drawback with Samadi’s algorithm is that it requires an acknowledgment

message to be sent for each message and anti-message. The underlying communication

30

Fig. 3.2. Snapshot of a Samadi’s GVT Computation

software may automatically send acknowledgments for a reliable message delivery;

however, such acknowledgments are typically not visible to the simulation kernel.

Therefore, a PDES framework should implement acknowledgment messages.

Matter’s GVT algorithm is also asynchronous like Samadi’s algorithm but it does

not require message acknowledgments. The fundamental idea behind Mattern’s GVT

is dividing the simulation into two parts with a cut : the past and the future. As

shown in Figure 3.3, a PE considers all event processing, messages sent and message

received before the cut point (in wall clock time) as having happened in its past. On

the contrary, a PE refers all the actions happened after the cut point as being in its

future.

Fig. 3.3. Cut divides simulation into two: past and future.

31

The set of cut points across all the PEs in the system defines the cut of the

distributed simulaiton. At each GVT round, Mattern’s algorithm creates two cuts

across the PEs and computes the GVT based on the snapshot taken on the second

cut. The purpose of the first cut is to notify each PE to start recording the smallest

time stamp of any message it sends. These messages could cause a transient message

problem if they cross the second cut and therefore must be included in the GVT

computation. The second cut is defined to guarantee that each message sent from the

past of the first cut, will be received before the construction of the second cut. This

makes all the transient messages in the system (crossing the second cut) to be sent

after the first cut. This enables accounting for these message in the GVT computation

during the construction of the first cut (by remembering their time stamp).

PEs are colored based on where they are (virtually) with respect to the cut line

as shown in Figure 3.4. PEs are initialized as white and they switch to the red after

the first cut is reached. After the second cut, PEs return back to the white color.

White PEs mark the messages they generate white and red PEs mark them red.

Fig. 3.4. Events before the cut line colored as white, after the cut line colored asred (dotted arrow).

By design, all white messages must be received prior to the second cut. Thus, all

transient messages crossing the second cut must be red. In Figure 3.5, the message

depicted as a dotted arrow violates this rule. The set of messages crossing the second32

cut is a subset of all red messages. Therefore, the minimum time stamp among all the

red messages is a lower bound on the minimum time stamp of all transient messages

crossing the second cut.

Fig. 3.5. Second cut should stretch towards future so that there should be no messagesent from the white phase and received in the consecutive white phase (dotted arrow).

The GVT is computed as the minimum of 1) minimum time stamp among all

red messages, and 2) the minimum time stamp of any unprocessed message in the

snapshot defined by the second cut. These two variables are stored locally at each

PE, so it is trivial to compute a minimum of these. However, the challenge is creating

such a second cut that no white message will ever cross it. This requires one to

guarantee that any message generated prior to the first cut will be received prior to

the second cut.

The first cut can be constructed by circulating a control message in a logical ring

of PEs. The GVT round can be initiated by a leader PE by starting the circulation

for the first cut. Upon receiving the control message, each PE changes its color from

white to red and passes the control message to the next PE in the ring.

When the leader PE gets back the control message it sent at the beginning of

the round, it is guaranteed that the first cut is constructed. During this process, each

PE has to access the control message only once. After the leader PE receives the

33

control message from the last PE in the ring, there will be no new generation of white

messages.

The construction of the second cut is different. Again, the leader PE initiates

its construction by sending the control message to the next PE in the ring. However,

a PE will not forward the control message to the next PE until it can guarantee that

it receives all the white messages destined for it (including the leader).

In order to implement this, each PE keeps an array of counters indicating the

number of white message it sent. These arrays are accumulated within the control

message as it circulates among the PEs during the construction of the first cut. After

the first cut is constructed, the control message has the information of how many

white messages are sent to any of the PEs.

During the construction of the second cut, a PE accesses the accumulated array

counters in the control message to compare how many white messages have been

sent to it in total to how many white message it actually received. When these two

numbers are equal, a PE will check that it received all of the white messages that

have been sent to it. Then, it can forward the control message to the next PE in the

ring.

Each PE maintains the following local variables:

• T min : Holds the smallest time stamp of any unprocessed message in the PE’s

event queue (same as LVT).

• T red : Holds the smallest time stamp of any red message sent by the PE.

• array counters : The array of counters indicating how many white messages

the PE sent to any of the other PEs. The destination PE’s id will be an index to

the sender PE’s array counters. PEs also count the number of white messages

they receive. This counter can be held at the array counters[PE id].

• color : Current color of the PE, white or red.34

Control message contains three fields as follows:

• CM T min : Records the minimum of T min values among PEs that the con-

trol message has circulated thus far.

• CM T red : Records the minimum of T red values among PEs that the control

message has circulated thus far.

• CM array counters : The cumulative array of counters among PEs that the

control message has visited thus far. CM array counters[i] indicates the number

of total white messages sent to PE i.

Now we can describe Mattern’s GVT algorithm. On each message sent, if the

event’s color is white, a PE increments its message counter for the destination PE

using array counters. If the event’s color is red, PE updates its T red. When a white

message is received, the PE increments its received message counter. If it is red, the

PE does nothing. These procedures are presented in Algorithm 3 and 4.

Algorithm 5 and 6 present, the procedures when first and second cut points are

reached respectively. When a PE reaches its first cut point, it changes its color to

red, resets its T red and accumulates its array counters into CM. When the second

cut point is reached, a PE waits until it receives all the white messages destined to it.

After that is checked, the PE updates the control message with its T min and T red

and forward it to the next PE in the ring. Finally, it resets its counters and continues

the simulation.

Mattern’s GVT algorithm is designed specifically for the distributed memory

system. For this study, we adapted Mattern’s distributed GVT to make it more

suitable for the shared memory architecture in order to exploit the large number of

CPUs available in the KNL processor. Messaging between PEs are realized by writing

35

ALGORITHM 3: Message Send

// PE I sending event E with time stamp T to PE J

1 if E → color == white then

2 PE I → array counters[PE J → id] += 1

3 else

4 PE I → T red = min(PE I → T red, T)

5 end

6 send message(E)

ALGORITHM 4: Message Receive

// PE I receives event E with time stamp T

1 event E = receive message(PE I)


3 PE I → array counters[PE I → id] += 1

4 else

// Ignore

5 end

6 PE I → event queue.push(E)

ALGORITHM 5: First Cut

// PE reaches first cut point

1 if PE → color == white then

2 PE → T red = ∞3 PE → color = red

// PE accumulates its message counters into CM’s message counters

4 for i = 0; i < #PE; i + + do

5 if PE → id != i then

6 CM → array counters[i] += PE → array counters[i]

7 end

8 end

// Forward the control message to the next PE in the ring

9 forward(CM, PE → id + 1)

10 else

// Assert

11 end

36

the event to the destination PE’s event queues as mentioned previously. Thus, the

transient messages in a distributed system are considered as events which are not yet

written to the target event queue in our shared system architecture.

ALGORITHM 6: Second Cut

// PE reaches second cut point

1 if PE → color == red then

// PE loops until it receives all messages destined to it

2 int key = PE → id

3 while 1 do

4 if PE → array counters[key] == CM → array counters[key] then

// All messages received

5 Break

6 end

7 receive message(PE)

8 end

// Update control message

9 CM → T min = min(CM → T min, PE → T min)

10 CM → T red = min(CM → T red, PE → T red)

// Forward the control message to the next PE in the ring

11 forward(CM, key + 1)

// Reset the array counters

12 CM → array counters[key] = 0

13 PE → array counters[key] = 0

14 else

// Assert

15 end

Instead of circulation of the control message through a ring, we utilized a global

shared control structure. Each PE accesses this shared structure asynchronously.

During the construction of the first cut, each PE checks this structure only once.

But for the second cut, a PE keeps checking it until it receives all the white messages

destined to it. Thus, instead of waiting in the GVT subroutine, it continues to execute

37

core simulation tasks. Both the control message (CM) and the control structure (CS)

have T red and T min fields for the same purposes.

The last PE that successfully checks the control structure at the end of the

second cut computes the GVT by taking the minimum of CS T red and CS T min

and writes it to a global variable. After the new GVT value has been computed, each

PE must read it. PEs do not read it from the control structure. Instead, it is held in

a global variable to be read at the end of each GVT round. Once a PE has read the

new GVT, it fossil collects and changes its color back to white. After the predefined

GVT interval, each PE becomes red again (during the construction of first cut) and

the process repeats.

We also optimized Mattern’s GVT algorithm in terms of memory space. Instead

of using an array of counters, a PE in our implementation holds a single variable

to count how many white messages it has sent and received without considering the

destination or source PE. Also, the control structure holds a single counter instead

of the array of counters to accumulate the counters among the PEs.

A PE decrements its counter when it receives a white message and increments

it when it sends one. During the second cut, this counter is accumulated on the

control structure and each PE checks the control structure’s counter for 0. If the

check succeeds, a PE updates the control structure’s T red and T min with its T red

and T min and verifies that it has received all the messages destined to it. If the

check fails, the PE leaves the GVT routine and checks in the next iteration of the

core simulation loop.

A timing diagram of the asynchronous algorithm is shown in Figure 3.6. For

clarity, assume that all PEs change their phases and check control structure in or-

der (this assumption is not necessary in practice, but it simplifies the explanation).

Messages are shown as arrows. The sending (+1) and receiving (-1) white events

38

are counted locally by each PE as shown. After the transition to the red phase, the

counts are accumulated at the control structure.

The first PE which checks the control structure sees it as 1. This is shown in

the form of a white circle on the first line. Then, the second PE checks the control

structure and it also sees it as 1 since it has no event counts to accumulate. Then,

the third PE with the message count of -2 checks the control structure and updates

it from +1 to -1. This is depicted as another white circle, implying that some events

which are not yet written to the destination event buffer may still exist. Finally, the

fourth PE arrives and accumulates its +1 event count with the control structure and

checks it successfully (sees the counter as 0). This is shown as a black circle at the

bottom line, implying that this PE accumulated the time stamp of its minimum red

message and its LVT.

Fig. 3.6. Snapshot of a Mattern’s GVT computation

Once each PE has read the control structure as 0, this ensures that all the sending

events are written to the destination PE’s event queue. Thus, the GVT computation

can be performed at this point. The PEs accumulate their minimum red message

timestamps and LVTs into the control structure as they pass the black circles. The

39

last PE that reaches the black circle computes the GVT by taking the minimum of

control structure’s T red and T min. At this point, the control structure holds the

LVT of the fourth PE since it has the smallest timestamp. However, the red event

from the first PE has an even smaller timestamp. Therefore, the GVT is set to the

timestamp of that event. Finally, all PEs read this new GVT value, turn their color

into white, and start counting events again.

The pseudo-code of our modified Mattern’s GVT algorithm for shared mem-

ory systems is presented in Algorithms 7, 8 and 9. The updated Message Send and

Message Receive procedures are shown in Algorithms 7 and 8, respectively. Previ-

ously separately defined First Cut and Second Cut functions are incorporated into

the GVT function as shown in Algorithm 9. Since this modified implementation is

through shared memory, the control structure is not forwarded anymore. Instead, it

is implemented as a global shared structure.

ALGORITHM 7: Modified Message Send

// PE I sending event E with time stamp T to PE J


2 PE I → msg counter += 1

3 else

4 PE I → T red = min(PE I → T red, T)

5 end

6 send message(E)

As seen in Algorithm 9, lines 1 through 6, the message counters are not accumu-

lated during the first cut anymore. Instead, during the second cut, each PE updates

the control structure with its message counter and checks if the updated value is 0

(lines 18 and 19). If the check succeeds, then the PE updates the control structure

one more time to write its T min (LVT) and T red into the control structure. Also,

the last PE notation used in lines 4, 12 and 22 is a shared counter to count how many

40

ALGORITHM 8: Modified Message Receive

// PE I receives event E with time stamp T

1 event E = receive message(PE I)


3 PE I → msg counter -= 1

4 else

// Ignore

5 end

6 PE I → event queue.push(E)

of the PEs finished the associated part of the algorithm. Concurrent accesses to this

counter are protected using three different approaches as discussed next.

We investigated different ways to cope with the lock contention. The first one

is a tree of mutex locks to reduce the contention. The second one uses try locks to

measure the contention and the third one utilizes atomic operations. The first one is

used for our experiments because it yields a better overall performance and has been

tested more extensively. These versions are explained as follows:

1. Lock Partitioning: Concurrent updates to the shared counters are serialized

using a tree based lock structure. Lock partitioning is implemented to prevent all the

PEs from competing for a single lock. Instead, groups of PEs compete for their

associated group lock and once a group is done, their group flag is turned on. When

all the groups are complete, an accumulated update is written to the shared counter.

2. Try Lock: The critical section is protected by a mutex try lock. A PE first

tries to acquire the lock and if it fails, it acquires the lock using a regular mutex lock.

If the try lock succeeds, the PE executes the critical section and releases the lock. We

keep track of how many times the try lock fails. This approach helps us to evaluate

the lock contention.

3. Atomic Operations: Atomic operations are hardware instructions that en-

41

ALGORITHM 9: Matter’s GVT Algorithm for Shared Memory Architectures

1 if PE → color == white then

// First cut point reached

2 PE → T red = ∞3 PE → color = red

4 if lastPE then

5 GVT ready = False

6 end

7 else

8 if GV T ready == TRUE then


10 PE → color = white


12 if lastPE then

// Reset the control structure

13 CS → T min = ∞14 CS → T red = ∞15 CS → msg counter = 0

16 end

17 else

// Second cut point reached

18 CS → msg counter += PE → msg counter

19 if CS → msg counter == 0 then

// Update the control structure

20 CS → T min = min(CS → T min, PE→ T min)

21 CS → T red = min(CS → T red, PE→ T red)

22 if lastPE then

// GVT is ready

23 new GVT = min(CS → T min, CS → T red)

24 GVT ready = TRUE

25 end

26 end

27 end

28 end

42

able concurrent accesses to the shared variables without locking them. Specifically,

GCC intrinsics of sync add and fetch and sync bool compare and swap are uti-

lized to remove the mutex locks. However, not all the locking was suitable for replace-

ment by atomic operations. Coarse critical sections remained the same. Therefore,

we measured the best performance with Lock Partitioning even though the Atomic

Operations are supposed to provide a higher concurrency compared to a locking ap-

proach.

The GVT interval is a predefined parameter which sets the gap between two

consecutive GVT calculation rounds. Ideally, the interval can be shorter at high

remote percentages to reduce rollbacks, and longer at low remote percentages to

reduce the computational overhead since rollbacks are less likely. A GVT computation

round is signalled when a global interval counter reaches zero. This causes each PE

to transition from their normal white phase to the red phase. A GVT interval of

128 is chosen but it is possible to initiate the GVT round before the interval counter

reaches zero. During event processing, if PEs run out of free event buffers, we can

force a GVT update to fossil collect immediately.

3.2.3 Wait-Free GVT Algorithm

Mattern’s algorithm does not take advantage of the shared memory, as it was

originally developed with messaging passing as its focus. In a shared memory system,

PEs can read the messages/anti-messages that have been sent to it instantly. Each

PE is in charge of managing its message queue which is populated by the events just

after they are sent. Thus, there is no in-flight messages across PEs. When PEs need

to process events, they read them from their message queue and insert into event

queue. One can think message queues as buffers and event queues as processing lines.

Message queues are named as inq in ROSS, as was explained in detail in the previous

43

chapter.

A GVT computation will require each PE to take the minimum time stamped

event in its event queue and calculate the local minimum by comparing it with its

local virtual time. The GVT will then be computed by taking a global minimum

across local minimums of each PE. It becomes problematic if a message is written

into the PE’s message queue, but not yet inserted into its event queue. When that

PE seeks to compute the GVT, it will miss the event that possibly has a lower time

stamp than its LVT or minimum time stamped event in its event queue.

This problem is resolved in [37], by using 5 phases to ensure that every message

is accounted for: phase A, phase Send, phase B, phase Aware and phase End. Each

PE starts the GVT computation from phase A and computes their local minimum,

called min A. This is the minimum of 1) PE’s LVT and 2) minimum time stamped

event in its event queue. When all PEs finish their phase A, they proceed to phase

Send. Here they incorporate messages into their event queue, execute one more event

and send output messages/anti-messages if there are any. This ensures that there will

be no message left in their message queue with a possibility to become a new GVT.

Once each PE completes their phase Send they then proceed to phase B and compute

a second local minimum which is called min B. At this point, PEs’ local minimum

is set to a minimum of min A and min B. At phase Aware, a global minimum is

computed across all local minimum and taken by each PE as the new GVT. Finally,

PEs move to phase End and become ready for the next GVT round.

Once all PEs complete the same phase, they can move on to the next phase. This

is controlled by atomic operations to prevent locking overhead and ensure correctness

of the GVT value. Different from Mattern’s algorithm, threads are not blocked to

acquire a lock and they calculate the GVT in a wait-free fashion. Algorithm 10

presents the pseudo-code of the Wait-Free GVT computation.

44

ALGORITHM 10: Wait-Free GVT Algorithm

1 if PE → phase == A & GVT round initiated then

2 int min A = min(PE → LVT, min event(PE → event queue))

3 atomic add(phase counter A, 1)

4 PE → phase = Send

5 else if PE → phase == Send & phase counter A == # PE then

6 event e = read messages(PE)

7 execute messages(PE, e)

8 send messages(PE)

9 atomic add(phase counter send, 1)

10 PE → phase = B

11 else if PE → phase == B & phase counter send == # PE then

12 int min B = min(PE → LVT, min event(PE → event queue))

13 int min final = min(min A, min B)

14 min array[PE → id] = min final

15 atomic add(phase counter B, 1)

16 PE → phase = Aware

17 else if PE → phase == Aware & phase counter B == # PE then

18 new GVT = min(min array)


20 atomic add(phase counter aware, 1)

21 PE → phase = End

22 else if PE → phase == End & phase counter aware == # PE then


24 PE → phase = A

25 end

45

A timing diagram of the Wait-Free GVT is shown in Figure 3.7. In phase A,

each PE calculate its min A while sending and receiving messages. When the last

PE completes calculating its min A, they proceed to phase Send and account for

messages that are possibly the new GVT. In phase B, they compute min B and find

the absolute local minimum. When each PE completes this operation, the new GVT

is computed by any of the PEs and written to a global variable so that other PEs can

take it and finish the GVT computation.

Fig. 3.7. Snapshot of a Wait-Free GVT computation

We also tried to optimize the Wait-Free GVT algorithm by implementing it using

three phases instead of five phases. In this version, computation starts with phase

Compute where a PE incorporates messages from its message queue into its event

queue. Then, it updates its LVT and moves into phase Send when all other PEs

finish with phase Compute.

In phase Send, a PE writes its LVT into an global shared array where its id is

the index. The last PE that enters the phase Send is responsible for computing the

minimum of the LVTs written into the shared array. That minimum becomes the

new GVT value and is written into a global shared variable. Finally, in phase Aware,

46

PEs read the new GVT value and switch back to phase Compute.

In phase Send, writes into the shared global array by threads which reside on

different cores violate the cache coherence protocol and cause false sharing. This im-

pacts the performance significantly since writes from different cores cause an eviction

of the entire cache line. Thus, a cache line has to travel all the way back to the main

memory to satisfy the consistency between first level caches of cores.

This problem can be fixed using padding. Global shared array now holds a

structure which would fit an entire cache line (64 bytes), instead of holding a double

variable to store LVT (8 bytes). This solution prevents the excessive evictions of

cache lines. On the other hand, this approach can abuse caches since an entire cache

line is sacrificed for a double variable.

For our experiments, we used the original implementation of the Wait-Free GVT

algorithm since it has been more extensively tested. We left the studies of this

modified Wait-Free GVT for future work.

47

Chapter 4

EXPERIMENTAL RESULTS

In this section, we present the results of our experiments with the three GVT

algorithms on both KNL and Xeon processors. We set a fixed GVT interval for each

algorithm for all the experiments. We selected the intervals that performed best across

our simulations. Specifically, we observed that the GVT interval of 128 was the best

overall choice for the Barrier GVT algorithm. For the asynchronous algorithms, the

GVT period value of 200 was used on Xeon and the GVT value of 400 was used on

KNL. While the performance of Barrier algorithm is dependent on the GVT interval,

the asynchronous performance is affected by it to a much lesser extent.

Another variable parameter is the percentage of messages that are sent to a

different thread. We call such messages remote. The opposite of remote would be

local, and such messages do not cross threads.

4.1 GVT Performance on 12-core Classical Xeon Machine

First, we present the performance trends for the three GVT algorithms (Bar-

rier algorithm based on pthread barrier calls, Mattern algorithm and Wait-Free algo-

rithm) on a traditional 12-core Xeon processor with hyper-threading. We scale these

experiments till 24 threads. Previous work [45] showed that overloading the cores is

counter-productive for performance, so we do not consider those scenarios here.

48

2 12 24Number of Threads

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Com

mitt

ed E

vent

Rat

e

×107

BarrierMatternWait Free


0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0×106

Fig. 4.1. Committed Event Rate on Xeon with Balanced Loading and 0 EPG: 10%Remote Events (left) & 100% Remote Events (right)

4.1.1 Model 1: Balanced Loading & Fast Event Processing

Figure 4.1 shows the commit event rate of ROSS for a simple PHOLD model

with 10% and 100% remote events respectively. The event processing granularity is

set to zero for these experiments, resulting in communication-dominated scenarios

with little event processing. As seen from these results, the asynchronous algorithms

significantly outperform the Barrier implementation, and the difference increases as

simulation scales to 24 threads. We also observe that the Wait-Free GVT algorithm

is faster than Mattern’s GVT by 30%, but is significantly faster than Barrier synchro-

nization. For example, the performance advantage of the Wait-Free algorithm over

Barrier implementation for 24 threads is almost 50% and 48% for the cases with 10%

and 100% remote events respectively. These trends are not surprising and allude to

the advantages of asynchronous GVT computations that allow the event processing

to continue without blocking the threads.

49

4.1.2 Model 2: Balanced Loading & Slower Event Processing


0

1

2

3

4

5

6

7

8Co

mm

itted

Eve

nt R

ate

×106



0

1

2

3

4

×106

Fig. 4.2. Committed Event Rate on Xeon with Balanced Loading and 50% RemoteEvents: 100 EPG (left) & 500 EPG (right)

Figure 4.2 shows the performance of three GVT algorithms for a scenario with

balanced load, 50% remote events and high event processing granularity. For this, we

consider the EPG values of 100 and 500. As expected, GVT becomes less of a bottle-

neck with high event processing granularity (due to a more dominant contribution of

the event processing itself). Specifically, with EPG of 100, the Wait-Free algorithm

outperforms Barrier GVT by 19% for 24-threaded simulation. With EPG of 500, the

percentage difference drops to only 12%.

In summary, the behavior of the two classes of GVT algorithms on a 12-core

Xeon processor reflects conventional wisdom and shows substantial improvements of

asynchronous GVT computation, particularly in scenarios with fast event processing,

which are typical for PDES applications. It is seen that asynchronous algorithms, es-

pecially Wait-Free implementation, outperforms Barrier GVT algorithm significantly.

These trends generally hold regardless of the percentage of events generated

50

remotely and regardless of the balance in the workload of each thread. The two

algorithms perform closer to each other only at high EPG values, which makes event

processing a major part of the simulation time thus de-emphasizing the importance

of GVT efficiency.

In the next subsection, we analyze and compare the performance of these algo-

rithms on the KNL system and demonstrate quite different trends. We also explain

the reasons for this behavior.

4.2 GVT Performance on 64-core Knights Landing Architecture

First, we evaluate scenarios for a classical PHOLD model, where all threads are

loaded evenly and the EPG is set to zero. Our second model is also balanced but

experiences heavier computational overhead. The last two models are imbalanced in

terms of communication and event processing respectively. We present the results

for the remote percentages of 0%, 10%, 50% and 100%, and we show the committed

event rates.

In each graph, we present the results for five simulation scales: 1) 24-threaded

simulation to match the maximum number of threads that we used to collect the

results on the Xeon machine as described in the previous section; 2) 64-threaded sim-

ulation to put one thread on each KNL core; 3) 128, 192 and 250-threaded simulation

to put 2, 3 and 4 threads respectively on each KNL core. 6 threads are reserved for

Slurm, which is a job management software deployed on our cluster. Since the KNL

cores are 4-way SMT, 256-way simulation loads the chip to capacity.

4.2.1 Model 1: Balanced Loading & Fast Event Processing

The results presented in Figures 4.3 and 4.4 show a different trend compared

to what we observed on Xeon. While for most scenarios, Mattern’s GVT algorithm

51

24 64 128 192 250Number of Threads

0

1

2

3

4

5

6

7

Com

mitt

ed E

vent

Rat

e

×107



0

1

2

3

4

×107

Fig. 4.3. Committed Event Rate on KNL with Balanced Loading and EPG of 0:0% Remote Events (left) & 10% Remote Events (right)

still outperforms Barrier implementations, the difference in many cases is signifi-

cantly smaller than what we observed on the Xeon system. However, Wait-Free GVT

algorithm continues to perform better than synchronous algorithm. For example,

Wait-Free algorithm is 30% faster than Barrier implementation at 250-scale when we

average all remote percentages, while Mattern’s GVT is 21% for this case.

The key observation from these results is that even when the Mattern’s asyn-

chronous algorithm is faster than Barrier, the performance differences are signifi-

cantly smaller compared to those observed on a conventional Xeon machine while

Wait-Free implementation outperforms other algorithms significantly. Consequently,

locking overhead becomes more critical when simulation is performed on a KNL pro-

cessor compared to when it is performed on a Xeon processor. Note that even if we

compare 24-way simulations on Xeon and KNL, the performance difference between

Barrier and asynchronous algorithms is much smaller on KNL.

In order to explain this performance disparity, it is instructive to compare de-

52


0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

Com

mitt

ed E

vent

Rat

e

×107



0.0

0.2

0.4

0.6

0.8

1.0

1.2×107

Fig. 4.4. Committed Event Rate on KNL with Balanced Loading and EPG of 0:50% Remote Events (left) & 100% Remote Events (right)

lays involved in both algorithms and project the scaling impact on these delays. In

asynchronous algorithm, each PE updates its message counters whenever they send

or receive a message to keep track of transient messages. This causes a computa-

tional overhead, especially at high scales with high remote percentages. In addition,

the Mattern’s algorithm involves thread serialization to determine that all conditions

for establishing the new GVT value are met by all threads. This requires locking

of shared variables and has a non-trivial performance impact, which worsens with

scaling. Especially at high remote percentages, lock acquiring failure is a major

overhead. At 250-scale and 100% remote events total number of locking failures

experienced by Mattern’s GVT implementation is 5,347,302. This number goes to

4,536,272, 1,808,442 and 1,021,915 at 50%, 10% and 0% remote events respectively.

We analyze this impact of this using detailed profiling of the simulation in subsequent

sections.

None of these apply to Wait-Free GVT algorithm which does not need to count

53

messages, so it is computationally much more light weight. Also, thread serialization

is realized by atomic operations which establish a Wait-Free algorithm. Threads are

not blocked to acquire lock and whole simulation proceed faster. We can observe that

Wait-Free outperforms Mattern’s algorithm by 7%, 30%, 40% and 42% at 250-scale

respectively with 0, 10, 50 and 100 remote events when we have fast event processing.

For Barrier implementation, the major overhead is in its synchronous nature.

The barrier-based approach freezes all the PEs at the GVT Barrier and Waits until

all transient messages arrive, at which point the simulation continues. This peri-

odic stopping of the simulation detrimentally impacts performance since no message

transmission or processing is accomplished by any thread during GVT computation

interval.

4.2.2 Model 2: Balanced Loading & Slower Event Processing

Figure 4.5 shows KNL performance for the models with higher event processing

granularity. We can observe that slower event processing is not a major overhead

at KNL systems as it was on Xeon processors. Although performance gap shrinks

compared to the fast event processing, Wait-Free implementation still outperforms

other algorithms. At 250-scale and 100 remote, Wait-Free GVT algorithm is 29%

faster than Mattern’s GVT and 31% faster than Barrier GVT computation.

4.2.3 Model 3: Imbalanced Communication

Figure 4.6 presents the results for the scenario with imbalanced loading, where

some threads are chosen as message destinations more often compared to others as

explained on previous sections. The left side of the figure presents the results of

committed event rate for 10% remote event. We can observe that Barrier GVT

algorithm outperforms asynchronous algorithms at all scales. Wait-Free GVT exhibits

advantage at small scale but the advantages disappear at scales above 128 threads. In54


0.0

0.5

1.0

1.5

2.0

2.5

3.0

Com

mitt

ed E

vent

Rat

e

×107



0.0

0.2

0.4

0.6

0.8

1.0

×107

Fig. 4.5. Committed Event Rate on KNL with Balanced Loading and the EPG of100: 10% Remote Events (left) & 100% Remote Events (right)

fact, Barrier algorithm outperforms the Wait-Free algorithm by 25% and Mattern’s

GVT by 30%.

Right side of the figure 4.6 shows the ratio between the number of total events

and the number of committed events for three algorithms. Here, the number of

committed events is kept strictly linear with respect to the number of threads, and

the same for all algorithms. While the number of committed events is same for all

algorithms, total number of events for Barrier implementation is almost 30% less

than asynchronous algorithms. This shows us that the efficiency of Barrier algorithm

(56%) is higher than asynchronous GVT (40%). This can be credited to the fact that

Barrier implementation performs significantly less number of roll-backs as compared

to asynchronous implementation. Specifically, for the case of 250 threads, Barrier

implementation performs 11.7 million rollbacks whereas asynchronous performs 20

million.

The synchronous nature of the Barrier algorithm reduces the disparity between

55

LPs at imbalanced models so that at high scales it outperforms asynchronous al-

gorithms significantly. The reason being that, when some of the LPs receive more

messages, asynchronous GVT computation allows them to stay behind the LPs with

less message load while Barrier algorithm computation syncs LPs periodically at ev-

ery GVT computation. This is because the asynchronous implementations requires

larger optimistic memory for imbalanced models, thus leading to more cache misses

and worse memory performance.


0.0

0.2

0.4

0.6

0.8

1.0

1.2

Com

mitt

ed E

vent

Rat

e

×107



0

1

2

3

4

5

6

7

Num

ber o

f Eve

nts

×108

Committed EventsBarrier Total EventsMattern Total EventsWait Free Total Events

Fig. 4.6. Imbalanced in terms of Communication on KNL: Committed Event Rate(left) & Efficiency (right)

4.2.4 Model 4: Imbalanced Event Processing

The final scenario that we consider on KNL for completeness of the presentation

is the imbalanced model with changing EPG values per LP. This model is imbal-

anced by generating different event processing delays while keeping the communica-

tion structure balanced. We generate a random weight per LP uniformly whenever

it sends a message. Then this weight is multiplied by a EPG constant to set varying

56

EPG values for each LP throughout the simulation.

Committed event rates of 100% remote events are presented in Figure 4.7 (left).

The trends here are very different than those observed in Figure 4.6, with barrier-

based GVT demonstrating significantly worse performance than asynchronous algo-

rithms. For example at 250-scale Wait-Free GVT is 47% faster than Barrier. We

observe that asynchronous algorithms only suffer from imbalanced models in terms of

network communication but not from event processing delays. This can be also seen

by the efficiency graphs on Figure 4.7 (right). While the committed event numbers are

same for three algorithms, number of total events is higher for Barrier implementation

by almost 30% than asynchronous algorithms. This shows that Barrier synchroniza-

tion is a bottleneck for synchronous algorithm when the model experience varying

processing delays.


0

1

2

3

4

5

6

7

8

Com

mitt

ed E

vent

Rat

e

×106



0

1

2

3

4

5

Num

ber o

f Eve

nts

×108

Committed EventsBarrier Total EventsMattern Total EventsWait Free Total Events

Fig. 4.7. Imbalanced in terms of Event Processing on KNL: Committed Event Rate(left) & Efficiency (right)

57

4.3 Profiling and Analysis

To further explain the behavior observed in the previous section, we isolated,

as much as possible, the GVT computation and analyzed the execution behavior of

both asynchronous and barrier-based GVT algorithms. This was achieved by running

the simulation with 0% remote messages and 0 EPG loading. Though executing

the simulation with no remote messages is a contrived example in the context of

PDES, it serves to eliminate node-to-node event communication leaving only GVT

communication and a consistent local event processing load for measuring simulation

performance.

As shown in Figure 4.3 (left), the asynchronous algorithms outpace Barrier in

performance as thread count increases for 0% Remote Messages with 0 EPG loading

with Wait-Free GVT showing the highest performance. However, when examining

imbalanced loads as shown in Figure 4.6, we noted that Barrier is actually superior.

Using htop, a utility similar to top that includes visualization for threads, we

noted that the asynchronous algorithms allow CPU saturation while the Barrier al-

gorithm does not. This is attributable to the mt all reduce functions which require all

threads to block synchronously in the Barrier implementation. However, it does not

provide any quantitative insight for the superior performance of Barrier GVT under

imbalanced loads.

Therefore, we next analyzed GVT algorithm performance using perf tool [22,39].

Perf is a utility that collects performance counter information for examining program

performance. Like a profiler, the program to be analyzed is invoked within the perf

tool. However, no special compilation is required and the performance penalty is

much smaller than with a profiler.

To analyze the data in Figure 4.1, Figure 4.3, and Figure 4.6, three sets of perf

results are presented for comparison: Table 4.1 summarizes the perftool results for

58

the Xeon processor with 24 threads using a balanced load. Table 4.2 summarizes

the perftool results for the KNL processor with 128 threads using a balanced load.

Finally, Table 4.3 summarizes the perftool results for the KNL processor with 250

threads using an imbalanced load.

Wait-Free Mattern Barrier Statistic

52574552.46 51008083.75 22749859.1 event rate (e/s)

25356.077865 15191.239883 27635.523529 task-clock (ms)

12,671 19,692 70,826 ctxt-switches

37 54 53 cpu-migrations

34,977 33,712 33,107 page-faults

63,538,414,687 37,066,920,255 2,701,184,297 cycles

45,098,656,437 21,933,431,830 21,713,987,634 instructions

0.71 0.59 0.66 insns per cycle

8,304,708,417 3,601,133,777 3,493,369,292 branches

169,630,569 169,375,821 172,940,784 branch-miss

12,934,781,849 5,600,674,689 5,579,363,865 L1-data-lds

474,672,636 480,858,921 484,695,938 L1-data-ld-miss

99,075,495 162,707,266 173,550,910 LLC-loads

3.114383734 2.650194459 3.412732663 seconds elapsed

Table 4.1. Performance statistics for Xeon: 24 Threads 0% Remote, Balanced, 0EPG Model

Table 4.1 shows the asynchronous algorithms have significantly less context

switch counts than the Barrier algorithm. This is likely the result of the pthread

block operation. Additionally, we observe that Wait-Free has approximately 60% of

the context switches that Mattern’s GVT has and only 20% of that of Barrier.

Additionally, though Mattern and Barrier GVT instruction counts are similar,

Wait-Free GVT has more than double the instructions of the other two. It is likely

that this disparity makes the performance improvement of Wait-Free over Barrier

closer to 2x.

Table 4.2 compares the asynchronous and Barrier algorithms on KNL at 128

threads. Like with Xeon, the context switch counts for KNL are much higher for

Barrier algorithm. However, we observe that Wait-Free has a substantially larger

drop in context switches over Mattern (3x) while maintaining a 5x advantage over

59


68979714.96 53183960.11 49212670.5 event rate (e/s)

311887.145498 391356.018605 291648.869244 task-clock (ms)

86,057 253,594 492,515 ctxt-switches


117,911 118,154 120,667 page-faults

422.0 x 109 523.1 x 109 388.5 x 109 cycles

111.3 x 109 125.1 x 109 111.0 x 109 instructions


18,033,898,740 21,479,750,643 17,788,136,078 branches

1,361,158,046 1,431,375,741 1,382,901,242 branch-miss

1,432,789,638 1,416,332,551 1,552,712,769 L1-data-ld-miss

11,564,734,810 11,365,247,641 10,242,960,482 LLC-loads

4.885092538 5.538352237 5.841830405 seconds elapsed

Table 4.2. Performance statistics for KNL: 128 Threads 0% Remote, Balanced, 0EPG Model

Barrier. However, the performance gain when measured in events per second is more

modest than that of the Xeon (1.3x).


9737869.32 5842742.38 12409135.08 event rate (e/s)

8152856.669 13434589.436 1499214.161 task-clock (ms)

792,496 3,443,157 3,890,861 ctxt-switches


206,068 186,304 203,267 page-faults

10.78 x 1012 17.77 x 1012 1.977 x 1012 cycles

543.1 x 109 1204.6 x 109 385.5 x 109 instructions


110,818,302,241 264,842,900,303 7,933,394,974 branches

9,081,487,324 18,515,600,126 7,493,233,316 branch-miss

10,763,809,635 16,757,683,847 3,692,152,524 L1-data-ld-miss

121,184,843,882 196,197,041,611 43,102,647,627 LLC-loads

36.255671304 57.883402944 29.163448318 seconds elapsed

Table 4.3. Performance statistics for KNL: 250 Threads 10% Remote, Imbalanced, 0EPG Model

Table 4.3 compares the asynchronous and Barrier algorithms on KNL at 250

threads with an imbalanced load. Like with previous examples, the context switch

counts for KNL are 5x higher for Barrier algorithm when compared with Wait-Free.

However, we observe that Mattern‘s GVT now has nearly equal context switches to

Barrier. However, we also observe the cache pressure to be an order of magnitude less

for Barrier. This may be the result of the substantial increase in branches and branch60

misses in the asynchronous algorithms. Finally, we note the achieved instructions per

cycle to be 4x higher with Barrier than the asynchronous algorithms.

These findings are consistent with the lower efficiency reported in the ROSS

statistics. This confirms that though asynchronous is faster, this speed allows PEs

to run away farther than Barrier with optimistic operation and thus results in more

wasted work. The number of rolled backed events confirms this behavior. At a 250-

scale, Barrier experiences 236,477,644 rollbacks while Wait-Free and Mattern’s GVT

algorithm experience 392,338,925 and 393,799,600 rollbacks respectively.

61

Chapter 5

LITERATURE REVIEW

GVT computation has been studied extensively in literature, though primar-

ily in a distributed setting. Samadi [41] developed one of the first GVT algorithms

and introduced the transient message and simultaneous reporting problem. However,

that algorithm requires acknowledgement messages to be sent, causing extra commu-

nication overhead. Chandy and Lamport [5, 6] described one of the first distributed

snapshot algorithms. Mattern [34] built on top of that to develop an asynchronous

algorithm that does not require acknowledgement messages.

There has also been work to improve the performance of GVT on multiple cores.

The work by Ianni [26] developed a non-blocking algorithm for concurrent computa-

tion of GVT. In [31], the researchers developed an asynchronous algorithm for com-

putation of GVT. In [9], the authors developed a multicore GVT based on Samadi’s

algorithm for a simulator written in the Go language.

There has also been significant work investigating PDES on manycore architec-

tures. The works of [27, 43] investigated the effects of several optimizations to a

multithreaded PDES simulator on smaller-scale platforms such as Intel’s Core-i7 and

AMD’s Magny-cours.

PDES performance on the Tilera processor, whose architecture shares similar-

ities to KNL, was investigated in [28]. Those results show excellent scalability and

demonstrate that the interconnect network can sustain high throughput. However, it

did not investigate alternate GVT implementations.

62

Another area of research involves removing boundaries on resource allocation,

in a “share-everything” system [25]. Such a system may allow a synchronous sys-

tem to compete with optimistic methods in unbalanced situations by shifting hard-

ware resources to more highly-loaded LPs. In addition, lock-free or wait-free event

queues [23], may improve performance in situations where remote percentages are

high.

The work of [1] is the follow-up to [2], reporting impressive event processing rates

on the Sequoia BlueGene/Q supercomputer. The recent effort of [7] evaluated PDES

performance on the Knights Corner processor. The main conclusion of [7] is that

Knights Corner does not outperform the host Xeon processor in terms of event rate

unless vector units are fully utilized, and increasing the number of threads does not

alter that trend. The reasons behind such sub-par performance are slower in-order

cores and limited amount of physical memory on the accelerator card.

Several other studies investigated the performance of various parallel applications

on Xeon Phi (Knights Corner) platforms [24,32,36, 38,40,46]. However, all of these

applications are very different from PDES and in general offer more parallelization

opportunities. Evaluating PDES on KNL provides an insight of how similar fine-

grain communication-dominated applications will be expected to perform on these

platforms.

63

Chapter 6

CONCLUSIONS AND FUTURE WORK

GVT computation algorithm is an important component of a parallel discrete

event simulation system and the choice of GVT algorithm often significantly impacts

the performance of PDES. In this Thesis, we performed a systematic comparative

analysis of various GVT algorithms on systems such as 12-core Xeon processor and

Intel’s Knights Landing many-core processor. While for balanced models, our results

corroborate the conventional wisdom that asynchronous GVT algorithms offer supe-

rior performance to blocking synchronous GVT. The opposite can be the case for

the imbalanced models, where the synchronous nature of GVT limits the disparity of

forward progress among the logical processes. We also performed detailed simulation

profiling to understand the causes of these results.

Our future work will be the extension of this study to a clusters of KNL proces-

sors. We target to scale up to 8 (number of nodes) * 256 (CPUs per node) threads.

We will exploit the recent advances in network technologies such as RDMA and In-

finiBand. We also consider developing a hybrid GVT algorithm that can exploit the

advantages of both synchronous and asynchorouns approaches. Theoretically, a GVT

algorithm can mutate itself based on the simulation model and yield the best of the

two worlds. We are considering modifying Mattern’s GVT algorithm by imposing

artificial synchronization when the average efficiency is below a certain threshold so

that we can throttle the disparity in imbalanced models.

64

REFERENCES

[1] Barnes Jr, P. D., Carothers, C. D., Jefferson, D. R., and LaPre,J. M. Warp speed: executing time warp on 1,966,080 cores. In Proceedings ofthe 2013 ACM SIGSIM conference on Principles of advanced discrete simulation(2013), ACM, pp. 327–336.

[2] Bauer, D., Carothers, C., and Holder, A. Scalable time warp on blue-gene supercomputer. In Proc. of the ACM/IEEE/SCS Workshop on Principlesof Advanced and Distributed Simulation (PADS) (2009).

[3] Carothers, C., Bauer, D., and Pearce, S. ROSS: A high-performance,low memory, modular time warp system. In Proc of the 11th Workshop onParallel and Distributed Simulation (PADS) (2000).

[4] Carothers, C. D., Fujimoto, R. M., and England, P. Effect of com-munication overheads on Time Warp performance: An experimental study. InProc. of the 8th Workshop on Parallel and Distributed Simulation (PADS 94)(July 1994), Society for Computer Simulation, pp. 118–125.

[5] Chandy, K. M., and Lamport, L. Distributed snapshots: Determiningglobal states of distributed systems. ACM Transactions on Computer Systems3, 1 (Feb. 1985), 63–75.

[6] Chandy, K. M., and Misra, J. Asynchronous distributed simulation via asequence of parallel computations. Communications of the ACM 24, 11 (Apr.1981), 198–206.

[7] Chen, H., Yao, Y., and Tang, W. Can mic find its place in the world ofpdes? In Proceeding of International Symposium on Distributed Simulation andReal Time Systems (DS-RT) (2015).

[8] Chrysos, G. Intel xeon phi x100 family coprocessor - the architecture. In Intelwhite paper (2012).

[9] D’Angelo, G., Ferretti, S., and Marzolla, M. Time warp on the go. InProceedings of the 5th International ICST Conference on Simulation Tools andTechniques (ICST, Brussels, Belgium, Belgium, 2012), SIMUTOOLS ’12, ICST(Institute for Computer Sciences, Social-Informatics and TelecommunicationsEngineering), pp. 242–248.

[10] Das, S., Fujimoto, R., Panesar, K., Allison, D., and Hybinette, M.GTW: a Time Warp system for shared memory multiprocessors. In Proceedings ofthe 1994 Winter Simulation Conference (Dec. 1994), J. D. Tew, S. Manivannan,D. A. Sadowski, and A. F. Seila, Eds., pp. 1332–1339.

65

[11] Eker, A., Williams, B., Mishra, N., Thakur, D., Chiu, K., Pono-marev, D., and Abu-Ghazaleh, N. Performance implications of global vir-tual time algorithms on a knights landing processor. In 2018 IEEE/ACM 22ndInternational Symposium on Distributed Simulation and Real Time Applications(DS-RT) (2018), IEEE, pp. 1–10.

[12] Fujimoto, R. Performance measurements of distributed simulation strategies.Tech. Rep. UU–CS–TR–87–026a, University Of Utah, Salt Lake City, November1987.

[13] Fujimoto, R. Parallel discrete event simulation. Communications of the ACM33, 10 (Oct. 1990), 30–53.

[14] Fujimoto, R. Performance of time warp under synthetic workloads. Proceedingsof the SCS Multiconference on Distributed Simulation 22, 1 (Jan. 1990), 23–28.

[15] Fujimoto, R. Parallel and distributed discrete event simulation: Algorithmsand applications. In Proc. of the 1993 Winter Simulation Conference (1993),pp. 106–114.

[16] Fujimoto, R., and Panesar, K. Buffer management in shared-memory TimeWarp system. In Proceedings of the 9th Workshop on Parallel and DistributedSimulation (PADS 95) (June 1995), pp. 149–156.

[17] Fujimoto, R. M. Time Warp on a shared memory multiprocessor. Transac-tions of Society for Computer Simulation (July 1989), 211–239.

[18] Fujimoto, R. M. Parallel discrete event simulation: Will the field survive ?ORSA Journal on Computing 5, 3 (June 1993).

[19] Fujimoto, R. M. Parallel and Distributed Simulation Systems. Wiley Inter-science, Jan. 2000.

[20] Fujimoto, R. M., and Hybinette, M. Computing global virtual time inshared-memory multiprocessors. ACM Transactions on Modeling and ComputerSimulation 7, 4 (1997), 425–446.

[21] Fujimoto, R. M., Tsai, J., and Gopalakrishnan, G. C. Design andevaluation of the rollback chip: Special purpose hardware for Time Warp. IEEETransactions on Computers 41, 1 (Jan. 1992), 68–82.

[22] Gperftools. Google performance tools.

[23] Gupta, S., and Wilsey, P. A. Lock-free pending event set management intime warp. In ACM SIGSIM Conference on Principles of Advanced DiscreteSimulation (PADS) (May 2014).

66

[24] Heinecke, A., Vaidanathan, K., Smelianskiy, M., Kobutov, A.,Dubtsov, R., Henri, G., Shet, A., Chrysos, G., and Dubey, P. Designand implementation of the linpack benchmark for single and multi-node systemsbased on intel xeon phi coprocessor. In Proceedings of International Parallel andDistributed Processing Symposium (IPDPS) (2013).

[25] Ianni, M., Marotta, R., Cingolani, D., Pellegrini, A., and Quaglia,F. The ultimate share-everything pdes system. In 2018 ACM SIGSIM Confer-ence on Principles of Advanced Discrete Simulation (May 2018), pp. 73–84.

[26] Ianni, M., Marotta, R., Pellegrini, A., and Quaglia, F. A non-blocking global virtual time algorithm with logarithmic number of memory oper-ations. In 2017 IEEE/ACM 21st International Symposium on Distributed Sim-ulation and Real Time Applications (DS-RT) (Oct 2017), pp. 1–8.

[27] Jagtap, D., Bahulkar, K., Ponomarev, D., and Abu-Ghazaleh, N.Characterizing and understanding pdes behaviour on tilera architecture. In work-sop on Principles of Advanced Discrete Simulation (PADS) (2012).

[28] Jagtap, D., N.Abu-Ghazaleh, and D.Ponomarev. Optimization of par-allel discrete event simulator for multi-core systems. In International Paralleland Distributed Processing Symposium (May 2012).

[29] Jefferson, D. Virtual time. ACM Transactions on Programming Languagesand Systems 7, 3 (July 1985), 405–425.

[30] Jefferson, D., Beckman, B., Wieland, F., Blume, L., Di Loreto, M.,Hontalas, P., Laroche, P., Sturdevant, K., Tupman, J., Warren,V., Wedel, J., Younger, H., and Bellenot, S. Distributed simulationand the Time Warp operating system. In Proceedings of the 12th SIGOPS —Symposium of Operating Systems Principles (1987), pp. 77–93.

[31] Lin, Z., and Yao, Y. An asynchronous gvt computing algorithm in neurontime warp-multi thread. In 2015 Winter Simulation Conference (WSC) (Dec2015), pp. 1115–1126.

[32] Lu, M., Zhang, L., Hyunh, H., Ong, Z., Liang, Y., He, B., Goh, R.,and Huynh, R. Optimizing the mapreduce framework on intel xeon phi copro-cessor. In Proceedings of International Conference on Big Data (2013).

[33] Mattern, F. Virtual time and global states in distributed systems. In Pro-ceedings of the Workshop on Parallel and Distributed Algorithms (Oct. 1989),pp. 215–226.

[34] Mattern, F. Efficient algorithms for distributed snapshots and global virtualtime approximation. Journal of Parallel and Distributed Computing 18, 4 (Aug.1993), 423–434.

67

[35] Mattern, F., Mehl, H., Schoone, A. A., and Tel, G. Global virtualtime approximation with distributed termination detection algorithms. Tech.Rep. RUU–CS–91–32, Dept. of Computer Science, University of Utrecht, TheNetherlands, 1991.

[36] Misra, G., Kurkure, N., Das, A., M.Valmiki, Das, S., and Gupta, A.Evaluation of rodinia codes on intel xeon phi. In Proceedings of the 4th Interna-tional Conference on Intelligent Systems, Modelling and Simulation (2013).

[37] Pellegrini, A., and Quaglia, F. Wait-free global virtual time computationin shared memory timewarp systems. In Computer Architecture and High Per-formance Computing (SBAC-PAD), 2014 IEEE 26th International Symposiumon (2014), IEEE, pp. 9–16.

[38] Pennycook, S., Hughes, C., Smelianskiy, M., and Jarvis, S. Exploringsimd for molecular dynamics using intel xeon processor and intel xeon phi co-processors. In Proceedings of International Parallel and Distributed ProcessingSymposium (IPDPS) (2013).

[39] Perf. Linux profiling with performance counters.

[40] Ramachandran, A., Vienne, J., Wijmgaart, R., Koesterke, L., andSharapov, I. Performance evaluation of nas parallel benchmarks on intel xeonphi. In Proceedings of International Conference on Parallel Processing (ICPP)(2013).

[41] Samadi, B. Distributed Simulation, Algorithms and Performance Analysis. PhDthesis, Computer Science Department, University of California, Los Angeles, CA,1985.

[42] Sodani, A., Gramunt, R., Corbal, J., Kim, H., Vinod, K.,Chinthamani, S., Hutsell, S., Agarwal, R., and Liu, Y. Knights land-ing: Second-generation intel xeon phi product. In IEEE Micro (2016).

[43] Wang, J., Jagtap, D., Abu-Ghazaleh, N., and Ponomarev, D. Paral-lel discrete event simulation for multi-core systems: Analysis and optimization.IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1574–1584.

[44] Wang, J., Ponomarev, D., and N.Abu-Ghazaleh. Performance analysisof multithreaded pdes simulator on multi-core clusters. In 26th IEEE/ACM/SCSWorkshop on Principles of Advanced and Distributed Simulations (PADS) (July2012).

[45] Williams, B., Ponomarev, D., Abu-Ghazaleh, N., and Wilsey, P.Performance characterization of parallel discrete event simulation on knightslanding processor. In Proceedings of the 2017 ACM SIGSIM Conference onPrinciples of Advanced Discrete Simulation (PADS) (2017), ACM, pp. 121–132.

68

[46] Xie, B., Liu, X., Zhan, J., Jia, Z., Zhu, Y., Wang, L., and Zhang, L.Characterizing data analytics workloads on intel xeon phi. In Workload Char-acterization (IISWC), 2015 IEEE International Symposium on (2015), IEEE,pp. 114–115.

69

Reproduced with permission of copyright owner. Further reproduction prohibited without permission.

ANALYSIS OF GLOBAL VIRTUAL TIME ALGORITHMS FOR …

Documents

Transcript of ANALYSIS OF GLOBAL VIRTUAL TIME ALGORITHMS FOR …