ANALYSIS OF GLOBAL VIRTUAL TIME ALGORITHMS FOR …
Transcript of ANALYSIS OF GLOBAL VIRTUAL TIME ALGORITHMS FOR …
ANALYSIS OF GLOBAL VIRTUAL TIME ALGORITHMS FOR PARALLEL
DISCRETE EVENT SIMULATION ON MANY-CORE SYSTEMS
BY
ALI ARDA EKER
BS, Binghamton University, 2017BS, Istanbul Technical University, 2017
THESIS
Submitted in partial fulfillment of the requirements forthe degree of Master of Science in Computer Science
in the Graduate School ofBinghamton University
State University of New York2019
ProQuest Number:
All rights reserved
INFORMATION TO ALL USERSThe quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscriptand there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest
Published by ProQuest LLC ( ). Copyright of the Dissertation is held by the Author.
All rights reserved.This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.
ProQuest LLC.789 East Eisenhower Parkway
P.O. Box 1346Ann Arbor, MI 48106 - 1346
13865127
13865127
2019
Accepted in partial fulfillment of the requirements forthe degree of Master of Science in Computer Science
in the Graduate School ofBinghamton University
State University of New York2019
July 2, 2019
Dmitry Ponomarev, Faculty AdvisorDepartment of Computer Science, Binghamton University
Kenneth Chiu, MemberDepartment of Computer Science, Binghamton University
iii
Abstract
Global Virtual Time (GVT) algorithms compute the snapshot of a distributed
simulation system to determine a consistent global state across all simulation pro-
cesses. These algorithms aim to affect the underlying computation at a minimum
level while computing a global state which would consist of local states of all pro-
cesses and the states of in transit messages between them. In other words, GVT
algorithms implement the monotonic functions which would give the lower bound of
the simulation time to which a distributed simulation system has advanced. In Paral-
lel Discrete Event Simulation (PDES), GVT is used to determine the correct time for
non reversible operations such as garbage collection, I/O operations and terminating
the simulation.
In this project, we implemented two asynchronous GVT algorithms: Wait-Free
and Mattern’s GVT and compared them with a barrier based synchronous GVT
algorithm [11]. We evaluated GVT algorithms based on the PDES performance un-
der different multi-core architectures: a classical 12-core Xeon machine and a high
performance computing Xeon Phi Processor (Knights Landing). Using the ROSS
simulator, we demonstrated that an efficient GVT algorithm can lead to significant
improvements in scalability depending on the simulation model. We observed that the
synchronous Barrier GVT algorithm with imbalanced models and the asynchronous
Wait-Free algorithm with balanced models both result in the simulation to scale
in performance all the way to 250 threads in a single machine. We also performed
detailed simulation profiling to understand the underlying reasons for different perfor-
mance trends based on the GVT algorithm choice, simulation model and parameters.
iv
ACKNOWLEDGEMENTS
I appreciate Dr. Dmitry Ponomarev and Dr. Kenneth Chiu for always directing
me to do my best.
I also thank Barry Williams and Dr. Nael Abu-Ghazaleh for their invaluable
advices on my work.
viii
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2 BACKGROUND: PDES, ROSS SIMULATOR AND MULTI-
CORE ARCHITECTURES . . . . . . . . . . . . . . . . . 5
2.1 Parallel Discrete Event Simulation . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Synchronization Issues . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Computing the Global Virtual Time . . . . . . . . . . . . . . 11
2.1.3 Transient Message Problem . . . . . . . . . . . . . . . . . . . 13
2.1.4 Simultaneous Reporting Problem . . . . . . . . . . . . . . . . 15
2.2 ROSS Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 PHOLD Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Intel Xeon & Xeon Phi Architectures . . . . . . . . . . . . . . . . . . 21
2.5 Experimental Setup & Parameters . . . . . . . . . . . . . . . . . . . . 22
Chapter 3 GLOBAL VIRTUAL TIME ALGORITHMS . . . . . . 26
3.1 Synchronous GVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Barrier GVT Algorithm . . . . . . . . . . . . . . . . . . . . . 27
ix
3.2 Asynchronous GVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Samadi’s GVT Algorithm . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Mattern’s GVT Algorithm . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Wait-Free GVT Algorithm . . . . . . . . . . . . . . . . . . . . 43
Chapter 4 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . . . 48
4.1 GVT Performance on 12-core Classical Xeon Machine . . . . . . . . . 48
4.1.1 Model 1: Balanced Loading & Fast Event Processing . . . . . 49
4.1.2 Model 2: Balanced Loading & Slower Event Processing . . . . 50
4.2 GVT Performance on 64-core Knights Landing Architecture . . . . . 51
4.2.1 Model 1: Balanced Loading & Fast Event Processing . . . . . 51
4.2.2 Model 2: Balanced Loading & Slower Event Processing . . . . 54
4.2.3 Model 3: Imbalanced Communication . . . . . . . . . . . . . . 54
4.2.4 Model 4: Imbalanced Event Processing . . . . . . . . . . . . . 56
4.3 Profiling and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 5 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . 62
Chapter 6 CONCLUSIONS AND FUTURE WORK . . . . . . . . 64
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
x
LIST OF TABLES
2.1 Details of Experimental Platforms . . . . . . . . . . . . . . . . . . . . 23
2.2 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Performance statistics for Xeon: 24 Threads 0% Remote, Balanced, 0
EPG Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Performance statistics for KNL: 128 Threads 0% Remote, Balanced, 0
EPG Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Performance statistics for KNL: 250 Threads 10% Remote, Imbalanced,
0 EPG Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xi
LIST OF FIGURES
2.1 PDES Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Rollback Example (Step 1): before rollback . . . . . . . . . . . . . . . 9
2.3 Rollback Example (Step 2): after straggler message is received, rollback
is initiated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Rollback Example (Step 3 a): after rollback completed in a determin-
istic simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Rollback Example (Step 3 b): after rollback completed in a non-
deterministic simulation . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 GVT is the time stamp of message sent from LP 2 to LP 3. . . . . . . 12
2.7 Transient message problem. . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Prevention of transient message problem using message acknowledge-
ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Simultaneous message problem. . . . . . . . . . . . . . . . . . . . . . 16
2.10 Hierarchy of simulation structures in ROSS. . . . . . . . . . . . . . . 18
2.11 Communication architecture in single node ROSS . . . . . . . . . . . 19
2.12 Intel Knights Landing Architecture . . . . . . . . . . . . . . . . . . . 22
2.13 Imbalanced Communication . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Snapshot of a Barrier GVT computation . . . . . . . . . . . . . . . . 28
xii
3.2 Snapshot of a Samadi’s GVT Computation . . . . . . . . . . . . . . . 31
3.3 Cut divides simulation into two: past and future. . . . . . . . . . . . 31
3.4 Events before the cut line colored as white, after the cut line colored
as red (dotted arrow). . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Second cut should stretch towards future so that there should be no
message sent from the white phase and received in the consecutive
white phase (dotted arrow). . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Snapshot of a Mattern’s GVT computation . . . . . . . . . . . . . . . 39
3.7 Snapshot of a Wait-Free GVT computation . . . . . . . . . . . . . . . 46
4.1 Committed Event Rate on Xeon with Balanced Loading and 0 EPG:
10% Remote Events (left) & 100% Remote Events (right) . . . . . . . 49
4.2 Committed Event Rate on Xeon with Balanced Loading and 50% Re-
mote Events: 100 EPG (left) & 500 EPG (right) . . . . . . . . . . . . 50
4.3 Committed Event Rate on KNL with Balanced Loading and EPG of
0: 0% Remote Events (left) & 10% Remote Events (right) . . . . . . 52
4.4 Committed Event Rate on KNL with Balanced Loading and EPG of
0: 50% Remote Events (left) & 100% Remote Events (right) . . . . . 53
4.5 Committed Event Rate on KNL with Balanced Loading and the EPG
of 100: 10% Remote Events (left) & 100% Remote Events (right) . . 55
4.6 Imbalanced in terms of Communication on KNL: Committed Event
Rate (left) & Efficiency (right) . . . . . . . . . . . . . . . . . . . . . . 56
xiii
4.7 Imbalanced in terms of Event Processing on KNL: Committed Event
Rate (left) & Efficiency (right) . . . . . . . . . . . . . . . . . . . . . . 57
xiv
Chapter 1
INTRODUCTION
The interest in scalable performance of Parallel Discrete Event Simulations (PDES)
has increased after the emergence of many-core architectures [7,8,27]. These machines
offer a tight integration among a large number of cores on the same chip in contrast
to traditional clusters and small-scale multi-core systems. With these new systems,
researchers aim to efficiently exploit the shared memory which feeds a large number
of threads. For example, 4-way hyper-threaded 64 cores share a 96 GB memory on In-
tel’s second generation of Xeon Phi - Knights Landing processor (KNL). Thus, the low
cost of on-chip communication on these processors offers a promise to substantially
improve scalability in PDES.
The emergence of many-core architectures offered to alleviate the communication
bottlenecks that hindered many previous attempts to design a scalable PDES [10,16,
20,28,43,44]. However, recent studies reported generally underwhelming performance
results [7]. Performance challenges and the lack of scalability partially stem from an
inefficient Global Virtual Time (GVT) algorithm used in these studies. A detailed
study [45] examined and characterized PDES performance and scalability on KNL.
This work demonstrated the lack of scalability for most models and execution sce-
narios when the thread count exceeded 128, or 2 threads for each core on the KNL
chip. One of the reasons cited in [45] is the higher overhead of Global Virtual Time
(GVT) computations with larger number of threads. Thus, we investigated PDES
performance on multi-core processors under more optimized GVT implementations.
1
As a result, our study offers more encouraging conclusions about PDES scalability
properties on multi-core systems.
We pursue our investigations using the ROSS parallel discrete event simulation
kernel [3] on a single-node Intel’s Xeon Phi processor. For comparison purposes, we
also evaluate several GVT algorithms on the traditional 12-core Xeon processor. The
default GVT implementation in ROSS is a synchronous GVT using native POSIX
barriers. Furthermore, we implemented two asynchronous GVT algorithms to further
boost performance. The first one was inspired by Mattern’s GVT algorithm, but
appropriately adjusted for shared memory systems within our framework [33–35]. The
second one is the more recent wait-free GVT algorithm proposed in [37]. We compare
the performance of the three implementations under different models, settings, and
conditions. Our experiments are driven by the classical PHOLD benchmark, as well
as its variants that provide uneven loading of threads, vary percentage of remote
communications, and change the event processing granularity.
In the Background section, we explain the concept of Parallel Discrete Event
Simulation and why an efficient Global Virtual Time computation is needed. We
also study the main challenges to be considered when calculating the GVT. The
Transient Message Problem and Simultaneous Reporting Problem are demonstrated
as examples. We also clarify the intrinsics of ROSS and discuss why it needs to
implement a GVT algorithm. Then, we study the PHOLD benchmark and its versions
used in our experiments and the effects of the simulation model on the choice of GVT
algorithm. Finally, we examine the Intel Xeon and Xeon Phi Architectures with their
experimental setup and parameters to understand their effects on the scalability.
We first explain and demonstrate the default synchronous GVT implementation
in the GVT section. Then, we analyze what makes it synchronous, as well as the
possible advantages and disadvantages of asynchronous implementations. In order to
2
identify the concept of the asynchronous GVT algorithm, we study Mattern’s GVT
and Wait-Free GVT in detail. We also examine the modifications made on those
algorithms to exploit the shared memory in a more efficient way.
Simulation models investigated on the Xeon and the KNL processor are pre-
sented in the Experimental Results chapter. We first evaluate and compare the GVT
performances under the Xeon processor using a balanced model with changing event
processing granularity on a smaller scale. A Knights Landing processor is used to
experiment with balanced and imbalanced core loading models. We also explore
communication and event processing dominated scenerios to further understand the
benefits and drawbacks of three GVT algorithms. Finally, we demonstrate the scal-
ability of some scenarios up to 250 threads and explain them with detailed profiling.
Main contributions of this thesis are:
• We extend the previous studies of PDES performance using the KNL proces-
sor under more efficient GVT algorithms: a synchronous barrier-based GVT,
and two asynchronous implementations: one inspired by Mattern’s GVT algo-
rithm [34] and one that is based on the wait-free algorithm of [37]. As a result,
we removed a significant bottleneck presented in the earlier study and demon-
strated that under a more efficient GVT, the simulation can often scale all the
way to 250 threads.
• This is the first study that comparatively evaluates synchronous and asyn-
chronous GVT algorithms on a many-core platform such as the KNL. We also
compare the results on traditional Xeon machines.
• We show that while the most efficient asynchronous algorithm (the wait-free
GVT) significantly outperforms other alternatives for balanced models on both
Xeon and KNL systems, the barrier-based synchronous implementation results
in better performance on KNL with imbalanced models, especially with larger
3
thread counts.
• We analyze the reasons for this behavior on KNL systems using a number of
profiling tools and offer the explanations for the observed results based on this
analysis.
4
Chapter 2
BACKGROUND: PDES, ROSS SIMULATOR
AND MULTI-CORE ARCHITECTURES
In this section we overview the Parallel Discrete Event Simulation and show the
challenges that arise when computing the Global Virtual Time. We also review the
ROSS design and PHOLD benchmark with its variants and describe the architectures
and experimental setup used in these experiments.
2.1 Parallel Discrete Event Simulation
In order to understand PDES, one should first understand Discrete Event Simu-
lation. The main objective of DES is to model a physical system which is composed of
some number of physical processes that interact with each other [12,19]. In DES, each
physical process is modeled as a logical process (LP) and interactions between physi-
cal processes are simulated by exchanging time-stamped event messages between the
associated logical processes. The computation performed by each LP is a sequence
of event processing which can modify the state of the LP and schedule new events
for itself or other LPs. For example, in an airport traffic simulation, each airport
is represented by an LP, while airplane arrivals and departures are event messages
which provide the communication between airports.
PDES is a parallel implementation of DES [13], extending the performance ad-
vantages of parallel processing for simulation kernels. The primary concept of PDES is
5
to divide the simulation entities into multiple Logical Processes (LPs) and to execute
them on different cores or nodes in parallel.
The LPs communicate with each other by exchanging time-stamped event mes-
sages [15, 18, 29]. Time-stamps contain virtual time and are not associated with the
real time (wall clock time). Sender LP generates an event message according to its
task and computes the message’s time-stamp by adding a look-ahead value to its
local virtual time (LVT). The Look-ahead value can be configured as a constant or
can vary according to LP’s task or virtual position in the simulation.
The LPs have local event queues and process the events from these queues in
a time-stamped order. Processing an event will guarantee generating a new event
with a larger time-stamp than the last processed event’s time-stamp. This new event
can be sent to any other LP including to itself. Depending on the simulation model,
some LPs can be chosen as message destinations more often than others but every
LP should receive events once in a while to advance in the simulation time. One
can compare this to the peaceful airport of Binghamton with JFK which receives
hundreds of departures daily.
Figure 2.1 presents an overview of PDES kernel where three LPs communicate
with time stamped event messages. Incoming messages stored in the Event Queue and
the State Queue is updated based on the last processed event. Message destinations
and look ahead values are chosen randomly.
As mentioned previously, some events are generated locally within the LP, and
some events are generated remotely and sent over the network. The timing of the
event arrival to a destination LP depends on the physical delays that the message
encounters while traversing the network. The on-chip interconnect for core-to-core
communication within a chip or network links for cluster-level communication can
determine the delay of the arrival.
6
Fig. 2.1. PDES Overview
2.1.1 Synchronization Issues
Sequential event processing by LPs in DES should be computed in parallel in
a PDES environment. This generates a synchronization problem since one cannot
simply map the different logical processes to different cores or nodes and allow each
LP to proceed forward by executing events in the incoming order of arrival. Event
messages should be causally consistent with each other based on their time stamps.
LPs should process events, both those generated locally and those generated by other
LPs in the time-stamp order. Failure to accomplish this could cause the processing
of an event E without processing the events which caused the generation of E. Errors
resulting out of order time-stamp event processing are referred as causality errors.
When an LP processes an event, there is no priori guarantee that an event with
smaller time-stamp will not arrive from some other LP due to physical delays in
the system. These events are called straggler events and violate the causality order
between events. Therefore, a PDES simulation engine needs to use a synchronization
mechanism to ensure that events are executed at different LPs in the correct time-
stamped order.
7
For example, a passenger travels from JFK, New York to Istanbul, Turkey
through Heathrow, London. They will switch from plane A to plane B in London.
If B arrives at London before the passenger arrives, this is normal in reality, but in
a PDES environment it causes message B to be processed. Therefore, the passenger
misses their flight to Istanbul. In this case, B is a straggler message with a time
stamp larger than A but arrives to the destination (Heathrow) before A.
There are two proposed synchronization approaches to solve this issue. The first
one is a conservative approach which uses synchronization and message exchanges
to guarantee that no straggler event will ever be generated and the causality order
is never violated. In contrast, the second approach is an optimistic solution which
allows LPs to process events forward without a global synchronization [4, 14, 17, 30].
Causality violations are handled by rolling back to a point in virtual time which is
earlier than the straggler message’s time stamp [21].
An analogy to optimistic processing would be speculative execution in Micro-
processors. Speculative execution provides a mechanism to fetching, decoding and
executing new instructions based on a branch prediction. Many modern micropro-
cessors predict the result of a branch instruction and optimistically begin executing
instructions according to this prediction. This is problematic since the predictions can
be incorrect. Thus, the CPU must have some way to back out of the wrong sequence
of instructions that it began to execute and start executing the correct sequence.
Rollbacks in an optimistic simulation require reverting the LP to a previous state.
Such reversions can be realized by either checkpointing or using reverse computation;
however, both methods require maintaining a list of event histories. In both methods,
LPs revert all the messages they sent when encountering a rollback (messages with a
time stamp larger than straggler message’s time stamp). This is realized by sending
an anti message for each normal (positive) message sent to the same target LP so
8
that the simulation backs up with a cascading effect.
Figure 2.2 through Figure 2.5 demonstrate an example of a rollback. In step
1, the LP is about to receive a straggler message with time stamp 7 while it has
two processed events with time stamps 5 and 10, respectively. The straggler message
causes a rollback because the causal consistency between events 7 and 10 is broken.
In Figure 2.3, the LP cancels the events with time stamp larger than 7, sends
anti messages for each positive message it sent (event with time stamp 11, targeting
LP A) and processes the straggler message. The next step is dependent on whether
the simulation is deterministic or non-deterministic.
Figure 2.4 presents a scenario after a rollback is completed in a deterministic
simulation where a reverted event with time stamp 10 is created again. The cancelled
positive message with time stamp 11 is also generated again with the same target LP
A. The unprocessed event is generated in the same way. However, Figure 2.5 shows a
situation where the simulation is non-deterministic so that a random event with time
stamp 9 is generated and it causes the sending of another random message with time
stamp 10 to a different LP (B).
Fig. 2.2. Rollback Example (Step 1): before rollback
9
Fig. 2.3. Rollback Example (Step 2): after straggler message is received, rollback isinitiated
Fig. 2.4. Rollback Example (Step 3 a): after rollback completed in a deterministicsimulation
Checkpointing requires LPs to save their states periodically between fixed inter-
vals. In the case of a straggler message triggering a rollback, LPs discard all the work
they have done between the last checkpoint (prior to the straggler) and their current
LVT. They set their LVT to the rollbacked checkpoint and restart the simulation from
that point. This can be implemented in a deterministic and non deterministic way.
The latter one does not guarantee the regeneration of the rollbacked events, while the
deterministic approach preserves the order of events.
Reverse computation is another approach to implement optimistic simulation.
This is based on reversible operations such as addition and subtraction. Upon a
10
Fig. 2.5. Rollback Example (Step 3 b): after rollback completed in anon-deterministic simulation
straggler message, each LP reverses the event messages starting from the most recent
event until the event just prior to the straggler message. More precisely, the reverse
computation code is carried with every event to reverse event’s effect to restore the
state during a rollback.
2.1.2 Computing the Global Virtual Time
Both checkpointing and reverse computation require maintaining a previous
event history. These histories accumulate over time and create a memory overhead
which leads to lower cache utilization. Therefore, a mechanism is needed to reclaim
the resources that are no longer needed. This mechanism can also be used to perform
operations which cannot be reverted, such as I/O.
The memory problems related to the optimistic simulation can be solved if one
can guarantee that certain events are no longer prone to rollback. Specifically, state
histories prior to time T can be freed if no rollbacks prior to time T will ever be
needed. Similarly, I/O operations issued by any LP with a smaller LVT than T can
be executed. Therefore, if we can determine a lower bound on the time stamp of any
future rollback, we can use it to free the memory. This lower bound is referred to
as Global Virtual Time (GVT). If we could capture the snapshot of the simulation
11
system, the minimum time stamp among all anti-messages, positive messages and
unprocessed events in the system would represent the lower bound on the time stamp
of any future rollback. Thus, the memory for the state histories and event messages
that have a lower time stamp than the GVT can be reclaimed, and I/O operations
issued before the GVT can be executed.
Figure 2.6 shows a possible GVT value among simulation processes and in transit
messages. White circles depict the virtual positions of LPs in the simulation. The
LVT for an LP is computed by taking the minimum time stamp of the unprocessed
events in LP’s event queue. GVT is the minimum of the minimum time stamp of
in-transit messages (min(4, 12)) and minimum of LVTs (min(5, 7, 10)). Therefore,
the GVT is is 4 (min(4, 5)).
Fig. 2.6. GVT is the time stamp of message sent from LP 2 to LP 3.
If one could freeze the simulation as a whole and capture the information of
all in-transit messages and unprocessed events, then computing the GVT would be
trivial. One would need to compute the minimum of 1) minimum time stamp of all
in transit messages and 2) minimum LVTs among all LPs. Intuitively, this approach
is not possible because once a message is sent, there is no way to stop it during
its transmission. Therefore, in order to determine the GVT, we need a mechanism
which computes a snapshot of the simulation as a whole. There are two challenging
12
problems associated with making such a snapshot: the transient message problem
and the simultaneous reporting problem.
2.1.3 Transient Message Problem
We mentioned that thee Local Virtual Time (LVT) of an LP determines its
position in the simulation system with respect to other LPs. The LVT is determined
by the last processed event’s time stamp. Thus, it is guaranteed that the LVT is
the lower bound on the time stamps of unprocessed events in an LP’s event queue
(assuming events are sorted in the event queue based on their time stamp).
Therefore, one could try to compute the GVT by instantaneously signaling all the
LPs to report their LVTs and computing the minimum LVT. However, this minimum
LVT might not be a correct GVT value. This is because there may be some messages
in the network which are sent by one LP, but not yet received by its destination LP.
These messages may be straggler messages that possibly cause a rollback. Thus, these
transient messages must be included in the GVT computation.
Figure 2.7 shows an example of the transient message problem. The leader LP
sends a message to LP 1 and LP 2 to report their LVTs, this being 10 and 20 at
the point of receiving the signal, respectively (this is shown as dashed arrows). The
transient message with time stamp 5 is not yet received; thus, it is not incorporated
into the LVT calculation of LP 2. The computed GVT value is 10 (min(10, 20)), but
should be 5.
There are essentially two approaches to solve the transient message problem:
1) The sender LP is responsible for taking into account the time stamp of transient
messages it sends, or 2) it is the receiver LP’s responsibility to take into account the
time stamp of transient messages when they arrive.
Both solutions require message acknowledgments. In the first approach, receiver
13
Fig. 2.7. Transient message problem.
should send a message of acknowledgment to the sender upon the arrival of every
message. The sender of each message is responsible for accounting the message in its
LVT until it receives the acknowledgment. It is acceptable for more than one LP to
account for the same message because it would not affect the GVT computation. This
handshake between the sender and the receiver ensures that no transient messages
“fall between the cracks” during the GVT computation [19].
Figure 2.8 shows how the transient message problem is eliminated using the above
solution. Message acknowledgements are sent when the receiver receives the message
(these are shown as dotted arrows). Since LP 1 did not receive the acknowledgment
yet, it remembers the time stamp of the message it sent (5) when the Report LVT
signal is received.
The main problem with the message acknowledgments is that it requires too
many message transmissions. It doubles the message count and overloads the network
if the simulation is dominated by a communication overhead. It can also increase the
message transmission time based on the underlying network organization. In a shared
memory architecture, it can result in a lower cache utilization because of memory
pressure.
Samadi’s GVT algorithm utilizes message acknowledgments while Mattern’s GVT
14
Fig. 2.8. Prevention of transient message problem using message acknowledgements.
uses a set of message counters and a control message to guarantee every message is
accounted for the GVT computation. These algorithms are described in the Global
Virtual Time Algorithms chapter in detail.
2.1.4 Simultaneous Reporting Problem
If we had a simple GVT algorithm, which would require LPs to report their
LVTs without stopping and then computing a minimum among those values, this
global minimum would still not be a correct GVT value. Although the transient
message problem is seemingly handled, the simultaneous reporting problem arises if
LPs do not report their LVTs at precisely the same instant in wallclock time. This can
result in some messages not being accounted by either the sender or the receiver LP,
thus creating a scenario where some messages “slip between the cracks”. Accounting
for unprocessed messages in the system becomes more complicated if LPs are allowed
to process events while the GVT computation is in progress.
Figure 2.9 shows an example of a simultaneous reporting problem. After LP
2 reports its LVT to the leader (shown as a dashed arrow), it receives a straggler
message with a time stamp of 15. This message is not considered during the GVT
computation. GVT is found to be 20 (min(20, 30)), although it should have been 15.
15
Fig. 2.9. Simultaneous message problem.
2.2 ROSS Simulator
Rensselaer’s Optimistic Simulation System (ROSS) [3] is used as the base simu-
lator for our studies. ROSS is a state-of-the art PDES simulation environment that
supports both conservative and optimistic synchronization. In the optimistic mode,
ROSS uses reverse computation [4] in place of state saving to rollback to a safe state
upon a straggler message. The default GVT algorithm in ROSS is a barrier based
synchronous implementation which is described in the next chapter in detail.
The original ROSS implementation utilized processes which communicated with
message passing using the MPI library. In our studies, we use a multi-threaded
version of ROSS [43] in order to effectively exploit the shared memory available
on the Knights Landing processor. In this version, processes are implemented as
threads which would require no expensive MPI-based communication; thus, directly
exploiting the shared memory.
Simulation tasks are executed repeatedly in the core simulation loop. Treads
execute four main simulation tasks in this tight loop in parallel. At each iteration,
threads first read messages they have received, process and generate new messages,
send them to appropriate threads, and participate in GVT computation as shown in
Algorithm 1.
16
ALGORITHM 1: Simulation Core Loop
1 while PE → GVT < simulation end time do
2 event e read = read message(PE → event queue)
3 event e new = process message(e read)
4 send message(e new)
5 participate gvt(PE)
6 end
In each iteration, threads can read, process and send messages up to a predefined
constant. This constant variable is called batch size. Furthermore, participation in the
GVT computation is controlled by the GVT interval constant. This enables threads
to start the computation at every GVT interval, unless the memory is exhausted. If
that is the case, freeing the memory becomes crucial for the simulation to proceed and
GVT computation can be triggered explicitly. Threads execute the core simulation
loop until GVT reaches the predefined simulation completion time. The shared data
structures between threads are protected by fine grained mutex locks, conditional
variables, pthread barriers or atomic operations.
In ROSS, simulation structures are distributed into three categories. At the high-
est level, Physical Entities (PE) execute the core simulation loop and call subroutines
for the main simulation tasks. Each PE is serviced by a posix thread. PEs manage
some local queues for incoming/outgoing messages and perform garbage collection.
Threads execute in parallel on different cores. Each posix thread is pinned on a single
core with CPU affinity and scheduled in a round robin fashion after they are spawned.
Each PE has local variables to indicate their Local Virtual Time and Global
Virtual Time. PEs also have a pointer for a currently processed event and bookkeeping
information for participating in GVT and other simulation tasks. Four main data
structures are managed by each PE locally: event queue, priority queue, cancelled
17
queue and free queue. Event queue is a linked list of events sent to this PE. Priority
queue is used to sort the received events according to their time stamps. Cancelled
queue is a linked list of cancelled events and free queue holds the list of free events.
PEs also have a linked list of Logical Processes (LP) and Kernel Processes (KP)
that they service. Each LP holds the state of a simulated entity. For example, the
state of an airport (capacity, congestion, or weather situation) would be encapsulated
in an LP. KPs are responsible for garbage collection. Each KP holds a list of processed
events for a collection of LPs it services. In the ROSS framework, garbage collection
is referred to as fossil collection, in order to conform to the PDES community. Fossil
collection is done by each KP by clearing its processed events list at each GVT
computation. The number of LPs and KPs is configured at the compile time and they
are executed sequentially by the PE (thread) that services them. The organization
of PEs, LPs and KPs is presented in Figure 2.10.
Fig. 2.10. Hierarchy of simulation structures in ROSS.
Communication between PEs are implemented through a shared data structure
called mt out q. During each iteration of the core simulation loop, processing an event
causes a new event to be generated. This new event is sent to its destination by being
18
written into mt out q where the id of the destination PE is the index. Total number
of PEs determine the size of the mt out q. Each cell of mt out q holds a pointer to
a thread local event queue named inq, and to a mutex lock. This lock protects only
the writes issued to the mt out q.
When an event e is written into mt out q[destination PE id], destination PE
receives it instantly because the pointer in the updated cell points to the inq of the
destination PE, shown in Figure 2.11. When a receiver PE wants to read its messages,
it pops events from its inq one by one, and then pushes them into its event queue.
Popping from the inq is protected by the lock stored at the mt out q[receiver PE id].
This is an efficient implementation, since instead of locking an entire data structure for
each write and read, only a single index is locked. Furthermore, message transmission
is abstracted from the event buffers, so that PEs do not need to lock them during
core simulation tasks.
Fig. 2.11. Communication architecture in single node ROSS
19
2.3 PHOLD Benchmark
PDES simulation have to be driven by the benchmarks. The most popular and
versatile benchmark for evaluating PDES, is the classical Phold model. Phold is a
synthetic benchmark that allows characterization of the performance of applications
under different scenarios. For example, it allows control of the percentage of events
generated locally to the same core and the percentage of events generated for the
other cores, thus requiring inter-core communication and delays.
Phold can also be used to alter the event processing granularity (EPG) to control
how much CPU processing is required for each event. As a result, this allows us to
evaluate systems with different computation/communication balance (by varying the
EPG) and with different execution locality patterns (by varying the percentage of
remote events).
Phold benchmark starts by initializing each LP with a number of events. Event
processing amounts to picking a destination LP according to some algorithm (for
example, randomly), sending a message to that destination LP and the EPG delay.
Upon receiving a message, the destination LP picks another destination and so on.
Therefore, the number of events in the system remains constant. One can load some
cores more than others by choosing LPs residing on those cores as message destina-
tions more than other LPs.
In our experiments, we created imbalanced scenarios in terms of communication
and event processing, by assigning more message loads or different EPG delays to
some of the LPs.
For our experiments, we assigned 128 LPs and 32 KPs to a single PE which is
handled by a hardware thread. Each LP is initialized with 1 starting event. PEs
process a maximum of 8 events at once before handling GVT and receiving new
events. GVT computation is initiated after every GVT interval iterations of the core
20
simulaiton loop. PEs are assigned with a lookahead value of 1 which determines
the time stamp of the newly generated event based on the last processed event. For
example, after a PE processes an event with time-stamp 15, it generates a new event
and sets its time-stamp as 16.
2.4 Intel Xeon & Xeon Phi Architectures
The Intel Knights Landing [8,42] is the second generation of the Many Integrated
Core (MIC) architectures designed to be used both as a standalone processor or as
a co-processor for High Performance Computing (HPC) applications. Both Knights
Landing and Knights Corner (first generation MIC) architectures compose Intel’s
Xeon Phi Architecture group.
Knights Landing processors feature up to 72 cores, each capable of executing 4
simultaneous threads. The cores run at a maximum frequency of 1.3 GHz and can
achieve better than 6000 Gflops/s single precision and 3000 Gflops/s double precision
when the vector processing units are utilized fully.
A major upgrade to the Knights Corner architecture, Knights Landing adds
branch prediction and out-of-order execution logic to each core. Vector Processing
Units (VPU) have been increased to two per core. Also, a 1 Mbyte L2 cache is
now shared between every core pair, forming a tile. Significantly, KNL systems now
include a 16 GB on-package DDR memory (HBM), which can be used as an L3 cache
to off-package DDR4 memory, or as the sole memory. Our versions of the Knights
Landing processor has 64 cores and 96 GB of DDR4 memory. High-level diagram
of the Knights Landing architecture used for this study (not including the DDR4
memory) is shown in Figure 2.12.
The KNL processor is commonly socketed and utilized as a standalone CPU,
as is the case in our experimental system. The KC processor had to be used as an
21
Fig. 2.12. Intel Knights Landing Architecture
accelerator. KNL runs standard Linux distributions as a full host computer, thus
eliminating the idiosyncrasies of accelerator interfacing.
2.5 Experimental Setup & Parameters
Table 2.1 summarizes the configurations and hardware details of the host Xeon
processor and the Xeon Phi Knights Landing (KNL) processor. Note that Xeon
system is only used for comparison purposes in the results presented in Section 4.1
For all presented experiments, we execute the multi-threaded version of the ROSS
simulator driven by the Phold benchmark. In a Phold benchmark, we vary the thread
22
Platform Xeon KNL
Model E5-2620 7230
Frequency 2.40GHz 1.3GHz
# of Cores 12 64
Memory Type DDR4 2133 DDR4 2400
Memory Size 60G 96GB + 16GB
OS CentOS 6.6 CentOS 7.2
Compiler GNU GNU
Compiler Compiler
Toolchains Toolchains
Table 2.1. Details of Experimental Platforms
count, the percentage of remotely generated events, the event processing ganularity
(EPG) and the loading of threads in terms of communication or computation. The
EPG represents the amount of work required to process a single event, and is specified
in units approximately equal to 1 FLOP per unit. This artificial event processing delay
enables us to create scenerios where computation dominates over communication
Our goal is to understand the behavior and scaling trends of the ROSS simulator
while executing on a single Knights Landing node with different GVT algorithms. We
also compare the performance of the KNL processor against a 12-core Xeon processor
to understand the trends between large number of smaller cores and the fewer number
of more powerful cores.
We consider both balanced and imbalanced Phold models in our evaluations. In a
balanced model, a destination LP is randomly chosen and every LP sends and receives
about the same number of messages during the course of simulation. Moreover,
LPs are delayed by the same EPG overhead. For an imbalanced model in terms of
communication latency, LPs are grouped into four different groups. An LP in the
first group sends messages to any LP, an LP in the second group sends to the first
23
half of the LPs, an LP in the third group sends to the first 30% of the LPs, and the
last group sends messages to the first quarter of the LPs. Figure 2.13 depicts the
communication pattern between the four groups.
Fig. 2.13. Imbalanced Communication
As a result, while some LPs can be destinations regardless of the source (and
therefore receive a larger number of messages), other LPs receive very few messages.
While this model is simplistic, it allows us to gauge the performance of systems
where the threads are not equally loaded. The imbalanced model in terms of event
processing is created by assigning varying EPG delays to the LPs. Therefore, each
core executes a different number of instructions due to a heavier processing time for
a single event, although the number of events processed are about the same.
We report the performance results in terms of committed events per second. As
we increase the number of processing nodes (thread), we maintain the number of
starting events per node, thus proportionately increasing the total number of events
generated by the simulator. If the underlying system is capable of efficiently keeping
up with this load without incurring additional delays, we can expect the committed
event rate to also show improvements to commensurate with the increase in the
24
number of nodes. This is known as weak scaling [2]. A thorough summary of the
simulation parameters is shown in Table 2.2.
Variable Value Description
Architecture Xeon, KNL Processor Model
GVT Barrier, Mattern, Global Virtual Time
Wait-Free
GVT Interval 128, 200, 400 GVT computation
frequency
Remote % 0, 10, 50, 100 Proportion of events
sent outside
of the core
EPG 0, 100, 500 Event Processing
Granularity
Simulation Model Balanced, Overloading cores
Imbalanced in terms of
communication or
computation
CPU Affinity Round Robin Scheduling threads
to cores
# PE # CPUs Physical Entities:
Posix threads
# LP 128 * # PE Logical Processes:
simulation objects
# KP 32 * # PE Kernel Processes
Initial Events 1 * # LP Number of events
to start simulation
Look-ahead LVT + 1 Time stamp of
the new event
Table 2.2. Simulation Parameters
25
Chapter 3
GLOBAL VIRTUAL TIME ALGORITHMS
In order to support recovery to a safe state upon a rollback in an optimistic
simulator, event histories must be saved during the simulation. These can grow large
over time and need to be garbage collected periodically to prevent memory exhaustion.
To achieve this, the Global Virtual Time (GVT) is periodically computed to determine
the earliest conceivable time to which rollback may be required. It is essentially the
greatest lower bound on the local virtual time of all PEs and the time stamp of all
in-flight messages. Principally, GVT algorithms come in two flavors: synchronous
and asynchronous.
3.1 Synchronous GVT
A synchronous implementation essentially follows the “stop-synchronize-and-go”
model where threads periodically stop processing, wait until all transient messages
arrive, collectively compute the new GVT value and then proceed again. In a thread-
based implementation of PDES, synchronous GVT computation utilizes a pthread
barrier. Synchronous implementations may be inefficient when threads arrive at the
barrier at different times. The threads that arrive early must wait until the slower
threads catch up. This cycle repeats at each GVT round.
The periodic synchronization of the PEs limits the optimism of ROSS. Faster
PEs have to be stalled thus, resulting in idle CPUs which do not do useful simulation
work until the slower PEs catch up. Therefore, synchronous GVT algorithms impose
26
indirect conservatism into optimistic simulation kernels.
3.1.1 Barrier GVT Algorithm
The Barrier GVT algorithm is fundamentally synchronous since it utilizes pthread
barriers. As shown in Algorithm 2, each PE executes a barrier call in a tight loop
until the number of transient messages is checked as zero. Barrier call collects the
local message counters from each PE as the input and reduces local counters to a
single sum. This sum represents the number of in transit messages in the system and
returned to each PE. When the reduced sum is zero, each PE breaks out of the loop
and synchronizes one last time to compute the minimum LVT among themselves. At
this point, there are no in-flight messages, thus a new GVT value can be computed
by reducing the LVTs into a single min value. Once all the PEs get the new GVT,
they fossil collect and leave the GVT routine.
ALGORITHM 2: Barrier GVT Algorithm
// PEs loop until there are no more in transit messages
1 while 1 do
2 int msg counter = PE → msg sent - PE → msg received
3 int msg intransit = sum barrier(msg counter)
4 if msg intransit == 0 then
5 Break
6 end
7 end
// No more in transit messages at this point
8 int new GVT = min barrier(PE → LVT)
9 PE → GVT = new GVT
10 fossil collect(PE)
A diagram of the barrier GVT computation is shown in Figure 3.1. Once the
GVT computation begins, all PEs are blocked until all the messages have been re-
27
ceived. Due to the nature of synchronous GVT computation, the transient message
problem and the simultaneous message problem are eliminated. This is because the
idle time during the barrier synchronization lasts until all of the transient messages
(that possibly cause a rollback) are received.
Fig. 3.1. Snapshot of a Barrier GVT computation
3.2 Asynchronous GVT
In contrast, in asynchronous GVT algorithms, GVT computations proceed “in-
line” with event processing, thus obviating the need to halt threads. In this approach,
the GVT is computed at the background asynchronously, without interfering with the
other simulation tasks. Asynchronous GVT algorithms do not block the PEs, thus
yielding higher CPU utilization since PEs always do useful work.
However, a higher computational overhead may be involved because some form
of thread synchronization and management is required to manage the participation
of threads in the GVT process. CPUs process more instructions related to GVT
computation compared to synchronous GVT algorithms. This is examined in detail
in the Profiling and Analysis subsection.
We implemented two different asynchronous GVT algorithms. One originates
from Mattern’s GVT algorithm [34] which relies on control messages and a locking28
mechanism. The second one is a Wait-Free GVT algorithm [37] which maintains a
set of phases to compute GVT using atomic operations. We also sought to explained
Samadi’s GVT computation [41] since it is a fundamental asynchronous algorithm.
We did not evaluate Samadi’s GVT algorithm because it is intuitively less efficient
than Mattern’s GVT due to message acknowledgments.
We implemented three versions of Mattern’s GVT algorithm for evaluation pur-
poses. First, we utilized mutex locks with a tree based lock structure to reduce the
contention. Second, we took advantage of try locks to measure the lock contention.
And third, we used atomic operations instead of locks whenever they were suitable.
We also implemented two versions of the Wait-Free GVT algorithm. One is the five
phase computation as proposed in the original paper and the other one is our three
phase implementation. For our experiments, we choose the mutex locks implementa-
tion of Matter’s GVT and five phase Wait-Free GVT because of their reliablity and
higher performance.
3.2.1 Samadi’s GVT Algorithm
Samadi’s GVT algorithm requires message acknowledgments to be sent on every
message sent between PEs [41]. The sender PE is responsible for accounting for each
message it has sent until it receives the acknowledgment, thereby solving the transient
message problem.
The simultaneous reporting problem is solved by having PEs tag any acknowl-
edgment message that they send between reporting their LVT and receiving new GVT
value. This identifies messages that might “slip between the cracks” and notifies the
sender PE for accounting the message before reporting its LVT.
Specifically, Samadi’s asynchronous GVT computation progresses in five main
steps as shown below:
29
1. One of the PEs is chosen as the leader and at each GVT cycle, it broadcasts
a Report-LVT message to all other PEs in order to initiate the GVT computation.
2. Upon receiving the Report-LVT message, the PEs send their LVT to the
leader. Specifically, they send a message indicating the minimum time stamp among
1) all unprocessed events in their event queue, 2) all unacknowledged messages and
anti-messages they have sent and 3) all marked acknowledgment messages they have
received since the last received new GVT. PEs now can set a flag indicating that they
are in find phase.
3. For each message or anti-message received by the PE while it is in find phase,
the PE sends a marked acknowledgment message indicating the time stamp of the
message it is acknowledging. An unmarked acknowledgment message is sent for all
messages received while not in find phase.
4. When the leader receives a local minimum value from every PE in the system,
it computes the minimum of all these values as the new GVT and broadcasts it to all
the PEs in the system.
5. Upon receiving the new GVT value, each PE switches from find phase to the
normal phase and continues main simulations tasks until the next GVT round.
Figure 3.2 shows an example of Samadi‘s GVT computation. Acknowledgement
for message from PE 1 to PE 2 arrives after PE 1 reports its LVT, thus PE 1 has
to account the time stamp 15 when it reports to the leader. If the acknowledgement
had arrived before PE 1 reported, then accounting the time stamp 15 would be PE
2’s responsibility.
3.2.2 Mattern’s GVT Algorithm
One drawback with Samadi’s algorithm is that it requires an acknowledgment
message to be sent for each message and anti-message. The underlying communication
30
Fig. 3.2. Snapshot of a Samadi’s GVT Computation
software may automatically send acknowledgments for a reliable message delivery;
however, such acknowledgments are typically not visible to the simulation kernel.
Therefore, a PDES framework should implement acknowledgment messages.
Matter’s GVT algorithm is also asynchronous like Samadi’s algorithm but it does
not require message acknowledgments. The fundamental idea behind Mattern’s GVT
is dividing the simulation into two parts with a cut : the past and the future. As
shown in Figure 3.3, a PE considers all event processing, messages sent and message
received before the cut point (in wall clock time) as having happened in its past. On
the contrary, a PE refers all the actions happened after the cut point as being in its
future.
Fig. 3.3. Cut divides simulation into two: past and future.
31
The set of cut points across all the PEs in the system defines the cut of the
distributed simulaiton. At each GVT round, Mattern’s algorithm creates two cuts
across the PEs and computes the GVT based on the snapshot taken on the second
cut. The purpose of the first cut is to notify each PE to start recording the smallest
time stamp of any message it sends. These messages could cause a transient message
problem if they cross the second cut and therefore must be included in the GVT
computation. The second cut is defined to guarantee that each message sent from the
past of the first cut, will be received before the construction of the second cut. This
makes all the transient messages in the system (crossing the second cut) to be sent
after the first cut. This enables accounting for these message in the GVT computation
during the construction of the first cut (by remembering their time stamp).
PEs are colored based on where they are (virtually) with respect to the cut line
as shown in Figure 3.4. PEs are initialized as white and they switch to the red after
the first cut is reached. After the second cut, PEs return back to the white color.
White PEs mark the messages they generate white and red PEs mark them red.
Fig. 3.4. Events before the cut line colored as white, after the cut line colored asred (dotted arrow).
By design, all white messages must be received prior to the second cut. Thus, all
transient messages crossing the second cut must be red. In Figure 3.5, the message
depicted as a dotted arrow violates this rule. The set of messages crossing the second32
cut is a subset of all red messages. Therefore, the minimum time stamp among all the
red messages is a lower bound on the minimum time stamp of all transient messages
crossing the second cut.
Fig. 3.5. Second cut should stretch towards future so that there should be no messagesent from the white phase and received in the consecutive white phase (dotted arrow).
The GVT is computed as the minimum of 1) minimum time stamp among all
red messages, and 2) the minimum time stamp of any unprocessed message in the
snapshot defined by the second cut. These two variables are stored locally at each
PE, so it is trivial to compute a minimum of these. However, the challenge is creating
such a second cut that no white message will ever cross it. This requires one to
guarantee that any message generated prior to the first cut will be received prior to
the second cut.
The first cut can be constructed by circulating a control message in a logical ring
of PEs. The GVT round can be initiated by a leader PE by starting the circulation
for the first cut. Upon receiving the control message, each PE changes its color from
white to red and passes the control message to the next PE in the ring.
When the leader PE gets back the control message it sent at the beginning of
the round, it is guaranteed that the first cut is constructed. During this process, each
PE has to access the control message only once. After the leader PE receives the
33
control message from the last PE in the ring, there will be no new generation of white
messages.
The construction of the second cut is different. Again, the leader PE initiates
its construction by sending the control message to the next PE in the ring. However,
a PE will not forward the control message to the next PE until it can guarantee that
it receives all the white messages destined for it (including the leader).
In order to implement this, each PE keeps an array of counters indicating the
number of white message it sent. These arrays are accumulated within the control
message as it circulates among the PEs during the construction of the first cut. After
the first cut is constructed, the control message has the information of how many
white messages are sent to any of the PEs.
During the construction of the second cut, a PE accesses the accumulated array
counters in the control message to compare how many white messages have been
sent to it in total to how many white message it actually received. When these two
numbers are equal, a PE will check that it received all of the white messages that
have been sent to it. Then, it can forward the control message to the next PE in the
ring.
Each PE maintains the following local variables:
• T min : Holds the smallest time stamp of any unprocessed message in the PE’s
event queue (same as LVT).
• T red : Holds the smallest time stamp of any red message sent by the PE.
• array counters : The array of counters indicating how many white messages
the PE sent to any of the other PEs. The destination PE’s id will be an index to
the sender PE’s array counters. PEs also count the number of white messages
they receive. This counter can be held at the array counters[PE id].
• color : Current color of the PE, white or red.34
Control message contains three fields as follows:
• CM T min : Records the minimum of T min values among PEs that the con-
trol message has circulated thus far.
• CM T red : Records the minimum of T red values among PEs that the control
message has circulated thus far.
• CM array counters : The cumulative array of counters among PEs that the
control message has visited thus far. CM array counters[i] indicates the number
of total white messages sent to PE i.
Now we can describe Mattern’s GVT algorithm. On each message sent, if the
event’s color is white, a PE increments its message counter for the destination PE
using array counters. If the event’s color is red, PE updates its T red. When a white
message is received, the PE increments its received message counter. If it is red, the
PE does nothing. These procedures are presented in Algorithm 3 and 4.
Algorithm 5 and 6 present, the procedures when first and second cut points are
reached respectively. When a PE reaches its first cut point, it changes its color to
red, resets its T red and accumulates its array counters into CM. When the second
cut point is reached, a PE waits until it receives all the white messages destined to it.
After that is checked, the PE updates the control message with its T min and T red
and forward it to the next PE in the ring. Finally, it resets its counters and continues
the simulation.
Mattern’s GVT algorithm is designed specifically for the distributed memory
system. For this study, we adapted Mattern’s distributed GVT to make it more
suitable for the shared memory architecture in order to exploit the large number of
CPUs available in the KNL processor. Messaging between PEs are realized by writing
35
ALGORITHM 3: Message Send
// PE I sending event E with time stamp T to PE J
1 if E → color == white then
2 PE I → array counters[PE J → id] += 1
3 else
4 PE I → T red = min(PE I → T red, T)
5 end
6 send message(E)
ALGORITHM 4: Message Receive
// PE I receives event E with time stamp T
1 event E = receive message(PE I)
2 if E → color == white then
3 PE I → array counters[PE I → id] += 1
4 else
// Ignore
5 end
6 PE I → event queue.push(E)
ALGORITHM 5: First Cut
// PE reaches first cut point
1 if PE → color == white then
2 PE → T red = ∞3 PE → color = red
// PE accumulates its message counters into CM’s message counters
4 for i = 0; i < #PE; i + + do
5 if PE → id != i then
6 CM → array counters[i] += PE → array counters[i]
7 end
8 end
// Forward the control message to the next PE in the ring
9 forward(CM, PE → id + 1)
10 else
// Assert
11 end
36
the event to the destination PE’s event queues as mentioned previously. Thus, the
transient messages in a distributed system are considered as events which are not yet
written to the target event queue in our shared system architecture.
ALGORITHM 6: Second Cut
// PE reaches second cut point
1 if PE → color == red then
// PE loops until it receives all messages destined to it
2 int key = PE → id
3 while 1 do
4 if PE → array counters[key] == CM → array counters[key] then
// All messages received
5 Break
6 end
7 receive message(PE)
8 end
// Update control message
9 CM → T min = min(CM → T min, PE → T min)
10 CM → T red = min(CM → T red, PE → T red)
// Forward the control message to the next PE in the ring
11 forward(CM, key + 1)
// Reset the array counters
12 CM → array counters[key] = 0
13 PE → array counters[key] = 0
14 else
// Assert
15 end
Instead of circulation of the control message through a ring, we utilized a global
shared control structure. Each PE accesses this shared structure asynchronously.
During the construction of the first cut, each PE checks this structure only once.
But for the second cut, a PE keeps checking it until it receives all the white messages
destined to it. Thus, instead of waiting in the GVT subroutine, it continues to execute
37
core simulation tasks. Both the control message (CM) and the control structure (CS)
have T red and T min fields for the same purposes.
The last PE that successfully checks the control structure at the end of the
second cut computes the GVT by taking the minimum of CS T red and CS T min
and writes it to a global variable. After the new GVT value has been computed, each
PE must read it. PEs do not read it from the control structure. Instead, it is held in
a global variable to be read at the end of each GVT round. Once a PE has read the
new GVT, it fossil collects and changes its color back to white. After the predefined
GVT interval, each PE becomes red again (during the construction of first cut) and
the process repeats.
We also optimized Mattern’s GVT algorithm in terms of memory space. Instead
of using an array of counters, a PE in our implementation holds a single variable
to count how many white messages it has sent and received without considering the
destination or source PE. Also, the control structure holds a single counter instead
of the array of counters to accumulate the counters among the PEs.
A PE decrements its counter when it receives a white message and increments
it when it sends one. During the second cut, this counter is accumulated on the
control structure and each PE checks the control structure’s counter for 0. If the
check succeeds, a PE updates the control structure’s T red and T min with its T red
and T min and verifies that it has received all the messages destined to it. If the
check fails, the PE leaves the GVT routine and checks in the next iteration of the
core simulation loop.
A timing diagram of the asynchronous algorithm is shown in Figure 3.6. For
clarity, assume that all PEs change their phases and check control structure in or-
der (this assumption is not necessary in practice, but it simplifies the explanation).
Messages are shown as arrows. The sending (+1) and receiving (-1) white events
38
are counted locally by each PE as shown. After the transition to the red phase, the
counts are accumulated at the control structure.
The first PE which checks the control structure sees it as 1. This is shown in
the form of a white circle on the first line. Then, the second PE checks the control
structure and it also sees it as 1 since it has no event counts to accumulate. Then,
the third PE with the message count of -2 checks the control structure and updates
it from +1 to -1. This is depicted as another white circle, implying that some events
which are not yet written to the destination event buffer may still exist. Finally, the
fourth PE arrives and accumulates its +1 event count with the control structure and
checks it successfully (sees the counter as 0). This is shown as a black circle at the
bottom line, implying that this PE accumulated the time stamp of its minimum red
message and its LVT.
Fig. 3.6. Snapshot of a Mattern’s GVT computation
Once each PE has read the control structure as 0, this ensures that all the sending
events are written to the destination PE’s event queue. Thus, the GVT computation
can be performed at this point. The PEs accumulate their minimum red message
timestamps and LVTs into the control structure as they pass the black circles. The
39
last PE that reaches the black circle computes the GVT by taking the minimum of
control structure’s T red and T min. At this point, the control structure holds the
LVT of the fourth PE since it has the smallest timestamp. However, the red event
from the first PE has an even smaller timestamp. Therefore, the GVT is set to the
timestamp of that event. Finally, all PEs read this new GVT value, turn their color
into white, and start counting events again.
The pseudo-code of our modified Mattern’s GVT algorithm for shared mem-
ory systems is presented in Algorithms 7, 8 and 9. The updated Message Send and
Message Receive procedures are shown in Algorithms 7 and 8, respectively. Previ-
ously separately defined First Cut and Second Cut functions are incorporated into
the GVT function as shown in Algorithm 9. Since this modified implementation is
through shared memory, the control structure is not forwarded anymore. Instead, it
is implemented as a global shared structure.
ALGORITHM 7: Modified Message Send
// PE I sending event E with time stamp T to PE J
1 if E → color == white then
2 PE I → msg counter += 1
3 else
4 PE I → T red = min(PE I → T red, T)
5 end
6 send message(E)
As seen in Algorithm 9, lines 1 through 6, the message counters are not accumu-
lated during the first cut anymore. Instead, during the second cut, each PE updates
the control structure with its message counter and checks if the updated value is 0
(lines 18 and 19). If the check succeeds, then the PE updates the control structure
one more time to write its T min (LVT) and T red into the control structure. Also,
the last PE notation used in lines 4, 12 and 22 is a shared counter to count how many
40
ALGORITHM 8: Modified Message Receive
// PE I receives event E with time stamp T
1 event E = receive message(PE I)
2 if E → color == white then
3 PE I → msg counter -= 1
4 else
// Ignore
5 end
6 PE I → event queue.push(E)
of the PEs finished the associated part of the algorithm. Concurrent accesses to this
counter are protected using three different approaches as discussed next.
We investigated different ways to cope with the lock contention. The first one
is a tree of mutex locks to reduce the contention. The second one uses try locks to
measure the contention and the third one utilizes atomic operations. The first one is
used for our experiments because it yields a better overall performance and has been
tested more extensively. These versions are explained as follows:
1. Lock Partitioning: Concurrent updates to the shared counters are serialized
using a tree based lock structure. Lock partitioning is implemented to prevent all the
PEs from competing for a single lock. Instead, groups of PEs compete for their
associated group lock and once a group is done, their group flag is turned on. When
all the groups are complete, an accumulated update is written to the shared counter.
2. Try Lock: The critical section is protected by a mutex try lock. A PE first
tries to acquire the lock and if it fails, it acquires the lock using a regular mutex lock.
If the try lock succeeds, the PE executes the critical section and releases the lock. We
keep track of how many times the try lock fails. This approach helps us to evaluate
the lock contention.
3. Atomic Operations: Atomic operations are hardware instructions that en-
41
ALGORITHM 9: Matter’s GVT Algorithm for Shared Memory Architectures
1 if PE → color == white then
// First cut point reached
2 PE → T red = ∞3 PE → color = red
4 if lastPE then
5 GVT ready = False
6 end
7 else
8 if GV T ready == TRUE then
9 PE → GVT = new GVT
10 PE → color = white
11 fossil collect(PE)
12 if lastPE then
// Reset the control structure
13 CS → T min = ∞14 CS → T red = ∞15 CS → msg counter = 0
16 end
17 else
// Second cut point reached
18 CS → msg counter += PE → msg counter
19 if CS → msg counter == 0 then
// Update the control structure
20 CS → T min = min(CS → T min, PE→ T min)
21 CS → T red = min(CS → T red, PE→ T red)
22 if lastPE then
// GVT is ready
23 new GVT = min(CS → T min, CS → T red)
24 GVT ready = TRUE
25 end
26 end
27 end
28 end
42
able concurrent accesses to the shared variables without locking them. Specifically,
GCC intrinsics of sync add and fetch and sync bool compare and swap are uti-
lized to remove the mutex locks. However, not all the locking was suitable for replace-
ment by atomic operations. Coarse critical sections remained the same. Therefore,
we measured the best performance with Lock Partitioning even though the Atomic
Operations are supposed to provide a higher concurrency compared to a locking ap-
proach.
The GVT interval is a predefined parameter which sets the gap between two
consecutive GVT calculation rounds. Ideally, the interval can be shorter at high
remote percentages to reduce rollbacks, and longer at low remote percentages to
reduce the computational overhead since rollbacks are less likely. A GVT computation
round is signalled when a global interval counter reaches zero. This causes each PE
to transition from their normal white phase to the red phase. A GVT interval of
128 is chosen but it is possible to initiate the GVT round before the interval counter
reaches zero. During event processing, if PEs run out of free event buffers, we can
force a GVT update to fossil collect immediately.
3.2.3 Wait-Free GVT Algorithm
Mattern’s algorithm does not take advantage of the shared memory, as it was
originally developed with messaging passing as its focus. In a shared memory system,
PEs can read the messages/anti-messages that have been sent to it instantly. Each
PE is in charge of managing its message queue which is populated by the events just
after they are sent. Thus, there is no in-flight messages across PEs. When PEs need
to process events, they read them from their message queue and insert into event
queue. One can think message queues as buffers and event queues as processing lines.
Message queues are named as inq in ROSS, as was explained in detail in the previous
43
chapter.
A GVT computation will require each PE to take the minimum time stamped
event in its event queue and calculate the local minimum by comparing it with its
local virtual time. The GVT will then be computed by taking a global minimum
across local minimums of each PE. It becomes problematic if a message is written
into the PE’s message queue, but not yet inserted into its event queue. When that
PE seeks to compute the GVT, it will miss the event that possibly has a lower time
stamp than its LVT or minimum time stamped event in its event queue.
This problem is resolved in [37], by using 5 phases to ensure that every message
is accounted for: phase A, phase Send, phase B, phase Aware and phase End. Each
PE starts the GVT computation from phase A and computes their local minimum,
called min A. This is the minimum of 1) PE’s LVT and 2) minimum time stamped
event in its event queue. When all PEs finish their phase A, they proceed to phase
Send. Here they incorporate messages into their event queue, execute one more event
and send output messages/anti-messages if there are any. This ensures that there will
be no message left in their message queue with a possibility to become a new GVT.
Once each PE completes their phase Send they then proceed to phase B and compute
a second local minimum which is called min B. At this point, PEs’ local minimum
is set to a minimum of min A and min B. At phase Aware, a global minimum is
computed across all local minimum and taken by each PE as the new GVT. Finally,
PEs move to phase End and become ready for the next GVT round.
Once all PEs complete the same phase, they can move on to the next phase. This
is controlled by atomic operations to prevent locking overhead and ensure correctness
of the GVT value. Different from Mattern’s algorithm, threads are not blocked to
acquire a lock and they calculate the GVT in a wait-free fashion. Algorithm 10
presents the pseudo-code of the Wait-Free GVT computation.
44
ALGORITHM 10: Wait-Free GVT Algorithm
1 if PE → phase == A & GVT round initiated then
2 int min A = min(PE → LVT, min event(PE → event queue))
3 atomic add(phase counter A, 1)
4 PE → phase = Send
5 else if PE → phase == Send & phase counter A == # PE then
6 event e = read messages(PE)
7 execute messages(PE, e)
8 send messages(PE)
9 atomic add(phase counter send, 1)
10 PE → phase = B
11 else if PE → phase == B & phase counter send == # PE then
12 int min B = min(PE → LVT, min event(PE → event queue))
13 int min final = min(min A, min B)
14 min array[PE → id] = min final
15 atomic add(phase counter B, 1)
16 PE → phase = Aware
17 else if PE → phase == Aware & phase counter B == # PE then
18 new GVT = min(min array)
19 PE → GVT = new GVT
20 atomic add(phase counter aware, 1)
21 PE → phase = End
22 else if PE → phase == End & phase counter aware == # PE then
23 fossil collect(PE)
24 PE → phase = A
25 end
45
A timing diagram of the Wait-Free GVT is shown in Figure 3.7. In phase A,
each PE calculate its min A while sending and receiving messages. When the last
PE completes calculating its min A, they proceed to phase Send and account for
messages that are possibly the new GVT. In phase B, they compute min B and find
the absolute local minimum. When each PE completes this operation, the new GVT
is computed by any of the PEs and written to a global variable so that other PEs can
take it and finish the GVT computation.
Fig. 3.7. Snapshot of a Wait-Free GVT computation
We also tried to optimize the Wait-Free GVT algorithm by implementing it using
three phases instead of five phases. In this version, computation starts with phase
Compute where a PE incorporates messages from its message queue into its event
queue. Then, it updates its LVT and moves into phase Send when all other PEs
finish with phase Compute.
In phase Send, a PE writes its LVT into an global shared array where its id is
the index. The last PE that enters the phase Send is responsible for computing the
minimum of the LVTs written into the shared array. That minimum becomes the
new GVT value and is written into a global shared variable. Finally, in phase Aware,
46
PEs read the new GVT value and switch back to phase Compute.
In phase Send, writes into the shared global array by threads which reside on
different cores violate the cache coherence protocol and cause false sharing. This im-
pacts the performance significantly since writes from different cores cause an eviction
of the entire cache line. Thus, a cache line has to travel all the way back to the main
memory to satisfy the consistency between first level caches of cores.
This problem can be fixed using padding. Global shared array now holds a
structure which would fit an entire cache line (64 bytes), instead of holding a double
variable to store LVT (8 bytes). This solution prevents the excessive evictions of
cache lines. On the other hand, this approach can abuse caches since an entire cache
line is sacrificed for a double variable.
For our experiments, we used the original implementation of the Wait-Free GVT
algorithm since it has been more extensively tested. We left the studies of this
modified Wait-Free GVT for future work.
47
Chapter 4
EXPERIMENTAL RESULTS
In this section, we present the results of our experiments with the three GVT
algorithms on both KNL and Xeon processors. We set a fixed GVT interval for each
algorithm for all the experiments. We selected the intervals that performed best across
our simulations. Specifically, we observed that the GVT interval of 128 was the best
overall choice for the Barrier GVT algorithm. For the asynchronous algorithms, the
GVT period value of 200 was used on Xeon and the GVT value of 400 was used on
KNL. While the performance of Barrier algorithm is dependent on the GVT interval,
the asynchronous performance is affected by it to a much lesser extent.
Another variable parameter is the percentage of messages that are sent to a
different thread. We call such messages remote. The opposite of remote would be
local, and such messages do not cross threads.
4.1 GVT Performance on 12-core Classical Xeon Machine
First, we present the performance trends for the three GVT algorithms (Bar-
rier algorithm based on pthread barrier calls, Mattern algorithm and Wait-Free algo-
rithm) on a traditional 12-core Xeon processor with hyper-threading. We scale these
experiments till 24 threads. Previous work [45] showed that overloading the cores is
counter-productive for performance, so we do not consider those scenarios here.
48
2 12 24Number of Threads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Com
mitt
ed E
vent
Rat
e
×107
BarrierMatternWait Free
2 12 24Number of Threads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0×106
Fig. 4.1. Committed Event Rate on Xeon with Balanced Loading and 0 EPG: 10%Remote Events (left) & 100% Remote Events (right)
4.1.1 Model 1: Balanced Loading & Fast Event Processing
Figure 4.1 shows the commit event rate of ROSS for a simple PHOLD model
with 10% and 100% remote events respectively. The event processing granularity is
set to zero for these experiments, resulting in communication-dominated scenarios
with little event processing. As seen from these results, the asynchronous algorithms
significantly outperform the Barrier implementation, and the difference increases as
simulation scales to 24 threads. We also observe that the Wait-Free GVT algorithm
is faster than Mattern’s GVT by 30%, but is significantly faster than Barrier synchro-
nization. For example, the performance advantage of the Wait-Free algorithm over
Barrier implementation for 24 threads is almost 50% and 48% for the cases with 10%
and 100% remote events respectively. These trends are not surprising and allude to
the advantages of asynchronous GVT computations that allow the event processing
to continue without blocking the threads.
49
4.1.2 Model 2: Balanced Loading & Slower Event Processing
2 12 24Number of Threads
0
1
2
3
4
5
6
7
8Co
mm
itted
Eve
nt R
ate
×106
BarrierMatternWait Free
2 12 24Number of Threads
0
1
2
3
4
×106
Fig. 4.2. Committed Event Rate on Xeon with Balanced Loading and 50% RemoteEvents: 100 EPG (left) & 500 EPG (right)
Figure 4.2 shows the performance of three GVT algorithms for a scenario with
balanced load, 50% remote events and high event processing granularity. For this, we
consider the EPG values of 100 and 500. As expected, GVT becomes less of a bottle-
neck with high event processing granularity (due to a more dominant contribution of
the event processing itself). Specifically, with EPG of 100, the Wait-Free algorithm
outperforms Barrier GVT by 19% for 24-threaded simulation. With EPG of 500, the
percentage difference drops to only 12%.
In summary, the behavior of the two classes of GVT algorithms on a 12-core
Xeon processor reflects conventional wisdom and shows substantial improvements of
asynchronous GVT computation, particularly in scenarios with fast event processing,
which are typical for PDES applications. It is seen that asynchronous algorithms, es-
pecially Wait-Free implementation, outperforms Barrier GVT algorithm significantly.
These trends generally hold regardless of the percentage of events generated
50
remotely and regardless of the balance in the workload of each thread. The two
algorithms perform closer to each other only at high EPG values, which makes event
processing a major part of the simulation time thus de-emphasizing the importance
of GVT efficiency.
In the next subsection, we analyze and compare the performance of these algo-
rithms on the KNL system and demonstrate quite different trends. We also explain
the reasons for this behavior.
4.2 GVT Performance on 64-core Knights Landing Architecture
First, we evaluate scenarios for a classical PHOLD model, where all threads are
loaded evenly and the EPG is set to zero. Our second model is also balanced but
experiences heavier computational overhead. The last two models are imbalanced in
terms of communication and event processing respectively. We present the results
for the remote percentages of 0%, 10%, 50% and 100%, and we show the committed
event rates.
In each graph, we present the results for five simulation scales: 1) 24-threaded
simulation to match the maximum number of threads that we used to collect the
results on the Xeon machine as described in the previous section; 2) 64-threaded sim-
ulation to put one thread on each KNL core; 3) 128, 192 and 250-threaded simulation
to put 2, 3 and 4 threads respectively on each KNL core. 6 threads are reserved for
Slurm, which is a job management software deployed on our cluster. Since the KNL
cores are 4-way SMT, 256-way simulation loads the chip to capacity.
4.2.1 Model 1: Balanced Loading & Fast Event Processing
The results presented in Figures 4.3 and 4.4 show a different trend compared
to what we observed on Xeon. While for most scenarios, Mattern’s GVT algorithm
51
24 64 128 192 250Number of Threads
0
1
2
3
4
5
6
7
Com
mitt
ed E
vent
Rat
e
×107
BarrierMatternWait Free
24 64 128 192 250Number of Threads
0
1
2
3
4
×107
Fig. 4.3. Committed Event Rate on KNL with Balanced Loading and EPG of 0:0% Remote Events (left) & 10% Remote Events (right)
still outperforms Barrier implementations, the difference in many cases is signifi-
cantly smaller than what we observed on the Xeon system. However, Wait-Free GVT
algorithm continues to perform better than synchronous algorithm. For example,
Wait-Free algorithm is 30% faster than Barrier implementation at 250-scale when we
average all remote percentages, while Mattern’s GVT is 21% for this case.
The key observation from these results is that even when the Mattern’s asyn-
chronous algorithm is faster than Barrier, the performance differences are signifi-
cantly smaller compared to those observed on a conventional Xeon machine while
Wait-Free implementation outperforms other algorithms significantly. Consequently,
locking overhead becomes more critical when simulation is performed on a KNL pro-
cessor compared to when it is performed on a Xeon processor. Note that even if we
compare 24-way simulations on Xeon and KNL, the performance difference between
Barrier and asynchronous algorithms is much smaller on KNL.
In order to explain this performance disparity, it is instructive to compare de-
52
24 64 128 192 250Number of Threads
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
Com
mitt
ed E
vent
Rat
e
×107
BarrierMatternWait Free
24 64 128 192 250Number of Threads
0.0
0.2
0.4
0.6
0.8
1.0
1.2×107
Fig. 4.4. Committed Event Rate on KNL with Balanced Loading and EPG of 0:50% Remote Events (left) & 100% Remote Events (right)
lays involved in both algorithms and project the scaling impact on these delays. In
asynchronous algorithm, each PE updates its message counters whenever they send
or receive a message to keep track of transient messages. This causes a computa-
tional overhead, especially at high scales with high remote percentages. In addition,
the Mattern’s algorithm involves thread serialization to determine that all conditions
for establishing the new GVT value are met by all threads. This requires locking
of shared variables and has a non-trivial performance impact, which worsens with
scaling. Especially at high remote percentages, lock acquiring failure is a major
overhead. At 250-scale and 100% remote events total number of locking failures
experienced by Mattern’s GVT implementation is 5,347,302. This number goes to
4,536,272, 1,808,442 and 1,021,915 at 50%, 10% and 0% remote events respectively.
We analyze this impact of this using detailed profiling of the simulation in subsequent
sections.
None of these apply to Wait-Free GVT algorithm which does not need to count
53
messages, so it is computationally much more light weight. Also, thread serialization
is realized by atomic operations which establish a Wait-Free algorithm. Threads are
not blocked to acquire lock and whole simulation proceed faster. We can observe that
Wait-Free outperforms Mattern’s algorithm by 7%, 30%, 40% and 42% at 250-scale
respectively with 0, 10, 50 and 100 remote events when we have fast event processing.
For Barrier implementation, the major overhead is in its synchronous nature.
The barrier-based approach freezes all the PEs at the GVT Barrier and Waits until
all transient messages arrive, at which point the simulation continues. This peri-
odic stopping of the simulation detrimentally impacts performance since no message
transmission or processing is accomplished by any thread during GVT computation
interval.
4.2.2 Model 2: Balanced Loading & Slower Event Processing
Figure 4.5 shows KNL performance for the models with higher event processing
granularity. We can observe that slower event processing is not a major overhead
at KNL systems as it was on Xeon processors. Although performance gap shrinks
compared to the fast event processing, Wait-Free implementation still outperforms
other algorithms. At 250-scale and 100 remote, Wait-Free GVT algorithm is 29%
faster than Mattern’s GVT and 31% faster than Barrier GVT computation.
4.2.3 Model 3: Imbalanced Communication
Figure 4.6 presents the results for the scenario with imbalanced loading, where
some threads are chosen as message destinations more often compared to others as
explained on previous sections. The left side of the figure presents the results of
committed event rate for 10% remote event. We can observe that Barrier GVT
algorithm outperforms asynchronous algorithms at all scales. Wait-Free GVT exhibits
advantage at small scale but the advantages disappear at scales above 128 threads. In54
24 64 128 192 250Number of Threads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Com
mitt
ed E
vent
Rat
e
×107
BarrierMatternWait Free
24 64 128 192 250Number of Threads
0.0
0.2
0.4
0.6
0.8
1.0
×107
Fig. 4.5. Committed Event Rate on KNL with Balanced Loading and the EPG of100: 10% Remote Events (left) & 100% Remote Events (right)
fact, Barrier algorithm outperforms the Wait-Free algorithm by 25% and Mattern’s
GVT by 30%.
Right side of the figure 4.6 shows the ratio between the number of total events
and the number of committed events for three algorithms. Here, the number of
committed events is kept strictly linear with respect to the number of threads, and
the same for all algorithms. While the number of committed events is same for all
algorithms, total number of events for Barrier implementation is almost 30% less
than asynchronous algorithms. This shows us that the efficiency of Barrier algorithm
(56%) is higher than asynchronous GVT (40%). This can be credited to the fact that
Barrier implementation performs significantly less number of roll-backs as compared
to asynchronous implementation. Specifically, for the case of 250 threads, Barrier
implementation performs 11.7 million rollbacks whereas asynchronous performs 20
million.
The synchronous nature of the Barrier algorithm reduces the disparity between
55
LPs at imbalanced models so that at high scales it outperforms asynchronous al-
gorithms significantly. The reason being that, when some of the LPs receive more
messages, asynchronous GVT computation allows them to stay behind the LPs with
less message load while Barrier algorithm computation syncs LPs periodically at ev-
ery GVT computation. This is because the asynchronous implementations requires
larger optimistic memory for imbalanced models, thus leading to more cache misses
and worse memory performance.
24 64 128 192 250Number of Threads
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Com
mitt
ed E
vent
Rat
e
×107
BarrierMatternWait Free
24 64 128 192 250Number of Threads
0
1
2
3
4
5
6
7
Num
ber o
f Eve
nts
×108
Committed EventsBarrier Total EventsMattern Total EventsWait Free Total Events
Fig. 4.6. Imbalanced in terms of Communication on KNL: Committed Event Rate(left) & Efficiency (right)
4.2.4 Model 4: Imbalanced Event Processing
The final scenario that we consider on KNL for completeness of the presentation
is the imbalanced model with changing EPG values per LP. This model is imbal-
anced by generating different event processing delays while keeping the communica-
tion structure balanced. We generate a random weight per LP uniformly whenever
it sends a message. Then this weight is multiplied by a EPG constant to set varying
56
EPG values for each LP throughout the simulation.
Committed event rates of 100% remote events are presented in Figure 4.7 (left).
The trends here are very different than those observed in Figure 4.6, with barrier-
based GVT demonstrating significantly worse performance than asynchronous algo-
rithms. For example at 250-scale Wait-Free GVT is 47% faster than Barrier. We
observe that asynchronous algorithms only suffer from imbalanced models in terms of
network communication but not from event processing delays. This can be also seen
by the efficiency graphs on Figure 4.7 (right). While the committed event numbers are
same for three algorithms, number of total events is higher for Barrier implementation
by almost 30% than asynchronous algorithms. This shows that Barrier synchroniza-
tion is a bottleneck for synchronous algorithm when the model experience varying
processing delays.
24 64 128 192 250Number of Threads
0
1
2
3
4
5
6
7
8
Com
mitt
ed E
vent
Rat
e
×106
BarrierMatternWait Free
24 64 128 192 250Number of Threads
0
1
2
3
4
5
Num
ber o
f Eve
nts
×108
Committed EventsBarrier Total EventsMattern Total EventsWait Free Total Events
Fig. 4.7. Imbalanced in terms of Event Processing on KNL: Committed Event Rate(left) & Efficiency (right)
57
4.3 Profiling and Analysis
To further explain the behavior observed in the previous section, we isolated,
as much as possible, the GVT computation and analyzed the execution behavior of
both asynchronous and barrier-based GVT algorithms. This was achieved by running
the simulation with 0% remote messages and 0 EPG loading. Though executing
the simulation with no remote messages is a contrived example in the context of
PDES, it serves to eliminate node-to-node event communication leaving only GVT
communication and a consistent local event processing load for measuring simulation
performance.
As shown in Figure 4.3 (left), the asynchronous algorithms outpace Barrier in
performance as thread count increases for 0% Remote Messages with 0 EPG loading
with Wait-Free GVT showing the highest performance. However, when examining
imbalanced loads as shown in Figure 4.6, we noted that Barrier is actually superior.
Using htop, a utility similar to top that includes visualization for threads, we
noted that the asynchronous algorithms allow CPU saturation while the Barrier al-
gorithm does not. This is attributable to the mt all reduce functions which require all
threads to block synchronously in the Barrier implementation. However, it does not
provide any quantitative insight for the superior performance of Barrier GVT under
imbalanced loads.
Therefore, we next analyzed GVT algorithm performance using perf tool [22,39].
Perf is a utility that collects performance counter information for examining program
performance. Like a profiler, the program to be analyzed is invoked within the perf
tool. However, no special compilation is required and the performance penalty is
much smaller than with a profiler.
To analyze the data in Figure 4.1, Figure 4.3, and Figure 4.6, three sets of perf
results are presented for comparison: Table 4.1 summarizes the perftool results for
58
the Xeon processor with 24 threads using a balanced load. Table 4.2 summarizes
the perftool results for the KNL processor with 128 threads using a balanced load.
Finally, Table 4.3 summarizes the perftool results for the KNL processor with 250
threads using an imbalanced load.
Wait-Free Mattern Barrier Statistic
52574552.46 51008083.75 22749859.1 event rate (e/s)
25356.077865 15191.239883 27635.523529 task-clock (ms)
12,671 19,692 70,826 ctxt-switches
37 54 53 cpu-migrations
34,977 33,712 33,107 page-faults
63,538,414,687 37,066,920,255 2,701,184,297 cycles
45,098,656,437 21,933,431,830 21,713,987,634 instructions
0.71 0.59 0.66 insns per cycle
8,304,708,417 3,601,133,777 3,493,369,292 branches
169,630,569 169,375,821 172,940,784 branch-miss
12,934,781,849 5,600,674,689 5,579,363,865 L1-data-lds
474,672,636 480,858,921 484,695,938 L1-data-ld-miss
99,075,495 162,707,266 173,550,910 LLC-loads
3.114383734 2.650194459 3.412732663 seconds elapsed
Table 4.1. Performance statistics for Xeon: 24 Threads 0% Remote, Balanced, 0EPG Model
Table 4.1 shows the asynchronous algorithms have significantly less context
switch counts than the Barrier algorithm. This is likely the result of the pthread
block operation. Additionally, we observe that Wait-Free has approximately 60% of
the context switches that Mattern’s GVT has and only 20% of that of Barrier.
Additionally, though Mattern and Barrier GVT instruction counts are similar,
Wait-Free GVT has more than double the instructions of the other two. It is likely
that this disparity makes the performance improvement of Wait-Free over Barrier
closer to 2x.
Table 4.2 compares the asynchronous and Barrier algorithms on KNL at 128
threads. Like with Xeon, the context switch counts for KNL are much higher for
Barrier algorithm. However, we observe that Wait-Free has a substantially larger
drop in context switches over Mattern (3x) while maintaining a 5x advantage over
59
Wait-Free Mattern Barrier Statistic
68979714.96 53183960.11 49212670.5 event rate (e/s)
311887.145498 391356.018605 291648.869244 task-clock (ms)
86,057 253,594 492,515 ctxt-switches
135 150 136 cpu-migrations
117,911 118,154 120,667 page-faults
422.0 x 109 523.1 x 109 388.5 x 109 cycles
111.3 x 109 125.1 x 109 111.0 x 109 instructions
0.26 0.24 0.19 insns per cycle
18,033,898,740 21,479,750,643 17,788,136,078 branches
1,361,158,046 1,431,375,741 1,382,901,242 branch-miss
1,432,789,638 1,416,332,551 1,552,712,769 L1-data-ld-miss
11,564,734,810 11,365,247,641 10,242,960,482 LLC-loads
4.885092538 5.538352237 5.841830405 seconds elapsed
Table 4.2. Performance statistics for KNL: 128 Threads 0% Remote, Balanced, 0EPG Model
Barrier. However, the performance gain when measured in events per second is more
modest than that of the Xeon (1.3x).
Wait-Free Mattern Barrier Statistic
9737869.32 5842742.38 12409135.08 event rate (e/s)
8152856.669 13434589.436 1499214.161 task-clock (ms)
792,496 3,443,157 3,890,861 ctxt-switches
276 291 290 cpu-migrations
206,068 186,304 203,267 page-faults
10.78 x 1012 17.77 x 1012 1.977 x 1012 cycles
543.1 x 109 1204.6 x 109 385.5 x 109 instructions
0.05 0.07 0.19 insns per cycle
110,818,302,241 264,842,900,303 7,933,394,974 branches
9,081,487,324 18,515,600,126 7,493,233,316 branch-miss
10,763,809,635 16,757,683,847 3,692,152,524 L1-data-ld-miss
121,184,843,882 196,197,041,611 43,102,647,627 LLC-loads
36.255671304 57.883402944 29.163448318 seconds elapsed
Table 4.3. Performance statistics for KNL: 250 Threads 10% Remote, Imbalanced, 0EPG Model
Table 4.3 compares the asynchronous and Barrier algorithms on KNL at 250
threads with an imbalanced load. Like with previous examples, the context switch
counts for KNL are 5x higher for Barrier algorithm when compared with Wait-Free.
However, we observe that Mattern‘s GVT now has nearly equal context switches to
Barrier. However, we also observe the cache pressure to be an order of magnitude less
for Barrier. This may be the result of the substantial increase in branches and branch60
misses in the asynchronous algorithms. Finally, we note the achieved instructions per
cycle to be 4x higher with Barrier than the asynchronous algorithms.
These findings are consistent with the lower efficiency reported in the ROSS
statistics. This confirms that though asynchronous is faster, this speed allows PEs
to run away farther than Barrier with optimistic operation and thus results in more
wasted work. The number of rolled backed events confirms this behavior. At a 250-
scale, Barrier experiences 236,477,644 rollbacks while Wait-Free and Mattern’s GVT
algorithm experience 392,338,925 and 393,799,600 rollbacks respectively.
61
Chapter 5
LITERATURE REVIEW
GVT computation has been studied extensively in literature, though primar-
ily in a distributed setting. Samadi [41] developed one of the first GVT algorithms
and introduced the transient message and simultaneous reporting problem. However,
that algorithm requires acknowledgement messages to be sent, causing extra commu-
nication overhead. Chandy and Lamport [5, 6] described one of the first distributed
snapshot algorithms. Mattern [34] built on top of that to develop an asynchronous
algorithm that does not require acknowledgement messages.
There has also been work to improve the performance of GVT on multiple cores.
The work by Ianni [26] developed a non-blocking algorithm for concurrent computa-
tion of GVT. In [31], the researchers developed an asynchronous algorithm for com-
putation of GVT. In [9], the authors developed a multicore GVT based on Samadi’s
algorithm for a simulator written in the Go language.
There has also been significant work investigating PDES on manycore architec-
tures. The works of [27, 43] investigated the effects of several optimizations to a
multithreaded PDES simulator on smaller-scale platforms such as Intel’s Core-i7 and
AMD’s Magny-cours.
PDES performance on the Tilera processor, whose architecture shares similar-
ities to KNL, was investigated in [28]. Those results show excellent scalability and
demonstrate that the interconnect network can sustain high throughput. However, it
did not investigate alternate GVT implementations.
62
Another area of research involves removing boundaries on resource allocation,
in a “share-everything” system [25]. Such a system may allow a synchronous sys-
tem to compete with optimistic methods in unbalanced situations by shifting hard-
ware resources to more highly-loaded LPs. In addition, lock-free or wait-free event
queues [23], may improve performance in situations where remote percentages are
high.
The work of [1] is the follow-up to [2], reporting impressive event processing rates
on the Sequoia BlueGene/Q supercomputer. The recent effort of [7] evaluated PDES
performance on the Knights Corner processor. The main conclusion of [7] is that
Knights Corner does not outperform the host Xeon processor in terms of event rate
unless vector units are fully utilized, and increasing the number of threads does not
alter that trend. The reasons behind such sub-par performance are slower in-order
cores and limited amount of physical memory on the accelerator card.
Several other studies investigated the performance of various parallel applications
on Xeon Phi (Knights Corner) platforms [24,32,36, 38,40,46]. However, all of these
applications are very different from PDES and in general offer more parallelization
opportunities. Evaluating PDES on KNL provides an insight of how similar fine-
grain communication-dominated applications will be expected to perform on these
platforms.
63
Chapter 6
CONCLUSIONS AND FUTURE WORK
GVT computation algorithm is an important component of a parallel discrete
event simulation system and the choice of GVT algorithm often significantly impacts
the performance of PDES. In this Thesis, we performed a systematic comparative
analysis of various GVT algorithms on systems such as 12-core Xeon processor and
Intel’s Knights Landing many-core processor. While for balanced models, our results
corroborate the conventional wisdom that asynchronous GVT algorithms offer supe-
rior performance to blocking synchronous GVT. The opposite can be the case for
the imbalanced models, where the synchronous nature of GVT limits the disparity of
forward progress among the logical processes. We also performed detailed simulation
profiling to understand the causes of these results.
Our future work will be the extension of this study to a clusters of KNL proces-
sors. We target to scale up to 8 (number of nodes) * 256 (CPUs per node) threads.
We will exploit the recent advances in network technologies such as RDMA and In-
finiBand. We also consider developing a hybrid GVT algorithm that can exploit the
advantages of both synchronous and asynchorouns approaches. Theoretically, a GVT
algorithm can mutate itself based on the simulation model and yield the best of the
two worlds. We are considering modifying Mattern’s GVT algorithm by imposing
artificial synchronization when the average efficiency is below a certain threshold so
that we can throttle the disparity in imbalanced models.
64
REFERENCES
[1] Barnes Jr, P. D., Carothers, C. D., Jefferson, D. R., and LaPre,J. M. Warp speed: executing time warp on 1,966,080 cores. In Proceedings ofthe 2013 ACM SIGSIM conference on Principles of advanced discrete simulation(2013), ACM, pp. 327–336.
[2] Bauer, D., Carothers, C., and Holder, A. Scalable time warp on blue-gene supercomputer. In Proc. of the ACM/IEEE/SCS Workshop on Principlesof Advanced and Distributed Simulation (PADS) (2009).
[3] Carothers, C., Bauer, D., and Pearce, S. ROSS: A high-performance,low memory, modular time warp system. In Proc of the 11th Workshop onParallel and Distributed Simulation (PADS) (2000).
[4] Carothers, C. D., Fujimoto, R. M., and England, P. Effect of com-munication overheads on Time Warp performance: An experimental study. InProc. of the 8th Workshop on Parallel and Distributed Simulation (PADS 94)(July 1994), Society for Computer Simulation, pp. 118–125.
[5] Chandy, K. M., and Lamport, L. Distributed snapshots: Determiningglobal states of distributed systems. ACM Transactions on Computer Systems3, 1 (Feb. 1985), 63–75.
[6] Chandy, K. M., and Misra, J. Asynchronous distributed simulation via asequence of parallel computations. Communications of the ACM 24, 11 (Apr.1981), 198–206.
[7] Chen, H., Yao, Y., and Tang, W. Can mic find its place in the world ofpdes? In Proceeding of International Symposium on Distributed Simulation andReal Time Systems (DS-RT) (2015).
[8] Chrysos, G. Intel xeon phi x100 family coprocessor - the architecture. In Intelwhite paper (2012).
[9] D’Angelo, G., Ferretti, S., and Marzolla, M. Time warp on the go. InProceedings of the 5th International ICST Conference on Simulation Tools andTechniques (ICST, Brussels, Belgium, Belgium, 2012), SIMUTOOLS ’12, ICST(Institute for Computer Sciences, Social-Informatics and TelecommunicationsEngineering), pp. 242–248.
[10] Das, S., Fujimoto, R., Panesar, K., Allison, D., and Hybinette, M.GTW: a Time Warp system for shared memory multiprocessors. In Proceedings ofthe 1994 Winter Simulation Conference (Dec. 1994), J. D. Tew, S. Manivannan,D. A. Sadowski, and A. F. Seila, Eds., pp. 1332–1339.
65
[11] Eker, A., Williams, B., Mishra, N., Thakur, D., Chiu, K., Pono-marev, D., and Abu-Ghazaleh, N. Performance implications of global vir-tual time algorithms on a knights landing processor. In 2018 IEEE/ACM 22ndInternational Symposium on Distributed Simulation and Real Time Applications(DS-RT) (2018), IEEE, pp. 1–10.
[12] Fujimoto, R. Performance measurements of distributed simulation strategies.Tech. Rep. UU–CS–TR–87–026a, University Of Utah, Salt Lake City, November1987.
[13] Fujimoto, R. Parallel discrete event simulation. Communications of the ACM33, 10 (Oct. 1990), 30–53.
[14] Fujimoto, R. Performance of time warp under synthetic workloads. Proceedingsof the SCS Multiconference on Distributed Simulation 22, 1 (Jan. 1990), 23–28.
[15] Fujimoto, R. Parallel and distributed discrete event simulation: Algorithmsand applications. In Proc. of the 1993 Winter Simulation Conference (1993),pp. 106–114.
[16] Fujimoto, R., and Panesar, K. Buffer management in shared-memory TimeWarp system. In Proceedings of the 9th Workshop on Parallel and DistributedSimulation (PADS 95) (June 1995), pp. 149–156.
[17] Fujimoto, R. M. Time Warp on a shared memory multiprocessor. Transac-tions of Society for Computer Simulation (July 1989), 211–239.
[18] Fujimoto, R. M. Parallel discrete event simulation: Will the field survive ?ORSA Journal on Computing 5, 3 (June 1993).
[19] Fujimoto, R. M. Parallel and Distributed Simulation Systems. Wiley Inter-science, Jan. 2000.
[20] Fujimoto, R. M., and Hybinette, M. Computing global virtual time inshared-memory multiprocessors. ACM Transactions on Modeling and ComputerSimulation 7, 4 (1997), 425–446.
[21] Fujimoto, R. M., Tsai, J., and Gopalakrishnan, G. C. Design andevaluation of the rollback chip: Special purpose hardware for Time Warp. IEEETransactions on Computers 41, 1 (Jan. 1992), 68–82.
[22] Gperftools. Google performance tools.
[23] Gupta, S., and Wilsey, P. A. Lock-free pending event set management intime warp. In ACM SIGSIM Conference on Principles of Advanced DiscreteSimulation (PADS) (May 2014).
66
[24] Heinecke, A., Vaidanathan, K., Smelianskiy, M., Kobutov, A.,Dubtsov, R., Henri, G., Shet, A., Chrysos, G., and Dubey, P. Designand implementation of the linpack benchmark for single and multi-node systemsbased on intel xeon phi coprocessor. In Proceedings of International Parallel andDistributed Processing Symposium (IPDPS) (2013).
[25] Ianni, M., Marotta, R., Cingolani, D., Pellegrini, A., and Quaglia,F. The ultimate share-everything pdes system. In 2018 ACM SIGSIM Confer-ence on Principles of Advanced Discrete Simulation (May 2018), pp. 73–84.
[26] Ianni, M., Marotta, R., Pellegrini, A., and Quaglia, F. A non-blocking global virtual time algorithm with logarithmic number of memory oper-ations. In 2017 IEEE/ACM 21st International Symposium on Distributed Sim-ulation and Real Time Applications (DS-RT) (Oct 2017), pp. 1–8.
[27] Jagtap, D., Bahulkar, K., Ponomarev, D., and Abu-Ghazaleh, N.Characterizing and understanding pdes behaviour on tilera architecture. In work-sop on Principles of Advanced Discrete Simulation (PADS) (2012).
[28] Jagtap, D., N.Abu-Ghazaleh, and D.Ponomarev. Optimization of par-allel discrete event simulator for multi-core systems. In International Paralleland Distributed Processing Symposium (May 2012).
[29] Jefferson, D. Virtual time. ACM Transactions on Programming Languagesand Systems 7, 3 (July 1985), 405–425.
[30] Jefferson, D., Beckman, B., Wieland, F., Blume, L., Di Loreto, M.,Hontalas, P., Laroche, P., Sturdevant, K., Tupman, J., Warren,V., Wedel, J., Younger, H., and Bellenot, S. Distributed simulationand the Time Warp operating system. In Proceedings of the 12th SIGOPS —Symposium of Operating Systems Principles (1987), pp. 77–93.
[31] Lin, Z., and Yao, Y. An asynchronous gvt computing algorithm in neurontime warp-multi thread. In 2015 Winter Simulation Conference (WSC) (Dec2015), pp. 1115–1126.
[32] Lu, M., Zhang, L., Hyunh, H., Ong, Z., Liang, Y., He, B., Goh, R.,and Huynh, R. Optimizing the mapreduce framework on intel xeon phi copro-cessor. In Proceedings of International Conference on Big Data (2013).
[33] Mattern, F. Virtual time and global states in distributed systems. In Pro-ceedings of the Workshop on Parallel and Distributed Algorithms (Oct. 1989),pp. 215–226.
[34] Mattern, F. Efficient algorithms for distributed snapshots and global virtualtime approximation. Journal of Parallel and Distributed Computing 18, 4 (Aug.1993), 423–434.
67
[35] Mattern, F., Mehl, H., Schoone, A. A., and Tel, G. Global virtualtime approximation with distributed termination detection algorithms. Tech.Rep. RUU–CS–91–32, Dept. of Computer Science, University of Utrecht, TheNetherlands, 1991.
[36] Misra, G., Kurkure, N., Das, A., M.Valmiki, Das, S., and Gupta, A.Evaluation of rodinia codes on intel xeon phi. In Proceedings of the 4th Interna-tional Conference on Intelligent Systems, Modelling and Simulation (2013).
[37] Pellegrini, A., and Quaglia, F. Wait-free global virtual time computationin shared memory timewarp systems. In Computer Architecture and High Per-formance Computing (SBAC-PAD), 2014 IEEE 26th International Symposiumon (2014), IEEE, pp. 9–16.
[38] Pennycook, S., Hughes, C., Smelianskiy, M., and Jarvis, S. Exploringsimd for molecular dynamics using intel xeon processor and intel xeon phi co-processors. In Proceedings of International Parallel and Distributed ProcessingSymposium (IPDPS) (2013).
[39] Perf. Linux profiling with performance counters.
[40] Ramachandran, A., Vienne, J., Wijmgaart, R., Koesterke, L., andSharapov, I. Performance evaluation of nas parallel benchmarks on intel xeonphi. In Proceedings of International Conference on Parallel Processing (ICPP)(2013).
[41] Samadi, B. Distributed Simulation, Algorithms and Performance Analysis. PhDthesis, Computer Science Department, University of California, Los Angeles, CA,1985.
[42] Sodani, A., Gramunt, R., Corbal, J., Kim, H., Vinod, K.,Chinthamani, S., Hutsell, S., Agarwal, R., and Liu, Y. Knights land-ing: Second-generation intel xeon phi product. In IEEE Micro (2016).
[43] Wang, J., Jagtap, D., Abu-Ghazaleh, N., and Ponomarev, D. Paral-lel discrete event simulation for multi-core systems: Analysis and optimization.IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1574–1584.
[44] Wang, J., Ponomarev, D., and N.Abu-Ghazaleh. Performance analysisof multithreaded pdes simulator on multi-core clusters. In 26th IEEE/ACM/SCSWorkshop on Principles of Advanced and Distributed Simulations (PADS) (July2012).
[45] Williams, B., Ponomarev, D., Abu-Ghazaleh, N., and Wilsey, P.Performance characterization of parallel discrete event simulation on knightslanding processor. In Proceedings of the 2017 ACM SIGSIM Conference onPrinciples of Advanced Discrete Simulation (PADS) (2017), ACM, pp. 121–132.
68
[46] Xie, B., Liu, X., Zhan, J., Jia, Z., Zhu, Y., Wang, L., and Zhang, L.Characterizing data analytics workloads on intel xeon phi. In Workload Char-acterization (IISWC), 2015 IEEE International Symposium on (2015), IEEE,pp. 114–115.
69
Reproduced with permission of copyright owner. Further reproduction prohibited without permission.