Project Deliverable D3.3 Resource Usage Modeling...Abstract The Q-ImPrESS project deals with...

Project Deliverable D3.3Resource Usage Modeling

Project name: Q-ImPrESSContract number: FP7-215013Project deliverable: D3.3: Resource Usage ModelingAuthor(s): Vlastimil Babka, Lubomır Bulej, Martin Decky, Johan Kraft,

Peter Libic, Lukas Marek, Cristina Seceleanu, Petr TumaWork package: WP3Work package leader: PMIPlanned delivery date: February 1, 2009Delivery date: February 2, 2009Last change: February 3, 2009Version number: 2.0

Abstract The Q-ImPrESS project deals with modeling of quality attributes in service ori-ented architectures, which generally consist of interacting components that share resources.This report analyzes the degree to which resource sharing ofvarious omnipresent implicitlyshared resources (e.g. memory content caches, memory buses, etc.) affect various quality at-tributes. The main goal is to identify the resources whose sharing affects the quality attributessignificantly, and next propose methods for modeling of these effects.

Keywords resource sharing, quality attributes, modeling

Project Deliverable D3.3: Resource Usage Modeling

Version: 2.0 Last change: February 3, 2009

Revision history

Version Change Date Author(s) Description

1.0 01-10-2008 CUNI, MDU Initial version

2.0 02-02-2009 CUNI Extended version

c© Q-ImPrESS Consortium Dissemination Level: public /210



Contents

1 Introduction 7

2 Experiment Design 92.1 Experiment Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Composition Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Quality Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

3 Shared Resources 123.1 Considered Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Experimental Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 Dell PowerEdge 1955 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.2 Dell PowerEdge SC1435 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.3 Dell Precision 620 MT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2.4 Dell Precision 340 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Resource: Example Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.1 Platform Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1.1 Platform Intel Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.1.2 Platform AMD Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.2 Platform Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.2.1 Experiment: The name of the experiment . . . . . . . . . . . . . . . . . . . . . 16

3.3.3 Composition Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Processor Execution Core 184.1 Resource: Register Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Resource: Branch Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 Platform Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.1.1 Platform Intel Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.1.2 Platform AMD Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.2 Sharing Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.3 Pipelined Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2.4 Artificial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.4.1 Experiment: Indirect Branch Misprediction Overhead . . . . . . . . . . . . . . . 204.2.5 Modeling Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 System Memory Architecture 285.1 Common Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .285.2 Resource: Address Translation Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 Platform Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.1.1 Platform Intel Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.1.2 Platform AMD Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2.2 Platform Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2.2.1 Miss Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2.2.2 Experiment: L1 DTLB miss penalty . . . . . . . . . . . . . . . . . . . . . . . . 355.2.2.3 Experiment: DTLB0 miss penalty, Intel Server . . . . . . . . . . . . . . . . . . 37




5.2.2.4 Experiment: L2 DTLB miss penalty, AMD Server . . . . . . . . . . . . . . . . . 385.2.2.5 Experiment: Extra translation caches . . . . . . . . . . . . . . . . . . . . . . . 405.2.2.6 Experiment: L1 ITLB miss penalty . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.2.7 Experiment: L2 ITLB miss penalty, AMD Server . . . . . . . . . . . . . . . . . . 47

5.2.3 Pipelined Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2.4 Artificial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.4.1 Experiment: DTLB sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2.4.2 Experiment: ITLB sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.5 Parallel Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.6 Artificial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2.6.1 Experiment: Translation Buffer Invalidation Overhead . . . . . . . . . . . . . . . 545.2.7 Modeling Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Resource: Memory Content Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3.1 Platform Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.1.1 Platform Intel Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.3.1.2 Platform AMD Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.2 Sharing Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3.3 Platform Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.3.1 Experiment: Cache line sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.3.3.2 Experiment: Streamer prefetcher, Intel Server . . . . . . . . . . . . . . . . . . 665.3.3.3 Experiment: Cache set indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3.3.4 Miss Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.3.3.5 Experiment: L1 cache miss penalty . . . . . . . . . . . . . . . . . . . . . . . . 705.3.3.6 Experiment: L2 cache miss penalty . . . . . . . . . . . . . . . . . . . . . . . . 765.3.3.7 Experiment: L1 and L2 cache random miss penalty, AMD Server . . . . . . . . . 775.3.3.8 Experiment: L2 cache miss penalty dependency on cache line set, Intel Server . 845.3.3.9 Experiment: L3 cache miss penalty, AMD Server . . . . . . . . . . . . . . . . . 86

5.3.4 Pipelined Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.3.5 Artificial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.3.5.1 Experiment: L1 data cache sharing . . . . . . . . . . . . . . . . . . . . . . . . 895.3.5.2 Experiment: L2 cache sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.3.5.3 Experiment: L3 cache sharing, AMD Server . . . . . . . . . . . . . . . . . . . . 915.3.5.4 Experiment: L1 instruction cache sharing . . . . . . . . . . . . . . . . . . . . . 93

5.3.6 Real Workload Experiments: Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 945.3.6.1 Experiment: FFT sharing data caches . . . . . . . . . . . . . . . . . . . . . . . 94

5.3.7 Parallel Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1005.3.8 Artificial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100

5.3.8.1 Experiment: Shared variable overhead . . . . . . . . . . . . . . . . . . . . . . 1005.3.8.2 Experiment: Cache bandwidth limit . . . . . . . . . . . . . . . . . . . . . . . .1025.3.8.3 Experiment: Cache bandwidth sharing . . . . . . . . . . . . . . . . . . . . . . 1055.3.8.4 Experiment: Shared cache prefetching . . . . . . . . . . . . . . . . . . . . . . 109

5.3.9 Real Workload Experiments: Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . 1165.3.9.1 Experiment: FFT sharing data caches . . . . . . . . . . . . . . . . . . . . . . . 116

5.3.10 Real Workload Experiments: SPEC CPU2006 . . . . . . . . . . . . . . . . . . . . . . . .1215.3.10.1 Experiment: SPEC CPU2006 sharing data caches . . . . . . . . . . . . . . . . 121

5.4 Resource: Memory Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1245.4.1 Platform Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124

5.4.1.1 Platform Intel Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1245.4.1.2 Platform AMD Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124

5.4.2 Sharing Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1255.4.3 Parallel Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1255.4.4 Artificial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125




5.4.4.1 Experiment: Memory bus bandwidth limit . . . . . . . . . . . . . . . . . . . . . 1255.4.4.2 Experiment: Memory bus bandwidth limit . . . . . . . . . . . . . . . . . . . . . 125

6 Operating System 1286.1 Resource: File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128

6.1.1 Platform Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1286.1.1.1 Platform RAID Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128

6.1.2 Sharing Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1296.1.3 General Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1296.1.4 Artificial Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129

6.1.4.1 Sequential access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1316.1.4.2 Experiment: Concurrent reading of individually written files . . . . . . . . . . . . 1326.1.4.3 Experiment: Individual reading of concurrently written files . . . . . . . . . . . . 1336.1.4.4 Experiment: Concurrent reading of concurrently written files . . . . . . . . . . . 1336.1.4.5 Random access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1366.1.4.6 Experiment: Concurrent random reading of individually written files . . . . . . . 1366.1.4.7 Experiment: Individual random reading of concurrently written files . . . . . . . . 136

7 Virtual Machine 1397.1 Resource: Collected Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139

7.1.1 Platform Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1407.1.1.1 Platform Desktop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1407.1.1.2 Platform Intel Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1407.1.1.3 Platform AMD Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140

7.1.2 Sharing Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1417.1.3 General Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1417.1.4 Artificial Experiments: Overhead Dependencies . . . . . . . . . . . . . . . . . . . . . . . 141

7.1.4.1 Experiment: Object lifetime . . . . . . . . . . . . . . . . . . . . . . . . . . . .1417.1.4.2 Experiment: Heap depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1467.1.4.3 Experiment: Heap size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1467.1.4.4 Varying Allocation Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1537.1.4.5 Experiment: Allocation speed with object lifetime . . . . . . . . . . . . . . . . . 1537.1.4.6 Experiment: Allocation speed with heap depth . . . . . . . . . . . . . . . . . . 1577.1.4.7 Experiment: Allocation speed with heap size . . . . . . . . . . . . . . . . . . . 1597.1.4.8 Varying Maximum Heap Size . . . . . . . . . . . . . . . . . . . . . . . . . . .1627.1.4.9 Experiment: Maximum heap size with object lifetime . . . . . . . . . . . . . . . 1627.1.4.10 Experiment: Maximum heap size with heap depth . . . . . . . . . . . . . . . . 1697.1.4.11 Experiment: Maximum heap size with heap size . . . . . . . . . . . . . . . . . 1697.1.4.12 Constant Heap Occupation Ratio . . . . . . . . . . . . . . . . . . . . . . . . .1697.1.4.13 Experiment: Constant heap occupation with object lifetime . . . . . . . . . . . . 1787.1.4.14 Experiment: Constant heap occupation with heap depth . . . . . . . . . . . . . 1787.1.4.15 Experiment: Constant heap occupation with heap size . . . . . . . . . . . . . . 178

7.1.5 Artificial Experiments: Workload Compositions . . . . . . . . . . . . . . . . . . . . . . . .1847.1.5.1 Experiment: Allocation speed with composed workload . . . . . . . . . . . . . . 1847.1.5.2 Experiment: Heap size with composed workload . . . . . . . . . . . . . . . . . 188

8 Predicting the Impact of Processor Sharing on Performance 1938.1 Simulation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1938.2 Simulation Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1948.3 Generated Statistics Report for Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . .196

9 Conclusion 201




Terminology 206

References 208




Chapter 1

Introduction

The Q-ImPrESS project deals with modeling of quality attributes, such as performance and reliability, in serviceoriented architectures. Since the project understands theservice oriented architectures in terms of interactingcomponents that share resources, modeling of quality attributes necessitates modeling of both the components andthe resources.

To achieve reasonable complexity, common approaches to modeling choose to abstract from certain resources,especially resources associated with service platform internals such as the memory caches of a processor or thegarbage collector of a virtual machine. The influence of suchresources on quality attributes, however, tendsto change, bringing some previously secondary resources toprominence – when advances in memory caches orgarbage collectors are behind major performance gains of processors or virtual machines, abstracting away frommemory caches or garbage collectors when modeling performance is hardly prudent.

The role of task T3.3 is to analyze the degree to which resource sharing affects various quality attributes,focusing on resources that are not yet considered in approaches to modeling planned for the Q-ImPrESS project.The task proceeds by first identifying the resources whose sharing affects the attributes, and next developingmethods for adjustment of the prediction models.

Task T3.3 is complemented by task T3.1, which defines the quality attributes and prediction models consideredin the Q-ImPrESS project. Both the attributes and the modelsare described in deliverable D3.1 [33], whichanalyzes the strengths and weaknesses of the individual prediction models with respect to support of the chosenquality attributes.

Task T3.3 is planned both for an early phase of the Q-ImPrESS project, when the initial experiments and initialanalyses are done, and for a late phase of the Q-ImPrESS project, when the validation and evaluation take place.The early work culminates with deliverable D3.3, a report describing the experiments that quantify the impact ofresource sharing on quality attributes and documenting thechoice of resources to model.

The report is structured as follows:

• In Chapter2, the design of the experiments used to assess the impact of resource sharing on quality attributesis outlined. Two major aspects of the design are the choice ofworkloads, which is made with the goalof separating individual resource demand factors, and the composition of workloads, which is made withthe goal of reflecting service composition. Two major scenarios of service composition are defined – thepipelined composition, where components are invoked sequentially, and the parallel composition, wherecomponents are invoked concurrently.

• In Chapter3, the shared resources are introduced. Since the following chapters, which focus on specific re-sources, use a common template, this template is also introduced, giving the basic properties of the platformsused for the resource sharing experiments as the template content.

• Chapters4, 5, 6 and7 give the descriptions and results of the resource sharing experiments for specificresources. For each shared resource, an overview of its principal features is given first, followed by thedetails of its implementation on the experimental platforms. The resource sharing experiments come next,designed to document how sharing occurs in various modes of service composition, what are the typical andmaximum effects of sharing and what is the workload under which such effects are observed.




• In Chapter8, a related tool for predicting impact of processor sharing on response time is presented, toillustrate one of the approaches to modeling planned for theQ-ImPrESS project.

• In Chapter9, an overall conclusion closes the report.

The resource sharing experiments describe results of complex interactions among multiple resources, withthe interactions only partially observable and the resources only partially documented. Besides the interpretationoffered here, the results are therefore open to multiple additional interpretations, including interpretations whereresults are attributed to errors in the experiment. Although utmost care was taken to provide correct analysis, thisdisclaimer should be remembered together with limits that external observation of complex interactions necessarilyhas.




Chapter 2

Experiment Design

The experiments used to analyze the degree to which resourcesharing affects various quality attributes followa straightforward construction. When two workloads are executed first in isolation and then composed togetherover a resource, the difference in the quality attributes observed during the experiment is necessarily due to sharingof the resource. Obviously critical in this construction isthe choice of workloads and the manner of composition.

2.1 Experiment Workloads

The workload choice is driven by two competing motivations.For the experimental results to be practically rel-evant, the workloads should exercise resources in the same patterns as the practical services do. However, forthe experimental results to be analyzable, the workloads should exercise as few resources as possible in as simplepatterns as possible.

Since the Q-ImPrESS project plan counts on modeling the effects of resource sharing, having analyzable resultsis essential. For this reason, the resource sharing experiments rely on artificial workloads, constructed specificallyto exercise a particular resource in a particular pattern. Where exercising a particular resource alone is not possible,as few additional resources as possible are exercised in as simple patterns as possible. The experimental results forthe artificial workloads form the centerpiece of the report.

To make sure that the experimental results do not lose their practical relevance, practical workloads are usedto check whether the effects of resource sharing under practical workloads resemble a combination of the effectsunder artificial workloads. The experimental results for the practical workloads, however, are not consideredessential for the report, because the Q-ImPrESS project plan allocates a separate additional task for validation onpractical workloads.

2.2 Composition Scenarios

The Q-ImPrESS project plan assumes that a service architecture consists of components interacting through con-nectors, where components and connectors can share resources. The workload composition follows two distinctscenarios in which such sharing occurs:

Pipelined composition The pipelined composition scenario covers a situation where the invocations comingfrom the outside pass through the composed components sequentially. The scenario describes commondesign patterns such as nesting of components, where one component invokes another to serve an outsiderequest.

The pipelined composition scenario is distinct in that multiple components access shared resources sequen-tially, one after another. A single component thus executeswith complete control of the shared resources,other components access the resources only when this component suspends its execution. In such a sce-nario, resource sharing impacts the quality attributes mostly due to state change overhead, incurred whenone component stops and another starts accessing a resource. Examples of such overhead are flushing andpopulating of caches when components switch on a virtual memory resource, or positioning of disk headswhen components switch on a file system resource.




Parallel composition The parallel composition scenario covers a situation wherethe invocations coming fromthe outside pass through the composed components concurrently. The scenario describes common designpatterns such as pooling of components, where multiple components are invoked to serve multiple outsiderequests.

In parallel composition, multiple components access shared resources concurrently, all together. No singlecomponent thus executes with complete control of any sharedresource. In such a scenario, resource sharingimpacts the quality attributes mostly due to capacity limitations, encountered when multiple componentsconsume a resource. Examples of such limitations are conflict and capacity evictions in caches when com-ponents share a virtual memory resource, or allocation of disk blocks when components share a file systemresource.

Note that a combination of the two composition scenarios canalso occur in practice. In that case, the informa-tion on how resource sharing influences quality attributes needs to be adjusted accordingly. In this respect, the twocomposition scenarios identified in this report are not meant to be exhaustive, but rather to represent well definedcases of service composition that lead to resource sharing.An example of a particularly frequent combination isrepresenting a service that handles multiple clients concurrently as a parallel composition of services that handle asingle client, each service being a pipelined composition of individual components.

The workload composition in the resource sharing experiments reflects the way in which performance modelsare populated with quality attributes gathered by isolatedbenchmarks. The Q-ImPrESS project plan assumes thatthe service architecture is captured by a service architecture model, which contains quality annotations describingthe quality attributes of the individual components and connectors [32]. To predict the quality attributes of theservice architecture from the quality attributes of the individual components and connectors, prediction modelsare generated from the service architecture model. When solved, the prediction models take the quality attributesof the components and connectors as their input and produce predictions of the quality attributes of the servicearchitecture as their output.

The outlined application of the prediction models requiresknowledge of the quality attributes of the com-ponents and connectors. These quality attributes can be obtained by various approaches, including estimates,monitoring and benchmarking.

• Estimates can serve well at early development stages, when the service implementation does not exist andlittle information is available. Early estimates are relatively cheap to obtain and imprecise.

• Monitoring can serve well at late development stages, when the service implementation exists and the run-ning service can be monitored. Late monitoring is relatively expensive to obtain and precise.

• Benchmarking is of particular importance as a compromise between estimates and monitoring when theservice implementation partially exists. It is more precise than estimates since the quality attributes of theimplemented components and connectors can be measured precisely. It is less expensive than monitoringsince the implemented components and connectors can be measured in isolation.

The use of benchmarking is hindered by the fact that the quality attributes of the components and connectorsare not constant. Instead, they change with the execution context – the same component will perform better withmore available memory or faster processor speed, the same connector will perform better with more networkbandwidth or smaller message sizes. Benchmarking, however, only captures the quality attributes in a singleexecution context, typically isolated or otherwise artificial.

The context in which the components and connectors execute differs between an isolated benchmark and acomposed service. Notably, resources that were only used bya single component in the benchmark can be usedby multiple components in the service. In this, the changes in quality attributes correspond to the changes in theexperimental results between the isolated execution of each workload and the composed execution of multipleworkloads.

When considering combinations of the two composition scenarios, special attention must be paid to resourcesharing due to context switching in multitasking operatingsystems. As in the pipelined scenario, context switchingimplies interleaved execution, but it does not quite match the scenario since the interleaving happens at arbitrary




points in time. As in the parallel scenario, context switching implies concurrent execution, but it does not quitematch the scenario since the concurrency is only coarse grained.

The two composition scenarios, however, can still represent context switching well. The argument is based onthe assumption that the observed effect of resource sharinghas to be large enough and frequent enough if it is toinfluence the prediction precision significantly. Obviously, the observed effect is unlikely to affect the predictionprecision at the component level if the combination of its size and frequency is below the scale of the componentquality attributes.

There are two major categories of context switches in multitasking operating systems. Their properties can becharacterized as follows:

Due to Scheduler Interrupts These context switches happen at regular intervals and involve resources neces-sarily used by all tasks, such as the resources described in Section4. The details of the resource sharingeffects for these resources suggest that context switches due to scheduler interrupts do not occur frequentlyenough to amplify the size of the resource sharing effects significantly.

For example, for the platforms considered in the report, theeffects of sharing a memory cache can onlyamount to some millions of cycles per context switch, and that in the extremely unlikely case of a workloadthat uses the entire cache with no prefetching possible. Context switches due to scheduler interrupts typicallyhappen only once some tens or hundreds of millions of cycles,meaning that the effects of sharing a mem-ory cache can only represent units of percents of the execution time, and that only for extremely unlikelyworkloads.

Due to Resource Blocking These context switches happen at arbitrary times and additionally involve the block-ing resource. The operation times and the resource sharing effects related to the blocking resource tend tobe larger than the resource sharing effects related to the other resources. The blocking resource thereforedictates behavior, pretty much the same as in the parallel scenario, or even in the interleaved scenario.

For example, for the platforms considered in the report, theoperation times and the effects of sharing a filesystem are in the order of milliseconds per operation. As noted above, the effects of sharing a memorycache can only ammount to some millions of cycles and therefore some milliseconds per context switch forextremely unlikely workloads, meaning that the effects of sharing a file system should prevail.

2.3 Quality Attributes

The Q-ImPrESS project plan considers a wide range of qualityattributes and the associated quantifying metrics,described in [33]. The resource sharing experiments, however, can only collect information on those attributes andthose metrics that are directly measurable. Where performance is concerned, these are:

• Responsivity as an attribute that describes temporal behavior of a service from the point of view of a singleclient, quantified as response time and derivative statistics of response time including mean response timeand response time jitter.

• Capacity as an attribute that describes temporal behavior of a service from the point of view of the overallarchitecture, quantified as throughput and utilization.

• Scalability as an attribute that describes changes in responsivity and capacity depending on the scale of theworkload or the scale of the platform.




Chapter 3

Shared Resources

The usual understanding of shared resources is rather broadin scope, requiring an additional selection of sharedresources of concern to the Q-ImPrESS project. The selection helps narrow down the scope of the shared resourcesconsidered, avoiding the danger of delivering shallow results spread over too many types of shared resources.Besides this, the selection also clearly defines the types ofshared resources to consider, making it possible to showthat the delivered results are complete with respect to the selection criteria.

The selection of shared resources singles out resources that are:

• Shared implicitly, by the fact of deploying components together. Resources that are shared explicitly inthe service architecture model are likely to be also modeledexplicitly in the prediction models created bytransformations from the service architecture model, and therefore do not need to be considered by separateresource models.

For example, this criterion would include memory content caches, because components share the caches byvirtue of sharing the processor, rather than by declaring explicit connection to the caches and performingexplicit operations with caches.

• Intended to serve other primary purpose than scheduling. Resources whose primary purpose is schedulingare better modeled in the prediction models than in separateresource models, since their function is pivotalto the functioning of the prediction models.

For example, this criterion would exclude database connection pools, because components use the poolsprimarily to schedule access to database, rather than to perform operations that would have scheduling as itsside effect.

3.1 Considered Resources

The list of resources that match this classification includes:

Processor Execution Core Resources

The processor execution core is likely to exhibit significant resource sharing effects when thread level parallelismis supported directly by the hardware.

System Memory Architecture Resources

Translation buffers are likely to exhibit significant resource sharing effects in services with large address spacecoverage requirements.

Memory caches are likely to exhibit significant resource sharing effects in services with localized memoryaccess patterns.

Memory buses are likely to exhibit significant resource sharing effects in services with randomized memoryaccess patterns and in services with coherency requirements.




Operating System Resources

File system is likely to exhibit significant resource sharing effects inservices with intensive file system access.

Virtual Machine Resources

Collected heap is likely to exhibit significant resource sharing effects inservices with complex data structures.

3.2 Experimental Platforms

This section contains description of the computing platforms that were used for running the resource sharing exper-iments. The range of different computing platforms in use today is extremely large and even minute configurationdetails can influence the experiment results. It is therefore not practical to attempt a comprehensive coverage ofthe computing platforms in the resource sharing experiments. Instead, we have opted for thoroughly documentingseveral common computing platforms that were used for running the resource sharing experiments, so that theapplicability of the results to other computing platforms can be assessed.

In line with the overall orientation of the Q-ImPrESS project, we have selected typical high-end desktop andlow-end server platforms with both Intel and AMD processors:

• We have considered only the internal processor caches, as opposed to the less common external caches.

• We have considered only SMP multiprocessor systems, as opposed to the less common NUMA multipro-cessor systems.

• We have considered only systems with separate processor cores, as opposed to the less common systemswith processor cores shared by multithreading or hyperthreading.

The description of the hardware platforms is derived mostlyfrom vendor documentation. The detailed infor-mation about the processor, such as the cache sizes and associativity, is obtained by the x86info tool [22], whichgathers the information using the CPUID instruction ([3, page 3-180] for Intel-based and [11] for AMD-basedplatforms), and confirmed by our experiments. Other hardware information, such as memory configuration andcontrollers, is obtained by the lshw tool [23], which uses the DMI structures and the device identification informa-tion from the available buses (PCI, SCSI).

3.2.1 Dell PowerEdge 1955

The Dell PowerEdge 1955 machine represents a common server configuration with an Intel processor, and isreferred to as PlatformIntel Server. The platform is used in most processor and memory related experiments,since its processor and memory architecture is representative of contemporary computing platforms.

Processor Dual Quad-Core Intel Xeon CPU E5345 2.33 GHz (Family 6 Model 15 Stepping 11), 32 KB L1caches, 4 MB L2 caches

Memory 8 GB Hynix FBD DDR2-667, synchronous, two-way interleaving, Intel 5000P memory controller

Hard drive 73 GB Fujitsu SAS 2.5 inch 10000RPM, LSI Logic SAS1068 Fusion-MPT controller

Operating system Fedora Linux 8, kernel 2.6.25.4-10.fc8.x8664, gcc-4.1.2-33.x8664, glibc-2.7-2.x8664

Virtual machine Sun Java SE Runtime Environment build 1.6.0-11-b03, Java HotSpot VM build 11.0-b16




3.2.2 Dell PowerEdge SC1435

The Dell PowerEdge SC1435 machine represents a common server configuration with an AMD processor, and isreferred to as PlatformAMD Server. The platform is used in most processor and memory related experiments,since its processor and memory architecture is representative of contemporary computing platforms.

Processor Dual Quad-Core AMD Opteron 2356 2.3 GHz (Family 16 model 2 stepping 3), 64 KB L1 caches,512 KB L2 caches, 2 MB L3 caches

Memory 16 GB DDR2-667 unbuffered, ECC, synchronous, integrated memory controller

Hard drive 146 GB Fujitsu SAS 3.5 inch 15000 RPM, 2 drives in RAID0, LSI Logic SAS1068 Fusion-MPTcontroller

Operating system Fedora Linux 8, kernel 2.6.25.4-10.fc8.x8664, gcc-4.1.2-33.x8664, glibc-2.7-2.x8664

3.2.3 Dell Precision 620 MT

The Dell Precision 620 MT machine represents a disk array server configuration, and is referred to as Plat-form RAID Server. The platform is used in operating system related experiments.

Processor Dual Intel Pentium 3 Xeon CPU 800 MHz (Family 6 Model 8 Stepping 3), 16 KB L1 instructioncache, 16 KB L1 data cache, 256 KB L2 cache

Memory 2 GB RDRAM 400 MHz, Intel 840 memory controller

Hard drive 18 GB Quantum SCSI 3.5 inch 10000 RPM, 4 drives in RAID 5, Adaptec AIC7899P SCSI U160controller

Operating system Fedora Linux 10, kernel 2.6.27.9-159.fc10.i686, gcc-4.3.2-7.i386, glibc-2.9-3.i686

File system Linux ext3 4 KB blocks, metadata journal, directory index

3.2.4 Dell Precision 340

The Dell Precision 340 machine represents a common desktop configuration, and is referred to as PlatformDesktop.The platform is used in virtual machine related experiments, since its relative simplicity facilitates result interpre-tation.

Processor Intel Pentium 4 CPU 2.2 GHz (Family 15 Model 2 Stepping 4), 12 Ko trace cache, 8 KB L1 datacache, 512 KB L2 cache

Memory 512 MB RDRAM 400 MHz, Intel 850E memory controller

Hard drive 250 GB Hitachi PATA 3.5 inch 7200 RPM, Intel 82801BA IDE U100 controller

Operating system Fedora Linux 9, kernel 2.6.25.11-97.fc9.i686, gcc-4.3.0-8.i386, glibc-2.8-8.i686

Virtual machine Sun Java SE Runtime Environment build 1.6.0-06-b02, Java HotSpot VM build 10.0-b22




3.3 Resource: Example Resource

A separate chapter is dedicated to each logical group of resources. Inside each chapter, a separate section isdedicated to each resource. The resource section follows a fixed template:

• An overview of the resource and a detailed information on howthe resource is implemented on the exper-imental platforms. This overview is not intended to serve asa tutorial for the resource, rather, it illustrateswhat principal features of the resource are considered to form a technological basis for the descriptions ofthe individual experiments.

• Descriptions of the individual experiments. Motivations for the individual experiments are provided infloating sections in between the experiments as necessary, introducing experiments that investigate platformdetails and experiments that mimic the pipelined and parallel composition scenarios from Chapter2.

• Notes on modeling the resource. In the Q-ImPrESS project, resource sharing experiments are conductedto develop resource sharing models, it is therefore necessary that initial sketches towards modeling theresources are done even in the experimental task.

This particular resource section serves as an example to introduce the resource section template. Rather thanfocusing on a particular resource, the section focuses on the framework used to perform the experiments on mostresources, providing information about framework overhead inherent to the experiments.

3.3.1 Platform Details

Although the principal features of a resource are usually well known, it turns out that the resource sharing ex-periment results are difficult to analyze with this knowledge alone. This part of the resource template thereforeprovides a detailed description of how the particular resource is implemented on the individual experimental plat-forms, facilitating the analysis.

Typically, the level of detail available in common sources,including vendor documentation, is not sufficientfor a rigorous analysis of the results, capable of distinguishing fundamental effects from accompanying cross talk,constantly present noise, or even potential experiment errors. The information in common sources, includingvendor documentation, often abstracts from details and sometimes provides conflicting or fragmented statements,significant effort was therefore spent documenting the exact source of each resource description statement andverifying or precising each statement with additional experiments, providing a unified resource description.

In this example section, the mechanism used to collect the results on the individual platforms is described.

3.3.1.1 Platform Intel Server

Precise timing

• To collect the timing information, the RDTSC processor instruction is used. The instruction fetches thevalue of the processor clock counter and, on this particularplatform, is stable even in presence of frequencyscaling. With 2.33 GHz clock, a single clock tick corresponds to 0.429ns.

Since the RDTSC processor instruction does not serialize execution, a sequence of XOR EAX, EAX andCPUID is executed before RDTSC to enforce serialization.

• The total duration of the timing collection sequence is 245 cycles. The overhead of the framework whencollecting the timing information is 266 cycles. The overhead is amortized when performing more operationsin a single measured interval.

Performance counters

• To collect additional information, we use the performance event counters. Performance event counters areinternal processor registers that can be configured to countoccurences of performance related events. The




selection of the performance related events depends on the processor, but typically includes events such asinstruction retirement, cache miss or cache hit. The performance events supported by this particular platformare described in [6, Appendix A.3].

Although the number of available performance events is usually very high, the number of performancecounters in a processor is typically low, often only two or four. When the number of events to be counted ishigher than the number of counters, we repeat the experimentmultiple times with different sets of events tobe counted.

To collect the values of the performance counters, we use thePAPI library [25] running on top of perfctr[26]. In this document, we refer to the events by the event names used by the PAPI library, which mostlymatch the event names in [6, Appendix A.3].

• The access to a performance counter takes between 7800 and 8000 cycles. The overhead is not present inthe timing information, since the additional information is collected separately. It is, however, still presentin the workload when the additional information is collected.

3.3.1.2 Platform AMD Server

Precise timing

• To collect the timing information, the RDTSCP processor instruction is used. The instruction fetches thevalue of the processor clock counter and, on this particularplatform, is stable even in presence of frequencyscaling. With 2.3 GHz clock, a single clock tick correspondsto 0.435 ns.

• The total duration of the RDTSCP processor instruction is 75cycles. The overhead of the framework whencollecting the timing information is 80-81 cycles. The overhead is amortized when performing more opera-tions in a single measured interval.

Performance counters

• To collect additional information, we use the processor performance counters. The performance eventssupported by this particular platform are described in [12, Section 3.14].

We use the PAPI library [25] running on top of perfctr [26] to collect the values of the performance counters.In this document, we refer to the events by the event names used by the PAPI library, which are in most casescapitalized event names from [12, Section 3.14].

• The access to a performance counter takes between 6500 and 7500 cycles. The overhead is not present inthe timing information, since the additional information is collected separately. It is, however, still presentin the workload when the additional information is collected.

The rest of the resource section contains experiments grouped by their intent, which is either investigation ofplatform details or assessment of particular resource sharing scenario – pipelined composition, parallel composi-tion, or general composition.

3.3.2 Platform Investigation

The section dedicated to investigation of platform detailspresents experiments that determine various details ofoperation of the particular shared resource on the particular experiment platform.

3.3.2.1 Experiment: The name of the experiment

When introducing an experiment, the code used in the experiments is described first, with code fragments includedas necessary. Descriptions and results of individual experiments follow, using a common template:

Purpose A brief goal of the experiment.




Measured The measured workload. This is the primary code of the experiment, monitored by the framework,which collects the information for the experiment results.

Parameters Parameters used by the measured code. A parameter may be a range of values, which means thatthe experiment is executed multiple times, iterating over the values.

Interference The interfering workload. This is the secondary code of the experiment, designed to compete withthe measured code over the shared resource. Depending on thecomposition scenario, it is executed either insequence or in parallel with the measured workload.

Expected Results Since most experiments are designed to trigger particular effects of resource sharing, wedescribe the expected results of the resource sharing first.This is necessary so that we can compare themeasured results with the expectations and perhaps explainwhy some of the expectations were not met.

Measured Results After the expected results, we describe the measured results and validate them against theexpectations. When the results meet the expectations, the numeric values of the results provide us with quan-tification of resource sharing effects. When the results do not meet the expectations, additional explanationand potentially also additional experiments are provided.

Open Issues When the measured results exhibit effects that would require additional experiments to investigate,but the effects are not essential to the purpose of the report, the effects are listed as open issues.

To illustrate the results, we often provide plots of values such as the duration of the measured operation or thevalue of a performance counter, often plotted as a dependency on one of the experiment parameters. To capturethe statistical variability of the results, we use boxplotsof individual samples, or, where the duration of individualoperations approaches the measurement overhead, boxplotsof averages. The boxplots are scaled to fit the boxeswith the whiskers, but not necessarily to fit all the outliers, which are usually not related to the experiment. Whereboxplots would lead to poorly readable graphs, we use dots connected by lines to plot the averages.

When averages are used in a plot, the legend of the plot informs about the exact calculation of the averagesusing standardized acronyms. TheAvg acronym is used to denote standard mean of the individual observations– for example,1000 Avgindicates that the plotted values are standard means from 1000 operations performedby the experiment. TheTrim acronym is used to denote trimmed mean of the individual observations where 1 %of minimum observations and 1 % of maximum observations was discarded – for example,1000 Trimindicatesthat the plotted values are trimmed means from 1000 operations performed by the experiment. The acronymscan be combined – for example,1000 walks Avg Trimmeans that observations from 1000 walks performed bythe experiment were the input of a standard mean calculation, whose outputs were the input of a trimmed meancalculation, whose output is plotted. In this context, a walk generally denotes multiple operations performed bythe experiment to iterate over a full range of data structures that the experiment uses, such as all cache lines or allmemory pages.

3.3.3 Composition Scenario

Separate sections group experiments that assess a particular resource sharing scenario, which is either the pipelinedcomposition scenario or the parallel composition scenarioas described in Chapter2. A section on general compo-sition groups experiments where the choice of a particular resource sharing scenario does not matter or does notapply.

Two general types of experiments are distinguished, namelythe artificial experiments and the practical experi-ments. The goal of the artificial experiments is to exhibit the largest possible effect of resource sharing, even if theworkload used to achieve the effect is not common. The goal ofthe practical experiments is to exhibit the effect ofresource sharing in a common workload. The artificial experiments thus allow us to determine the upper bounds ofthe impact due to sharing the particular resource, and to decide whether to continue with the practical experiments.

At this stage of the Q-ImPrESS project, we focus on the artificial experiments, although some practical exper-iments are also already presented.




Chapter 4

Processor Execution Core

When considering the shared resources associated with the processor execution core, we assume a commonprocessor that implements speculative superscalar execution with pipelining and reordering on multiple cores.

The essential functions provided by the processor execution core to the components are maintaining registercontent and executing machine instructions with optimizations based on tracking the execution history. Of theseoptimizations, branch prediction is singled out as a function associated with the processor execution core. Otheroptimizations, such as caching and prefetching, are discussed with other parts of the processor, such as the addresstranslation buffers and the memory content caches.

Since multiple components require multiple processor execution cores to access shared resources concurrently,entire processor execution cores will only be subject to resource sharing in the pipelined composition scenario. Asan exception, the processor execution cores of some processors do share most parts of their architecture, except forpartitioning resources to guarantee fair utilization and replicating resources necessary to guarantee core isolation[1, page 2-41]. Such processors, however, are not considered in the Q-ImPrESS project.

4.1 Resource: Register Content

Since machine instructions frequently modify registers, register content is typically assumed lost whenever controlis passed from one component to another. The overhead associated with register content change is therefore alwayspresent simply because the machine code of the components uses calling conventions that assume register contentis lost.

In an environment where calling conventions are subject to optimization after composition, the overhead as-sociated with register content change can be influenced by composition. Such environments, however, are notconsidered in the Q-ImPrESS project.

Effect Summary The overhead of register content change is unlikely to be influenced by component composi-tion, and therefore also unlikely to be visible as an effect of component composition.

4.2 Resource: Branch Predictor

The goal of branch prediction is to allow efficient pipelining in presence of conditional and indirect branches. Whenencountering a conditional or an indirect branch instruction, the processor execution core can either suspend thepipeline until the next instruction address becomes known,or speculate on the next instruction address, filling thepipeline with instructions that might need to be discarded should the speculation prove wrong. Branch predictionincreases the efficiency of pipelining by improving the chance of successful speculation on the next instructionaddress after conditional and indirect branches.

Multiple branch predictor functions are typically presentin a processor execution core, including a conditionalbranch predictor, a branch target buffer, a return stack buffer, an indirect branch predictor.

A conditional branch predictor is responsible for decidingwhether a conditional branch instruction will jumpor not. Common predictors decide based on the execution history, assuming that the branch will behave as it did




earlier. In absence of the execution history, the predictorcan decide based on the direction of the branch. As aspecial case, conditional branches that form loops with constant iteration counts can also be predicted.

A branch target buffer caches the target addresses of recentbranch instructions. The address of the branchinstruction is used as the cache tag, the target address of the branch instruction makes the cache data. The branchtarget buffer can be searched even before the branch instruction is fully decoded, providing further opportunitiesfor increasing the efficiency of pipelining.

A return stack buffer stores the return addresses of nested call instructions.An indirect branch predictor is responsible for deciding where an indirect branch instruction will jump to.

Combining the functions of the conditional branch predictor and the branch target buffer, the indirect branchpredictor uses the execution history to select among the cached target addresses of an indirect branch.



A detailed description of the branch predictor functions onPlatformIntel Servershould not be relied upon in theexperiments, since the branch predictor functions depend on the exact implementation of the processor executioncore [6, Section 18.12.3].

• The processor contains a branch target buffer that caches the target addresses of recent branch instructions.The branch target buffer is indexed by the linear address of the branch instruction [1, Section 2.1.2.1].

• The return stack buffer has 16 entries [1, page 2-6].

• The indirect branch predictor is capable of predicting indirect branches with constant targets and indirectbranches with varying targets based on the execution history [1, page 2-6].


A description of the branch prediction function on PlatformAMD Serveris provided in [13, page 224].

• The processor contains a conditional branch predictor based on a global history bimodal counter table in-dexed by the global conditional branch history and the conditional branch address.

• The branch target buffer has 2048 entries.

• The return stack buffer has 24 entries.

• The indirect branch predictor contains a separate target array to predict indirect branches with multipledynamic targets. The array has 512 entries.

• Mispredicted branches incur a penalty of 10 or more cycles.

4.2.2 Sharing Effects

The effects that can influence the quality attributes when multiple components share a branch predictor include:

Execution history change The processor execution core can only keep a subset of the execution history. As-suming that the predictions based on tracking the executionhistory work best with a specific subset of theexecution history, then the more code is executed, the higher the chance that the predictions will not havethe necessary subset of the execution history available.

However, if the overhead associated with the missed optimizations is to affect the prediction precision at thecomponent level, the combination of its size and frequency should be on the scale of the component qualityattributes. Since the optimizations are performed at the machine instruction level, the overhead associatedwith the missed optimizations is likely to be of similar scale as the machine instructions themselves. Themissed optimizations therefore have to be very frequent to become significant, but that also makes it morelikely that the necessary subset of the execution history will be available and the optimizations will not bemissed in the first place.




4.2.3 Pipelined Composition

In the pipelined composition scenario, the effects of sharing the branch predictor can be exhibited as follows:

• Assume components that execute many different branch instructions in each invocation. A pipelined compo-sition of such components will increase the number of different branch instructions executed each invocation.When the number of branch instructions exceeds the capacityof the branch target buffer, the target addressesof the branch instructions will not be predicted.

With the typical branch target buffer capacity of thousandsof items, the overhead of missing the branchpredictions would only become significant if the branches made up a significant part of a series of thousandsof instructions. This particular workload is therefore notconsidered in further experiments.

• Assume components that contain nested call instructions. Apipelined composition of such components willincrease the nesting. When the nesting exceeds the depth of the return stack buffer, the return addresses ofnested call instructions beyond the depth of the return stack buffer will not be predicted.

With the typical return stack buffer depth of tens of items, the overhead of missing a single branch predictionin a series of tens of nested calls is unlikely to become significant. This particular workload is therefore notconsidered in further experiments.

• Assume a component that invokes virtual functions implemented by other components. The virtual functionsare invoked by indirect branch instructions. A pipelined composition of such components can increase thenumber of targets of each indirect branch instruction, eventually exceeding the ability of the indirect branchpredictor to predict the targets.

4.2.4 Artificial Experiments

4.2.4.1 Experiment: Indirect Branch Misprediction Overhe ad

The experiment to determine the overhead associated with the indirect branch predictor invokes a virtual functionon a varying number of targets. The experiment instantiatesTotalClassesobjects, each a child of the same baseclass and each with a different implementation of an inherited virtual function. An array ofTotalRe f erencesreferences is filled with references to theTotalClassesobjects so that entryi points to objecti modTotalClasses.A single virtual function call iterates over the array of references to invoke the inherited virtual functions of theobjects.

Listing 4.1: Indirect branch prediction experiment.

1 // Object implementations2 class VirtualBase {3 virtual void Invocation() = 0;4 };5

6 class VirtualChildOne : public VirtualBase {7 virtual void Invocation()8 {9 // ...

10 }11 };12

13 VirtualChildOne oVirtualChildOne;14 // ...15

16 VirtualBase * apChildren[ TotalClasses] = { & oVirtualChildOne, ... };17 VirtualBase * apReferences[ TotalReferences];18




19 // Array initialization20 for ( int i = 0; i < TotalReferences; i++) {21 apReferences[ i] = apChildren[ i % TotalClasses];22 }23

24 // Workload generation25 for ( int i = 0; i < TotalReferences; i++) {26 apReferences[ i]-> Invocation();27 }

The real implementation of the experiment uses multiple virtual function invocations in the workload gener-ation to make sure the effect of the indirect branches that implement the invocation outweighs the effect of theconditional branches that implement the loop.

The implementations of the inherited virtual function perform random memory accesses that trigger cachemisses, assuming high cost of their speculative execution.

Purpose Determine the maximum overhead related to the ability of theindirect branch predictor to predict thetarget of a virtual function invocation.

Measured Time to perform a single invocation from Listing4.1 depending on the number of different virtualfunction implementations.

Parameters TotalClasses: 1-4;TotalRe f erences: 12 (chosen as the least common multiple of the actual num-ber of classes).

Expected Results With the growing number of different virtual function implementations invoked by the sameindirect branch instruction, the ability of the indirect branch predictor to predict the targets will be exceeded.

Measured Results On PlatformIntel Server, the difference between an invocation time with one and fourvir-tual function implementations is 112 cycles, see Figure4.1. The indirect branch prediction miss counter onFigure4.2 indicates that the indirect branch predictor predicts 100 %of the branches with a single target,12 % of the branches with two targets, and less than 5 % of the branches with three and four targets. Thereturn branch prediction miss counter on Figure4.3indicates that the miss on the indirect branch instruction,coupled with a speculative execution of the return branch instruction, makes the return stack buffer miss aswell.

The comparison of the cycles spent stalling due to branch prediction miss on Figure4.4and the cycles spentstalling due to loads and stores on Figure4.5 illustrates the high cost of performing the random memoryaccesses speculatively. With four targets, out of the 401 cycles that the invocation takes, 350 cycles are spentstalling due to branch prediction miss.

On PlatformAMD Server, the difference between an invocation time with one and three virtual functionimplementations is 75 cycles, see Figure4.6. The indirect branch prediction miss counter on Figure4.7indicates that the indirect branch predictor predicts 100 %of the branches with a single target, 46 % of thebranches with two targets, and less than 20 % of the branches with three and four targets. The return branchprediction hit counter on Figure4.8indicates that the miss on the indirect branch instruction,coupled with aspeculative execution of the return branch instruction, does not make the return stack buffer miss.

The comparison of the cycles spent stalling due to branch prediction miss on Figure4.9and the cycles spentstalling due to loads and stores on Figure4.10illustrates the high cost of performing the random memoryaccesses speculatively. With three targets, out of the 353 cycles that the invocation takes, 264 cycles arespent stalling due to branch prediction miss.

Effect Summary The overhead can be visible in workloads with virtual functions of size comparable to theaddress prediction miss, however, it is unlikely to be visible with larger virtual functions.




1 2 3 4

280

300

320

340

360

380

400

420

Number of different virtual function implementations

Dur

atio

n of

virt

ual f

unct

ion

invo

catio

n [c

ycle

s −

10

Avg

]

Figure 4.1: Indirect branch misprediction overhead onIntel Server.

1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0


Cou

nts

of B

R_I

ND

_MIS

SP

_EX

EC

[eve

nts

− 1

00 A

vg]

Figure 4.2: Indirect branch prediction counter per virtual function invocation onIntel Server.




1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0


Cou

nts

of B

R_R

ET

_MIS

SP

_EX

EC

[eve

nts

− 1

00 A

vg]

Figure 4.3: Return branch prediction counter per virtual function invocation onIntel Server.

1 2 3 4

010

020

030

040

0


Cou

nts

of R

ES

OU

RC

E_S

TA

LLS

.BR

_MIS

S_C

LEA

R [e

vent

s −

100

Avg

]

Figure 4.4: Stalls due to branch prediction miss counter per virtual function invocation onIntel Server.




1 2 3 4

150

200

250

300


Cou

nts

of R

ES

OU

RC

E_S

TA

LLS

.LD

_ST

[eve

nts

− 1

00 A

vg]

Figure 4.5: Stalls due to loads and stores counter per virtual function invocation onIntel Server.

1 2 3 4

280

300

320

340

360


Dur

atio

n of

virt

ual f

unct

ion

invo

catio

n [c

ycle

s −

100

Avg

]

Figure 4.6: Indirect branch misprediction overhead onAMD Server.




1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

Number of different virtual function implementationsRE

TIR

ED

_IN

DIR

EC

T_B

RA

NC

HE

S_M

ISP

RE

DIC

TE

D [e

vent

s −

10

Avg

]

Figure 4.7: Indirect branch prediction counter per virtual function invocation onAMD Server.

1 2 3 4

1.0

1.5

2.0

2.5

3.0


Cou

nts

of R

ET

UR

N_S

TA

CK

_HIT

S [e

vent

s −

10

Avg

]

Figure 4.8: Return branch prediction counter per virtual function invocation onAMD Server.




1 2 3 4

050

100

150

200

250

300

Number of different virtual function implementationsCou

nts

of D

ISP

AT

CH

_ST

ALL

_FO

R_B

RA

NC

H_A

BO

RT

[eve

nts

− 1

0 A

vg]

Figure 4.9: Stalls due to branch prediction miss counter per virtual function invocation onAMD Server.

1 2 3 4

5010

015

020

025

030

0


Cou

nts

of D

ISP

AT

CH

_ST

ALL

_FO

R_L

S_F

ULL

[eve

nts

− 1

0 A

vg]

Figure 4.10: Stalls due to loads and stores counter per virtual function invocation onAMD Server.




4.2.5 Modeling Notes

For the branch predictor resource, the only source of overhead investigated by the experiments is the reaction ofthe indirect branch predictor to an increase in the number ofindirect branch targets.

When the number of potential targets of an indirect branch increases, especially from one target to more thanone target, the indirect branch predictor might be more likely to mispredict the target. The misprediction introducestwo sources of overhead, one due to the need to cancel the speculatively executed instructions, and one due to theinterruption in the pipelined execution. Modeling this effect therefore requires modeling the probability of theindirect branch predictor mispredicting the target, and modeling the overhead of canceling speculatively executedinstructions and the overhead of interrupting the pipelined execution.

The existing work on performance evaluation of branch predictors provides contribution in these major topicgroups:

Evaluation of a particular branch predictor is usually done when a new branch predictor is proposed, ob-serving the branch predictor behavior under specific workloads. Because a hardware implementation of thebranch predictor would be expensive, a software simulationis used instead, with varying level of detail andtherefore varying precision.

Examples in this group of related work include [38], which simulates the behavior of several conditionalbranch predictors and indirect branch predictors over traces of the SPEC benchmark, reporting mispredictionrates, and [36], which simulates the behavior of several indirect branch predictors over traces of the SPECbenchmark, reporting misprediction counts.

With respect to the modeling requirements of the Q-ImPrESS project, the evaluations of a particular branchpredictor do not attempt to characterize the workload and donot attempt to model the overhead associatedwith misprediction.

Evaluation of the worst case execution time is done to improve the overestimation rates of the worst caseexecution time on modern processors. The models consider all execution paths in the control flow graph,assuming a general implementation of the branch predictor based on execution history and branch counters.

Examples in this group of related work include [37], where the worst case execution time is estimated bymaximizing the accumulated misprediction overhead over all execution paths in the control flow graph,assuming constant overhead of single misprediction. The assumption of constant overhead is challengedin [35], which extends an earlier approach to modeling certain predictable branches by including varyingmisprediction overhead, but the overhead itself is not enumerated.

Given the requirements of the Q-ImPrESS project and the state of the related work, it is unlikely that thereaction of the indirect branch predictor to an increase in the number of indirect branch targets could be modeledprecisely. It is, however, worth considering whether a potential for incurring a resource sharing overhead could bedetected by identifying cases of composition that increasethe number of indirect branch targets in performancesensitive code.




Chapter 5

System Memory Architecture

When considering the shared resources associated with the system memory architecture, we assume a commonarchitecture with virtual memory and multiple levels of caches, potentially shared between multiple processors,with support for coherency.

5.1 Common Workloads

Because the artificial workloads used in the experiments on the individual memory subsystem resources are prac-tically the same, only with different parameters, we describe them together.

In most of the experiments with data access, the measured workload reads or writes data from or to memoryusing various access patterns. Apart from the access instructions themselves, the code of the workload also containsadditional instructions that determine the access addressand control the access loop. Although it is only thebehavior of the access instructions that is of interest in the experiment, measuring the access instructions alone isnot possible due to measurement overhead. Instead, the entire access loop, containing both the access instructionsand the additional instructions, is measured.

To minimize the distortion of the experiment results, the measured workload should perform as few additionalmemory accesses and additional processor instructions as possible. To achieve this, we create the access patternbefore the measurement and store it in memory as the very datathat the experiment accesses. The access patternforms a chain of pointers and the measured workload uses the pointer that it reads in each access as an address forthe next access. When writing is needed in addition to reading, the measured workload also flips and masks thelowest bit of the pointers. The workload is illustrated in Listing5.1.

Listing 5.1: Pointer walk workload.

1 // Variable start is initialized by an access pattern genera tor2 uintptr_t *ptr = start;3

4 // When measuring the duration of the whole pointer walk,5 // the loop is surrounded by the timing code and the6 // loopCount variable is set to a multiple of the7 // pointer walk length.8 //9 // When measuring the duration of a fixed number of iteration s,

10 // the loop is split in two, with the inner loop surrounded11 // by the timing code and performing the fixed number of12 // iterations.13

14 for ( int i = 0; i < loopCount; i++) {15 if ( writeAccess) {16 uintptr_t value = * ptr;17

18 // Write access flips the least significant bit of the pointe r




19 *ptr = value ˆ 1;20

21 // The least significant bit is masked to get next access addr ess22 ptr = ( uintptr_t) ( value & -2)23 } else {24 // Read access just follows the pointer walk25 ptr = ( uintptr_t *) * ptr;26 }27 }

The pointer walk code from Listing5.1 serves to emphasize access latencies, since each access hasto finishbefore the address of the next access is available. In experiments that need to assess bandwidth rather than latency,the dependency between accesses would limit the maximum speed achieved. Such experiments therefore use avariant of the pointer walk code with multiple pointers illustrated in Listing5.2. The multipointer walk is similar tothe pointer walk, except that in each iteration, it advancesmultiple pointers in independent memory regions insteadof just one pointer in one memory region. The processor can therefore execute multiple concurrent accesses in eachiteration. When enough pointers are used in multipointer walk, the results will be limited by the access bandwidthrather than the access latency since, at any given time, there will be an outstanding access.

Listing 5.2: Multipointer walk workload.

1 uintptr_t ** ptrs = new ( void *)[ numPointers];2

3 for ( int i = 0; i < numPointers; i++) {4 // Variable startAddress is an array variant of start in poin ter walk5 ptrs[ i] = startAddress[ i];6 }7

8 // The same considerations as in pointer walk9 // apply for measuring access duration.

10

11 for ( int i = 0; i < loopCount; i++) {12 for ( int j = 0; j < numPointers ; j++) {13 // Read access just follows the pointer walk14 // Write access is the same as in pointer walk15 ptrs[ j] = ( uintptr_t *) *( ptrs[ j]);16 }17 }

In some experiments, the multipointer walk is used as an interfering workload running in parallel with themeasured workload. When that is the case, the intensity of accesses performed by the interfering workload iscontrolled by inserting a sequence of NOP instructions intoeach iteration, as illustrated in Listing5.3. The lengthof the inserted sequence of NOP instructions is a parameter of the experiment, with an upper limit to preventtrashing the instruction cache. If the number of inserted NOP instructions needs to be higher than this limit,a shorter sequence is executed repeatedly to achieve a reasonably homogeneous workload without trashing theinstruction cache.

Listing 5.3: Multipointer walk workload with delays.

1 // Create the sequence of NOP instructions dynamically2 void (* nopFunc)() = createNopFunc( nopCount, nopLimit);3

4 uintptr_t ** ptrs = new ( void *)[ numPointers];5

6 for ( int i = 0; i < numPointers; i++) {




7 ptrs[ i] = startAddress[ i];8 }9

10 for ( int i = 0; i < loopCount; i++) {11 // Execute the NOP instructions as a delay12 (* nopFunc)();13

14 for ( int j = 0; j < numPointers ; j++) {15 // Read access just follows the pointer walk16 // Write access is the same as in pointer walk17 ptrs[ j] = ( uintptr_t *) *( ptrs[ j]);18 }19 }

To initialize a memory region for the pointer walk, or multiple independent memory regions for the multipointerwalk, we use several access pattern generators. The very basic access pattern is the linear pattern, where the pointerwalk consists of addresses increasing with a constant stride, starting at the beginning of the allocated buffer. Thecode to generate the access pattern is presented in Listing5.4and has the following parameters:

allocSize Amount of memory both allocated and accessed through the pointers.

accessStride The stride between two consecutive pointers.

Listing 5.4: Linear access pattern generator.

1 // To simplify the pointer arithmetics2 accessStride /= sizeof( uintptr_t);3

4 // Create the linear pointer walk5 uintptr_t *start = buffer;6 uintptr_t *ptr = start;7

8 while ( ptr < buffer + allocSize) {9 uintptr_t *next = ptr + accessStride;

10 (* ptr) = next;11 ptr = next;12 }13

14 // Wrap the pointer walk15 (*( ptr - accessStride)) = ( uintptr_t) start;

In experiments that use the linear access pattern, care needs to be taken to avoid misleading interpretationof experiment results. When both the measured workload and the interfering workload use the linear accesspattern, a choice of the buffer addresses can make the workloads exercise different associativity sets, distorting theexperiment results.

For experiments that require uniform distribution of accesses over all cache entries with no hardware prefetch,a random access pattern is used instead of the linear one. Thecode to generate the access pattern is presented inListing 5.5. First, an array of pointers to the buffer is created. Next, the array is shuffled randomly. Finally, thearray is used to create the pointer walk of given length. The parameters of the code follow:

allocSize The range of addresses spanned by the pointer walk.

accessSize The amount of memory accessed by the pointer walk.




accessStride The stride between the pointers beffore shuffling.

accessOffset Offset of pointers within the stride.

Listing 5.5: Random access pattern generator.

1 // All pointers are shifted by requested offset2 buffer += accessOffset;3

4 // Create array of pointers in the allocated buffer5 int numPtrs = allocSize / accessStride;6 uintptr_t ** ptrs = new ( uintptr_t *)[ numPtrs];7 for ( int i = 0; i < numPtrs; i++) {8 ptrs [ i] = buffer + i * accessStride;9 }

10

11 // Randomize the order of the pointers12 random_shuffle( ptrs, ptrs + numPtrs);13

14 // Create the pointer walk from selected pointers15 uintptr_t *start = ptrs[0];16 uintptr_t ** ptr = ( uintptr_t **) start;17 int numAccesses = accessSize / accessStride;18 for ( int i = 1; i < numAccesses; i++) {19 uintptr_t *next = ptrs[ i];20 (* ptr) = next;21 ptr = ( uintptr_t **) next;22 }23

24 // Wrap the pointer walk25 (* ptr) = start;26 delete[] ptrs;

Some experiments need to force set collision by accessing only those addresses that map to a single associativityset of a translation buffer or a memory cache. Although this could be done by using the random access patterngenerator with the stride parameter set to the size of the buffer or cache divided by the associativity, we have decidedto create a a specialized access pattern generator instead.The set collision access pattern generator operates onentire memory pages rather than individual cache lines, andit can randomize the offset of accesses within a pageto make it possible to avoid memory cache misses while still triggering translation buffer misses. The code of theset collision access pattern generator, presented in Listing5.6, accepts these parameters:

allocPages The range of page addresses spanned by the pointer walk.

accessPages The number of pages accessed by the pointer walk.

cacheSets The number of associativity sets to consider.

accessOffset Offset of pointers within the page when not randomized.

accessOffsetRandom Tells whether the offset will be randomized.

Listing 5.6: Set collision access pattern generator.

1 uintptr_t ** ptrs = new ( uintptr_t *)[ allocPages];2




3 // Create array of pointers to the allocated pages4 for ( int i = 0; i < allocPages; i++) {5 ptrs[ i] = ( uintptr_t *) buf + cacheSets * PAGE_SIZE;6 }7

8 // Cache line size is considered in units of pointer size9 int numPageOffsets = PAGE_SIZE / cacheLineSize;

10

11 // Create array of offsets in a page12 offsets = new int[ numPageOffsets];13 for ( int i = 0; i < numPageOffsets; i++) {14 offsets[ i] = i * cacheLineSize;15 }16

17 // Randomize the order of pages and offsets18 random_shuffle( ptrs, ptrs + allocPages);19 random_shuffle( offsets, offsets + numPageOffsets);20

21 // Create the pointer walk from pointers and offsets22 uintptr_t *start = ptrs[0];23 if ( accessOffsetRandom)24 start += offsets[0];25 else26 start += accessOffset;27

28 uintptr_t ** ptr = ( uintptr_t **) start;29 for ( int i = 1; i < accessPages; i++) {30 uintptr_t *next = ptrs[ i];31 if ( accessOffsetRandom)32 next += offsets[ i \% numPageOffsets];33 else34 next += accessOffset;35 (* ptr) = next;36 ptr = ( uintptr_t **) next;37 }38

39 // Wrap the pointer walk40 (* ptr) = start;41 delete[] ptrs;

So far, only experiments with data access were considered. Experiments with instruction access use a similarapproach, except for replacing chains of pointers with chains of jump instructions. A necessary difference fromusing the chains of pointers is that the chains of jump instructions must not wrap, but must contain additionalinstructions that control the access loop. To achieve a reasonably homogeneous workload, the access loop ispartially unrolled, as presented in Listing5.7.

5.2 Resource: Address Translation Buffers

An address translation buffer is a shared resource that provides address translation caching to multiple componentsrunning on the same processor core. The essential function provided by the address translation buffer to thecomponents is caching of virtual-to-physical address mappings.




Listing 5.7: Instruction walk workload.

1 int len = testLoopCount / 16;2

3 while ( len --) {4 // The jump_walk function contains the jump instructions5 jump_walk();6 // The jump_walk function is called 16 times7 }

Whenever a component accesses a virtual address, the processor searches the address translation buffer. If thetranslation of the virtual address is found in the buffer, the corresponding physical address is fetched from thebuffer, allowing the processor to bypass the translation using the paging structures. When the virtual address is notfound in the buffer, a translation using the paging structures is made and the result of the translation is cached inthe address translation buffer.

Both PlatformIntel Serverand PlatformAMD Serverare configured to use 64 bit virtual addresses, but outof the 64 bits, only 48 bits are used [7, Section 2.2]. When only the common page size of 4 KB is considered,the paging structures used to translate virtual addresses to physical addresses have four levels, with every 9 bits ofthe virtual address serving as an index of a table entry containing the physical address of the lower level structure[7, page 10, Figure 1]. From top to bottom, these structures arethe Page Mapping Level 4 (PML4) Table, PageDirectory Pointer (PDP) Table, Page Directory with Page Directory Entries (PDE) and Page Table with Page TableEntries (PTE).



Each processor core is equipped with its own translation lookaside buffer (TLB), which caches a limited numberof most recently used [5, page 10-5] translations referenced by the virtual page number [7, Section 3], allowing itto skip the relatively slow page walks.

TLB is split in two separate parts for instruction fetching and data access.

• The replacement policy of all TLBs behaves as a true LRU for the access patterns of our experiments.

• The instruction TLB (ITLB) is 4-way set associative, has 128entries, and an ITLB miss incurs penalty ofapproximately 18.5 cycles (Experiment5.2.2.6).

• The data TLB (DTLB) has two levels, not exclusive according to Experiment5.2.2.2. Both are 4-way setassociative, where the smaller and faster DTLB0 with 16 entries is used only for load operations [1, page2-13] and the larger and slower DTLB1 with 256 entries is usedfor stores and DTLB0 misses.

• The penalty of a DTLB0 miss which hits in the DTLB1 is 2 cycles as stated in [6, page A-9] and confirmedin Experiment5.2.2.3.

• Translations that miss also in the DTLB1 and hit the PDE cache(see below) incur a 9 cycles penalty (7 cyclesbeyond the DTLB0 miss) in case the page walk step hits the L1 data cache (Experiment5.2.2.2).

In addition to the multiple TLBs, [7, Section 4] states that the processor may use extra caches for the PML4,PDP, and PDE entries to prevent page walks through the whole page structures hierarchy in case of a TLB miss.For example, PDE cache entries are indexed by 27 virtual address bits (47:21) and contain the Page Table address,so the translation has to look only in the Page Table. Similarly, the PDP and PML4 caches are indexed by less bits(18 and 9) and result in more paging structures lookups (2 and3, respectively). Only misses in (or absence of) allthese caches result in all four page walk steps.




Specific details about which of these caches are implementedin this particular processor and their parametersare however implementation dependent and not specified precisely in the vendor documentation. The documen-tation [5, page 10-5] states only that TLBs store ”page-directory andpage-table entries”, which suggests that aPDE cache is implemented. The Intel presentation [9, slide 15] mentions a PDE cache with 32 entries and 4-wayassociativity.

Experiment5.2.2.5indicates that PDE and PDP caches are present, the PDE cache has 64 entries and 4-wayassociativity, and the PDP cache has 4 or less entries. Thereseems to be no PML4 cache – a miss in the PDP cacheneeds two extra page walk steps. A PDE miss adds 4 cycles and a PDP cache miss adds 8 cycles penalty in theideal case – since the paging structures are cached as any other data in the main memory, duration of the addresstranslation depends on whether the entries are present in the data cache. The results of Experiment5.2.2.5indicatethat cache misses caused by page walk steps have the same penalties as the penalties of data access (see5.3) andadd up to other penalties.


Each processor core is equipped with its own TLB, which caches a limited number of most recently used virtualto physical address translations [13, page 225]. The TLB is split in two separate parts for instruction fetching anddata access, both parts have two levels.

• The L1 DTLB has 48 entries for 4 KB pages and is fully associative [13, page 225]. A miss in the L1 that hitsin the L2 DTLB incurs a penalty of 5 cycles, the replacement policy behaves as true LRU for our workload(Experiment5.2.2.2).

• The L2 DTLB has 512 entries with 4-way associativity for 4 KB pages [13, page 226] and seems to be non-exclusive with the L1 DTLB. A miss that needs only one page walk step (PDE cache hit) incurs a penalty of35 cycles (if the page walk step hits in the L2 cache) beyond the L1 DTLB miss penalty (Experiment5.2.2.4).

• The L1 ITLB has 32 entries for 4 KB pages and is fully associative [13, page 225]. A miss in the L1 that hitsin the L2 ITLB incurs a penalty of 4 cycles, the replacement policy behaves as true LRU for our workload(Experiment5.2.2.6).

• The L2 ITLB has 512 entries with 4-way associativity for 4 KB pages [13, page 226] and seems to be non-exclusive with the L1 ITLB. A miss that needs only one page walk step (PDE cache hit) incurs a penalty of40 cycles (if the page walk step hits in the L2 cache) beyond the L1 DTLB miss penalty (Experiment5.2.2.7).

In addition to the two levels of DTLB and ITLB, Experiment5.2.2.5indicates that all the extra translationcaches (PDE, PDP and PML4) are present and used for data access translations, although not mentioned in thevendor documentation. Although we did not determine their sizes and associativity in the experiment, we observedthat additional penalties for misses in these caches are 21 cycles for every extra page walk step. A data accessthat misses all these caches therefore has a penalty of 103 cycles, which is subsequently more than on PlatformIntel Server. The difference is partially caused by the fact that on the AMD processor, the page walk steps godirectly to the L2 cache and not in the L1 data cache first. Thisis a consequence of the paging structures beingaddressed by physical frame numbers and the L1 data cache on this platform being virtually indexed.


We perform the following experiments to verify parameters of the translation caches derived from the documenta-tion or CPUID queries, and experimentally determine parameters that could not be derived.

In particular, we are interested in the following properties:

TLB miss penalty Determines the maximum theoretical slowdown of a single memory access due a TLB shar-ing. It is not always specified.

TLB associativity Determines the number of translations for pages with addresses of a particular fixed stride,that may simultaneously reside in the cache. It is generallywell specified, our experiments that determinethe miss penalty also confirm these specifications.




Extra translation caches The processor may implement extra translation caches that are used in the case of aTLB miss to reduce the number of page walk steps needed. Details about their implementation are generallyunspecified or model-specific. We determine their presence and miss penalties, that add up to the TLB misspenalties.

5.2.2.1 Miss Penalties

To measure the penalty of a TLB miss, we use a memory access pattern that accesses a given number of memorypages using the code in Listing5.1and5.6. The stride is set so that the accesses map to the same TLB entry set inorder to trigger associativity misses. Because we generally access only few pages, we repeat the pointer walk 1000times to amortize the measurement overhead. By varying the number of accessed pages we should observe noTLB misses until we reach the TLB associativity. The exact behavior after the associativity is exceeded dependson the TLB replacement policy. For LRU, our access pattern should trigger TLB miss on each access as soon asthe number of accesses exceeds the number of ways, because the page that is accessed next is the page that has itsTLB entry just evicted. Depending on the LRU or LRU approximation variant being implemented in the particularTLB, our access pattern may or may not exhibit the same behavior.

5.2.2.2 Experiment: L1 DTLB miss penalty

Purpose Determine the penalty of an L1 DTLB miss.

Measured Time to perform a single memory access in set collision pointer walk from Listing5.1and5.6.

Parameters Intel ServerPages allocated: 32; access pattern stride: 64 pages (256 DTLB entries divided by4 ways); pages accessed: 1-32; offset in page: randomized.

AMD ServerPages allocated: 64; access pattern stride: 1 page (full associativity); pages accessed: 1-64;offset in page: randomized.

Expected Results We should observe no L1 DTLB misses until the number of pages reaches the number ofways in the L1 DTLB, or the number of all entries in case of fullassociativity. Then we should observe L1DTLB misses, with exact behavior depending on the replacement policy. The difference between the accessdurations with and without miss is the L1 DTLB miss penalty.

Measured Results The results from PlatformIntel Server(Figure 5.1) show an increase in access durationfrom 3 to 12 cycles at 5 accessed pages, which confirms the 4-way set-associativity of the L1 DTLB. Thereplacement policy behaves as a true LRU for our access pattern.

The event counts of the L0 DTLB miss (DTLBMISSES:L0MISS LD) and L1 DTLB miss (DTLB-MISSES:ANY) events both change from 0 to 1 simultaneously. This indicates that the policy betweenL0 and L1 DTLB is not exclusive, otherwise their 4-way associativity would add up. The penalty of an L1DTLB miss is thus 9 cycles. We will determine the penalty of a purely L0 DTLB miss (L1 DTLB hit) in thenext experiment.

The counts of PAGEWALKS:COUNT (number of page walks executed) increase from 0to 1, confirmingthat a page walk has to be performed to perform the address translation in case of a DTLB miss. ThePAGE WALKS:CYCLES (cycles spend in page walks) event counter shows increase from 0 to 5 cycles,which means that the counter does not capture the whole 9 cycles penalty observed. The L1DALL REF(L1 data cache accesses) event counter shows increase from 1to 2 following the change in DTLB misses.This indicates that (1) page tables are cached in the L1 data cache and (2) a PDE cache is present and theaccesses hit there, thus only the last level page walk step isneeded. Experiment5.2.2.5examines the PDEcache more and determines whether there is also a PDP and other caches present.

The results from PlatformAMD Server(Figure5.3) show change from 3 to 8 cycles at 49 accessed pages,which confirms the full associativity and 48 entries in the L1DTLB. The replacement policy behaves as atrue LRU for our access pattern. The performance counters (Figure5.4) show a change from 0 to 1 in the L1DTLB miss (L1 DTLB MISS AND L2 DTLB HIT:L2 4K TLB HIT) event and that the L2 DTLB miss




1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

46

810

12

Number of accessed pages

Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.1: L1 DTLB miss penalty onIntel Server.

0 5 10 15 20 25 30

01

23

45


Num

ber

of e

vent

s pe

r ac

cess

[eve

nts

− 1

000

wal

ks A

vg T

rim]

Event counters

DTLB_MISSES:ANYL1D_ALL_REFDTLB_MISSES:L0_MISS_LDPAGE_WALKS:COUNTPAGE_WALKS:CYCLES

Figure 5.2: Performance event counters related to L1 DTLB misses onIntel Server.




1 4 7 10 14 18 22 26 30 34 38 42 46 50 54 58 62

34

56

78


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.3: L1 DTLB miss penalty onAMD Server.

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0


Num

ber

of e

vent

s pe

r ac

cess

[eve

nts

− 1

000

wal

ks A

vg T

rim]

Event counters

L1_DTLB_MISS_AND_L2_DTLB_HIT:L2_4K_TLB_HITL1_DTLB_AND_L2_DTLB_MISS:4K_TLB_RELOADL1_DTLB_HIT:L1_4K_TLB_HIT

Figure 5.4: Performance event counters related to L1 DTLB misses onAMD Server.

(L1 DTLB AND L2 DTLB MISS:4K TLB RELOAD) event does not occur, which confirms the expecta-tion. The penalty of the L1 DTLB miss which hits in the L2 DTLB is thus 5 cycles. Note that the value ofL1 DTLB HIT:L1 4K TLB HIT event counter being always 1 indicates a possible problem with this eventcounter – either it is supposed to count DTLB accesses instead of hits, or has an implementation error.

5.2.2.3 Experiment: DTLB0 miss penalty, Intel Server

Purpose Determine the penalty of a pure L0 DTLB (DTLB0) miss on Platform Intel Server





0 5 10 15 20 25 30

01

23

45

Number of accessed pagesDur

atio

n of

acc

ess/

coun

ts o

f eve

nts

[cyc

les/

even

ts −

100

0 w

alks

Avg

Trim

]

Cycles / event counters

Clock cyclesDTLB_MISSES:ANYDTLB_MISSES:L0_MISS_LD

Figure 5.5: DTLB0 miss penalty and related performance events onIntel Server.

Parameters Pages allocated: 32; access pattern stride: 4 pages (16 DTLB0 entries divided by 4 ways); pagesaccessed: 1-32; offset in page: randomized.

Expected Results We should observe no DTLB0 misses until the number of pages reaches the DTLB0 asso-ciativity, after which we should start seeing misses, depending on the replacement policy. Because the strideis 4 pages, the accesses should not cause associativity misses in the L1 DTLB which has 64 sets.

Measured Results The results (Figure5.5) show that the access duration increases from 3 to 5 cycles at5accessed pages and the related performance events show thatit is caused by only DTLB0 misses and L1DTLB hits. The penalty of a DTLB0 miss is thus 2 cycles, which confirms the description of the DTLB-MISSES.L0MISS LD event in [6, page A-9].

5.2.2.4 Experiment: L2 DTLB miss penalty, AMD Server

Purpose Determine the penalty of an L2 DTLB miss and the inclusion policy between L1 and L2 DTLB onPlatformAMD Server.


Parameters Pages allocated: 64; access pattern stride: 128 pages (512 L2 DTLB entries divided by 4 ways);pages accessed: 1-64; offset in page: randomized.

Expected Results We should observe no L1 or L2 DTLB misses until the number of pages reaches the numberof entries in the L1 DTLB. Depending on the inclusion policy between L1 and L2 DTLB we should then startobserving L2 DTLB hits first (exclusive policy) or immediately L2 DTLB misses (non-exclusive policy),with exact behavior depending on the L2 DTLB replacement policy. The difference between the accessduration with L2 DTLB hit and L2 DTLB miss is the L2 DTLB miss penalty.

Measured Results The results (Figure5.6) show an increase from 3 to 43 cycles at 49 accessed pages, whichmeans we observe L2 DTLB misses and indicates non-exclusivepolicy. The performance counters (Fig-ure 5.7) show a change from 0 to 1 in the L1DTLB AND L2 DTLB MISS:4K TLB RELOAD event.




1 4 7 10 14 18 22 26 30 34 38 42 46 50 54 58 62

010

2030

40


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.6: L2 DTLB miss penalty onAMD Server.

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0


Num

ber

of e

vent

s pe

r ac

cess

[eve

nts

− 1

000

wal

ks A

vg T

rim]

Event counters

L1_DTLB_MISS_AND_L2_DTLB_HIT:L2_4K_TLB_HITL1_DTLB_AND_L2_DTLB_MISS:4K_TLB_RELOADREQUESTS_TO_L2:TLB_WALK

Figure 5.7: Performance event counters related to L2 DTLB misses onAMD Server.

The L1 DTLB MISS AND L2 DTLB HIT:L2 4K TLB HIT does not occur, which confirms immediateL2 DTLB misses and no hits. The penalty of the L2 DTLB miss is thus 35 cycles beyond the L1 DTLB misspenalty (40 cycles in total).

On this processor, paging structures are cached only in the L2 cache or L3 cache and not in the L1 data cache.The REQUESTSTO L2:TLB WALK event counter shows that each L2 DTLB miss in this experimentresults in one page walk step that accesses the L2 cache. Thismeans that a PDE cache is present, which isfurther examined in Experiment5.2.2.5. Note that the value of the L1DTLB HIT:L1 4K TLB HIT eventcounter is still always 1, even in case of L2 DTLB misses.




5 10 15

1020

3040


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg T

rim] Access stride [pages]

5121k2k4k8k16k32k64k128k256k

Figure 5.8: Extra translation caches miss penalty onIntel Server.

5.2.2.5 Experiment: Extra translation caches

Purpose Determine the presence and latency of translation caches other than TLB.

Measured Time to perform a single memory access in set collision pointer walk from Listing5.1and5.6withvarious strides.

Parameters Intel ServerPages allocated: 16; access pattern stride: 512 pages-256Kpages (exponential step);pages accessed: 1-16; offset in page: randomized.

AMD ServerPages allocated: 64; access pattern stride: 128 pages-128Mpages (exponential step); pagesaccessed: 32-64; offset in page: randomized.

Expected Results As we increase the access stride on PlatformIntel Serverwe should see the increase in L1cache references caused by page walk steps due to misses in the translation caches. Up to 4 accessed pages,the translation should hit in the DTLB as in the previous experiments.

On PlatformAMD Serverwe use the REQUESTSTO L2:TLB WALK event counter to determine the num-ber of page walk steps. Up to 48 accessed pages, the translation should hit in the L1 DTLB.

The results after exceeding the L1 DTLB capacity depends on parameters of the extra caches, which is notspecified in vendor documentation for both platforms.

Measured Results The results from PlatformIntel Serverare presented in Figure5.8. Figure5.9shows eventcounts for L1 data cache references, which correspond to thepage walk steps being performed.

For the 512 pages stride, the related performance events arepresented in Figure5.10. The access durationchanges from 3 to 12 cycles and the L1DALL REF event counter shows a change from 1 to 2 events at 5accessed pages, which means we hit the PDE cache as in the previous experiment. We see also an increase ofthe duration from 12 to 23 cycles and a change in L1DREPL counter from 0 to 1 events at 9 accessed pages.These L1 data cache misses are not caused by the accessed databut by the page walks – with this stride andalignment we always read the first entry of a page table, thus the same cache set. We see that the penalty ofthis miss is 11 cycles, also reflected in the value of PAGEWALKS:CYCLES counter, which changes from5 to 16. The experiments in the memory caches section show that an L1 data cache miss penalty for dataload on this platform is indeed 11 cycles, which means it is the same as for the miss during the page walkstep and this penalty simply adds up to the DTLB miss penalty.




5 10 15

12

34

5


L1 c

ache

acc

esse

s pe

r pa

ge [e

vent

s −

100

0 w

alks

Avg

Trim

]

Access stride [pages]

5121k2k4k8k16k32k64k128k256k

Figure 5.9: L1 data cache references events related to misses in the extra translation caches onIntel Server.

5 10 15

05

1015


Num

ber

of e

vent

s pe

r ac

cess

[eve

nts

− 1

000

wal

ks A

vg T

rim]

Event counters

L1D_ALL_REFL1D_REPLPAGE_WALKS:CYCLES

Figure 5.10: Performance event counters with access stride of 512 pages on Intel Server.




5 10 15

05

1015

20


Num

ber

of e

vent

s pe

r ac

cess

[eve

nts

− 1

000

wal

ks A

vg T

rim]

Event counters

L1D_ALL_REFL1D_REPLPAGE_WALKS:CYCLES

Figure 5.11: Performance event counters with access stride of 8192 pageson Intel Server.

As we increase the stride, we start to cause conflict misses also in the PDE cache. With the stride of 8192pages (16 PDE entries) and 5 or more accessed pages, the PDE cache is missed on each access. The L1D-ALL REF event counter shows that there are 3 L1 data cache references per access, 2 of them are thereforecaused by page walk steps. This means that a PDP cache is also present. (Figure5.11). The increasein cycles per access (and the PAGEWALKS:CYCLES event) compared to a PDE hit (thus the PDE misslatency) is 4 cycles. If we assume that the PDE cache has 4-wayassociativity, the fact we need a stride of 16PDE entries to reliably miss indicates that it has 16 sets, thus 64 entries in total, which is slightly differentfrom the 32 entries (8 sets of 4 ways) mentioned in the Intel presentation [9, slide 15].

Further increase of stride results in increase of PDP misses. With the 256 K (512×512) pages stride, eachaccess maps to a different PDP entry. We see that at 5 accessedpages, the L1DALL REF event counterincreases to 5 L1 data cache references per access. This indicates that there is no PML4 cache (all four levelsof page tables are walked) and that the PDP cache has only 4 or less entries. Compared to the 8192 pagesstride, the PDP miss adds approximately 19 cycles per access. Out of those cycles, 11 cycles are added byan extra L1 data cache miss, as both PDE and PTE entries miss the L1 data cache due to being mapped tothe same set – the L1DREPL event counter shows 2 cache misses. The remaining 8 cycles is the cost ofwalking two additional levels of page tables due to the PDP miss.

The observed access durations on PlatformAMD Serverare shown in Figure5.12and values of the RE-QUESTSTO L2:TLB WALK event counter in Figure5.13. We can see that for a stride of 128 pages westill hit the PDE cache as in the previous experiment, strides of 512 pages and more need 2 page walk stepsand thus hit the PDP cache, strides of 256 K pages need 3 steps and thus hit the PML4 cache and finallystrides of 128 M pages need all 4 steps. The duration per access increases by 21 cycles for each additionalpage walk step. In the case of 128 M stride we see additional penalty caused by the page walks triggeringL2 cache misses as the L2CACHE MISS:TLB WALK event counter shows (Figure5.14). Determining theassociativity and size of the extra translation caches is not possible from the results of this experiments as thefully associative L1 DTLB hides the behavior for 48 and less accesses and we therefore can not determine ifthe experiment causes associativity misses or capacity misses by 49 and more accesses.




35 40 45 50 55 60 65

050

100

150


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg T

rim] Access stride [pages]

1282565121k2k4k8k16k32k64k128k256k512k1M2M4M8M16M32M

Figure 5.12: Extra translation caches miss penalty onAMD Server.

35 40 45 50 55 60 65

01

23

4

Number of accessed pagesCou

nts

of R

EQ

UE

ST

S_T

O_L

2.T

LB_W

ALK

[eve

nts

− 1

000

wal

ks A

vg T

rim]



Figure 5.13: Page walk requests to L2 cache onAMD Server.




35 40 45 50 55 60 65

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Number of accessed pagesCou

nts

of L

2_C

AC

HE

_MIS

S.T

LB_W

ALK

[eve

nts

− 1

000

wal

ks A

vg T

rim]



Figure 5.14: L2 cache misses caused by page walks onAMD Server.

5.2.2.6 Experiment: L1 ITLB miss penalty

To measure the penalty of an ITLB miss, we use the same access pattern generator as for the DTLB (Listing5.6),only modified to create chains of jump instructions executedby the code in Listing5.7.

Purpose Determine the penalty of an L1 instruction TLB miss.

Measured Time to execute a jump instruction in jump instruction chainfrom Listing5.7and5.6.

Parameters Intel ServerPages allocated: 32; access pattern stride: 32 pages (128 ITLB entries divided by4 ways); pages accessed: 1-32; offset in page: randomized.

AMD ServerPages allocated: 64; access pattern stride: 1 page (full associativity); pages accessed: 1-64;offset in page: randomized.

Expected Results We should observe no L1 ITLB misses until the number of pages reaches the number ofways in the L1 ITLB (or all entries in case of full associativity). Then we should observe L1 ITLB misses,with exact behavior depending on the replacement policy. The penalty is likely to be similar to the L1 DTLBpenalty.

Measured Results The results from PlatformIntel Server(Figure5.15) show an increase from approximately3.5 cycles per jump instruction to 22 cycles, which is a penalty of 18.5 cycles (note that this could be probablyeven more if we could eliminate the measurement overhead). The related performance event counters areshown in Figure5.16.

The CYCLESL1I MEM STALLED event counters shows that the ITLB misses cause 19 cycles duringwhich instruction fetches are stalled, which therefore should be the penalty with measurement overheadeliminated. The ITLB:MISSES event counter increases from 0to 1, as well as the PAGEWALKS:COUNTand L1DALL REF event counters, which means that only the last level pagetable is accessed (and cachedin the L1 data cache) and a PDE cache is used for instruction fetches as well as for the data accesses. Thenumber of PAGEWALKS:CYCLES increases from 0 to 5 cycles which is the same asin the case of a L1DTLB miss, but in this case the observed penalty is twice the penalty of a L1 DTLB miss. Note that this




1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

510

1520

Number of JMP instructions executed

Dur

atio

n of

JM

P in

stru

ctio

n [c

ycle

s −

102

4 w

alks

Avg

]

Figure 5.15: ITLB miss penalty onIntel Server.

0 5 10 15 20 25 30

05

1015

20


Num

ber

of e

vent

s pe

r JM

P in

stru

ctio

n [e

vent

s −

102

4 w

alks

Avg

Trim

]

Event counters

L1D_ALL_REFPAGE_WALKS:COUNTPAGE_WALKS:CYCLESITLB:MISSESCYCLES_L1I_MEM_STALLED

Figure 5.16: Performance event counters related to ITLB miss onIntel Server.




1 4 7 10 14 18 22 26 30 34 38 42 46 50 54 58 62

23

45

6


Dur

atio

n pe

r JM

P in

stru

ctio

n [c

ycle

s −

102

4 w

alks

Avg

]

Figure 5.17: L1 ITLB miss penalty onAMD Server.

0 10 20 30 40 50 60

0.0

0.5

1.0

1.5

2.0

2.5


Num

ber

of e

vent

s pe

r JM

P in

stru

ctio

n [e

vent

s −

102

4 w

alks

Avg

Trim

]

Event counters

L1_ITLB_MISS_AND_L2_ITLB_HITL1_ITLB_MISS_AND_L2_ITLB_MISS:4K_PAGE_FETCHESINSTRUCTION_CACHE_FETCHES

Figure 5.18: Performance event counters related to L1 ITLB miss onAMD Server.

should not be caused by the L1 instruction cache misses nor branch prediction misses – their event countersare close to zero.

The results from PlatformAMD Server(Figure5.17) show an increase from 2 to 6 cycles per jump instruc-tion at 32 instructions, which confirms the full associativity and 32 entries in the L1 ITLB. The L1ITLB -MISS AND L2 ITLB HIT event counter (Figure5.18increases from 0 to 1 and the L1ITLB MISS AND -L2 ITLB MISS:4K PAGE FETCHES event counter stays zero. The penalty of an L1 ITLB miss that hits inthe L2 ITLB is therefore 4 cycles.




1 4 7 10 14 18 22 26 30 34 38 42 46 50 54 58 62

010

2030

4050


Dur

atio

n pe

r JM

P in

stru

ctio

n [c

ycle

s −

102

4 w

alks

Avg

Trim

]

Figure 5.19: L2 ITLB miss penalty onAMD Server.

5.2.2.7 Experiment: L2 ITLB miss penalty, AMD Server

Purpose Determine the penalty of an L2 ITLB miss on PlatformAMD Server.


Parameters Pages allocated: 64; access pattern stride: 128 pages (512 L2 ITLB entries divided by 4 ways);pages accessed: 1-64; offset in page: randomized.

Expected Results We should observe no ITLB misses until the number of jump instructions reaches the numberof entries in the L1 ITLB. Depending on the inclusion policy between L1 and L2 ITLB we should then startobserving L2 ITLB hits first (exclusive policy) or immediately L2 ITLB misses (non-exclusive policy) withexact behavior depending on the replacement policy. The difference between the jump duration with L2ITLB hit and L2 ITLB miss is the L2 ITLB miss penalty.

Measured Results The results (Figure5.19) show a change from 2 to 46 cycles at 32 accessed pages, which isthe same amount of accesses than for the L1 ITLB and thus indicates non-exclusive policy. The L1 ITLBmiss (L1 ITLB MISS AND L2 ITLB MISS:4K PAGE FETCHES) and ITLBRELOADS events show achange from 0 to 1 events per access, which confirms there are L2 ITLB misses and no hits (Figure5.20).The penalty of the L2 ITLB miss is thus 40 cycles beyond the L1 ITLB miss penalty (44 cycles in total),which is similar to the L2 DTLB miss penalty in Experiment5.2.2.4. The REQUESTSTO L2:TLB WALKevent counter shows that each L2 ITLB causes only one page walk step, which indicates a PDE cache beingused also for instruction fetches.


Translation lookaside buffers are shared in pipelined composition. This can have performance impact as evictionsfrom a shared buffer make components compete for the buffer capacity:

• Compared to the isolated scenario, accesses to memory can trigger more misses in the translation lookasidebuffer. An access that triggers a miss in the translation lookaside buffer requires an additional addresstranslation, which entails searching the paging structurecaches and traversing the paging structures.

Code most sensitive to these effects is:




0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0


Num

ber

of e

vent

s pe

r JM

P in

stru

ctio

n [e

vent

s −

102

4 w

alks

Avg

Trim

]

Event counters

L1_ITLB_MISS_AND_L2_ITLB_MISS:4K_PAGE_FETCHESITLB_RELOADSREQUESTS_TO_L2:TLB_WALKL2_CACHE_MISS:TLB_WALK

Figure 5.20: Performance event counters related to L1 ITLB misses onAMD Server.

• Access to memory where addresses belonging to the same page are accessed only once, and where theaccessed addresses would fit into the translation lookasidebuffer (or a TLB set in case they have a badstride, and the other pipelined code too) in the isolated scenario. Note that prefetching should not mask TLBmisses (at least on Intel Core where prefetching works only within a page).


The following experiments resemble the scenario with the greatest expected sharing overhead of the TLB in thepipelined scenario, where executions of individual components are interleaved and thus a component may evictTLB entries of a different component during its execution.

5.2.4.1 Experiment: DTLB sharing

For the DTLB, the measured workload accesses data in a memorybuffer so that each access uses an address indifferent memory page, i.e. different TLB entry. This is done by executing the set collision pointer walk code inListing 5.1and5.6with stride of 1 page so the accesses map to all entries in the TLB and not just to entries of oneparticular associativity set. The number of pages to accessequals to the number of TLB entries, and offsets of theaccesses in pages are randomly spread over cache lines to prevent associativity misses in the L1 data cache.

The interfering workload is the same as the measured workload, but accessing a different memory buffer, sothat the its address translations evict the TLB entries occupied by the measured workload. The number of accessedpages varies from 0 to the number of TLB entries, which influences the amount of TLB misses in the measuredcode from 0 % to 100 %

Purpose Determine the impact of DTLB sharing on the most sensitive code.


Parameters Intel ServerPages allocated: 256; access pattern stride: 1 page; pages accessed: 256 (all L1 DTLBentries); offset in page: randomized.

AMD ServerPages allocated: 512; access pattern stride: 1 page; pages accessed: 512 (all L2 DTLB entries);offset in page: randomized.




0 16 32 48 64 80 96 112 136 160 184 208 232 256

510

1520

Number of pages accessed by the interfering workload

Dur

atio

n of

acc

ess

[cyc

les

− 5

12 A

vg]

Figure 5.21: DTLB sharing with pipelined composition of most sensitive code onIntel Server.

Interference The same as the measured workload.Intel ServerPages allocated: 256; access pattern stride:1 page; pages accessed: 0-256 (8 pages step); offset in page:randomized.

AMD ServerPages allocated: 512; access pattern stride: 1 page; pages accessed: 0-512 (16 pages step);offset in page: randomized.

Expected Results The measured workload should fit in the last level DTLB (L1 onIntel Serverand L2 onAMD Server) and therefore hit on each access when executed with no interference. The interfering workloadshould increasingly evict the DTLB entries occupied by the measured workload, until the whole DTLB isevicted and the measured workload should miss the DTLB on each access.

On PlatformIntel Server, 256 accesses of the measured workload will also occupy 16 KBof the L1 datacache. The interfering workload will also occupy 16 KB with 256 accesses. Because DTLB misses causeadditional accesses to the L1 data cache for the page walks and the L1 data cache is 32 KB large, we shouldalso see L1 data cache misses as a direct consequence of the DTLB sharing.

On PlatformIntel Server, 512 accesses of the measured workload will also occupy 32 KBof the L1 datacache. The interfering workload will also occupy 32 KB with 512 accesses. Because page walks on DTLBmisses access only the L2 cache and not the L1 data cache, we should see no extra L1 data cache capacitymisses on this platform.

Measured Results The results from PlatformIntel Server(Figure5.21) show an increase from 6 to 18 cyclesdue to the L1 DTLB sharing. With 256 L1 DTLB entries, the totaloverhead is approximately 3000 cycles.The event counters (Figure5.22) confirm that the number of DTLBMISSES:ANY events increases up to1 per access due to the interfering workload. The number of L1D REPL events increases to 0.9 L1 cachemisses per access due to the page walks exceeding the L1 cachesize which is already fully occupied by theworkload accesses.

The results from PlatformAMD Server(Figure5.23) show an increase from 11 to 42 cycles due to the L2DTLB sharing. With 512 L2 DTLB entries, the total overhead isapproximately 16000 cycles. The eventcounters (Figure5.23-papi) show that number of the L1DTLB AND L2 DTLB MISS.4K TLB RELOADevent increases due to the interfering workload, but does not reach 1 L2 DTLB miss per access. This couldmean that the replacement policy of the L2 DTLB is not true LRU. Note that the DATACACHE MISSESevent counter shows L1 data cache misses that are non-zero even with no interference and increase with theinterference. This is because the access pattern initialization (Listing5.6) randomizes access offsets to cache




0 50 100 150 200 250

01

23

45


Eve

nt c

ount

s [e

vent

s −

512

Avg

Trim

]

Event counters

DTLB_MISSES:ANYL1D_ALL_REFL1D_REPLL2_LINES_IN:SELFDTLB_MISSES:L0_MISS_LDPAGE_WALKS:COUNTPAGE_WALKS:CYCLES

Figure 5.22: DTLB sharing with pipelined composition of most sensitive code onIntel Server– performance events.

0 32 64 96 128 176 224 272 320 368 416 464 512

1020

3040


Dur

atio

n of

acc

ess

[cyc

les

− 5

12 A

vg]

Figure 5.23: DTLB sharing with pipelined composition of most sensitive code onAMD Server.

lines uniformly in individual pages, however the L1 cache size of this platform divided by associativity is32 KB (8 pages), which means some cache sets are getting more accesses than others.

Note that the values of the performance counters (on both platforms) are affected by measurement overhead,which means the observed values might be higher. However theoverhead here cannot be reduced by repeatedexecution of the experiment inside one event counters collection, otherwise events of for both measured andinterfered code would be counted together.

Effect Summary The overhead can be visible in workloads with very poor locality of data references to virtualpages that fit in the address translation buffer when executed alone. Depending on the range of accessedaddresses, the workload can also cause additional cache misses when traversing the address translationstructures. The translation buffer miss can be repeated only as many times as there are the address translation




0 100 200 300 400 500

0.2

0.4

0.6

0.8


Eve

nt c

ount

s [e

vent

s −

512

Avg

Trim

]

Event counters

DATA_CACHE_MISSESL1_DTLB_MISS_AND_L2_DTLB_HIT:L2_4K_TLB_HITL1_DTLB_AND_L2_DTLB_MISS:4K_TLB_RELOAD

Figure 5.24: DTLB sharing with pipelined composition of most sensitive code onAMD Server– performance events.

buffer entries, the overhead will therefore only be significant in workloads where the number of data accessesper invocation is comparable to the size of the buffer.

5.2.4.2 Experiment: ITLB sharing

The ITLB sharing experiment is similar to the DTLB one exceptfor executing chains of jump instruction (List-ing 5.6) as both measured and interfering workload.

Purpose Determine the impact of ITLB sharing on the most sensitive code.


Parameters Intel ServerPages allocated: 128; access pattern stride: 1 page; pages accessed: 128 (all L1 ITLBentries); offset in page: randomized.

AMD ServerPages allocated: 512; access pattern stride: 1 page; pages accessed: 512 (all L2 ITLB entries);offset in page: randomized.

Interference The same as the measured workload.Intel ServerPages allocated: 128; access pattern stride:1 page; pages accessed: 0-128 (4 pages step); offset in page:randomized.

AMD ServerPages allocated: 512; access pattern stride: 1 page; pages accessed: 0-512 (16 pages step);offset in page: randomized.

Expected Results The measured workload should fit in the ITLB (L2 ITLB onAMD Server) and thereforehit on each access when executed with no interference. The interfering workload should increasingly evictthe ITLB entries occupied by the measured workload, until the whole ITLB is evicted and the measuredworkload should miss the ITLB on each jump instruction.

On PlatformIntel Server, 128 accesses of the measured workload will also occupy 8 KB of the L1 instructioncache. The interfering workload will also occupy 8 KB with 128 accesses. As the L1 instruction cache is32 KB large, there should be no misses. Also the amount of memory accessed by page walks due to ITLBmisses should fit in the L1 data cache and cause no misses.




0 8 16 24 32 40 48 56 64 72 80 88 96 104 116 128

510

1520

2530

Number of JMP instructions executed by the interfering workload

Dur

atio

n pe

r JM

P in

stru

ctio

n [c

ycle

s −

128

Avg

]

Figure 5.25: ITLB sharing with pipelined composition of most sensitive code onIntel Server.

0 20 40 60 80 100 120

01

23

45


Cou

nts

of e

vent

s pe

r JM

P in

stru

ctio

n [e

vent

s −

128

Avg

Trim

]

Event counters

L1D_ALL_REFL1D_REPLPAGE_WALKS:COUNTPAGE_WALKS:CYCLESL1I_MISSESITLB:MISSES

Figure 5.26: ITLB sharing with pipelined composition of most sensitive code onIntel Server– performance events.

On PlatformIntel Server, 512 accesses of the measured workload will also occupy 32 KBof the L1 instruc-tion cache. The interfering workload will also occupy 32 KB with 512 accesses. The L1 instruction cache is64 KB large, we should therefore see no L1 capacity misses on this platform.

Measured Results The results from PlatformIntel Server(Figure5.25) show an increase from 6 to 28 cy-cles due to the ITLB sharing. With 128 ITLB entries, the totaloverhead is approximately 2800 cycles.The event counters (Figure5.26) confirm that counts of the DTLBMISSES event as well as the PAGE-WALKS:COUNT event increase up to 1 event per access due to theinterfering workload.

The results from PlatformAMD Server (Figure5.27) show an increase from 14 to 59 cycles due to theL2 ITLB sharing. With 512 L2 ITLB entries, the total overheadis approximately 23000 cycles. The event




0 32 64 96 128 176 224 272 320 368 416 464 512

1020

3040

5060


Dur

atio

n pe

r JM

P in

stru

ctio

n [c

ycle

s −

128

Avg

]

Figure 5.27: L2 ITLB sharing with pipelined composition of most sensitive code onAMD Server.

0 100 200 300 400 500

0.2

0.4

0.6

0.8

1.0

1.2

1.4


Cou

nts

of e

vent

s pe

r JM

P in

stru

ctio

n [e

vent

s −

128

Avg

Trim

]

Event counters

L1_ITLB_MISS_AND_L2_ITLB_HITL1_ITLB_MISS_AND_L2_ITLB_MISS:4K_PAGE_FETCHESINSTRUCTION_CACHE_FETCHESINSTRUCTION_CACHE_MISSES

Figure 5.28: L2 ITLB sharing with pipelined composition of most sensitive code onAMD Server– performance events.

counters (Figure5.28) show an increase of the L1ITLB MISS AND L2 ITLB MISS:4K PAGE FETCHESevents, it does not however reach 1 miss per access. This could mean that the replacement policy of theL2 ITLB is not true LRU. The number of the INSTRUCTIONCACHE MISSES events is not zero, andincreases with interference for similar reason as in the previous DTLB experiment.

Note that the as in the previous experiment, the performancecounters on both platforms may be showingsomewhat higher values due to measurement overhead which here cannot be reduced by repeated execu-tion.

Effect Summary The overhead can be visible in workloads with very poor locality of instruction references tovirtual pages that fit in the address translation buffer whenexecuted alone. The translation buffer miss can




be repeated only as many times as there are the address translation buffer entries, the overhead will thereforeonly be significant in workloads where the number of instruction accesses per invocation is comparable tothe size of the buffer.

5.2.5 Parallel Composition

Translation lookaside buffers are never shared, however, accessing the same virtual addresses from multiple pro-cessors causes translation entries to be replicated. In theparallel composition scenario, the effects of replicatingthe translation entries can be exhibited as follows:

• Assume components that change the mapping between virtual and physical addresses. A change of themapping causes the corresponding translation entries to beinvalidated, penalizing all other components thatrely on them.

The effect of invalidating a translation entry can be further amplified due to limited selectivity of the invali-dation operation. Rather than invalidating the particulartranslation entry, all translation entries in the sameaddress space can be impacted. Similarly, rather than delivering the invalidation request to the processorsholding the particular translation entry, all processors mapping the same address space can be impacted.


A component can change the mapping between virtual and physical addresses by invoking an operating systemfunction that modifies its address space. Among such functions that invalidate translation entries on the experimen-tal platforms aremprotect, mremapandmunmap. The experiment to determine the overhead associated with theaddress space modification uses a component that keeps modifying the address space using themmapandmunmapfunctions from Listing5.8 as the interfering workload, and a component that performs the random pointer walkfrom Listing5.1and5.5as the measured workload.

Listing 5.8: Address space modification.

1 // Workload generation2 while ( true) {3 // Map a page4 void *pPage = mmap( NULL, 4096,5 PROT_READ | PROT_WRITE,6 MAP_ANONYMOUS | MAP_PRIVATE,7 -1, 0);8

9 // Access the page to create the translation entry10 *(( char *) pPage) = 0;11

12 // Unmap the page13 munmap( pPage, 4096);14 }

Due to limited selectivity, the interfering workload invalidates the entire address translation buffer on all pro-cessors that map its address space. The invalidation involves sending an interrupt to all the processors that map theaddress space. The processors invalidating their entire address translation buffer on receiving the interrupt. Thisbehavior is specific to the operating system of the experimental platforms [28].

5.2.6.1 Experiment: Translation Buffer Invalidation Over head

Purpose Determine the maximum overhead associated with intensive address space modifications that triggertranslation buffer invalidations.




0 1 2 3 4 5 6 7

1520

2530

35

Number of interfering processors

Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

0 A

vg]

Figure 5.29: Translation buffer invalidation overhead onIntel Server.

Measured Time to perform a single memory access in random pointer walkfrom Listing5.1and5.5.

Parameters Intel ServerAllocated: 512 KB; accessed: 512 KB; stride: 4 KB.

AMD ServerAllocated: 256 KB; accessed: 256 KB; stride: 4 KB.

Interference Address space modification from Listing5.8. Processors: 0-7.

Expected Results The measured workload is configured to access different pages in an address range thatfits into the cache, each access therefore uses a different translation entry. Invalidating the entire addresstranslation buffer should impose the data TLB miss penalty on the entire cycle of accesses by the measuredworkload.

Measured Results On PlatformIntel Server, the time to perform a single memory access in random pointerwalk is shown on Figure5.29. The figure indicates that running the interfering workloadon one processorextends the average time for a single access in the measured workload from 16 cycles to 28 cycles. Whenrunning the interfering workload on more than one processor, the average time for a single access settlesbetween 18 and 20 cycles. The values of the data TLB miss counter on PlatformIntel Serveron Figure5.33show a maximum miss rate of 33 %, observed when the interfering workload executes on one processor.When the interfering workload executes on more than one processor, the miss rate drops below 5 %.

On PlatformAMD Server, the time to perform a single memory access in random pointerwalk is shownon Figure5.31. The figure indicates that running the interfering workloadon one processor extends theaverage time for a single access in the measured workload from 18 cycles to 48 cycles. When running theinterfering workload on more than one processor, the average time for a single access settles between 20and 21 cycles. The values of the data TLB miss counter on Platform AMD Serveron Figure5.33show amaximum miss rate of 29 %, observed when the interfering workload executes on one processor. When theinterfering workload executes on more than one processor, the miss rate drops below 3 %.

The reason for the drop in the miss rate when running the interfering workload on more than one processoris the synchronization in the operating system. Figure5.33shows the duration of the syscalls on PlatformIntel Serverincreasing from an average of 12100 cycles when running on one processor to 99500 cycleswhen running on two processors. Figure5.34shows the duration of the syscalls on PlatformAMD Serverincreasing from an average of 10700 cycles when running on one processor to 118700 cycles when runningon two processors. With more processors, the duration of thesyscalls grows almost linearly.




0 1 2 3 4 5 6 7

0.0

0.1

0.2

0.3

0.4


Cou

nts

of D

TLB

_MIS

SE

S.A

NY

[eve

nts

− 1

0000

Avg

]

Figure 5.30: Address translation miss counter per access onIntel Server.

0 1 2 3 4 5 6 7

2030

4050

60


Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

0 A

vg]

Figure 5.31: Translation buffer invalidation overhead onAMD Server.

On PlatformIntel Server, Experiment5.2.2.2estimates the penalty for data TLB miss to 9 cycles, with a missrate of 33 % this would make the penalty 3 cycles on average. OnPlatformAMD Server, Experiment5.2.2.4estimates the penalty for data TLB miss to 40 cycles, with a miss rate of 29 % this would make the penalty12 cycles on average. The increase in the average time for a single access from 16 cycles to 28 cycles onPlatformIntel Serverand from 18 to 48 cycles on PlatformAMD Serveris therefore not entirely due to dataTLB misses. Part of the increase is due to handling the interrupt used to invalidate the address translationbuffer.

Effect Summary The overhead can be visible in workloads with very poor locality of data references to virtualpages that fit in the address translation buffer, when combined with workloads that frequently modify itsaddress space.




0 1 2 3 4 5 6 7

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35


Cou

nts

of L

1_D

TLB

_AN

D_L

2_D

TLB

_MIS

S.A

LL [e

vent

s −

100

00 A

vg]

Figure 5.32: Address translation miss counter per access onAMD Server.

1 2 3 4 5 6 7 8

0e+

002e

+05

4e+

056e

+05

Number of processors executing syscalls

Dur

atio

n of

sin

gle

sysc

all s

eque

nce

[cyc

les]

Figure 5.33: Duration of interfering workload cycle onIntel Server.




1 2 3 4 5 6 7 8

0e+

001e

+05

2e+

053e

+05

4e+

055e

+05

Number of processors executing syscalls

Dur

atio

n of

sin

gle

sysc

all s

eque

nce

[cyc

les]

Figure 5.34: Duration of interfering workload cycle onAMD Server.

5.2.7 Modeling Notes

For the address translation buffer resource, two sources ofoverhead were investigated by the experiments:

Overhead due to competition for capacity in the pipelined composition scenario. When the number of dif-ferent pages accessed by the workload increases, the likelihood of the addresses missing in the translationcaches also increases. The miss introduces an overhead due to the need to perform a page walk, and, possiblyalso an overhead due to the page walk missing in the memory content caches. Modeling this effect thereforerequires modeling the number of different pages accessed bythe workload, and modeling the overhead ofthe page walk.

Overhead due to loss of cached content in the parallel composition scenario. When the content of the trans-lation caches is lost, the addresses miss in the translationcaches. Both the associated overhead and themodeling requirements are similar to the previous effect.

The existing work on performance evaluation of address translation buffers agrees that certain workloads ex-hibit overhead due to sharing of the translation caches [39, 41]. The overhead of misses in the translation caches isgenerally considered similar enough to the overhead of misses in the memory caches to warrant modeling the twocaches together [40].

Given the requirements of the Q-ImPrESS project and the state of the related work, it is likely that the overheadof sharing the address translation buffers can be modeled ina manner similar to the overhead of sharing the memorycontent caches. It is also worth considering whether a potential for incurring a resource sharing overhead due toloss of cached content could be detected by identifying components that change the mapping between virtual andphysical addresses and therefore trigger the loss of cachedcontent.

5.3 Resource: Memory Content Caches

Memory content cache is a shared resource that provides fastaccess to a subset of data or instructions in the mainmemory for all components running on the same processor coreor the same processor.

Whenever a component reads data from memory, the processor searches the memory caches for the data. If thedata is present, it is fetched from the cache rather than directly from memory, speeding up the read. The cache is




said to hit. If the data is not present, it is fetched directlyfrom memory and copied into the cache. The cache issaid to miss.

What happens when a component writes data to memory depends on whether the cache is a write-back cacheor a write-through cache. With a write-back cache, the processor stores the data into the cache, speeding up thewrite. With a write-through cache, the processor stores thedata directly in memory.

Since instructions and data tend to occupy unrelated addresses, separate caches can be used for instructionsand data.

An access to a cache is different from an access to memory in the way data is looked up. A memory entryholds only data – there are as many memory entries as there aredifferent addresses and given an address, the entryholding the data at that address is selected directly. A cache entry holds both data and address – there are fewercache entries than there are different addresses and given an address, the entry holding the data at that address isfound by searching multiple entries.

Since circuits that search for an address are more complex and therefore slower than circuits that select basedon an address, a compromise is between the two is often implemented in the form of a limited associativity cache.A limited associativity cache is organized in sets of entries. Given an address, the limited associativity cache firstselects a single set that can hold the data at that address, and then searches only this set for the data at that address.The cache is said to have as many ways as there are entries in a set, a fully associative cache can be viewed ashaving an equal number of sets and entries.

Both virtual and physical addresses can be used when searching for data. Since the virtual address is availablesooner than the physical address, faster caches are more likely to be indexed using virtual addresses, while slowercaches are more likely to be indexed using physical addresses.

As a general rule, larger caches tend to be slower and smallercaches tend to be faster. The conflict betweensize and speed leads to the construction of memory architectures with multiple levels of caches. The cache levelsare searched from the smallest and fastest to the largest andslowest, the higher level cache only accessed when thelower level cache misses, and the memory only accessed when the last level cache misses. The cache levels arenumbered in the search order.

The separation into caches for instructions and data, as well as cache sharing, may vary with the cache level.Typically, there is a separate L1 instruction cache and L1 data cache for each processor core. In contrast, the L2cache tends to be unified, storing both instructions and data, and shared by multiple processor cores. The samegoes for the L3 cache, if present.

With multiple levels of caches, an inclusion policy defines how multiple levels share data. In strictly inclusivecaches, the data present in a lower level must always be present in the higher level. In strictly exclusive caches,the data present in one level must never be present in anotherlevel. A mostly inclusive cache stores the data in alllevels that have missed when handling a miss, but the data canlater be evicted from higher levels while staying inlower levels.

Rather than operating with individual bytes, caches handledata in fixed size blocks, called cache lines – atypical cache line is 64 bytes long and aligned at 64 bytes boundary. Since a cache line cannot be filled partially,a write of less than an entire cache line requires a read of thecache line. Only entire cache lines are transferredbetween cache levels and between cache and memory.

The bus transaction that transfers a cache line from memory takes several bus cycles, each cycle transferringpart of the cache line. Rather than waiting for the entire cache line to be transferred, the accessed data is deliveredas soon as it arrives, with the rest of the cache line filled afterwards. With a memory subsystem implementation thattransfers the cache line linearly, this would mean that the access latency depends on the position of the accesseddata within the cache line. To avoid the dependency, memory subsystem implementations can transfer the cacheline starting with the accessed data, employing what is called a critical word first policy.

Coherency between multiple caches is enforced using various coherency protocols. A common choice is theMESI protocol, in which each cache tracks the state of each cache line by snooping the activity of the other caches.A cache that reads a line is notified by the other caches whether it is the only cache holding the particular line. Acache that writes a line notifies the other caches that it mustbe the only cache holding the particular line. A cachethat holds a modified line flushes the line on access from othercaches.






Cache details.The memory caches are organized in a hierarchy of two levels,with dedicated L1 instruction anddata caches in each processor core, and two unified L2 caches,each shared by a pair of processor cores.

• Both L1 caches are 32 KB large and 8-way associative.

• Both L1 caches are virtually indexed and physically tagged [1, page 8-30].

• The caches are write-back and non-inclusive [1, page 2-13] and also not exclusive (Experiment5.3.3.6).

• The cache line size is 64 bytes for all caches [5, page 10-5].

• A cache miss always causes the entire line to be transfered [5, page 10-5]. Under certain circumstances (seeHardware prefetching), two cache lines can be transfered atonce. The critical word first protocol [8, page16] is used (Experiment5.3.3.5).

• The measured L1 data cache latency is 3 cycles and penalty fora miss is 11 cycles (Experiment5.3.3.5),which confirms the L1 latency of 3 cycles and L2 latency of 14 cycles in [1, page 2-19].

• The measured L1 instruction cache penalty is approximately30 cycles including the penalty of branchmisprediction (Experiment5.3.3.5).

• The L2 cache is 4 MB large with 16-way associativity. It is physically indexed and tagged [1, page 8-30].

• In experiments5.3.3.6and5.3.3.8we observed a L2 cache miss penalty of 256-286 cycles beyond the L1data cache miss penalty, including penalty of a DTLB miss andone L1 cache miss during page walk. Thepenalty differs with the cache line set where misses occur.

• The penalties of TLB misses and cache misses simply add up according to experiments5.3.3.5and5.3.3.6.

An important observation related to the exact parameters ofthe cache hierarchy is that a collision due to limitedassociativity does not depend on virtual addresses but onlyon the addresses of the physical pages allocated by theoperating system. The possibility of aliasing with 256 KB strides, mentioned in [1, page 3-61], therefore onlyapplies to operating systems that support deterministic page allocation techniques such as page coloring [15],which is currently not the case of the experimental platforms.

Cache coherency.To maintain cache coherency, the processor employs the MESIprotocol [5, Section 10-4].An important feature of the MESI protocol is that transfer ofmodified lines between caches only happens throughmain memory. In particular, modified lines from L1 caches aretransferred through main memory even betweenprocessors that share an L2 cache. Accessing modified lines from L1 caches of other cores is therefore similar inperformance to accessing main memory [1, page 8-21].

Hardware prefetching.The mechanism which tries to detect data access patterns andprefetch them from mainmemory to either L1 or L2 cache automatically is described in[1, pages 2-15, 3-73 and 7-3].

There are two L1 prefetchers.

• The DCU prefetcher (also known as streaming prefetcher) detects ascending data access and fetches thefollowing cache line.

• The IP-based strided prefetcher detects regular forward and backwards strides (up to 2 KB) of individualload instructions by their IP.

• The prefetches are not performed under certain conditions,including when many other load misses are inprogress.

There are also two L2 prefetchers.




• The Streamer prefetcher causes an L2 miss to fetch not only the line that missed, but the whole 128-bytealigned block of two lines.

• The DPL (Data Prefetch Logic) prefetcher tracks regular patterns of requests that are coming from the L1data cache.

• It supports 12 ascending and 4 descending streams (entries of different cores are handled separately) andstrides exceeding the cache line size, but within 4 KB memorypages.

• The prefetching can get up to 8 lines ahead, depending on the available memory bus bandwidth.

• The prefetches are not guaranteed to be performed if the memory bus is very busy. Results of experiment5.3.8.4indicate that prefetches are discarded also when the L2 cache itself is busy.


Cache details.The memory caches are organized in a hierarchy of three levels, with dedicated L1 instruction anddata caches, unified L2 caches, and an unified L3 cache.

• The cache line size is 64 bytes for all of the caches [13, page 189].

• Both L1 caches are 64 KB large, 2-way associative with LRU replacement policy [13, page 223] and virtuallyindexed (Experiment5.3.3.3).

• The L1 data cache has a 3 cycles latency and an L1 miss that hitsin the L2 cache incurs an 9 cycles penaltyaccording to the vendor documentation [13, page 223]. In our experiments5.3.3.5and5.3.3.7we howeverobserved 12 cycles penalty when missing in random cache lineset and 27-40 cycles penalty when whenfrequent misses occur in a single cache line set.

• The L1 instruction cache miss incurs a penalty of 20 cycles when missing in random cache line set, in-cluding partial penalty of branch misprediction and L1 ITLBmiss (Experiment5.3.3.7). Repeated missesin a single cache line set incur 25 cycles penalty each, including partial penalty of branch misprediction(Experiment5.3.3.5).

• The L2 and L3 caches are physically indexed according to Experiment5.3.3.3.

• The L2 unified cache is 512 KB large with 16-way associativityaccording to CPUID [12, page 291] and isan exclusive victim cache, i.e. stores only cache lines evicted from the L1 caches [13, page 223].

• For misses in random L2 cache line sets, we observed 32 cyclespenalty (including a penalty of 0.7 L1 DTLBmisses) beyond the L1 miss penalty, with additional up to 3 cycles depending on the access offset in a cacheline (Experiment5.3.3.7).

• The observed penalty when repeatedly missing in a single L2 cache line set is 16-63 cycles beyond the L1miss penalty (Experiment5.3.3.6).

• The size of unified L3 cache is 2 MB, 32-way associative [12, page 291] and is a non-inclusive victim cache,i.e. stores cache lines evicted from the L2 caches. It is however not always exclusive – on hits, a copy ofthe data can be kept in the L3 cache if it is likely to be requested also by other cores [13, page 223] (with nofurther details on how this is determined).

• For misses in random L3 cache line sets, we observed 208 cycles penalty (including a penalty of L2 DTLBmiss) beyond the L2 miss penalty (Experiment5.3.3.9).

• The observed penalty when repeatedly missing in a single L3 cache line set is 159-211 cycles beyond the L2miss penalty (Experiment5.3.3.9).

Cache coherency.To maintain cache coherency, the processor employs the MOESI protocol [10, Section 7.3].The important difference between the MOESI protocol and theMESI protocol is that transfer of modified lines




between caches happens directly rather than through main memory. In particular, modified lines are transferreddirectly even between processors that do not share a package, using a direct processor interconnect bus.

Hardware prefetching.

• The L1 instruction cache miss triggers prefetch of the next sequential line along with the requested line [13,page 223].

• The L1 data cache has a unit-stride prefetcher, triggered bytwo consecutive L1 cache line misses, initiallyprefetching fixed number of lines ahead. Further accesses with the same stride increase the number of linesthe prefetcher gets ahead [13, page 100].


The effects that can influence the quality attributes when multiple components share a memory cache include:

• Components compete for the cache capacity. This is evidenced both in increased overhead, when data thatwould normally be cached for a component are evicted due to activity of another, and decreased overhead,when data that would normally be flushed by a component are flushed due to activity of another.

Competition for cache capacity can be emphasized when associativity sets are not used evenly by the com-ponents. Since the choice of associativity sets is often made incidentally during data allocation, the effect ofcompetition for cache capacity can change each data allocation.

• Components compete for the cache bandwidth. This is evidenced in increased overhead in workloads thatdo not compensate the sharing effects by parallel execution, and in workloads that employ prefetching.

Even when memory caches are not shared by multiple processors, workload influence can still occur due tocache coherency protocols. Accessing the same data from multiple processors causes data that would normallyreside in exclusively owned cache lines to reside in shared cache lines. This can have performance impact:

• Writes to a shared cache line are slower than writes to what would otherwise be an exclusively owned cacheline. The writer has to announce the write on the memory bus and thus invalidate the other copies of theshared cache line first.

• Subsequent reads from an invalidated cache line are slower than reads from what would otherwise be anexclusively owned cache line.

• On some architectures, a read from a cache line invalidated by a remote write also causes the remote modi-fications to be flushed to memory. This can cause apparent performance changes on the remote node sinceflushing would otherwise be done synchronously with other operations of the remote node.

• On some architectures, a read from a cache line invalidated by a remote write fetches data from the remotenode rather than from the memory bus.


Experiments in this section are performed to determine or validate our understanding of both quantitative aspectsof memory caches and various details affecting their operation, in order to determine how sharing might affect thecaches and what performance impact we can expect.

In particular, we are interested in the following properties:

Cache line size Determines what amount of data needs to be transferred as a result of a single cache miss. Wealso need to know it to set the right access stride in the laterexperiments. Although the line size is specifiedin the vendor documentation, the adjacent line prefetch feature may result in different perceived cache linesize, as our experiment shows.




Cache set indexing Caches line sets may be indexed using either virtual or physical addresses. This knowledgeis needed for experiments based on repeated accesses in the same cache line set. It is not always specified inthe vendor documentation.

Cache miss penalty Determines the maximum theoretical slowdown of a single memory access due to cachesharing. It is not specified for all caches in the vendor documentation, in some cases it is specified using acombination of processor and memory bus cycles. The penaltymay also depend on the offset of the accessin a cache line, if critical word first protocol is not used, which is also not definitely specified.

Cache associativity Determines the number of cache lines with addresses of a particular fixed stride, that maysimultaneously reside in the cache. It is generally well specified, our experiments that determine the misspenalty also confirm these specifications.

Inclusion policy Determines whether the cache sizes effectively add up, as well as their associativity. It is notalways specified in the vendor documentation.

5.3.3.1 Experiment: Cache line sizes

The first experiment determines the cache line size of all available caches. It does so by an interleaved executionof a measured workload that randomly accesses half of the cache lines with an interfering workload that randomlyaccesses all cache lines, using code from Listing5.1and5.5for data caches and5.7and5.5for instruction caches.The measured workload uses the lowest possible access stride, which is 8 B for 64 bit aligned data reads and 16 Bto fit the jump instruction opcode. The interfering workloadvaries its access stride. When the stride exceeds thecache line size, the interference should stop accessing some cache lines, which should be observed as a decreasein the measured workload duration, compared to the situation when the interfering workload accesses all cachelines.

Purpose Determine or confirm the cache line sizes of all memory cachesin the processor.

Measured Time to perform a single memory access in random pointer walkfrom Listing 5.1 and5.5 for thedata caches and unified caches. Time to execute a jump instruction in random jump instruction chain fromListing 5.7and5.5for the L1 instruction cache.

Parameters Intel ServerAllocated: 16 KB, 2 MB; accessed: 16 KB (code and data), 2 MB (data only); stride:8 B data, 16 B code.

AMD ServerAllocated: 32 KB, 320 KB, 1600 KB; accessed: 32 KB (code and data), 320 KB, 1600 KB (dataonly); stride: 8 B data, 16 B code.

Interference The same as the measured workload.Intel ServerAllocated: 32 KB, 4 MB; accessed: 32 KB,4 MB; stride: 8 B (16 B for code) to 512 B (exponential step).

AMD ServerAllocated: 64 KB, 576 KB, 2624 KB; accessed: 64 KB, 576 KB, 2624 KB; stride: 8 B (16 B forcode) to 512 B (exponential step).

Expected Results The memory range sizes (16 KB and 2 MB on PlatformIntel Server, 32 KB, 576 KB and2624KB on PlatformAMD Server) for the measured pointer walk are set so that they fit in a halfof the L1,L2 and L3 (on PlatformAMD Server) cache, respectively. The range sizes for the interfering code are setso that they evict the whole L1, L2 and L3 cache (although not entirely due to associativity). Note that onthe PlatformAMD Serverwe add up cache sizes of all lower levels to the cache size of a given level due tothe exclusive policy. The interfering code should therefore evict the data accessed by the measured code aslong as its access stride is lower or equal to the cache line size and the measured code should be affected thesame. When the access stride of the interfering pointer walkexceeds the cache line size, some lines will notbe evicted and we should see an increased performance of the measured code.

Measured Results The results with 16 KB measured / 32 KB interfering memory ranges accessed by the datavariant on PlatformIntel Server(Figure5.35) show a decrease in memory access duration in the measured




8 16 32 64 128 256

5.0

5.5

6.0

6.5

Interfering workload access stride [bytes]

Dur

atio

n of

acc

ess

[clo

cks

− 2

048

Avg

]

Figure 5.35: The effect of interfering workload access stride on the L1 data cache eviction onIntel Server.

10 20 50 100 200

0.04

0.06

0.08

0.10

0.12


Eve

nt c

ount

s [e

vent

s −

204

8 A

vg T

rim]

Event counters

L1D_REPLL2_LD:SELF:MESI

Figure 5.36: The effect of interfering workload access stride on the L1 data cache eviction onIntel Server– relatedperformance events.

code when the access stride of the interfering code is 128 B ormore. The L1 miss and L2 hit event counters(Figure5.36) show a similar decrease, the number of L2 misses and all prefetch counters are always zero.We can conclude that the line size of the L1 data cache is 64 B, as the vendor documentation states [1, page2-13].

The results of the instruction cache variant (Figure5.37) show a decrease of the L1 instruction cache misseswith 128 B and higher stride of the interfering code and thus also confirm the 64 B cache line size.

The results with 2 MB / 4 MB memory ranges on PlatformIntel Server(Figure5.38) indicate that the cacheline size of the L2 cache is 128 B, which should not be the case according the vendor documentation. Thereason for the results is the Streamer prefetcher [1, page 3-73] which causes the interfering code to fetch twolines to the L2 cache in case of a miss, even though the second line is not being accessed. This therefore




16 32 64 128 256

0.05

0.10

0.15

0.20

0.25

0.30

0.35


Cou

nts

of L

1I_M

ISS

ES

[eve

nts

− 1

024

Avg

]

Figure 5.37: The effect of interfering workload access stride on the L1 instruction cache eviction onIntel Server.

8 16 32 64 128 256

2830

3234

3638

4042


Dur

atio

n of

acc

ess

[cyc

les

− 2

56K

Avg

]

Figure 5.38: The effect of interfering workload access stride on the L2 cache eviction onIntel Server.

causes two cache lines occupied by the measured code to be evicted, which is the same effect as an interfer-ence with 64 B stride. The L2LINES IN:PREFETCH event counter values obtained during the interferingcode execution rather than the measured code execution (Figure 5.39) also confirm that L2 cache missestriggered by prefetches occur.

The results from PlatformAMD Serverfor all cache levels and types show a decrease in memory accessduration in the measured code when access stride of the interfering code is 128 B or more. The changes inaccess duration and performance counters are similar to those in Figure5.35from PlatformIntel Server. Wecan conclude that the line size is 64 B for all cache levels, asthe vendor documentation states [13, page 189],and that there is no adjacent line prefetch as on PlatformIntel Server.




8 16 32 64 128 256

0.1

0.2

0.3

0.4

0.5


Cou

nts

of L

2_LI

NE

S_I

N:P

RE

FE

TC

H [e

vent

s −

256

K A

vg]

Figure 5.39: Streamer prefetches triggered by the interfering workloadduring the L2 cache eviction onIntel Server.

5.3.3.2 Experiment: Streamer prefetcher, Intel Server

To examine the behavior of the Streamer prefetcher and to verify that the Streamer prefetcher could indeed accountfor the results of Experiment5.3.3.1, we have performed a slightly modified experiment to determine which twolines are fetched together.

Purpose Determine whether the L2 cache Streamer prefetcher onIntel Serverfetches always the cache line thatforms a 128 B aligned pair with the requested line.


Parameters Allocated: 4 MB; accessed: 4 MB; stride: 256 B; offset: 0 B

Interference Random pointer walk: Allocated: 4 MB; accessed: 4 MB; stride: 256 B; offset: 0, 64, 128, 192 B

Expected Results Depending on the offset that the interfering workload uses,its accesses either map to thesame associativity set as the measured workload, or not. Theoffset of 0 B should always evict lines accessedby the measured code, the offset of 128 B should always avoid these lines. If the Streamer prefetcher alwaysfetches lines a 128 B aligned pair of cache lines, using the 64B offset should also evict lines of the measuredcode, and the 192 B offset should avoid them.

Measured Results The results (Figure5.40and5.41) show that with 128 B and 192 B offsets the interferingcode does not evict lines of the measured code while with 0 B and 64 B offsets it does. This indicates that theStreamer prefetch does always fetch 128 B aligned pair cachelines. Note that the relatively small differencebetween 0 B and 64 B offsets does not contradict this. Settinga 64 B the offset to the measured code (insteadof 0 B) just exchanges the results of 0 B and 64 B interfering code offsets, not affecting the 128 B and 192 Boffsets. The difference can be instead explained by the factthat the Streamer prefetch is triggered onlyby a L2 miss and because the interfering code does not always miss with these parameters, there are lessaccesses (and thus less evictions) in the pair line than to the requested line. When the memory accessed bythe interfering workload is increased to 8 MB and its number of L2 misses approaches 1 miss per access, thedifference between 0 B and 64 B offset diminishes.

The presence of the Streamer prefetcher onIntel Serverposes a problem for selecting the right access stridefor the random pointer walk (Listing5.5).




0 64 128 192

100

150

200

Interfering workload offset [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 1

6K A

vg]

Figure 5.40: The effect of access offset on the L2 streamer prefetch,Intel Server.

0 64 128 192

0.2

0.4

0.6

0.8

Interfering workload offset [bytes]

Cou

nts

of L

2_LI

NE

S_I

N:S

ELF

[eve

nts

− 1

6K A

vg]

Figure 5.41: The effect of access offset on the L2 streamer prefetch - L2 cache misses,Intel Server.

Using a stride of 64 B could cause additional lines to be fetched to the cache in case the parameters are set sothat only a subset of the allocated memory is accessed. It is however no issue when the parameters are set to accessall cache lines in the allocated memory. Using a 128 B stride would cause only even lines to be fetched to the L1cache, which does not use this kind of prefetcher. It also does not guarantee the pair line to be fetched to the L2 incase the line being accessed does not miss, or the prefetch isdiscarded.

A universal solution would be to use a stride of 128 B to work with the whole pair, but access both lines inthe pair to ensure that both are fetched in both cases. This would however trigger the stride-based prefetcher andaccess additional cache lines.

We will therefore use the 64 B access stride in most of the experiments, keeping in mind the extra accesses thismight cause.




5.3.3.3 Experiment: Cache set indexing

The following experiments test whether the caches are virtually or physically indexed. We need to know thisinformation for the later experiments that determine cachemiss penalties by triggering cache misses in a singlecache line set by accesses with a particular stride. On physically indexed caches however the cache line set isdetermined by a physical frame number rather than a virtual address. This is a problem on our experimentalplatforms where the operating system does not assign physical frames in a deterministic or directly controllableway.

To work around this limitation, we have developed a special memory allocation function based on page coloring[15], which assigns virtual colors to both virtual pages and physical frames and ensures that each virtual page ismapped to a physical frame with the same color. The color is determined by the least significant bits in the virtualpage or physical frame number, the number of colors is selected so that cache lines in pages with the same colormap to the same cache line sets in the particular cache. For example, the L2 cache on PlatformIntel Serveris 4 MBlarge with 16-way associativity, which yields a stride of 256 KB needed to be mapped to the same cache line set[1, page 3-61]. With 4 KB page size, this yields 64 different colors, i.e. the 6 least significant bits in the page orframe number determine the color.

Although the operating system on our experimental platforms does not support page coloring or other determin-istic physical frame allocation, recent Linux kernel versions provide a way for the executed program to determineits current mapping by reading the special/proc/self/pagemapfile. Our page color aware allocator thus uses thisinformation together with themremapfunction to (1) allocate a continuous virtual memory area, (2) determine itsmapping and (3) remap the allocated pages one by one to a different virtual memory area with the target virtualaddresses having the same color as the determined physical frame numbers. This way the allocator construct acontinuous virtual memory area with virtual pages having the same color as the page frames being mapped to.Note that (1) this is possible thanks to themremapkeeping the same physical frame and (2) the memory area weallocate in the first step provides enough physical frames ofeach color. In our experiments, allocating twice theneeded memory in the first step showed to be sufficient.

This allocator is used in the following experiment to determine which caches are virtually or physically indexed,and in all experiments that rely on stride of accesses to a physically indexed cache. Note that we do not have toperform this experiment for the L1 caches on PlatformIntel Server– the 32 KB size and 8-way associativity meanthat all pages map to the same cache line sets.

Purpose Determine whether the caches are virtually or physically indexed.

Measured Time to perform a single memory access in set collision pointer walk from Listing5.1 and5.6 forthe data caches and unified caches. Time to execute a jump instruction in set collision jump instruction chainfrom Listing5.7and5.5for the L1 instruction cache.

The buffer for the experiment using either standard allocator or page coloring allocator. The number ofallocated and accessed pages is selected so that it exceeds the cache associativity.

Parameters Intel Server, L2 Pages allocated: 32; access stride: 64 pages (4 MB L2 cache size divided by16 ways); pages accessed: 1-32; page colors: none / 64.

AMD Server, L1 and L2Pages allocated: 32; access stride: 8 pages (64 KB L1 cache size divided by 2 waysand 512 MB L2 cache size divided by 16 ways); pages accessed: 1-32; page colors: none / 8.

Expected Results If a particular cache is virtually indexed, the results should show an increase in access du-ration when the number of accesses exceeds the associativity both when using and not using page coloringaware allocation. If the cache is physically indexed and page color based allocation is not used, there shouldbe no such increase in access duration because the stride in virtual addresses does not imply the same stridein physical addresses.

Measured Results The results from PlatformIntel Server(Figure5.42) show that page color based allocationis needed to trigger L2 cache misses – the L2 cache is therefore physically indexed. We can also see that32 accesses in one cache line set are not enough to achieve 100% L2 cache misses, probably due to a




0 5 10 15 20 25 30

050

100

150

200

Number of accesses mapping to the same L2 cache set

Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg T

rim] Page colors

064

Figure 5.42: Dependency of associativity misses in L2 cache on page coloring onIntel Server.

0 5 10 15 20 25 30

020

4060

8010

0

Number of accesses mapping to the same L1/L2 cache set

Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg T

rim] Page colors

None8

Figure 5.43: Dependency of associativity misses in L1 data and L2 cache onpage coloring onAMD Server.

replacement policy not being close to LRU enough – the experiment measuring its miss penalty will thereforeaccess 128 pages.

The results from PlatformAMD Server(Figure5.43) also show that page coloring is needed to trigger L2cache misses with 19 and more accesses. Page coloring also seems to make some difference for the L1 datacache, but values of the events counters (Figure5.44) show that L1 data cache misses occur both with andwithout page coloring and the difference in the observed duration is therefore caused by something else.The L1 data cache is therefore virtually indexed and the L2 cache is physically indexed, which implies thatthe L3 cache is also physically indexed. The results of code fetching variant yield similar results for the L1instruction cache as for the data cache, it is therefore alsovirtually indexed.




0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0

Number of accesses mapping to the same L1/L2 cache set

Eve

nt c

ount

er v

alue

s [e

vent

s −

100

0 w

alks

Avg

Trim

]

Event counters

DATA_CACHE_MISSESDATA_CACHE_MISSES (page coloring)L2_CACHE_MISS:DATAL2_CACHE_MISS:DATA (page coloring)

Figure 5.44: Dependency of associativity misses in L1 data and L2 cache onpage coloring onAMD Server– performancecounters.

5.3.3.4 Miss Penalties

The following experiments determine the penalties of misses in all levels of the cache hierarchy and their possibledependency on the offset of accesses triggering the misses.We again use the pointer walk (Listing5.1) for themeasured workload and create the access pattern so that all accesses map to the same cache line set. For this wecan reuse the same pointer walk initialization code as for the TLB experiments (Listing5.6) because the stride weneed (cache size divided by number of ways) is always a multiple of the 4 KB page size on all of our platforms.The difference here is that we do not use the offset randomization because we need the same cache line offset in apage. Some experiments however set a fixed non-zero offset todetermine whether it influences the miss penalty.

5.3.3.5 Experiment: L1 cache miss penalty

Purpose Determine the cache miss penalty in L1 instruction and L1 data caches and whether it depends on theoffset of the accessed word in the cache line.

Measured Time to perform a single memory access in set collision pointer walk from Listing5.1 and5.6 forthe data caches and unified caches. Time to execute a jump instruction in set collision jump instruction chainfrom Listing5.7and5.5for the L1 instruction cache.

Parameters Intel ServerPages allocated: 32; access pattern stride: 1 page (32 KB cache size divided by8 ways).

Miss penaltyPages accessed: 1-32; access offset: 0.

Offset dependencyPages accessed: 10; access offset: 0-128B (8 B step).

AMD ServerPages allocated: 32; access pattern stride: 8 pages (64 KB cache size divided by 2 ways).


Offset dependencyPages accessed: 1-4; access offset: 0-64 B (8 B step).

Expected Results The accesses should not cause L1 cache misses until the number of accessed pages reachesthe number of associativity ways, and then start causing L1 cache misses. The exact behavior depends onhow the replacement policy behaves with our access pattern.




1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

510

15


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.45: L1 data cache miss penalty onIntel Server.

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0


Num

ber

of e

vent

s pe

r ac

cess

[eve

nts

− 1

000

wal

ks A

vg T

rim]

Event counters

DTLB_MISSES:L0_MISS_LDL1D_ALL_REFL1D_REPLL2_LINES_IN:SELFL2_LD:SELF:MESI

Figure 5.46: Performance event counters related to L1 data cache miss penalty onIntel Server.

Measured Results The results of the data access variant from PlatformIntel Server(Figure 5.45) show anincrease from 3 to 14 cycles between 8 and 10 accesses. The fact that this increase is not immediate between8 and 9 accesses suggests that the replacement policy is probably not a true LRU. The L1DREPL (L1 datacache misses) event counter (Figure5.46) increases from 0 to 1 as well as the L2LD (L2 cache loads) eventcounter and there are no L2 cache miss events (L2LINES IN). The L1 cache miss penalty is thus 11 cycles.The subsequent increase from 14 to 16 cycles per access is caused by DTLB0 misses (unavoidable due to itslimited size, as the respective event counter shows), whichmeans the DTLB0 miss penalty of 2 cycles (seeExperiment5.2.2.3) adds up to the L1 data cache miss penalty.

As Figure5.47shows, the penalty does not depend on the access offset.

The results of the code access variant (Figure5.48) show an increase from approximately 3 cycles per jump




0 8 16 24 32 40 48 56 64 72 80 88 96 104 120

12.5

13.0

13.5

14.0

14.5

15.0

15.5

16.0

Offset of access in a cache line [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.47: Dependency of L1 data cache miss penalty on access offset in acache line onIntel Server.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

010

2030

40

Number of accesses mapping to the same L1 instruction cache line set

Dur

atio

n of

acc

ess

[cyc

les

− 1

024

wal

ks A

vg]

Figure 5.48: L1 instruction cache miss penalty onIntel Server.

instruction to 33 cycles between 6 and 11 jump instructions.The performance counters (Figure5.49) showthat this is caused by L1 instruction cache misses and also mispredicted branches. The penalty of the L1miss with the misprediction is thus approximately 30 clock cycles.

The results of the data access from PlatformAMD Server(Figure5.50) show an increase in access durationfrom 3 to 43 cycles between 2 and 3 accesses which confirms the 2-way associativity. The 3 cycles latency foran L1 cache hit confirms the vendor documentation [13, page 223]. The performance counters (Figure5.51)confirm that 3 and more accesses cause L1 misses and no L2 misses. The penalty is however excessivelyhigh for an L1 miss – the vendor documentation states a 9 cycles L2 access latency [13, page 223].

The duration per access then decreases at 4 accesses, after which it remains approximately at 30 cycles. Notethat in Experiment5.3.3.3we saw that this decrease does not occur when page color allocation is used. The




0 5 10 15 20 25 30

0.0

0.5

1.0

1.5


Cou

nts

of e

vent

s [e

vent

s −

102

4 w

alks

Avg

Trim

] Event counters

L1I_MISSESBR_IND_MISSP_EXEC

Figure 5.49: Performance event counters related to L1 instruction cachemiss penalty onIntel Server.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

010

2030

40


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.50: L1 data cache miss penalty when accessing a single cache lineset onAMD Server.

DISPATCH STALL FOR LS FULL event counter (Figure5.52) indicates what causes these penalties andas Experiment5.3.3.7shows, the penalty is not so high if the misses are not concentrated in a single cacheline set. Another unexpected results is that the REQUESTSTO L2:DATA event counter shows 2 L2 cacheaccesses per one pointer access for 3 accesses.

For this platform we will therefore keep using the single cache line set access pattern to confirm numberof associativity ways and to determine the upper limit of thecache miss penalty, and a lower bound of thecache miss penalty will be determined by accessing random cache line sets instead of a single one.

Note that we observed no dependency of the penalty on the access offset when accessing a single cache lineset, however as it could be masked by the anomaly, we will alsomeasure it in Experiment5.3.3.7.

The result of instruction fetch variant on PlatformAMD Server(Figure5.53) show an increase from 5 to 30




0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0


Cou

nts

of e

vent

s [e

vent

s −

100

0 w

alks

Avg

Trim

] Event counters

DATA_CACHE_MISSESL2_CACHE_MISS:DATAREQUESTS_TO_L2:DATA

Figure 5.51: Performance event counters related to L1 data cache misses when accessing a single cache line set onAMD Server.

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

010

2030

40

Number of accesses mapping to the same L1 cache setCou

nts

of D

ISP

AT

CH

_ST

ALL

_FO

R_L

S_F

ULL

[eve

nts

− 1

00 w

alks

Avg

]

Figure 5.52: Dispatch Stall for LS Full events when accessing a single cache line set onAMD Server.




1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

010

2030

4050


Dur

atio

n of

acc

ess

[cyc

les

− 1

024

wal

ks A

vg]

Figure 5.53: L1 instruction cache miss penalty when accessing a single cache line set onAMD Server.

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5


Cou

nts

of e

vent

s [e

vent

s −

102

4 w

alks

Avg

Trim

] Event counters

INSTRUCTION_CACHE_MISSESRETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS

Figure 5.54: Performance event counters related to L1 instruction cachemiss penalty when accessing a single cache lineset onIntel Server.

cycles at 3 jump instructions, confirming the 2-way associativity of the L1 instruction cache. The perfor-mance counters (Figure5.54) show that instruction cache misses occur along with some amount of mispre-dicted branch predictions. Note that the INSTRUCTIONCACHE MISSES event counter shows somewhathigher values than expected, which could be caused by speculative execution. The penalty of L1 instruc-tion miss (when accessing a single cache line set) is 25 cycles including the mispredicted branch predictionpenalty.

Open Issues The exact reason why cache misses in a single L1 cache line setcause significantly higher penaltiesthan misses spreading over multiple sets remains an open question.




5.3.3.6 Experiment: L2 cache miss penalty

Purpose Determine the L2 cache miss penalty and whether it depends onthe offset of the accessed word in thecache line.


Parameters Intel ServerPages allocated: 128; access stride: 64 pages (4 MB cache size divided by 16 ways);page colors: 64.

Miss penaltyPages accessed: 1-128; access offset: 0 B.

Offset dependencyPages accessed: 128; access offset: 0-128 B (8 B step).

AMD ServerPages allocated: 32; access stride: 8 pages (512 KB cache size divided by 16 ways); pagecolors: 8.

Miss penaltyPages accessed: 1-32; access offset: 0 B.


Expected Results The accesses should not cause L2 cache misses until the number of accessed pages reachesthe number of associativity ways, and then start causing cache misses. Because the stride needed to map intothe same L2 cache line set is the same as for the L1 DTLB on PlatformIntel Server(see Experiment5.2.2.2),we should also observe L1 DTLB misses and will have to subtract their already known penalty to obtain theL2 cache miss penalty. On PlatformIntel Serverthe fully associative L1 DTLB with its 48 entries should beable to hold translations for all 32 accesses at once.

Measured Results The results from PlatformIntel Server(Figure5.55) show an increase from 3 to 12 cyclesat 5 accesses due to the DTLB misses (event counters in Figure5.56), matching the results of Experi-ment5.2.2.2. Between 8 and 10 accesses, the duration per access increases to 23 cycles due to the L1 datacache misses, which thus add up to the DTLB misses. Starting from 17 accesses we can see a rapid increasein access duration, matched by the L2LINES IN:SELF (L2 cache misses) event counter. This confirms the16-way associativity of the L2 cache and that the policy is not exclusive, otherwise the number of ways inthe L1 and L2 would effectively add up. The change from zero misses per access to one miss per access ishowever not immediate, which suggests that the replacementpolicy is not true LRU.

Aside from the L2 cache misses, the access latency is furtherincreased by the L1 data cache misses causedby page walks, as the L1DREPL and PAGEWALKS:CYCLES (Figure5.57) event counters show.

At around 80 accesses we see another sudden increase in latency up to 300 cycles per access. However, noperformance event counter related to caches explains this change.

The results of the cache line offset dependency (Figure5.58) show that the L2 cache miss penalty does notdepend on the offset of the access inside one cache line. However, the results hint at a possible dependencyof the L2 miss penalty on the cache line set used. This is further investigated in the next experiment.

The results from PlatformAMD Server(Figure5.59) again exhibit unusually high durations as in Exper-iment 5.3.3.5. We can see again an increase from 3 to 43 cycles when the number of L1 cache ways isexceeded. Between 18 and 19 accessed pages we see an increasefrom 43 to 106 cycles followed by adecrease to around 59 cycles, which is the penalty of extra 16-63 cycles for misses in a single cache lineset. The performance event counters (Figure5.60) show that the increase is caused by L2 cache misses, butstrangely reports 2 per access misses for 19 accesses. The fact that the increase occurs between 18 and 19accessed pages confirms the 16-way associativity and the exclusive policy, where the effective cache sizesof caches at different levels add up.

We observed no dependency of the penalty on the access offsetwith 18 and 19 accessed pages, there ishowever some interesting dependency with 20 accessed pages. As Figure5.61shows, the duration of accessvaries between 54 and 59 cycles depending on the offset.

Open Issues The exact reason why cache misses in a single L2 cache line setcause significantly higher penaltiesthan misses spreading over multiple sets remains open, as well as the reported 2 misses per access for 19accesses.




1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 104 113 122

050

100

150

200

250

300

Number of accesses mapping to the same L2 cache line set

Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.55: L2 cache miss penalty onIntel Server.

0 20 40 60 80 100 120

0.0

0.5

1.0

1.5

2.0


Cou

nts

of e

vent

s [e

vent

s −

100

0 w

alks

Avg

Trim

]

Event counters

DTLB_MISSES:ANYL1D_ALL_REFL1D_REPLL2_LINES_IN:SELF

Figure 5.56: Performance events related to L2 cache miss penalty onIntel Server.

5.3.3.7 Experiment: L1 and L2 cache random miss penalty, AMD Server

Purpose Determine the L1 and L2 cache miss penalty when accessing random cache line sets, and whether itdepends on the offset of the accessed word in the cache line onPlatformAMD Server

Measured Time to perform a single memory access in random pointer walkfrom Listing 5.1 and5.5 for thedata caches and unified caches. Time to execute a jump instruction in random jump instruction chain fromListing 5.7and5.5for the L1 instruction cache.

Parameters L1 Miss penaltyAllocated: 128 KB; accessed: 16-128KB (16 KB step); stride:64 B.

L1 Offset dependencyAllocated: 128 KB; accessed: 128 KB; stride: 64 B; access offset: 0-56 B (8 B step).L2 Miss penaltyAllocated: 640 KB; accessed: 64-640KB (32 KB step); stride:64 B.




1 6 12 19 26 33 40 47 54 61 68 75 82 89 96 104 113 122

05

1015

20


Cou

nts

of P

AG

E_W

ALK

S:C

YC

LES

[eve

nts

− 1

000

wal

ks A

vg]

Figure 5.57: Cycles spent by page walks when accessing a single L2 cache line set onIntel Server.

0 8 16 24 32 40 48 56 64 72 80 88 96 104 120

270

275

280

285

290

295

300

Offset of access in 3 adjacent cache line sets [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 1

000

Wal

ks A

vg]

Figure 5.58: Dependency of L2 cache miss penalty on access offset in a cache line in two adjacent cache line sets onIntel Server.




1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

020

4060

8010

0


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.59: L2 cache miss penalty when accessing a single cache line set on AMD Server.

0 5 10 15 20 25 30

0.0

0.5

1.0

1.5

2.0


Cou

nts

of e

vent

s [e

vent

s −

100

0 w

alks

Avg

Trim

]

Event counters

DATA_CACHE_MISSESL1_DTLB_MISS_AND_L2_DTLB_HIT:L2_4K_TLB_HITL2_CACHE_MISS:DATA

Figure 5.60: Performance event counters related to L2 cache misses when accessing a single cache line set onAMD Server.




0 8 16 24 32 40 48 56 64 72 80 88 96 104 120

5254

5658

60

Offset of access in 3 adjacent cache line sets [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.61: Dependency of L2 cache miss penalty on access offset in a cache line in two adjacent cache line sets, whenaccessing 20 cache lines in the same set onAMD Server.

L2 Offset dependencyAllocated: 640 KB; accessed: 640 KB; stride: 64 B; access offset: 0-56 B (8 B step).

Expected Results The amount of allocated memory is selected to exceed the L1 cache size but fit in the L2cache size or to exceed the L1 + L2 cache size but fit in the L3 cache size, respectively. As we increasethe amount of accessed cache lines, the ratio of L1 or L2 cachemisses and thus the duration per one accessshould increase. Accessing the whole allocated memory buffer should cause 100 % L1 or L2 cache misses.After subtracting the 3 cycles duration per L1 data cache hitthat we observed in the previous experiment,we should obtain the L1 data cache miss penalty for accesses to random cache lines. Similarly we obtain theL2 cache miss penalty.

Measured Results The results for the L1 data cache (Figure5.62) show an expected increase of duration peraccess as the number of accesses increases. Values of related performance event counters (Figure5.63)confirm that we observe L1 data cache misses, L2 cache hits andno L1 DTLB misses. Accessing the wholememory buffer causes an L1 miss for each access and costs 15 cycles, which yields a penalty of 12 cycles.Note that this is still somewhat higher than the 9 cycles stated in vendor documentation [13, page 223].

The results for the L1 instruction cache (Figure5.64) show a gradual increase up to 25 cycles per access.Values of related performance event counters (Figure5.65) show that it is caused by L1 instruction cachemisses and partially also mispredicted branch instructions and L1 ITLB misses. The penalty of the L1 missis thus 20 cycles when accessing random cache line sets, including the overhead of the partial ITLB missesand branch mispredictions.

The results for the L2 cache (Figure5.66) show an increase to 47 cycles per access when accessing thewhole 640 KB of allocated buffer. The performance event counters (Figure5.67) show that it is caused byL2 cache misses as expected but also by 0.7 L1 DTLB misses in average. One L1 DTLB miss per accesscould theoretically add 5 cycles to the penalty (Experiment5.2.2.2). The L1 DTLB misses are inevitablewhen accessing such amount of memory, the L2 DTLB is however sufficient. The penalty of the L2 cachemiss including the L1 DTLB miss overhead is therefore 32 cycles in addition to the L1 cache miss.

The L1 miss penalty does not depend on the access offset. We however observed a small dependency on theaccess offset in a L2 cache line (Figure5.68). The access duration increases with each 16 B of the offset andcan add almost 3 cycles to the L2 miss penalty.




16384 32768 49152 65536 81920 98304 114688 131072

46

810

1214

16

Amount of data accessed [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 1

wal

k A

vg]

Figure 5.62: L1 data cache miss penalty when accessing random cache line sets onAMD Server.

20000 40000 60000 80000 100000 120000

0.0

0.2

0.4

0.6

0.8

1.0


Cou

nts

of e

vent

s [e

vent

s −

1 w

alk

Avg

Trim

]

Event counters

DATA_CACHE_MISSESL2_CACHE_MISS:DATAREQUESTS_TO_L2:DATA

Figure 5.63: Performance event counters related to L1 data cache misses when accessing random cache line sets onAMD Server.




16384 32768 49152 65536 81920 98304 114688 131072

510

1520

25

Amount of code executed [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 1

wal

k A

vg]

Figure 5.64: L1 instruction cache miss penalty when accessing random cache line sets onAMD Server.

20000 40000 60000 80000 100000 120000

0.2

0.4

0.6

0.8

1.0

1.2

1.4


Cou

nts

of e

vent

s [e

vent

s −

1 w

alk

Avg

Trim

]

Event counters

L1_ITLB_MISS_AND_L2_ITLB_HITINSTRUCTION_CACHE_MISSESRETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS

Figure 5.65: Performance event counters related to L1 instruction cachemisses when accessing random cache line setsonAMD Server.




65536 131072 229376 327680 425984 524288 622592

1020

3040

50


Dur

atio

n of

acc

ess

[cyc

les

− 1

wal

k A

vg]

Figure 5.66: L2 data cache miss penalty when accessing random cache line sets onAMD Server.

1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

0.0

0.2

0.4

0.6

0.8

1.0


Cou

nts

of e

vent

s [e

vent

s −

1 w

alk

Avg

Trim

]

Event counters

DATA_CACHE_MISSESL1_DTLB_MISS_AND_L2_DTLB_HIT:L2_4K_TLB_HITL1_DTLB_AND_L2_DTLB_MISS:4K_TLB_RELOADL2_CACHE_MISS:DATA

Figure 5.67: Performance event counters related to L2 data cache misses when accessing random cache line sets onAMD Server.




0 8 16 24 32 40 48 56

46.5

47.0

47.5

48.0

48.5

49.0

49.5

Offset of access in a cache line [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 1

0K A

vg]

Figure 5.68: Dependency of L2 cache miss penalty on access offset in a cache line when accessing random cache linesets onAMD Server.

5.3.3.8 Experiment: L2 cache miss penalty dependency on cac he line set, Intel Server

In Experiment5.3.3.6, we saw that the penalty of the L2 miss when accessing a singleL2 cache line set on PlatformIntel Servermay differ depending on the cache line set used for the experiment. The following experiment aimsto further explore this dependency. It repeats Experiment5.3.3.6on (1) all cache line sets that have a fixed pagecolor of 0 and (2) cache lines sets with different page color and a fixed offset in the page. This is done by varyingthe offset of accesses relatively to the memory buffer, which is aligned to the beginning of a page with color 0.

Purpose Determine the dependency of a L2 cache miss penalty on the cache line set being accessed on PlatformIntel Server.


Parameters Pages allocated: 128; access pattern stride: 64 pages (4 MB cache size divided by 16 ways); pagesaccessed: 16-128 (exponential step); page colors: 64.

One page color (0)Access offset: 0-4096B (64 B step).

Different page colorsAccess offset: 256-258304B (4 KB step).

Expected Results There might be some differences on access duration depending on the cache line set due tosome cache line sets getting additional accesses other thenaccesses caused by the pointer walk itself, whichcause more evictions in that set. A possible cause are the page walk accesses due to DTLB misses the pointerwalk causes on such number of accessed pages. Other differences may be caused by memory bus, controller,or the system memory itself, because each L2 cache miss results in a system memory access.

Measured Results The results of varying cache line offset when accessing pages with color 0 show that for 16accessed pages, accesses with offsets that are multiples of512 B are slightly slower than others (Figure5.69).This is accompanied by increases in L1 data cache misses and page walk cycles and can be explained bythe DTLB misses causing page table lookups. Since there are 64 page colors and we access pages withcolor 0, the address translation reads page directory entries with numbers being also multiple of 64 in a pagedirectory. Since the entries are 8 B large offsets of the entries being read are therefore multiple of 512 B.

With 32 accessed pages (Figure5.71) this effect on access duration diminishes (although the performancecounters still show the difference). Instead, we see that accesses to odd cache lines are approximately




0 256 576 896 1216 1600 1984 2368 2752 3136 3520 3904

2324

2526

27

Offset of access in a page [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.69: Dependency of L2 cache miss penalty on accessed cache line sets with page color 0 and 16 accesses onIntel Server.

0 1000 2000 3000 4000

12

34

56


Cou

nts

of e

vent

s [e

vent

s −

100

0 w

alks

Avg

Trim

] Event counters

PAGE_WALKS:CYCLESL1D_REPL

Figure 5.70: Performance event counters related to L2 cache misses on accessed cache line sets with page color 0 and16 accesses onIntel Server.

6 cycles slower than accesses to even cache lines. Interestingly, this effect is reversed when we furtherincrease the number of accessed pages – accesses to even cache lines are slower than to the odd lines.Finally at 128 accessed pages we see that accesses to the evencache lines take 300 cycles, which is 30cycles slower than the 270 cycles for odd cache lines. The L2 miss penalty is thus 256-286 cycles beyondL1 cache miss. This includes the penalty DTLB miss and L1 datacache miss during page walk, as describedin Experiment5.3.3.6– these events are hard to avoid when accessing memory range so large to not fit inthe L2 cache.

The experiment variant with different color pages uses 256 Boffset in a page to avoid the abovementionedcollisions with page table entries. The results showed no dependency on the page color.




0 1000 2000 3000 4000

220

240

260

280

300


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

test_pages

3264128

Figure 5.71: Dependency of L2 cache miss penalty on accessed cache line sets with page color 0 and 32-128 accesses onIntel Server.

We can conclude that there seems to be difference only between odd and even cache line sets. No eventcounter that we sampled would however explain this difference, which indicates that this might be a propertyof parts of the memory subsystem beyond the processor caches.

5.3.3.9 Experiment: L3 cache miss penalty, AMD Server

Purpose Determine the miss penalty of the L3 cache (only present on PlatformAMD Server) both when ac-cessing a single cache line set and random cache line sets. Also determine whether it depends on the offsetof the accessed word in the cache line.

Measured Time to perform a single memory access in set collision pointer walk from Listing5.1.

Parameters Single setPages allocated: 64; access stride: 16 pages (2 MB L3 cache size divided by 32 ways);page colors: 16; pattern: set collision (Listing5.6).



Random setsAllocated: 4096 KB; stride: 64 B; page colors: 16; pattern: random (Listing5.5)

Miss penaltyAccessed: 256-4096KB (256 KB step); access offset: 0.

Offset dependencyAccessed: 4096 KB; access offset: 0-56 B (8 B step).

Expected Results When accessing a single cache line set, we should see L3 cachemisses as we exceed thenumber of ways in all 3 levels due to the exclusive policy. When accessing random cache line sets, weshould see L3 cache misses as the amount of accessed memory increases.

Measured Results The results when accessing a single cache line set (Figure5.72) show an increase in accessduration from 60 to 80 cycles at 49 accesses. With 51 and more accesses, we observe a duration of 265-270cycles. The performance event counters (Figure5.73) show that the first increase in duration is caused by L2DTLB misses that are inevitable with that many accesses. Thesecond increase in duration is accompanied byan increase from 0 to 1 of the DRAMACCESSESPAGE:ALL event count which confirms system memoryaccesses due to L3 cache misses. The L3CACHE MISSES:ALL event counter however shows less than 0.5




1 4 7 10 14 18 22 26 30 34 38 42 46 50 54 58 62

050

100

150

200

250

300


Dur

atio

n of

acc

ess

[cyc

les

− 1

000

wal

ks A

vg]

Figure 5.72: L3 cache miss penalty when accessing a single cache line set on AMD Server.

0 10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0


Cou

nts

of e

vent

s [e

vent

s −

100

0 w

alks

Avg

Trim

] Event counters

L1_DTLB_MISS_AND_L2_DTLB_HIT:L2_4K_TLB_HITL1_DTLB_AND_L2_DTLB_MISS:4K_TLB_RELOADL3_CACHE_MISSES:ALLDRAM_ACCESSES_PAGE:ALL

Figure 5.73: Performance event counters related to L3 cache misses when accessing a single cache line set onAMD Server.

misses per access in average, which could be an implementation error. The fact that the increase occurs at51 accesses confirms the 32 ways in the L3 cache and its exclusivity.

When determining the offset dependency, with 48 accessed pages we observed similar results as in Experi-ment5.3.3.7– in both cases there are just L2 cache misses involved. Thereis however no dependency with49 and 50 accessed pages, the additional overhead of L2 DTLB misses probably hides it. We also observedno dependency for 51 and more accessed pages, where the L3 cache misses occur.

The results when accessing random cache line sets (Figure5.74) show a gradual increase of access dura-tion up to 255 cycles. The DRAMACCESSESPAGE:ALL event counter (Figure5.75) confirms that thisis caused by system memory accesses and the L3CACHE MISSES:ALL shows unexpectedly low values




262144 786432 1310720 1835008 2359296 2883584 3407872 3932160

5010

015

020

025

0


Dur

atio

n of

acc

ess

[cyc

les

− 1

wal

k A

vg]

Figure 5.74: L3 cache miss penalty when accessing random cache line sets on AMD Server.

1e+06 2e+06 3e+06 4e+06

0.0

0.2

0.4

0.6

0.8

1.0


Cou

nts

of e

vent

s [e

vent

s −

1 w

alk

Avg

Trim

]

Event counters

L1_DTLB_MISS_AND_L2_DTLB_HIT:L2_4K_TLB_HITL1_DTLB_AND_L2_DTLB_MISS:4K_TLB_RELOADL3_CACHE_MISSES:ALLDRAM_ACCESSES_PAGE:ALL

Figure 5.75: Performance event counters related to L3 cache misses when accessing random cache line sets onAMD Server.

again. We can also see that almost each access results in L1 orL2 DTLB miss – this is inevitable whenaccessing in such memory range. The penalty of an L3 miss including the L2 DTLB miss is thus 208 cyclesbeyond the L2 miss penalty, when accessing random cache linesets.

We observed no dependency of the penalty on the access offsetin a cache line when accessing random cacheline sets.


Code most sensitive to pipelined sharing of memory content caches includes:




• Accesses to memory where addresses belonging to the same cache line are accessed only once in a patternthat does not trigger prefetching, and where the accessed data would fit into the processor cache in theisolated scenario.

• A mix of reads and writes where modifications would not be flushed to memory by the end of each cycle inthe isolated scenario.


The following experiments resemble the pipelined sharing scenario with the most sensitive code. Executions oftwo workloads are interleaved and thus the interfering workload evicts code or data of the measured workload fromthe caches during its execution.

For data caches, the measured workload accesses data in a memory buffer so that each access uses an addressin a different cache line. This is done by executing the random pointer walk code in Listing5.1and5.5with strideset to the cache line size. The memory buffer size is set to fit in the particular cache.

The interfering workload is the same as the measured workload, but accessing a different memory buffer inorder to evict the data or code of the measured workload. The amount of memory accessed by the interferingworkload varies from none to an amount that guarantees full eviction of the measured workload. For data caches,the interfering workload uses either read accesses or writeaccesses in order to determine the overhead of dirtycache lines write-back in the measured workload.

5.3.5.1 Experiment: L1 data cache sharing

Purpose Determine the impact of L1 data cache sharing on the most sensitive code.


Parameters Intel ServerAllocated: 32 KB; accessed: 32 KB; stride: 64 bytes.

AMD ServerAllocated: 64 KB; accessed: 64 KB; stride: 64 bytes.

Interference The same as the measured workload. Allocated: 128 KB; accessed: 0-128KB (8 KB step); stride:64 B; access type: read-only, write.

Expected Results The measured workload should fit in the L1 data cache and therefore hit on each access whenexecuted with no interference. The interfering workload should increasingly evict the data of the measuredworkload until the whole buffer is evicted and the measured workload should miss the L1 data cache on eachaccess.

Measured Results The results confirm the expected slowdown due to the interfering workload. On PlatformIntel Server, the average duration of a memory access in the measured workload increases from 5.5 cyclesto 14 cycles with read-only interference and 14.5 cycles with write interference (Figure5.76). With 512 L1cache lines the total overhead is approximately 4350 and 4600 cycles, respectively.

On PlatformAMD Serverwe observed no difference between read-only and write interference, due to theexclusive caches. The average duration increases from 3 to 15 cycles due to the sharing, see Figure5.77forresults of the read-only variant. With 1024 cache lines thismeans an overhead of 12300 cycles.

Effect Summary The overhead can be visible in workloads with very good locality of data references that fitin the L1 data cache when executed alone. The cache miss can berepeated only as many times as there arethe L1 data cache entries, the overhead will therefore only be significant in workloads where the number ofdata accesses per invocation is comparable to the size of theL1 data cache.




0 20000 40000 60000 80000 100000 120000

68

1012

14

Amount of memory accessed by the interfering workload [bytes]

Dur

atio

n of

acc

ess

[cyc

les

− 5

12 A

vg T

rim]

Interfering accesses

readwrite

Figure 5.76: L1 data cache sharing impact on code accessing random cache lines onIntel Server.

0 8192 24576 40960 57344 73728 90112 106496 122880

46

810

1214

16


Dur

atio

n of

acc

ess

[cyc

les

− 1

024

Avg

]

Figure 5.77: L1 data cache sharing impact on code accessing random cache lines onAMD Server.

5.3.5.2 Experiment: L2 cache sharing

Purpose Determine the impact of L2 cache sharing on the most sensitive code.


Parameters Intel ServerAllocated: 4 MB; accessed: 4 MB; stride: 128 bytes.


Interference The same as the measured workload. Access type: read-only, write. Intel ServerAllocated:16 MB; accessed: 0-16 MB (1 MB step); stride: 128 B;

AMD ServerAllocated: 1024 KB; accessed: 0-1024 KB (64 KB step); stride: 64 B.




0.0e+00 5.0e+06 1.0e+07 1.5e+07

100

150

200

250


Dur

atio

n of

acc

ess

[cyc

les

− 3

2K A

vg T

rim]


readwrite

Figure 5.78: L2 cache sharing impact on code accessing random cache lineson Intel Server.

Expected Results The measured workload should fit in the L2 cache and thereforehit on each access when ex-ecuted with no interference, except for the associativity misses due to the physical indexing. The interferingworkload should increasingly evict the data of the measuredworkload until the whole buffer is evicted andthe measured workload should miss the L2 cache on each access.

Measured Results On PlatformIntel Server, the average duration of memory access in the measured workloadincreases from 80 cycles to 247 cycles, resp. 258 cycles withwrite interference (Figure5.78). With all32768 accessed pairs of L2 cache entries, the total overheadis 5.5, resp. 5.8 millions of cycles.

On PlatformAMD Serverwe again saw no difference between the read-only and write interference. Theaverage duration of memory access in the measured workload increases from 26 cycles to 47 cycles (Fig-ure5.79for the read-only variant). With all 8192 accessed L2 cache entries, the total overhead is 172000cycles.

Effect Summary The overhead can be visible in workloads with very good locality of data references that fit inthe L2 cache when executed alone. The cache miss can be repeated only as many times as there are the L2cache entries (or pairs of entries on platform with adjacentline prefetch), the overhead will therefore onlybe significant in workloads where the number of data accessesper invocation is comparable to the size ofthe L2 cache.

5.3.5.3 Experiment: L3 cache sharing, AMD Server

Purpose Determine the impact of L3 cache sharing on the most sensitive code on PlatformAMD Server


Parameters Allocated: 2 MB; accessed: 2 MB; stride: 64 B.

Interference The same as the measured workload. Allocated: 4 MB; accessed: 0-4 MB (256 KB step); stride:64 B; access type: read-only, write.

Expected Results The only difference from the previous experiments is that the write interference should makesome difference on this platform – only dirty cache lines have to be written back to the system memory uponeviction.




0 131072 262144 393216 524288 655360 786432 917504 1048576

2025

3035

4045


Dur

atio

n of

acc

ess

[cyc

les

− 8

192

Avg

]

Figure 5.79: L2 cache sharing impact on code accessing random cache lineson AMD Server.

0e+00 1e+06 2e+06 3e+06 4e+06

100

150

200


Dur

atio

n of

acc

ess

[cyc

les

− 3

2K A

vg T

rim]


readwrite

Figure 5.80: L3 cache sharing impact on code accessing random cache lineson AMD Server.

Measured Results The results (Figure5.80) show that the expected difference between read-only and writeinterference exists but is very small. The average durationof memory access in the measured workloadincreases from 59 to 238, resp. 240 cycles. With all 32768 L3 cache entries accessed, the total overhead isapproximately 5.9 millions of cycles.

Effect Summary The overhead can be visible in workloads with very good locality of data references that fit inthe L3 cache when executed alone. The cache miss can be repeated only as many times as there are the L3cache entries, the overhead will therefore only be significant in workloads where the number of data accessesper invocation is comparable to the size of the L3 cache.




0 8192 24576 40960 57344 73728 90112 106496 122880

510

1520

2530


Dur

atio

n of

acc

ess

[cyc

les

− 5

12 A

vg]

Figure 5.81: L1 instruction cache sharing impact on code that jumps between random cache lines onIntel Server.

5.3.5.4 Experiment: L1 instruction cache sharing

The L1 instruction cache sharing experiment is similar to the L1 data cache one except for executing chains ofjump instruction from Listing5.7as both measured and interfering workload.

Purpose Determine the impact of L1 instruction cache sharing on the most sensitive code.

Measured Time to execute a jump instruction in random jump instruction chain from Listing5.7and5.5.

Parameters Intel ServerAllocated: 32 KB; accessed: 32 KB; stride: 64 bytes.


Interference The same as the measured workload. Allocated: 128 KB; accessed: 0 KB-128 KB (8 KB step);stride: 64 B.

Expected Results The measured workload should fit in the L1 instruction cache and therefore hit on each jumpinstruction when executed with no interference. The interfering workload should increasingly evict the codeof the measured workload until it is all evicted and the measured workload should miss the L1 instructioncache in each executed jump instruction.

Measured Results The results confirm the expected slowdown due to the interfering workload. On PlatformIntel Server, the average duration of one jump instruction in the measured workload increases from 5 to 28cycles (Figure5.81). With 512 L1 cache lines the total overhead is approximately 11800 cycles.

On PlatformAMD Server, the average duration of one jump instruction in the measured workload increasesfrom 9 to 21 cycles (Figure5.82). With 1024 L1 cache lines the total overhead is approximately 12300cycles.

Effect Summary The overhead can be visible in workloads that perform many jumps and branches and that fitin the L1 instruction cache when executed alone. The cache miss can be repeated only as many times asthere are the L1 instruction cache entries, the overhead will therefore only be significant in workloads wherethe number of executed branch instructions per invocation is comparable to the size of the L1 instructioncache.




0 8192 24576 40960 57344 73728 90112 106496 122880

1015

2025


Dur

atio

n of

acc

ess

[cyc

les

− 1

024

Avg

]

Figure 5.82: L1 instruction cache sharing impact on code that jumps between random cache lines onAMD Server.

5.3.6 Real Workload Experiments: Fourier Transform

A Fast Fourier Transform implementation takes a memory buffer filled with input data as its input and transformsit either in-place or with a separate memory buffer for the output. It is an example of a memory intensive operationand might be therefore affected by the data cache sharing. Specifically in the pipelined scenario, the performanceof a component performing the FFT transformation might be affected by the amount of the input data being cachedupon the components invocation. If another memory intensive component is executed in the pipeline between acomponent that fills the buffer from e.g. disk, network or by aprevious processing, and the FFT component, theinput data is evicted from the cache.

The following experiments model this scenario by repeated execution of sequence of the following operations– buffer initialization, data cache eviction and the in-place FFT calculation, whose duration is measured.

A slightly different variant of the experiment invokes FFT calculation with separate input and output buffersinterleaved with the data cache eviction. In this scenario,the buffer is initialized only once before the experiment.

We use FFTW 3.1.2 [27] as the FFT implementation. For the data eviction we executethe pointer walk code(Listing 5.1) accessing random cache lines (Listing5.5) using both read-only and write access variants.

5.3.6.1 Experiment: FFT sharing data caches

Purpose Determine the impact of data cache sharing on performance ofthe FFT calculation.

Measured Duration of a FFT calculation with varying input buffer size.

Parameters FFT method: FFTW in-place, FFTW separate buffers; FFT buffer size: 4-8192KB (exponentialstep)

Interference Pointer walk (Listing5.1) accessing random cache lines (Listing5.5). Allocated: 16 MB; ac-cessed: 0-16 MB (1 MB step); access stride: 64 B; access type:read-only, write.

Expected Results Increasing the amount of data accessed by the cache evictionshould increase the duration ofthe FFT transformation due to more cache misses. The effect should diminish as the FFT buffer size exceedsthe size of the last-level cache, because the FFT calculation is already causing capacity misses during itsoperation. The effect should be stronger in some configurations, when the interfering code is performingmemory writes, due to the write-back of dirty cache lines.




0.0e+00 5.0e+06 1.0e+07 1.5e+07

050

100

150

200

250

Amount of memory read by interfering workload [bytes]

Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

FFT buffer size

4 KB8 KB16 KB32 KB64 KB128 KB256 KB512 KB1 MB2 MB4 MB8 MB

Figure 5.83: In-place FFT slowdown by data cache sharing with read-only interference onIntel Server.

0.0e+00 5.0e+06 1.0e+07 1.5e+07

050

100

150

200

250

300

Amount of memory written by interfering workload [bytes]

Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

FFT buffer size


Figure 5.84: In-place FFT slowdown by data cache sharing with write interference onIntel Server.

Measured Results The results from PlatformIntel Servershow a significant slowdown due to cache evictionin all cases where the FFT buffer (or the two separate buffers) fit in half of the L2 memory cache. Writeinterference and using separate buffers both generally increase the slowdown. See Figures5.83and5.84forthe in-place variant,5.85and5.86for the separate-buffers variant.

The slowdown diminishes with 4 MB and larger in-place bufferor two 2 MB input and output buffers. Thelargest slowdown occurs with 8 KB buffer – almost 260 % with read-only interference and almost 300 %with write interference in the in-place variant, 400 % and 500 % in the separate-buffers variant.

In the variant with separate buffers and buffer sizes that donot fit in the L2 cache we also observed asituation where the read-only interfering workload improves, albeit slightly, the perceived performance ofthe FFT calculation (Figure5.87). This is a case where the calculation leaves dirty cache lines in the caches




0.0e+00 5.0e+06 1.0e+07 1.5e+07

010

020

030

040

0


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

FFT buffer size


Figure 5.85: FFT with separate buffers slowdown by data cache sharing with read-only interference onIntel Server.

0.0e+00 5.0e+06 1.0e+07 1.5e+07

010

020

030

040

050

0


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

FFT buffer size


Figure 5.86: FFT with separate buffers slowdown by data cache sharing with write interference onIntel Server.




0 2097152 5242880 8388608 11534336 14680064

6750

0000

6850

0000

6950

0000


Dur

atio

n of

FF

T c

alcu

latio

n [c

ycle

s]

Figure 5.87: FFT with separate buffers speedup thanks to dirty lines eviction by read-only interference with 8 MB bufferson Intel Server.

0.0e+00 5.0e+06 1.0e+07 1.5e+07

050

100

150

200

250


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

FFT buffer size


Figure 5.88: In-place FFT slowdown by data cache sharing with read-only interference onAMD Server.

when it finishes, but accesses some different memory when it is executed again. The read-only interferenceevicts these dirty cache lines and replaces them with clean cache lines, which decreases the perceived cachemiss penalty in the FFT calculation. This effect naturally does not occur with write interference.

The results from PlatformAMD Serverare similar, with less significant effect of write interference andgenerally smaller relative slowdown. The most significant slowdown observed with in-place transformationis 250 % and 270 % with 8 KB FFT buffer using read-only and writeinterference, respectively. With separatebuffers, 4 KB FFT buffer yields the most significant slowdown– up to 300 %, respectively 350 %.

Similarly to the results from PlatformIntel Server, we also observed very small speedup with separate 8 MBbuffers and read-only interference, which does not occur with write interference (Figure5.92).




0.0e+00 5.0e+06 1.0e+07 1.5e+07

050

100

150

200

250


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

FFT buffer size


Figure 5.89: In-place FFT slowdown by data cache sharing with write interference onAMD Server.

0.0e+00 5.0e+06 1.0e+07 1.5e+07

050

100

150

200

250

300


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

FFT buffer size


Figure 5.90: FFT with separate buffers slowdown by data cache sharing with read-only interference onAMD Server.




0.0e+00 5.0e+06 1.0e+07 1.5e+07

050

100

150

200

250

300

350


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

FFT buffer size


Figure 5.91: FFT with separate buffers slowdown by data cache sharing with write interference onAMD Server.

0 2097152 5242880 8388608 11534336 14680064

8550

0000

8650

0000

8750

0000


Dur

atio

n of

FF

T c

alcu

latio

n [c

ycle

s]

Figure 5.92: FFT with separate buffers speedup thanks to dirty lines eviction by read-only interference with 8 MB buffersonAMD Server.




Listing 5.9: Shared variable overhead experiment.

1 // Workload generation2 while ( true) {3 asm volatile (4 "lock incl (%0)"5 : "=r" ( pShared)6 : "r" ( pShared)7 :8 );9 }

Effect Summary The overhead is visible in FFT as a real workload representative. The overhead depends onthe size of the buffer submitted to FFT. In some cases, the interfering workload can flush modified data,yielding apparently negative overhead of the measured workload.


Code most sensitive to parallel sharing of memory content caches includes:

• Accesses to memory where accessed data would just fit into thememory cache in the isolated scenario andwhere the access pattern does not trigger prefetching.

• Accesses to shared memory where the access pattern triggersflushing of modified data.

• Assume components that transfer data at rates close to the memory cache bandwidth. A parallel compositionof such components will reduce the memory cache bandwidth available to each component, increasing thememory access latencies.

• Assume components that benefit from prefetching at rates close to the memory cache bandwidth. A par-allel composition of such components will reduce the memorycache bandwidth available for prefetching,unmasking the memory access latencies.


5.3.8.1 Experiment: Shared variable overhead

The experiment to determine the overhead associated with sharing a variable performs an atomic increment opera-tion on a variable shared by multiple processors, with the standard cache coherency and memory ordering rules ineffect. The workload is common in the implementation of synchronization primitives and synchronized structures.

To determine the overhead associated with sharing the variable, the same workload is also executed on avariable local to each processor.

Purpose Determine the overhead associated with sharing a variable.

Measured Time to perform a single atomic increment operation from Listing 5.9.

Parameters Different pairs of processors used, local and shared variables used.

Expected Results Operations on the shared variable should exhibit an overhead compared to the operationson the local variable. The overhead can differ between configurations with shared L2 caches and separateL2 caches, but it should be present in both configurations since L1 caches are always separate and cachecoherency needs to be enforced.




0 1

2040

6080

100

120

Memory sharing [boolean]

Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

]

Figure 5.93: Shared variable overhead with shared L2 cache onIntel Server.

0 1

2040

6080

100

120


Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

]

Figure 5.94: Shared variable overhead with separate L2 cache and shared package onIntel Server.

Measured Results For a configuration where the two processors share an L2 cache, the results on Figure5.93show that a shared access takes an average of 95 cycles, compared to the 23 cycles of the local access.For a configuration where the two processors do not share an L2cache but share a package, the results onFigure5.94show that a shared access takes an average of 113 cycles, compared to the 23 cycles of the localaccess. Finally, for a configuration where the two processors share neither the L2 cache nor the package, theresults on Figure5.95show that a shared access takes an average of 55 cycles, compared to the 23 cycles ofthe local access.

Also notable is the fact that the more tightly coupled the twoprocessors are, the more likely it is that theaccesses are strictly interleaved and thus always exactingthe maximum variable sharing penalty.

Effect Summary The overhead can be visible in workloads with frequent blindaccess to a shared variable.




0 1

2040

6080

100


Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

]

Figure 5.95: Shared variable overhead with separate L2 cache and separate package onIntel Server.

5.3.8.2 Experiment: Cache bandwidth limit

The experiment to determine the bandwidth limit associatedwith shared caches performs the random multipointerwalk from Listing 5.2 and 5.5 as the measured workload and the random multipointer walk with delays fromListing 5.3and5.5as the interfering workload. The measured workload is configured to access the shared cache atmaximum speed over a range of addresses that is likely to hit in the cache. The interfering workload is configuredto access the shared cache at varying speeds in two experiment configurations, one over a range of addresses thatis likely to hit in the cache and one over a range of addresses that is likely to miss in the cache.

Purpose Determine the bandwidth limit associated with shared caches.

Measured Time to perform a single memory access in random multipointer walk from Listing5.2and5.5.

Parameters Allocated: 128 KB; accessed: 128 KB; stride: 64 B; pointers:8.

Interference Random multipointer walk with delays from Listing5.3and5.5.

Allocated: 128 KB, 4MB; accessed: 128 KB, 4MB; pointers: 8; delay: 0-512 K operations.

Expected Results If there is a competition for the shared cache access bandwidth between the measured work-load and the interfering workload, the time to perform a single memory access will change with the interfer-ing workload delay. Depending on the cache architecture, the interfering workload that causes mostly cachehits might behave differently from the interfering workload that causes mostly cache misses.

Measured Results Considering PlatformIntel Server. When both workloads stay in the cache, the competitionfor the shared cache access bandwidth is visible as an increase in the average access time from 6.2 to 7cycles on Figure5.96. When the measured workload stays in the cache but the interfering workload missesin the cache, the competition is visible as an increase in theaverage access time from 6.2 to 7.4 cycles onFigure5.97. Figures5.98and5.99serve to estimate how close the workload is to the shared cache accessbandwidth. The figures show the values of the cache idle counter, suggesting that the workload utilizes thecache very close to the bandwidth limit.

Effect Summary The limit can be visible in workloads with high cache bandwidth requirements and workloadswhere cache access latency is not masked by concurrent processing.




0 1 2 4 8 16 64 256 1024 4096 16384 65536 524288

6.0

6.2

6.4

6.6

6.8

7.0

7.2

7.4

Idle length [number of NOP instructions]

Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

]

Figure 5.96: Shared cache bandwidth limit where interfering workload hits in the shared cache onIntel Server.

0 1 2 4 8 16 64 256 1024 4096 16384 65536 524288

6.0

6.5

7.0

7.5

8.0


Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

]

Figure 5.97: Shared cache bandwidth limit where interfering workload misses in the shared cache onIntel Server.




0 1 2 4 8 16 64 256 1024 4096 16384 65536 524288

01

23

45

67


Cou

nt o

f L2_

NO

_RE

Q.B

OT

H_C

OR

ES

[eve

nts

− 1

000

Avg

]

Figure 5.98: Shared cache idle counter per access where interfering workload hits in the cache onIntel Server.

0 1 2 4 8 16 64 256 1024 4096 16384 65536 524288

4.0

4.5

5.0

5.5

6.0


Cou

nt o

f L2_

NO

_RE

Q.B

OT

H_C

OR

ES

[eve

nts

− 1

000

Avg

]

Figure 5.99: Shared cache idle counter per access where interfering workload misses in the cache onIntel Server.




5.3.8.3 Experiment: Cache bandwidth sharing

The experiment that determines the impact of sharing cache bandwidth by multiple parallel requests executes therandom multipointer walk from Listings5.2 and5.5 as the measured workload. The workload is configured toaccess the shared cache over a range of addresses that is large enough to miss in all private caches, yet smallenough to likely hit in the shared cache.

There are two variants of the interfering workload, one thathits and one that misses in the shared cache:

• The variant that hits uses the random multipointer walk overa range of addresses that is likely to hit in theshared cache. With this variant, both workloads compete forthe number of requests the cache can handlesimultaneously when only hits occur.

• The variant that misses uses the set collision multipointerwalk from Listings5.2 and5.6, configured sothat each pointer accesses cache lines from a different randomly selected associativity set, over a range ofaddresses that is large enough to miss in the selected sets. With this variant, both workloads compete for thenumber of requests the cache can handle simultaneously whenboth hits and misses occur.

In both variants, the interfering workload evicts only a small portion of the shared cache, making it possible toassess the impact of sharing cache bandwidth without competing for the shared cache capacity.

Both workloads vary the number of pointers, which determines the number of simultaneous requests to theshared cache.

Purpose Determine the impact of sharing cache bandwidth.

Measured Time to perform a single memory access in random multipointer walk from Listing5.2and5.5.

Parameters Allocated and accessed: 256 KB, 8 MB (Intel Server), 1 MB, 8 MB (AMD Server); stride: 64 B;pointers: 1-64 (exponential step).

Interference Random multipointer walk from Listing5.2and5.5to hit in the shared cache.

Allocated and accessed: 64 KB (Intel Server), 768 KB (AMD Server); stride: 64 B; pointers: 1-64 (expo-nential step).

Set collision multipointer walk from Listing5.2and5.6to miss in the shared cache.

Pages allocated and accessed: 64; access stride: 64 pages (4MB L2 cache size divided by 16 ways); pagecolors: 64; pointers: 1-64 (exponential step) onIntel Server.

Pages allocated and accessed: 64; access stride: 16 pages (2MB L3 cache size divided by 32 ways); pagecolors: 16; pointers: 1-64 (exponential step) onAMD Server.

Expected Results If there is a competition for the shared cache bandwidth between the measured workload andthe interfering workload, the time to perform a single memory access should increase with the number ofpointers used by the interfering workload. Depending on thecache architecture, the measured workload thatcauses mostly cache hits might be affected differently thanthe workload that causes mostly cache misses.Similarly, the interfering workload that causes mostly shared cache hits might have different impact than theworkload that causes mostly shared cache misses.

Measured Results Considering PlatformIntel Server. All workload variants show slowdown of the measuredworkload due to sharing, depending on the number of pointersused by the interfering workload (in figures,results for different number of pointers used by the measured workload are plotted as different lines).

The results for the variant where both workloads hit in the shared L2 cache are illustrated on Figure5.100.In general, increasing the number of pointers in the interfering workload increases the performance impact.The event counter for outstanding L1 data cache misses at anycycle (L1D PEND MISS) also increases withthe number of interfering workload pointers, confirming that the slowdown is caused by a busy shared L2cache. For the measured workload with one pointer, the slowdown is less than 2 %. With four or more




Measured workload pointers

148163264

0 1 2 4 8 16 32 64

05

1015

Interfering workload pointers [pointers]

Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

Figure 5.100: Slowdown of random multipointer walk in the parallel cache sharing scenario where both measured andinterfering workload hit in the shared cache onIntel Server.

pointers used by both workloads, the slowdown is above 10%. The maximum observed slowdown is 16 %.Results with two pointers were too unstable to be presented.

When the measured workload misses and the interfering workload hits in the shared cache, the results withone pointer yield more significant slowdown than the previous variant – up to 13 %, as evidenced in Fig-ure5.101. The impact, however, decreases as the number of pointers inthe measured workload increasesThe slowdown caused by the shared cache bandwidth sharing isthus less significant compared to the penaltycaused by the L2 cache misses. Note that the observed L2 miss rates increase slightly with the numberof pointers in the interfering workload – the workloads alsocompete for cache capacity and the observedslowdown can be partially attributed to these extra L2 cachemisses.

Finally, the variant where the measured workload hits and the interfering workload misses in the sharedcache is illustrated on Figure5.102. Here, we observe the most significant slowdown – up to 107% when themeasured workload uses four pointers. The pending misses block concurrent hits in the shared L2 cache.

Considering PlatformAMD Server. The results of the variant where both workloads hit in the shared L3cache are illustrated on Figure5.103for one pointer in the measured workload, which yields 5 % slowdown.Results with more pointers were too unstable to be presented.

When the measured workload misses and the interfering workload hits in the shared cache, the results showno measurable slowdown on this platform.

Finally, the variant where the measured workload hits and the interfering workload misses in the sharedcache is illustrated on Figure5.104. The maximum observed slowdown is 49 % with four pointers in themeasured workload. We could not, however, verify whether the interfering workload causes L3 cache missesin the measured workload due to problems with the L3CACHE MISSES event counter, which reported thesame results regardless of the processor core mask setting.It is therefore possible that some of the overheadshould be attributed to the L3 cache misses caused by the interfering workload, not just to to the cache beingbusy.

Open Issues The problem with the L3CACHE MISSES even counter onAMD Serverprevented confirmingthat the observed overhead is not due to L3 cache misses.





148163264

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

Figure 5.101: Slowdown of random multipointer walk in the parallel cache sharing scenario where the measured work-load misses in the shared cache onIntel Server.


148163264

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

Figure 5.102: Slowdown of random multipointer walk in the parallel cache sharing scenario where the interfering work-load misses in the shared cache onIntel Server.




0 1 2 4 8 16 32 64

5253

5455

5657

5859


Acc

ess

dura

tion

[cyc

les

− 1

00 A

vg]

Figure 5.103: Slowdown of random pointer walk in the parallel cache sharing scenario where both measured and inter-fering workload hit in the shared cache onAMD Server.


1248163264

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

Figure 5.104: Slowdown of random multipointer walk in the parallel cache sharing scenario where the interfering work-load misses in the shared cache onAMD Server.




Effect Summary The impact can be visible in workloads with many pending requests to the shared cache,where cache access latency is not masked by concurrent processing. The impact is significantly larger whenone of the workloads misses in the shared cache.

5.3.8.4 Experiment: Shared cache prefetching

The experiment with shared cache prefetching is similar to the experiment with cache bandwidth sharing (Experi-ment5.3.8.3) in that it configures the measured and interfering workloads to hit or miss in the shared cache withoutcompeting for its capacity. The difference is that the measured workload uses the linear multipointer walk fromListing 5.2and5.4, which benefits from prefetching. The interfering workloadmay disrupt prefetching by makingthe shared cache busy.

Purpose Determine the impact of shared cache prefetching.

Measured Time to perform a single memory access in linear multipointer walk from Listing5.2and5.4.

Parameters Allocated and accessed: 256 KB, 8 MB (Intel Server), 1 MB, 8 MB (AMD Server); stride: 64 B;pointers: 1-64 (exponential step).






Expected Results The memory accesses of the measured linear walk workload should trigger and benefit fromprefetches into the L1 cache. If the L1 prefetches are discarded due to demand requests of the interferingworkload, the linear walk should be affected more than the random walk in Experiment5.3.8.3, which isused as a reference. We should also be able to verify this effect by examining the L1 prefetch event counter.

When the accessed memory range of the measured workload exceeds the shared cache capacity, the linearpattern should also trigger and benefit from prefetches fromthe system memory into the shared cache. Theinterfering workload may cause some of these prefetches to be discarded, increasing the number of cachemisses due to demand requests and introducing the associated penalty in the measured workload, withoutcompeting for the cache capacity.

Measured Results Considering PlatformIntel Server. The results of the variant where both workloads hit inthe shared cache, illustrated on Figure5.105, show only a slightly larger slowdown than that of the randomwalk workload on Figure5.100, with the maximum observed slowdown being 16 %. The counter of L1 datacache prefetch events (L1DPREFETCH:REQUESTS) shows that prefetches occur only when the measuredworkload uses one pointer, at a rate of one prefetch event perdata access, regardless of the number of pointersused by the interfering workload.

The slowdown when the interfering workload misses in the shared cache, illustrated on Figure5.106, is alsoonly slightly larger than that of the random walk workload inFigure5.102. Using four pointers in the mea-sured workload again yields the maximum slowdown, up to 108 %. There is, however, a visible differencewith one pointer used in the measured workload and eight or more pointers used in the interfering workload.The results of the L1DPREFETCH:REQUESTS event counter (Figure5.107) reveal that almost half of theprefetches to the L1 cache are discarded when the interfering workload uses eight or more pointers, whichinterestingly does not occur when it hits in the shared cache.





148163264

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

05

1015


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

Figure 5.105: Slowdown of linear multipointer walk in the parallel cache sharing scenario where both measured andinterfering workload hit in the shared cache onIntel Server.

The slowdown when the memory range accessed by the measured workload exceeds the shared cache capac-ity is illustrated on Figure5.108. The counts of the L2 cache demand request misses (L2LINES IN:SELF)and the prefetch request misses (L2LINES IN:SELF:PREFETCH) show that when the measured workloaduses 16 or less pointers and there is no interference, the L2 cache misses occur mostly during prefetchingand the demand requests thus mostly hit. The prefetches are discarded as the number of pointers in the inter-fering workload increases, causing the accesses of the measured workload to miss. This slowdown is mostsignificant with 16 pointers in the measured workload, up to 63 %. Figures5.109and5.110illustrate thechanges of the prefetch and demand request misses in the L2 cache, respectively. Using 32 or more pointersin the measured workload seems to exceed the number of prefetch streams the shared cache is able to track,the measured workload thus misses on each access and the interfering workload does not add any significantslowdown.

Considering PlatformAMD Server. The results of the variant where both workloads hit in the shared cache,illustrated on Figure5.111, show very unexpected results, where the interfering workload actually seems tospeed up the measured workload with two or more pointers, in some cases very significantly. This couldnot be explained by any of the related event counters. In somecases, we have observed an increase of theprefetch requests to the L2 cache, as illustrated on Figure5.112for 16 pointers in the measured workload.This is accompanied by a decrease of the L1 cache misses. It isnot, however, clear why sharing the L3cache would affect prefetches from the private L2 cache to the private L1 cache. This increase of prefetchesand decrease of L1 misses also disappears when 32 or more pointers are used by the measured workload, asillustrated on Figure5.113for 32 pointers, when some speedup still remains.

The results of the variant where the measured workload exceeds the shared cache capacity also show anunexpected speedup similar to the previous variant, as illustrated on Figure5.114.

Finally, the impact of interfering workload missing in the shared cache is illustrated on Figure5.115. Theseresults also exhibit the unexpected speedup due to the interfering workload, as seen in the previous variants.The speedup, however, diminishes or even changes to slowdown as the number of pointers in the interferingworkload increases, similarly to the random multipointer walk in Experiment5.3.8.3on Figure5.104. Thissuggest that there are two different effects influencing theresults in opposite directions.

Open Issues The unexpected speedup caused by the interfering workload on PlatformAMD Serverremains anopen issue.





148163264

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060

8010

012

0


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

Figure 5.106: Slowdown of linear multipointer walk in the parallel cache sharing scenario where interfering workloadmisses in the shared cache onIntel Server.

0 1 2 4 8 16 32 64

0.4

0.5

0.6

0.7

0.8

0.9

1.0


L1D

_PR

EF

ET

CH

:RE

QU

ES

TS

per

acc

ess

[eve

nts

− 1

000

Avg

]

Figure 5.107: Decrease of L1 prefetch events per memory access in the parallel cache sharing scenario where interferingworkload misses in the shared cache onIntel Server.





148163264

0 1 2 4 8 16 32 64

020

4060


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

0 1 2 4 8 16 32 64

020

4060


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Trim

]

Figure 5.108: Slowdown of linear multipointer walk in the parallel cache sharing scenario where measured workloadexceeds the cache capacity onIntel Server.

0 1 2 4 8 16 32 64

0.1

0.2

0.3

0.4

0.5

0.6

0.7


L2_L

INE

S_I

N:S

ELF

:PR

EF

ET

CH

per

acc

ess

[eve

nts

− 1

000

Avg

]

Figure 5.109: Decrease of L2 prefetch request misses per memory access in the parallel cache sharing scenario wheremeasured workload with 16 pointers exceeds the cache capacity onIntel Server.




0 1 2 4 8 16 32 64

0.2

0.3

0.4

0.5

0.6

0.7

0.8


L2_L

INE

S_I

N:S

ELF

per

acc

ess

[eve

nts

− 1

000

Avg

]

Figure 5.110: Increase of L2 demand misses per memory access in the parallel cache sharing scenario where measuredworkload with 16 pointers exceeds the cache capacity onIntel Server.


1248163264

0 1 2 4 8 16 32 64

−50

−40

−30

−20

−10

010


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−50

−40

−30

−20

−10

010


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−50

−40

−30

−20

−10

010


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−50

−40

−30

−20

−10

010


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−50

−40

−30

−20

−10

010


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−50

−40

−30

−20

−10

010


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−50

−40

−30

−20

−10

010


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

Figure 5.111: Unexpected speedup of linear multipointer walk in the parallel cache sharing scenario where both mea-sured and interfering workload hit in the shared cache onAMD Server.




0 1 2 4 8 16 32 64

0.4

0.5

0.6

0.7


RE

QU

ES

TS

_TO

_L2.

HW

_PR

EF

ET

CH

_FR

OM

_DC

per

acce

ss [e

vent

s −

100

0 A

vg]

Figure 5.112: Increase of prefetch requests to L2 cache per access in linear multipointer walk with 16 pointers in theparallel cache sharing scenario where both measured and interfering workload hit in the shared cache onAMD Server.

0 1 2 4 8 16 32 64

0.00

0.01

0.02

0.03


RE

QU

ES

TS

_TO

_L2.

HW

_PR

EF

ET

CH

_FR

OM

_DC

per

acce

ss [e

vent

s −

100

0 A

vg]

Figure 5.113: Negligible change of requests to L2 cache per access in linear multipointer walk with 32 pointers in theparallel cache sharing scenario where both measured and interfering workload hit in the shared cache onAMD Server.





1248163264

0 1 2 4 8 16 32 64

−80

−60

−40

−20

020


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−80

−60

−40

−20

020


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−80

−60

−40

−20

020


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−80

−60

−40

−20

020


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−80

−60

−40

−20

020


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−80

−60

−40

−20

020


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−80

−60

−40

−20

020


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

Figure 5.114: Unexpected speedup of linear multipointer walk in the parallel cache sharing scenario where the measuredworkload exceeds the cache capacity onAMD Server.


1248163264

0 1 2 4 8 16 32 64

−20

−10

010

2030

40


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−20

−10

010

2030

40


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−20

−10

010

2030

40


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−20

−10

010

2030

40


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−20

−10

010

2030

40


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−20

−10

010

2030

40


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

0 1 2 4 8 16 32 64

−20

−10

010

2030

40


Slo

wdo

wn

of a

cces

ses

[% −

100

Avg

Med

ian]

Figure 5.115: Slowdown and speedup of linear multipointer walk in the parallel cache sharing scenario where the inter-fering workload misses in the shared cache onAMD Server.




Effect Summary The impact can be visible in workloads with working sets thatdo not fit in the shared cache,but employ hardware prefetching to prevent demand request misses. Prefetching can be disrupted by demandrequests of the interfering workload, even if those requests do not miss in the shared cache.

5.3.9 Real Workload Experiments: Fourier Transform

The following experiment uses the FFT workload described inSection5.3.6and measures the slowdown of theFFT workload when sharing caches in parallel composition. Both the in-place variant and the variant with separateinput and output buffers are used, with varying buffer sizes.

The interfering workload is the same as in the artificial experiments with cache bandwidth sharing and sharedcache prefetching (Experiments5.3.8.3and5.3.8.4), namely the random multipointer walk from Listing5.2and5.5that hits in the shared cache, and the set collision multipointer walk from Listing5.2 and5.6 that misses in theshared cache. The number of pointers used by the interferingworkload varies.

5.3.9.1 Experiment: FFT sharing data caches

Purpose Determine the impact of data cache sharing on performance ofthe FFT calculation.

Measured Duration of a FFT calculation with varying input buffer size.

Parameters FFT method: FFTW in-place, FFTW separate buffers; FFT buffer size: 128 KB-16 MB (exponen-tial step).






Expected Results In general, increasing the number of pointers in the interfering workload should yield a largerslowdown. The slowdown should be similar or lower than the one observed in the artificial experiments(Experiments5.3.8.3and5.3.8.4), depending on how intensively FFT accesses the shared L2 cache and howmuch it benefits from prefetching.

Measured Results Considering PlatformIntel Server. All workload variants show slowdown of the FFT cal-culation due to sharing, depending on the number of pointersused by the interfering workload.

The results where the interfering workload hits in the shared cache are illustrated on Figure5.116for thein-place variant and on Figure5.117for the separate-buffers variant. The observed slowdown isbelow theslowdown observed in Experiments5.3.8.3and5.3.8.4and does not significantly depend on the FFT buffersize, except for the 2 MB and larger buffer sizes. The maximumobserved slowdown with the smaller buffersizes is 10 % with one 128 KB buffer and 6 % with two 128 KB buffers.

The difference between the results with smaller buffers andthe results with the 2 MB and larger buffer sizesis that the latter causes L2 cache misses even with no interference, which means that the workload does notfit in the L2 cache. The number of the L2 cache demand request misses also increases with the number ofpointers in the interfering workload. This is because the interfering workload causes L2 prefetches to bediscarded, similar to Experiment5.3.8.4. This is illustrated by figures5.118and5.119for the variant withone 4 MB buffer, which yields the most significant slowdown – 27 % for the in-place variant and 43 % forthe separate-buffers variant.




FFT buffer size

128 KB256 KB512 KB1 MB2 MB4 MB8 MB16 MB

0 1 2 4 8 16 32 64

05

1015

2025

30


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

05

1015

2025

30


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

05

1015

2025

30


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

05

1015

2025

30


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

05

1015

2025

30


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

05

1015

2025

30


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

05

1015

2025

30


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

05

1015

2025

30


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

Figure 5.116: In-place FFT slowdown in the parallel cache sharing scenario where the interfering workload hits in theshared cache onIntel Server.

FFT buffer size


0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

010

2030

4050


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

Figure 5.117: FFT with separate buffers slowdown in the parallel cache sharing scenario where the interfering workloadhits in the shared cache onIntel Server.




0 1 2 4 8 16 32 64

2e+

044e

+04

6e+

048e

+04

1e+

05


L2_L

INE

S_I

N:S

ELF

eve

nts

durin

g F

FT

cal

cula

tion

[eve

nts]

Figure 5.118: Increase of L2 demand misses in the parallel cache sharing scenario during FFT with one 4 MB buffer onIntel Server.

0 1 2 4 8 16 32 64

6000

080

000

1000

0012

0000


L2_L

INE

S_I

N:S

ELF

:PR

EF

ET

CH

eve

nts

durin

g F

FT

cal

cula

tion

[eve

nts]

Figure 5.119: Decrease of L2 prefetch misses in the parallel cache sharingscenario during FFT with one 4 MB buffer onIntel Server.




FFT buffer size


0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

Figure 5.120: In-place FFT slowdown in the parallel cache sharing scenario where the interfering workload misses inthe shared cache onIntel Server.

FFT buffer size


0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

0 1 2 4 8 16 32 64

050

100

150


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Trim

]

Figure 5.121: FFT with separate buffers slowdown in the parallel cache sharing scenario where the interfering workloadmisses in the shared cache onIntel Server.

The results where the interfering workload misses in the shared are illustrated on Figures5.120and5.121.The observed slowdown for 1 MB and smaller FFT buffer sizes isnot considerably lower than for the arti-ficial workload with 8 or more pointers in Experiments5.3.8.3and5.3.8.4, with negligible dependency onthe FFT buffer size. The maximum observed slowdown is 63 % with one 512 KB FFT buffer and 70 % withtwo 256 KB buffers.

With 2 MB or larger buffer sizes, the FFT calculation again does not fit in the L2 cache even without interfer-ence, it therefore competes for the memory bus as well as for the shared cache and the observed slowdown islarger. Again, the number of L2 prefetch requests and missesdecreases and thus the number of L2 demandmisses increases with the number of pointers in the interfering workload. The maximum observed slowdown




0 1 2 4 8 16 32 64

3.8e

+07

4.0e

+07

4.2e

+07

4.4e

+07


FF

T d

urat

ion

[cyc

les]

Figure 5.122: In-place FFT slowdown in the parallel cache sharing scenario during FFT with one 4 MB buffer onAMD Server.

FFT buffer size

512 KB1 MB2 MB8 MB

0 1 2 4 8 16 32 64

−4

−2

02

46


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

0 1 2 4 8 16 32 64

−4

−2

02

46


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

0 1 2 4 8 16 32 64

−4

−2

02

46


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

0 1 2 4 8 16 32 64

−4

−2

02

46


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

Figure 5.123: In-place FFT slowdown in the parallel cache sharing scenario where the interfering workload misses inthe shared cache onAMD Server.

is 133 % with one 4 MB FFT buffer and 148 % with two 4 MB buffers.

Considering PlatformAMD Server. The results where the interfering workload hits in the shared L3 cachewere generally unstable, the highest slowdown among the stable results was 6 % observed with one 4 MBbuffer, illustrated on Figure5.122.

The results where the interfering workload misses in the shared cache are illustrated on Figures5.123and5.124, with the maximum observed slowdown being 6 %. However, we have also observed up to 13 %speedup, similar to the speedup observed in Experiment5.3.8.4.

Open Issues The reasons of the speedup on PlatformAMD Server, observed also in Experiment5.3.8.4, remain




FFT buffer size

512 KB1 MB2 MB4 MB8 MB

0 1 2 4 8 16 32 64

−15

−10

−5

05


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

0 1 2 4 8 16 32 64

−15

−10

−5

05


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

0 1 2 4 8 16 32 64

−15

−10

−5

05


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

0 1 2 4 8 16 32 64

−15

−10

−5

05


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

0 1 2 4 8 16 32 64

−15

−10

−5

05


Slo

wdo

wn

of F

FT

cal

cula

tion

[% −

Med

ian]

Figure 5.124: FFT with separate buffers slowdown in the parallel cache sharing scenario where the interfering workloadmisses in the shared cache onAMD Server.

an open question.

Effect Summary The overhead is visible in FFT as a real workload representative. The overhead is smallerwhen FFT fits in the shared cache and the interfering workloadhits, and larger when FFT does not fit in theshared cache or the interfering workload misses.

5.3.10 Real Workload Experiments: SPEC CPU2006

SPEC CPU2006 [30] is an industry-standard benchmark suite that comprises ofa spectrum of processor intensiveand memory intensive workloads based on real applications.The following experiment measures the slowdown ofthe SPEC CPU2006 workloads when sharing caches in parallel composition.

As the measured workload, the SPEC CPU2006 suite version 1.1is executed via the includedrunspectool,which measures the median execution time of each workload from 10 runs. Due to time constraints, we only runthe benchmarks with thebasetuning and thetest input data, an eventual confirmation of the results with therefinput data should be performed. The execution of the measured workload has been pinned to a single core usingthetasksetutility.

The interfering workload runs on a single core that shares the cache with the measured workload, and is thesame as the interfering workload in Experiments5.3.8.3and5.3.8.4. Namely, the experiments use the randommultipointer walk from Listings5.2 and5.5 to cause hits in the shared cache, and the set collision multipointerwalk from Listings5.2and5.6to cause misses in the shared cache. The number of pointers used by the interferingworkload is set to the value that caused the largest slowdownin the artificial experiments.

5.3.10.1 Experiment: SPEC CPU2006 sharing data caches

Purpose Determine the impact of data cache sharing on performance ofthe SPEC CPU2006 workloads.

Measured Median execution time of the SPEC CPU2006 workloads.

Parameters Runs: 10; tuning: base; input size: test.





CINT2006 benchmark Isolated [s] Hit [s] Slowdown Miss [s] Slowdown

400.perlbench 3.89 3.95 1.5 % 4.16 6.9 %401.bzip2 9.67 10.5 8.6 % 13.1 35 %403.gcc 1.85 2.07 12 % 2.32 25 %429.mcf 5.72 6.19 8.2 % 10.1 77 %445.gobmk 26.9 28.0 4.1 % 36.4 35 %456.hmmer 6.27 6.28 0.2 % 6.38 1.8 %458.sjeng 5.96 6.07 1.8 % 6.54 9.7 %462.libquantum 0.0753 0.0768 2.0 % 0.0788 4.6 %464.h264ref 22.0 23.2 5.5 % 23.8 8.2 %471.omnetpp 0.599 0.65 7.7 % 0.664 11 %473.astar 12.6 12.7 0.8 % 14.9 18 %483.xalancbmk 0.111 0.120 8.1 % 0.152 37 %

CINT2006 geomean 5.0 % 21 %Table 5.1: Slowdown of the SPEC CPU2006 integer benchmarks (CINT2006)in the parallel cache sharing scenario

where the interfering workload hits or misses in the shared cache onIntel Server.

Allocated and accessed: 64 KB (Intel Server), 768 KB (AMD Server); stride: 64 B; pointers: 64 (Intel Server),32 (AMD Server)..


Pages allocated and accessed: 64; access stride: 64 pages (4MB L2 cache size divided by 16 ways); pagecolors: 64; pointers: 64 onIntel Server.

Pages allocated and accessed: 64; access stride: 16 pages (2MB L3 cache size divided by 32 ways); pagecolors: 16; pointers: 32 onAMD Server.

Expected Results The slowdown should be similar to but lower than the slowdownobserved in Experiments5.3.8.3and5.3.8.4, depending on how intensively the workloads access the shared cache and how much they benefitfrom prefetching.

Measured Results Considering PlatformIntel Server. The observed median durations of the workloads whenrun in isolation and when run with the two variants of parallel cache sharing are presented in Table5.1 forthe integer benchmarks and Table5.2 for the floating point benchmarks, along with the relative slowdown.The sensitivity to cache sharing varies greatly from workload to workload, the interfering workload thatmisses in the shared cache has a significantly higher impact than the interfering workload that hits. Theslowdown due to the interfering workload hitting in the shared cache ranges from 0.2 % (456.hmmer) to26 % (437.leslie3d). The interfering workload missing in the shared cache causes slowdown ranging from1.8 % (456.hmmer) to 90 % (459.GemsFDTD).

Considering PlatformAMD Server. The results, summarized in Tables5.3 and5.4, show that the inter-ference can result in a slight slowdown (up to 9.4 % for 470.lbm and missing interference), but also in asignificant speedup (up to 21 % for 464.h264ref). The speeduphas already been observed for the linearmultipointer workload in Experiment5.3.8.4. The difference between the two experiments is that here, thespeedup is only manifested in cases when the interfering workload misses in the shared cache.

Open Issues The reasons of the speedup on PlatformAMD Server, observed also in the artificial experimentswith linear multipointer walk, remain an open question.

Effect Summary The overhead is visible in both the integer and floating pointworkloads. The overhead variesgreatly from benchmark to benchmark, an interfering workload that misses in the shared cache has a largerimpact than an interfering workload that hits.




CFP2006 benchmark Isolated [s] Hit [s] Slowdown Miss [s] Slowdown

410.bwaves 35.2 40.1 14 % 46.0 31 %416.gamess 0.546 0.582 6.6 % 0.630 15 %433.milc 15.8 17.7 12 % 24.5 55 %434.zeusmp 26.3 28.4 8.0 % 38.4 46 %435.gromacs 2.11 2.22 5.2 % 2.38 13 %436.cactusADM 5.29 5.50 4.0 % 6.91 31 %437.leslie3d 27.0 33.9 26 % 45.4 68 %444.namd 18.4 18.5 0.5 % 20.2 10 %447.dealII 25.2 27.3 8.3 % 29.5 17 %450.soplex 0.0252 0.0265 5.2 % 0.0307 22 %453.povray 0.910 0.940 3.3 % 1.10 21 %454.calculix 0.0613 0.0657 7.2 % 0.0675 10 %459.GemsFDTD 4.22 4.73 12 % 8.03 90 %465.tonto 1.39 1.44 3.6 % 1.48 6.5 %470.lbm 16.0 17.3 8.1 % 28.8 80 %481.wrf 7.73 8.36 8.2 % 9.48 23 %482.sphinx3 3.08 3.40 10 % 4.17 35 %

CFP2006 geomean 8.2 % 32 %Table 5.2: Slowdown of the SPEC CPU2006 floating point benchmarks (CFP2006) in the parallel cache sharing scenario

where the interfering workload hits or misses in the shared cache onIntel Server.

CINT2006 benchmark Isolated [s] Hit [s] Slowdown Miss [s] Slowdown

400.perlbench 4.08 4.11 0.7 % 3.97 -2.7 %401.bzip2 12.3 12.8 4.1 % 12.1 -1.6 %403.gcc 2.29 2.35 2.6 % 1.98 -14 %429.mcf 11.0 11.2 1.8 % 10.9 -0.9 %445.gobmk 29.9 30.7 2.7 % 27.0 -9.7 %456.hmmer 5.61 5.62 0.2 % 5.59 -0.4 %458.sjeng 7.59 7.65 0.8 % 6.70 -12 %462.libquantum 0.0684 0.0685 0.1 % 0.0677 -1.0 %464.h264ref 30.9 30.7 -0.6 % 24.5 -21 %471.omnetpp 0.627 0.638 1.8 % 0.602 -4.0 %473.astar 13.9 13.9 0.0 % 13.8 -0.7 %483.xalancbmk 0.141 0.145 2.8 % 0.126 -11 %

CINT2006 geomean 1.4 % -6.7 %Table 5.3: Slowdown of the SPEC CPU2006 integer benchmarks (CINT2006)in the parallel cache sharing scenario

where the interfering workload hits or misses in the shared cache onAMD Server.




CFP2006 benchmark Isolated [s] Hit [s] Slowdown Miss [s] Slowdown

410.bwaves 50.3 50.5 0.4 % 50.5 0.4 %416.gamess 0.955 0.989 3.6 % 0.886 -7.2 %433.milc 26.0 26.3 1.2 % 28.1 8.1 %434.zeusmp 42.3 42.5 0.5 % 38.0 -10 %435.gromacs 2.08 2.14 2.9 % 1.87 -10 %436.cactusADM 8.05 8.04 -0.1 % 6.18 -23 %437.leslie3d 33.4 33.7 0.9 % 34.1 2.1 %444.namd 36.4 36.9 1.4 % 32.6 -10 %447.dealII 37.2 38.2 2.7 % 34.8 -6.5 %450.soplex 0.0297 0.0303 2.0 % 0.0282 -5.1 %453.povray 1.08 1.1 1.9 % 1.07 -0.9 %454.calculix 0.0767 0.0780 1.7 % 0.0728 -5.1 %459.GemsFDTD 4.85 4.88 0.6 % 5.05 4.1 %465.tonto 1.52 1.53 0.7 % 1.51 -0.7 %470.lbm 7.63 7.72 1.2 % 8.35 9.4 %481.wrf 7.99 8.02 0.4 % 7.81 -2.3 %482.sphinx3 3.97 3.99 0.5 % 4.05 2.0 %

CFP2006 geomean 1.3 % -3.6 %Table 5.4: Slowdown of the SPEC CPU2006 floating point benchmarks (CFP2006) in the parallel cache sharing scenario

where the interfering workload hits or misses in the shared cache onAMD Server.

5.4 Resource: Memory Buses

In the system memory architecture, the memory bus connects the processors with caches to the memory controllerswith memory modules. Typically, multiple agents are connected to the bus and an arbitration protocol is used todetermine ownership. An agent that owns the bus can initiatebus transactions, which are either atomic or split intorequests and replies. To avoid the memory bus becoming a bottleneck, architectures with multiple memory busescan be introduced.



The two processor packages are connected to a shared memory controller hub by separate front side busses runningat 333 MHz. Each front side contains 36 bit wide address bus and 64 bit wide data bus. The address bus cantransfer two addresses per cycle, but the address bus strobesignal is only sampled once every two cycles, yieldinga theoretical limit of 166 M addresses per second. The data bus can transfer four words per cycle, yielding atheoretical throughput of 1.33 G transfers per second or 10.7 GB per second. Split transactions are used [8, Section5.1].


Each of the two processor packages is equipped with an integrated dual-channel DDR2 memory controller, sharedby all four processor cores of the package. Each of the two channels is 64 bit wide, 72 bit wide with ECC. Thechannels can operate either independently, or as a single 128 bit wide channel, for a theoretical throughput of10.7 GB per second with DDR2-667 memory [13, page 230].

The two processors are also connected by a HyperTransport 3.0 link with the theoretical throughput of up to7.2 GB per second in each direction [14]. Each processor is connected to dedicated memory and the HyperTrans-port link between the processors is used when code running onone processor wants to access memory connectedto the other processor.





When a memory bus is shared, memory access is necessarily serialized. When multiple components share amemory bus, they compete for its capacity. The effects that can influence the quality attributes therefore resemblethe effects of sharing a server in a queueing system, except for the ability of the components to compensate thememory bus effects by prefetching and parallel execution.

Rather than investigating the effects of sharing a memory bus in more detail, we limit ourselves to determiningthe combined memory access bandwidth, which includes the memory bus bandwidth, memory controller band-width and memory modules bandwidth. This allows us to estimate whether further investigation within the scopeof the Q-ImPrESS project is warranted.


In the parallel composition scenario, the effects of sharing the memory bus can be exhibited as follows:

• Assume components that transfer data at rates close to the memory bus bandwidth. A parallel compositionof such components will reduce the memory bus bandwidth available to each component, increasing thememory access latencies.

• Assume components that benefit from prefetching at rates close to the memory bus bandwidth. A parallelcomposition of such components will reduce the memory bus bandwidth available for prefetching, unmask-ing the memory access latencies.


5.4.4.1 Experiment: Memory bus bandwidth limit

The experiment to determine the bandwidth limit associatedwith shared memory bus performs a pointer walkfrom Listing5.1. Multiple configurations of the experiment initialize the pointer chain by linear initialization code(Listing 5.4) and random initialization code (Listing5.5). One or two processors perform the workload to seewhether the bandwidth limit associated with shared memory bus has been exceeded.

Purpose Determine the bandwidth limit associated with shared memory bus.

Measured Time to perform a single memory access in pointer walk from Listing5.1.

Parameters Allocated: 64 MB; accessed: 64 MB; pattern: linear (Listing5.4), random (Listing5.5); stride:64 B; processors: 1, 2.

Expected Results Assuming that there is a limit on the shared memory bus accessbandwidth that the work-load exceeds, the average time to perform a single memory access will be halved from the one processorconfiguration to the two processors configuration.

Measured Results For PlatformIntel Server, Figure5.125shows an access time of 50 cycles per cache line forlinear workload running on one processor and 76 cycles per cache line on two processors. The correspondingfigures for random workload are 261 and 289 cycles per cache line, see Figure5.126. The values suggestthat while the linear workload approaches the memory bus capacity with rates of 3920 MB/s, the randomworkload does not nearly approach the same rates.

5.4.4.2 Experiment: Memory bus bandwidth limit

Since the experiment to determine the bandwidth limit associated with shared memory bus reveals that the randompointer walk from Listing5.1and5.5does not saturate the shared memory bus, another experimentis performedwhere the pointer walk code (Listing5.1 has been replaced with multipointer walk code (Listing5.2) and moreprocessors have been added.

Purpose Determine the bandwidth limit associated with shared memory bus.




1 2

4050

6070

8090

Number of processors generating workload

Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

]

Figure 5.125: Memory bus bandwidth limit in linear workload onIntel Server.

1 2

250

260

270

280

290

300

310

Number of processors generating workload

Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

]

Figure 5.126: Memory bus bandwidth limit in random workload onIntel Server.


Parameters Allocated: 64 MB; accessed: 64 MB; stride: 64 B; pointers: 1-16; processors: 4, 8.

Expected Results Assuming that there is a limit on the shared memory bus accessbandwidth that the work-load exceeds, the average time to perform a single memory access will be halved from the four processorconfiguration to the eight processors configuration.

Measured Results For PlatformIntel Serverand 16 pointers, Figure5.127shows an access time of 111 cyclesper cache line for the workload running on four processors and 203 cycles per cache line on eight processors.The values suggest that the workload approaches the memory bus capacity with rates of 5880 MB/s. The




4 processors8 processors

1 2 4 8 16

100

200

300

400

500

Number of pointers

Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

Trim

]

1 2 4 8 16

100

200

300

400

500

Number of pointers

Dur

atio

n of

sin

gle

acce

ss [c

ycle

s −

100

Avg

Trim

]

Figure 5.127: Memory bus bandwidth limit in random workload onIntel Server.

access times for 8 and 16 pointers differ by less than one percent, indicating that an individual processordoes not issue more than 8 outstanding accesses to independent addresses.

Effect Summary The limit can be visible in workloads with high memory bandwidth requirements and work-loads where memory access latency is not masked by concurrent processing.




Chapter 6

Operating System

When considering the shared resources associated with an operating system, we assume a common operatingsystem with threads and processes and a device driver layer supporting the file system and the network stack.Examples of such operating systems include Linux and Windows.

6.1 Resource: File Systems

The file system is a shared resource that creates the abstraction of directories and files over a disk resource con-sisting of equal sized and directly addressable blocks. Theessential functions provided by the file system to thecomponents are reading and writing of files.

Whenever a component reads a file, the file system locates the blocks containing the data, after which the datais read and returned to the component. Whenever a component writes a file, the file system locates a block availablefor writing the data, after which the block is assigned to thefile and the data is written.

By virtue of its position above the disk resource, the file system can be viewed as a resource that transformsrequests to read and write files into requests to read and write blocks. To separate the complexity of modeling thefile system from the complexity of modeling the disk, it is therefore helpful to quantify the behavior of the filesystem in terms of disk operation counts rather than in termsof file operation times. This separation also makesit possible to use models of various disk configurations (single disks, disk arrays, solid state disks) with the samemodel of the file system.

Models of disks that allow calculating the disk operation times are readily available. Important operationsrecognized by the models of disks include seeking to a particular block, reading of a block and writing of a block.Both reading and writing can be subject to queueing and reordering that minimizes seeking. While reading isnecessarily synchronous, writing can be asynchronous withbuffering.

The operations recognized by the models of disks are the operations that the models of the file system must useto quantify the behavior of the file system.

Reading of a file is synchronous, reading of multiple files therefore requires seeking between the blocks wherethe files are stored. Seeking can also be required when reading a single file, either to read data spanning multipleblocks or to read metadata associated with the file. Data blocks belonging to a single file are usually allocated in away that minimizes seeking during sequential reading.

Writing of a file can be asynchronous, writing of multiple files therefore does not necessarily require seek-ing. Writing of multiple files can interfere with the allocation of data blocks belonging to a single file, causingfragmentation that increases seeking during sequential reading.


This section describes platform dependent details of the operating systems used in the experiments for platformsintroduced in Section3.2.

6.1.1.1 Platform RAID Server

The operating system uses the standard file system configuration.





The effects that can influence the quality attributes when multiple components share a file system include:

Directory structure Components share the directory structure. The dimensions of the directory structure caninfluence the efficiency of locating a file.

Read fragmentation Reading multiple files can introduce additional seeking operations into otherwise mostlysequential workload. Seeking is inherently slow and therefore even a small number of additional seekingoperations degrades performance.

Write fragmentation Writing multiple files can introduce fragmentation into otherwise compact allocation.Fragmentation introduces seeking both in writing and in reading, a fragmented file is therefore not onlyslower to write, but also slower to read.

6.1.3 General Composition

The effects of sharing the file system can be exhibited as follows:

• A composition of components that read files will introduce additional seeking operations, which bring sig-nificant overhead especially if each component alone would read its files sequentially.

• A composition of components that write files will introduce fragmentation, which will in turn introduceadditional seeking operations both in writing and in reading.


In general, the artificial experiments for file system sharing first create and write a number of files using the POSIXfunctionsopenandwrite. The number of files and their size is varied. The files are created and written eitherone after another (we will refer to this method asindividualwriting), or concurrently. In the second case, all filesare created upfront and writes to the individual files are interleaved using a fixed order — asegmentof data iswritten to the first file, then to the second file, and after the last file another segment is appended to the first fileand so on. The segment size for this concurrent writing is given as a parameter. We will call this method asconcurrentwriting. Note that concurrent writing where the segment size equals the file size would be identical tothe individual writing.

When all files have been written, thesyncfunction is called to flush all delayed writes to the disk, andthe filebuffers are dropped from the system memory by writing to thedrop cachespseudo-file in theprockernel interface.This ensures that subsequent reads are not cached. The wholefile are then read using theread function. Similarlyto writing, reading is also done eitherindividuallyor concurrently, with an analogous read segment size parameter.Durations of read operations and block IO traces are collected.

Experiment resultsIn all subsequent experiments, the experiment results showratio between measurements from two benchmarks,

one of which is considered to represent thebaseline. Each benchmark consists of a set of measurements corre-sponding to workload configuration from benchmark parameter space. Each measurement in turn contains valuesof multiple attributes. The pairing between workload configurations is determined by a particular experiment.

Benchmark parameter spaceEach workload in a benchmark is determined by a set of parameters that describe the activity the workload

generator should perform. The set of allowed variations in those parameters determines the benchmark parameterspace. One run of a benchmark iterates over all possible states in benchmark parameter space. Considering thedescription of the artificial experiments, each workload configuration has the following parameters:

• file count, fcThe number of files written (and read).




• file size, fsThe size of files written (and read).

• write segment size, wssThe number of bytes written to a single file before switching to another file in concurrent workloads.

• concurrent write, cwDetermines whether the files are written concurrently (cw=1) or individually (cw=0).

• read segment size, rssThe number of bytes read from a single file before switching toanother file in concurrent workloads.

• concurrent read, rssDetermines whether the files are read concurrently (cr=1) orindividually (cr=0).

• random read, rrDetermines whether the files are read sequentially (rr=0) orrandomly (rr=1).

The results of each experiment are summarized in a table, which captures part of the experiment parameterspace. For each experiment, there is a set of configuration parameters that are common to both compared bench-marks. Then for each column of the table, there is a parametertuple which is unique for each table column andwhich captures the variability of the workloads. Finally, there is typically a single parameter that is common forthe baseline workloads but different for the other workloads.

Measured attributesFor each workload configuration, a set of attributes is directly measured or indirectly derived from collected

data. In many benchmarks, the measured quantity corresponds to time, either in seconds or processor clocks, eventcounts, etc.

In case of file system workloads, time provides only limited amount of information, such as duration of theentire workload or parts of it. This is useful for quick comparison of statistical summaries or histograms, but itdoes not help with prediction of performance on a different device. Since the file system is a resource that basicallytranslates file system workload to disk device workload, we need a way to capture the properties of the workloadimposed by file system on a device.

There are many ways to characterize disk workload and we haveopted to use attributes that have been success-fully used [42] to characterize workloads for performance prediction using an analytic model for disk drives withread-ahead and request reordering.

Shriver et al. define [42] two main attribute classes and an auxiliary class for attributes from neither class. Thefirst class containstemporal locality measures, the second class containsspatial locality measures, and the thirdclass containsother measures.

Temporal locality measures Besides the usual temporal description of incoming requests in terms of arrivalprocess and request rate isburstinessof the incoming requests, which is commonly present in many workloads.A burst is a group of consecutive requests with short interarrival times, typically such that there are still requestspending to be processed by the device while a request arrives. In many cases,mean device service timemay serveas the minimum interarrival time for which requests are considered to belong in the same burst. The specificationof burstiness then specifies the fraction of all requests arriving in bursts as well as mean number of requests in aburst.

• requestrate, requests/secondRate at which requests arrive at the storage device.

• requestper burst, requestsSize of a burst.

• burst f raction, 0-1Fraction of all requests that occur in a burst.




Spatial locality measures Another class of attributes captures spatial locality of the workload. The principal con-cept here is arun, which captures the notion of sequentiality of the workload, which allows eliminating positioningtime and process consecutive requests faster, possibly using read-ahead cache to service the requests. Similar toburst, a run is a group of requests, but in this case the requests have to be spatially contiguous. Such a run is de-scribed bystride, which is the mean distance between the starting sectors of consecutive runs, andlocality fraction,which determines the fraction of requests that are part of a run.

Since purely contiguous runs are relatively rare, it makes sense to allow small (incremental) gaps between theindividual requests. Such a run is then calledsparseand has similar attributes. Besides number of requests in asparse run and fraction of requests that are part of sparse runs, we also characterize the mean length of a run interms of sectors serviced within the span of the run.

• data span, sectorsSpan (range) of data accessed during the workload.

• requestsize, sectorsLength of a host read or host write request.

• run length, sectorsLength of a run, a contiguous set of requests.

• run stride, sectorsDistance between the start points of two consecutive runs.

• locality f raction, 0-1Fraction of requests that occur in a run.

• requestsper sparserun, requestsNumber of requests in a sparse run.

• sparserun length, sectorsNumber of sectors serviced within the span of a sparse run.

• sparserun f raction, 0-1Fraction of requests that are in sparse runs.

Other measures Besides temporal and spatial measures, there are other attributes that we may want to use tocharacterize a workload. Typically, such an attribute would be fraction of read or write requests, but since ourbenchmarks are targeted at reading performance, we have omitted such an attribute. However, for simple com-parison of results on the same platform, we use a time span attribute which allows us to determine how long aworkload took to finish.

• time span, secondsTime the measured system took to finish the workload.

6.1.4.1 Sequential access

This group of experiments assesses impact of sharing when each individual file is accessed sequentially and mul-tiple files are written and/or read either individually or concurrently. The quantitative parameters are the same forall these experiments. The number of files that is written andread is two, three or four, the size of each file being64 MB, 128 MB or 256 MB. Thewrite andreadfunctions are called on a single 256 KB memory buffer, initializedwith random data for writing. For concurrent writes and or/reads, the respective segment size is set to 256 KB,1 MB or 16 MB.




Baseline only:cr=0 fs

=64

MB

rss=

256K

Bcr

=1

fs=

64M

Brs

s=1M

Bcr

=1

fs=

64M

Brs

s=16

MB

cr=

1

fs=

128M

Brs

s=25

6KB

cr=

1

fs=

128M

Brs

s=1M

Bcr

=1

fs=

128M

Brs

s=16

MB

cr=

1

fs=

256M

Brs

s=25

6KB

cr=

1

fs=

256M

Brs

s=1M

Bcr

=1

fs=

256M

Brs

s=16

MB

cr=

1

requestrate 0.57±0.06 0.93±0.12 0.46±0.05 0.61±0.05 0.95±0.10 0.50±0.05 0.68±0.16 0.97±0.21 0.50±0.10bursty f raction 3.51±1.31 1.12±0.36 3.32±1.47 3.16±0.82 1.11±0.15 3.21±0.73 2.51±1.48 1.10±0.64 3.64±2.53requestsper burst 1.04±0.03 1.02±0.05 0.96±0.03 1.03±0.02 1.00±0.03 0.95±0.02 1.00±0.06 1.00±0.07 0.94±0.07data span 6.87±12.8 4.49±8.97 7.80±13.9 1.39±1.73 1.50±1.79 2.46±3.51 1.31±2.23 0.69±1.04 0.85±1.19requestsize 1.32±0.14 1.00±0.13 1.34±0.14 1.32±0.11 1.01±0.10 1.35±0.11 1.19±0.16 1.00±0.10 1.33±0.14requestsper run 0.60±0.06 0.97±0.13 0.46±0.05 0.60±0.05 0.97±0.10 0.45±0.04 0.68±0.08 0.98±0.11 NA±NArun length 0.80±0.01 0.98±0.02 0.35±0.01 0.81±0.01 0.98±0.01 0.35±0.01 0.81±0.03 0.99±0.04 NA±NArun stride 122±11.4 11.6±10.5 78.4±182 241±4.24 17±5.23 194±50.4 59.5±3.39 3.99±0.13 NA±NAlocality f raction 0.85±0.01 0.99±0.00 0.04±0.00 0.85±0.01 1.00±0.00 0.04±0.00 0.87±0.02 0.99±0.01 0.01±0.01requestsper sparserun 0.01±0.00 0.18±0.04 0.00±0.00 0.01±0.00 0.14±0.02 0.00±0.00 NA±NA NA±NA NA±NAsparserun length 0.01±0.00 0.18±0.03 0.00±0.00 0.01±0.00 0.14±0.01 0.00±0.00 NA±NA NA±NA NA±NAsparserun f raction 1.39±0.30 1.34±0.29 0.26±0.10 1.18±0.06 1.15±0.05 0.22±0.04 1.94±1.34 1.74±1.21 0.34±0.31time span 1.32±0.11 1.07±0.10 1.62±0.12 1.24±0.05 1.04±0.05 1.47±0.07 1.23±0.20 1.03±0.18 1.49±0.22

Table 6.1: Slowdown of concurrent vs. individual sequential reading of 2 individually written files. Common parameters:fc=2, wss=256KB, cw=0, rr=0


=64

MB

rss=

256K

Bcr

=1

fs=

64M

Brs

s=1M

Bcr

=1

fs=

64M

Brs

s=16

MB

cr=

1

fs=

128M

Brs

s=25

6KB

cr=

1

fs=

128M

Brs

s=1M

Bcr

=1

fs=

128M

Brs

s=16

MB

cr=

1

fs=

256M

Brs

s=25

6KB

cr=

1

fs=

256M

Brs

s=1M

Bcr

=1

fs=

256M

Brs

s=16

MB

cr=

1

requestrate 0.60±0.07 0.96±0.14 0.51±0.06 0.61±0.02 0.94±0.05 0.50±0.02 0.57±0.07 0.93±0.14 0.47±0.06bursty f raction 3.39±1.45 1.10±0.24 3.38±1.04 3.35±0.77 1.20±0.20 3.37±0.74 3.53±1.19 1.02±0.38 3.58±1.08requestsper burst 1.02±0.03 1.00±0.03 0.95±0.02 1.03±0.02 1.00±0.02 0.94±0.02 1.00±0.05 1.01±0.06 0.92±0.04data span 1.42±1.80 1.52±1.85 1.52±1.85 2.42±3.70 4.11±7.39 2.10±3.35 4.76±9.49 2.33±4.33 5.13±10.3requestsize 1.32±0.14 1.01±0.15 1.35±0.13 1.33±0.04 1.02±0.05 1.36±0.04 1.31±0.14 1.00±0.14 1.35±0.13requestsper run 0.61±0.06 0.97±0.15 0.46±0.05 0.60±0.02 0.96±0.04 0.45±0.01 0.61±0.07 0.98±0.15 NA±NArun length 0.81±0.01 0.98±0.01 0.36±0.01 0.80±0.01 0.98±0.01 0.35±0.00 0.82±0.02 0.99±0.02 NA±NArun stride 226±127 13.4±9.67 59.6±30.1 303±34.5 19.6±2.22 299±83.8 81.1±3.20 5.86±1.83 NA±NAlocality f raction 0.85±0.01 1.00±0.00 0.03±0.01 0.85±0.00 0.99±0.00 0.04±0.01 0.84±0.02 0.99±0.00 0.01±0.01requestsper sparserun 0.01±0.00 0.14±0.03 0.00±0.00 0.01±0.00 0.13±0.02 0.00±0.00 NA±NA NA±NA NA±NAsparserun length 0.01±0.00 0.14±0.03 0.00±0.00 0.01±0.00 0.13±0.02 0.00±0.00 NA±NA NA±NA NA±NAsparserun f raction 1.11±0.07 1.07±0.06 0.20±0.05 1.04±0.03 1.02±0.03 0.20±0.03 1.39±0.42 1.28±0.40 0.25±0.09time span 1.25±0.07 1.03±0.06 1.44±0.08 1.24±0.03 1.04±0.03 1.48±0.05 1.34±0.14 1.08±0.13 1.56±0.18

Table 6.2: Effect of concurrent vs. individual sequential reading of 3individually written files. Common parameters:fc=3, wss=256KB, cw=0, rr=0

6.1.4.2 Experiment: Concurrent reading of individually wr itten files

Purpose Determine the effect of concurrent reading compared to individual reading of files.

Measured Individual and concurrent sequential reading of individually written sequential files.

Parameters file count: 2-4; file size: 64MB, 128MB, 256MB; writing: individual; read segment size: 256KB,1MB, 16MB; reading: individual, separate

Expected Results Concurrent reading should be slower than individual reading, because the buffer cache in thesystem memory and the disk cache are shared, and the disk needs to seek between the files. Read-aheadbuffering could reduce these seeks, because individual files are read sequentially. Lower segment sizesshould result in more rapid seeking and thus yield more significant slowdown. Increasing the number andsize of files extends the area occupied on the disk, which can prolong the latency of seeks. Buffers for morefiles and larger sizes also occupy more memory and therefore can reduce the read-ahead buffering.

Measured Results The results of the experiment are shown in tables6.1, 6.2, and6.3 for 2, 3, and 4 files,respectively. Looking at thetime spanattribute we can see that the most significant slowdown occurs for16 MB read segment size, while only a negligible overhead occurs for 1 MB. While this is somewhat counter-intuitive, it corresponds e.g. to the values of thelocality f ractionattribute, which shows almost no localitycompared to baseline benchmark. This is reflected in other attributes as well. As for the 1 MB read segmentsize, we can observe increased locality, shorter data span,longer run length, reduced burstiness, and otherattributes compared to other read segment sizes.





=64

MB

rss=

256K

Bcr

=1

fs=

64M

Brs

s=1M

Bcr

=1

fs=

64M

Brs

s=16

MB

cr=

1

fs=

128M

Brs

s=25

6KB

cr=

1

fs=

128M

Brs

s=1M

Bcr

=1

fs=

128M

Brs

s=16

MB

cr=

1

fs=

256M

Brs

s=25

6KB

cr=

1

fs=

256M

Brs

s=1M

Bcr

=1

fs=

256M

Brs

s=16

MB

cr=

1

requestrate 0.60±0.03 0.96±0.07 0.50±0.03 0.61±0.03 0.94±0.04 0.51±0.03 0.61±0.07 0.98±0.14 0.52±0.05bursty f raction 3.08±0.80 0.97±0.20 3.04±0.81 3.34±0.81 1.17±0.18 3.52±0.76 3.13±1.30 1.09±0.30 3.25±0.82requestsper burst 1.01±0.03 0.98±0.04 0.94±0.03 1.03±0.02 1.00±0.02 0.95±0.02 1.01±0.04 1.02±0.06 0.94±0.03data span 1.79±2.27 1.79±2.27 1.79±2.27 2.92±3.37 2.64±3.19 2.73±3.36 0.93±1.33 0.78±1.12 0.78±1.12requestsize 1.33±0.06 1.01±0.08 1.35±0.06 1.32±0.04 1.02±0.04 1.35±0.04 1.32±0.10 1.01±0.12 1.35±0.09requestsper run 0.60±0.03 0.98±0.08 0.46±0.02 0.60±0.02 0.97±0.04 0.45±0.01 0.61±0.05 0.97±0.12 0.46±0.03run length 0.81±0.01 0.98±0.01 0.35±0.01 0.80±0.01 0.98±0.00 0.35±0.00 0.81±0.01 0.98±0.02 0.36±0.01run stride 180±2.60 15±8.68 240±104 304±99.1 24.4±12.8 381±155 101±23.2 15.5±19 191±286locality f raction 0.86±0.01 1.00±0.01 0.04±0.01 0.85±0.01 1.00±0.00 0.05±0.01 0.85±0.01 0.99±0.00 0.04±0.01requestsper sparserun 0.01±0.00 0.17±0.08 0.00±0.00 0.01±0.00 0.13±0.02 0.00±0.00 0.01±0.00 0.16±0.05 0.00±0.00sparserun length 0.01±0.01 0.17±0.08 0.00±0.00 0.01±0.00 0.13±0.02 0.00±0.00 0.01±0.00 0.17±0.05 0.00±0.00sparserun f raction 1.11±0.12 1.09±0.12 0.21±0.04 1.06±0.06 1.05±0.06 0.21±0.04 1.24±0.28 1.19±0.27 0.24±0.07time span 1.26±0.05 1.04±0.04 1.48±0.07 1.23±0.03 1.05±0.03 1.46±0.06 1.24±0.10 1.02±0.09 1.43±0.11

Table 6.3: Effect of concurrent vs. individual sequential reading of 4individually written files. Common parameters:fc=4, wss=256KB, cw=0, rr=0

Baseline only:cw=0 fs

=64

MB

wss

=25

6KB

cw=

1

fs=

64M

Bw

ss=

1MB

cw=

1

fs=

64M

Bw

ss=

16M

Bcw

=1

fs=

128M

Bw

ss=

256K

Bcw

=1

fs=

128M

Bw

ss=

1MB

cw=

1

fs=

128M

Bw

ss=

16M

Bcw

=1

fs=

256M

Bw

ss=

256K

Bcw

=1

fs=

256M

Bw

ss=

1MB

cw=

1

fs=

256M

Bw

ss=

16M

Bcw

=1

requestrate 0.80±0.09 0.90±0.10 0.80±0.09 0.86±0.08 0.94±0.08 0.85±0.08 0.78±0.15 0.90±0.17 0.81±0.17bursty f raction 1.76±0.42 1.23±0.33 1.81±0.44 1.66±0.21 1.13±0.15 1.65±0.22 1.67±0.75 1.09±0.52 1.74±0.83requestsper burst 1.03±0.04 1.04±0.05 1.03±0.04 1.03±0.03 1.03±0.04 1.04±0.03 1.06±0.09 1.01±0.07 1.06±0.09data span 6.68±12.6 2.29±4.19 7.78±14.4 2.03±3.17 2.67±3.95 1.39±1.72 2.55±3.69 3.13±4.30 2.40±3.65requestsize 0.99±0.10 1.00±0.11 0.99±0.10 0.99±0.08 1.00±0.08 0.99±0.08 1.01±0.09 0.99±0.12 1.01±0.09requestsper run 0.92±0.10 0.97±0.11 0.91±0.10 0.93±0.07 0.97±0.08 0.92±0.07 0.91±0.09 0.99±0.13 0.90±0.09run length 0.92±0.01 0.98±0.01 0.91±0.01 0.92±0.01 0.98±0.01 0.91±0.01 0.92±0.03 0.98±0.03 0.91±0.03run stride 2.80±0.03 5.04±10.5 2.80±0.03 2.79±0.02 5.18±7.25 3.90±4.94 16.5±33.5 2.45±0.06 2.78±0.07locality f raction 0.98±0.00 1.00±0.00 0.98±0.00 0.98±0.00 1.00±0.00 0.98±0.00 0.98±0.01 0.99±0.01 0.98±0.01requestsper sparserun 0.05±0.01 0.18±0.03 0.05±0.01 0.04±0.00 0.14±0.01 0.04±0.00 NA±NA NA±NA NA±NAsparserun length 0.05±0.01 0.18±0.03 0.05±0.01 0.04±0.00 0.14±0.01 0.04±0.00 NA±NA NA±NA NA±NAsparserun f raction 1.39±0.30 1.31±0.28 1.39±0.30 1.18±0.06 1.15±0.06 1.18±0.06 1.91±1.33 1.73±1.20 1.91±1.32time span 1.25±0.11 1.10±0.10 1.25±0.11 1.17±0.05 1.05±0.05 1.18±0.05 1.25±0.18 1.10±0.15 1.21±0.19

Table 6.4: Effect of individual sequential reading of 2 files written concurrently vs. individually. Common parameters:fc=2, rss=256KB, cr=0, rr=0

6.1.4.3 Experiment: Individual reading of concurrently wr itten files

Purpose Determine the residual effect of concurrently written fileson individual sequential reading.

Measured Individual sequantial reading after individual or concurrent writing.

Parameters file count: 2-4; file size: 64MB, 128MB, 256MB; writing: concurrent, separate; write segmentsize: 256KB, 1MB, 16MB; reading: separate

Expected Results The concurrent writing is expected to result in physically interleaved (fragmented) files onthe disk, and the resulting fragmentation should prolong the individual reading due to seeks between thefragments. The slowdown will probably be not too significantbecause the seeks occur in only one directionand skip relatively small areas depending on number of files and block size.

Measured Results The results of the experiment are shown in tables6.4, 6.5, and6.6 for 2, 3, and 4 files,respectively. As expected, the impact of fragmentation during writing is lower than in case of concurrent vs.individual reading of contiguous files. The results show similar results as in previous case with respect tovarying read segment size. Increasing file size appears to beincreasingbursty f raction, but the increase isnot very significant, because it is coupled with increasing deviation making the numbers less precise. On theother hand, increasing the number of files decreasesbursty f raction, leading to slightly improved times.

6.1.4.4 Experiment: Concurrent reading of concurrently wr itten files

Purpose Determine the residual effect of concurrently written fileson concurrent sequential reading.





=64

MB

wss

=25

6KB

cw=

1

fs=

64M

Bw

ss=

1MB

cw=

1

fs=

64M

Bw

ss=

16M

Bcw

=1

fs=

128M

Bw

ss=

256K

Bcw

=1

fs=

128M

Bw

ss=

1MB

cw=

1

fs=

128M

Bw

ss=

16M

Bcw

=1

fs=

256M

Bw

ss=

256K

Bcw

=1

fs=

256M

Bw

ss=

1MB

cw=

1

fs=

256M

Bw

ss=

16M

Bcw

=1

requestrate 0.86±0.10 0.95±0.13 0.86±0.10 0.86±0.04 0.95±0.05 0.85±0.03 0.81±0.10 0.92±0.13 0.78±0.09bursty f raction 1.73±0.26 1.18±0.21 1.80±0.28 1.82±0.29 1.17±0.17 1.72±0.25 1.73±0.44 1.09±0.37 1.82±0.48requestsper burst 1.04±0.04 1.02±0.03 1.04±0.03 1.03±0.03 1.02±0.03 1.03±0.02 1.07±0.06 1.03±0.06 1.06±0.06data span 1.73±2.64 1.40±1.78 1.40±1.78 4.76±7.90 3.96±7.05 5.83±9.25 2.32±4.32 6.83±13.1 5.70±10.7requestsize 0.98±0.10 0.99±0.12 0.98±0.09 0.99±0.02 1.00±0.04 0.99±0.02 0.99±0.10 1.00±0.12 0.99±0.10requestsper run 0.93±0.09 0.98±0.12 0.92±0.09 0.93±0.02 0.98±0.04 0.93±0.02 0.92±0.10 0.98±0.13 0.91±0.10run length 0.92±0.01 0.98±0.01 0.91±0.01 0.92±0.01 0.98±0.01 0.92±0.00 0.92±0.02 0.98±0.02 0.91±0.02run stride 4.64±0.05 4.42±0.04 4.66±0.07 5.87±6.69 4.39±0.50 4.40±0.50 4.54±0.12 3.87±0.13 10.5±18.2locality f raction 0.98±0.00 1.00±0.00 0.98±0.00 0.98±0.01 1.00±0.00 0.98±0.00 0.98±0.00 1.00±0.00 0.98±0.01requestsper sparserun 0.04±0.01 0.14±0.03 0.03±0.01 0.03±0.00 0.13±0.02 0.03±0.00 NA±NA NA±NA NA±NAsparserun length 0.04±0.01 0.14±0.03 0.03±0.01 0.03±0.00 0.13±0.02 0.03±0.00 NA±NA NA±NA NA±NAsparserun f raction 1.10±0.07 1.07±0.06 1.11±0.07 1.04±0.03 1.02±0.03 1.04±0.03 1.37±0.42 1.29±0.40 1.37±0.42time span 1.17±0.07 1.06±0.06 1.18±0.07 1.18±0.04 1.06±0.04 1.18±0.03 1.24±0.14 1.09±0.13 1.29±0.13



=64

MB

wss

=25

6KB

cw=

1

fs=

64M

Bw

ss=

1MB

cw=

1

fs=

64M

Bw

ss=

16M

Bcw

=1

fs=

128M

Bw

ss=

256K

Bcw

=1

fs=

128M

Bw

ss=

1MB

cw=

1

fs=

128M

Bw

ss=

16M

Bcw

=1

fs=

256M

Bw

ss=

256K

Bcw

=1

fs=

256M

Bw

ss=

1MB

cw=

1

fs=

256M

Bw

ss=

16M

Bcw

=1

requestrate 0.85±0.05 0.94±0.07 0.86±0.05 0.86±0.03 0.95±0.04 0.86±0.04 0.84±0.09 0.95±0.15 0.84±0.09bursty f raction 1.59±0.32 1.10±0.24 1.62±0.32 1.74±0.23 1.21±0.17 1.80±0.26 1.68±0.30 1.07±0.26 1.73±0.30requestsper burst 1.04±0.04 1.02±0.04 1.04±0.04 1.04±0.02 1.02±0.02 1.04±0.02 1.03±0.05 1.01±0.06 1.06±0.05data span 2.78±4.44 4.11±6.02 2.97±4.63 1.27±1.39 2.36±3.18 2.45±3.35 0.75±1.11 1.26±1.50 1.12±1.49requestsize 0.99±0.05 1.00±0.07 0.99±0.05 0.99±0.03 1.00±0.03 0.99±0.03 1.00±0.07 1.01±0.13 1.00±0.07requestsper run 0.92±0.04 0.98±0.06 0.92±0.04 0.93±0.03 0.98±0.03 0.92±0.03 0.91±0.07 0.98±0.13 0.91±0.07run length 0.92±0.01 0.98±0.01 0.91±0.01 0.91±0.00 0.98±0.00 0.91±0.01 0.91±0.01 0.98±0.01 0.90±0.01run stride 7.59±5.02 6.11±0.05 7.66±5.29 5.45±1.78 5.45±1.78 5.92±2.84 10.8±13.8 7.65±10.5 15.3±18.3locality f raction 0.99±0.01 1.00±0.01 0.98±0.01 0.98±0.00 1.00±0.00 0.98±0.00 0.98±0.00 1.00±0.00 0.98±0.01requestsper sparserun 0.05±0.02 0.17±0.08 0.04±0.02 0.04±0.00 0.13±0.02 0.03±0.00 0.04±0.01 0.16±0.05 0.04±0.01sparserun length 0.05±0.02 0.17±0.08 0.04±0.02 0.04±0.00 0.13±0.02 0.03±0.00 0.04±0.01 0.16±0.05 0.04±0.01sparserun f raction 1.11±0.12 1.08±0.11 1.11±0.12 1.06±0.06 1.05±0.06 1.06±0.06 1.23±0.27 1.17±0.26 1.22±0.27time span 1.19±0.05 1.07±0.05 1.18±0.05 1.17±0.03 1.05±0.03 1.17±0.03 1.18±0.10 1.05±0.08 1.19±0.10


Measured Concurrent reading after individual or concurrent writing, using the same segment size where bothreading and writing is concurrent.

Parameters file count: 2-4; file size: 64MB, 128MB, 256MB; writing: individual, concurrent; reading: con-current; read/write segment size: 256KB, 1MB, 16MB

Expected Results The previous experiment showed that concurrent writing results in fragmented files, asshown by the fact that it takes longer to read the same file.. Were the physical fragments exactly as largeand ordered as the writes issued, concurrent reading of the same segment size should be similar to individ-ual reading of individually written files. We have seen that this is not the case because physical fragmentsare usually larger. Still, interleaved files should result in faster concurrent read, because the fragments aresmaller than the files themselves, reducing the seek latency.

Measured Results The results of the experiment are shown in tables6.7, 6.8, and6.9 for 2, 3, and 4 files,respectively. The results show that concurrent writing canreally improve concurrent reading performance,which is however still far from the performance of individual sequential reading of individually writtenfiles. The results follow a seemingly established trend in that results for 1 MB segment sizes show oppositetendencies than those of the other two segment sizes. This experiment is no exception – the performancefor 1 MB segment sizes is actually slightly worse for concurrently vs. individually written files. Lookingat other workload attributes, we can notice that when increasing the number of concurrently written filesas well as file size, the slight performance gain slowly diminishes. Apart from increasingbursty f ractionattribute there appears to be no other explanation for the progressive leveling of performance.





=64

MB

wss

=25

6KB

cw=

1rs

s=25

6KB

fs=

64M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

64M

Bw

ss=

16M

Bcw

=1

rss=

16M

B

fs=

128M

Bw

ss=

256K

Bcw

=1

rss=

256K

B

fs=

128M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

128M

Bw

ss=

16M

Bcw

=1

rss=

16M

B

fs=

256M

Bw

ss=

256K

Bcw

=1

rss=

256K

B

fs=

256M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

256M

Bw

ss=

16M

Bcw

=1

rss=

16M

B

requestrate 1.52±0.13 0.95±0.10 1.60±0.13 1.50±0.07 1.00±0.07 1.56±0.06 1.27±0.25 0.99±0.18 1.56±0.24bursty f raction 0.66±0.21 1.05±0.26 0.88±0.35 0.70±0.17 1.10±0.13 0.88±0.19 0.88±0.38 1.07±0.50 0.73±0.42requestsper burst 0.97±0.02 0.99±0.04 1.05±0.02 0.97±0.02 1.01±0.03 1.05±0.02 1.02±0.06 1.03±0.08 1.06±0.07data span 0.65±0.94 1.49±2.04 0.86±1.01 1.08±1.14 1.78±2.36 1.13±1.68 1.44±2.06 0.92±1.02 0.81±0.73requestsize 0.75±0.03 1.00±0.09 0.75±0.03 0.75±0.02 0.98±0.07 0.74±0.02 0.84±0.09 1.01±0.12 0.76±0.05requestsper run 1.26±0.03 0.99±0.09 1.01±0.02 1.25±0.03 1.00±0.07 1.02±0.01 1.12±0.09 0.99±0.11 NA±NArun length 0.92±0.01 1.00±0.02 0.96±0.03 0.91±0.01 0.99±0.01 0.96±0.01 0.94±0.03 0.99±0.03 NA±NArun stride 0.02±0.00 0.17±0.16 0.15±0.41 0.01±0.00 0.20±0.31 0.01±0.00 0.12±0.31 4.06±8.72 NA±NAlocality f raction 1.06±0.02 1.00±0.00 12.6±1.66 1.05±0.01 1.00±0.00 12.7±1.62 1.02±0.03 1.00±0.01 54.6±48requestsper sparserun 1.26±0.05 0.66±0.07 1.03±0.01 1.23±0.05 0.66±0.05 1.03±0.01 1.11±0.12 0.75±0.20 1.01±0.05sparserun length 0.94±0.02 0.67±0.04 1.01±0.01 0.92±0.02 0.65±0.02 1.01±0.01 0.94±0.02 0.75±0.18 1.04±0.04sparserun f raction 0.98±0.00 0.98±0.04 3.28±1.04 0.98±0.00 1.00±0.01 3.26±0.59 0.98±0.01 1.02±0.08 3.49±2.15time span 0.88±0.07 1.04±0.08 0.84±0.05 0.89±0.03 1.01±0.04 0.86±0.03 0.92±0.13 1.01±0.17 0.84±0.10

Table 6.7: Effect of concurrent sequential reading of 2 files written concurrently vs. individually. Common parameters:fc=2, cr=1, rr=0


=64

MB

wss

=25

6KB

cw=

1rs

s=25

6KB

fs=

64M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

64M

Bw

ss=

16M

Bcw

=1

rss=

16M

B

fs=

128M

Bw

ss=

256K

Bcw

=1

rss=

256K

B

fs=

128M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

128M

Bw

ss=

16M

Bcw

=1

rss=

16M

B

fs=

256M

Bw

ss=

256K

Bcw

=1

rss=

256K

B

fs=

256M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

256M

Bw

ss=

16M

Bcw

=1

rss=

16M

B

requestrate 1.40±0.09 1.00±0.13 1.50±0.09 1.40±0.04 1.01±0.05 1.55±0.05 1.35±0.13 0.98±0.17 1.54±0.19bursty f raction 0.74±0.31 1.08±0.24 0.91±0.26 0.74±0.15 1.06±0.15 0.92±0.17 0.71±0.20 1.11±0.43 0.82±0.20requestsper burst 0.98±0.02 1.01±0.03 1.05±0.02 0.96±0.01 1.01±0.03 1.04±0.01 1.02±0.04 1.03±0.05 1.10±0.04data span 1.92±2.65 0.92±0.99 1.87±2.40 2.21±2.50 1.45±2.10 1.00±1.07 1.91±2.36 1.96±2.76 0.89±1.43requestsize 0.75±0.03 0.98±0.13 0.74±0.02 0.75±0.02 0.98±0.05 0.74±0.02 0.76±0.04 1.01±0.16 0.75±0.03requestsper run 1.25±0.03 1.00±0.13 1.01±0.02 1.25±0.02 1.01±0.05 1.02±0.01 1.24±0.04 0.99±0.16 NA±NArun length 0.91±0.02 0.99±0.01 0.95±0.02 0.91±0.01 0.99±0.01 0.97±0.01 0.92±0.02 0.99±0.02 NA±NArun stride 0.03±0.04 0.24±0.17 0.08±0.04 0.01±0.00 0.28±0.37 0.01±0.00 0.12±0.21 0.53±0.17 NA±NAlocality f raction 1.05±0.01 1.00±0.00 15.8±2.83 1.05±0.01 1.00±0.01 12.2±1.74 1.05±0.02 1.00±0.01 64.6±58.8requestsper sparserun 1.23±0.06 0.72±0.10 1.03±0.01 1.21±0.04 0.64±0.07 1.03±0.01 1.21±0.07 0.69±0.14 1.02±0.02sparserun length 0.93±0.01 0.71±0.05 1.00±0.01 0.91±0.02 0.63±0.06 1.00±0.00 0.93±0.02 0.69±0.07 1.02±0.02sparserun f raction 0.98±0.00 1.00±0.02 3.29±0.74 0.98±0.00 1.00±0.01 3.25±0.57 0.98±0.01 0.97±0.07 3.40±0.84time span 0.95±0.05 1.02±0.06 0.90±0.04 0.95±0.02 1.02±0.03 0.87±0.02 0.97±0.08 1.02±0.12 0.87±0.09



=64

MB

wss

=25

6KB

cw=

1rs

s=25

6KB

fs=

64M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

64M

Bw

ss=

16M

Bcw

=1

rss=

16M

B

fs=

128M

Bw

ss=

256K

Bcw

=1

rss=

256K

B

fs=

128M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

128M

Bw

ss=

16M

Bcw

=1

rss=

16M

B

fs=

256M

Bw

ss=

256K

Bcw

=1

rss=

256K

B

fs=

256M

Bw

ss=

1MB

cw=

1rs

s=1M

B

fs=

256M

Bw

ss=

16M

Bcw

=1

rss=

16M

B







=64

MB

rss=

256K

Bcr

=1

fs=

64M

Brs

s=1M

Bcr

=1

fs=

64M

Brs

s=16

MB

cr=

1

fs=

128M

Brs

s=25

6KB

cr=

1

fs=

128M

Brs

s=1M

Bcr

=1

fs=

128M

Brs

s=16

MB

cr=

1

fs=

256M

Brs

s=25

6KB

cr=

1

fs=

256M

Brs

s=1M

Bcr

=1

fs=

256M

Brs

s=16

MB

cr=

1

requestrate 0.84±0.11 1.05±0.20 0.99±0.08 0.93±0.05 1.02±0.10 1.01±0.05 0.87±0.14 1.04±0.29 0.96±0.14bursty f raction 0.84±0.30 1.03±0.29 1.06±0.26 0.88±0.17 1.08±0.19 1.00±0.16 0.97±0.29 0.97±0.62 0.76±0.36requestsper burst 1.00±0.03 0.99±0.05 1.02±0.06 0.99±0.03 1.00±0.04 1.00±0.03 0.98±0.05 NA±NA 1.00±0.05data span 1.40±1.38 0.49±0.49 0.66±0.80 0.53±0.50 0.89±0.53 0.80±0.48 0.79±0.50 0.88±0.64 0.75±0.44requestsize 1.13±0.15 1.00±0.13 1.00±0.09 1.05±0.05 1.01±0.06 1.00±0.04 1.10±0.12 1.02±0.17 1.06±0.12requestsper run 0.91±0.10 1.01±0.14 1.00±0.01 0.97±0.04 0.99±0.06 1.00±0.01 0.93±0.09 0.99±0.16 0.94±0.05run length 1.06±0.04 1.01±0.02 1.00±0.09 1.03±0.02 1.00±0.01 1.00±0.06 1.03±0.04 0.99±0.05 1.04±0.16run stride 2.63±0.20 1.67±3.52 2.65±1.85 2.38±0.47 1.99±1.81 2.24±0.47 2.53±0.32 0.52±1.66 2.28±0.76locality f raction 0.95±0.04 1.00±0.01 0.98±0.19 0.98±0.03 1.00±0.00 1.01±0.14 0.96±0.04 1.00±0.01 0.87±0.31requestsper sparserun 0.89±0.11 0.98±0.25 1.00±0.02 0.96±0.05 0.95±0.11 1.00±0.02 0.91±0.10 0.91±0.20 0.94±0.06sparserun length 1.02±0.02 0.98±0.21 0.99±0.08 1.01±0.03 0.96±0.09 1.00±0.05 0.99±0.05 0.91±0.17 1.01±0.15sparserun f raction 0.99±0.01 0.99±0.05 0.99±0.15 1.00±0.01 0.99±0.02 1.01±0.11 0.99±0.01 1.10±0.23 0.94±0.31time span 1.06±0.05 0.97±0.09 1.01±0.05 1.03±0.03 0.99±0.14 1.04±0.02 1.04±0.10 0.88±0.15 0.99±0.07

Table 6.10: Effect of concurrent vs. individual random reading of 2 fileswritten individually. Common parameters: fc=2,wss=256KB, cw=0, rr=1

6.1.4.5 Random access

This group of experiments assesses impact of sharing when each individual file is accessed randomly instead ofsequentially, i.e. a seek to a random position inside the fileis issued before each block read. The sizes of blocksthat are read between two seek operations are the same as the block sizes of concurrent reading. Again, multiplefiles are written and/or read either individually or concurrently.

6.1.4.6 Experiment: Concurrent random reading of individu ally written files

Purpose Determine the effect of concurrent vs. individual random reading of individually written files.

Measured Individual or concurrent random reading after individual writing.

Parameters file count: 2-4; file size: 64MB, 128MB, 256MB; writing: individual; reading: random individual;read segment size: 256KB, 1MB, 16MB

Expected Results Random concurrent reading should not affect performance compared to random individualreading as much as in the case of individual and concurrent sequential reading, because random reading ofa file already causes disk seeks and does not benefit from read-ahead buffering like the sequential readingdoes. Concurrent reading should just increase the length ofseeks and cause more buffer cache sharing,which should not be such difference without read-ahead.

Measured Results The results of the experiment are shown in tables6.10, 6.11, and6.12for 2, 3, and 4 files,respectively. As expected, the slowdown was much smaller than with sequential reading. However, theimpact is still notable with 256 KB segment size. Again, for 1MB segment size there is actually a veryslight speedup. The assumption that concurrent reading should just increase the length of seeks is reflectedin the measured values of therun strideattribute.

6.1.4.7 Experiment: Individual random reading of concurre ntly written files

Purpose Determine the residual effect of concurrently written fileson individual random reading.

Measured Individual random reading after individual or concurrent writing.

Parameters file count: 2-4; file size: 64MB, 128MB, 256MB; writing: individual, concurrent; reading: randomindividual; write segment size: 256KB, read segment size: 256KB, 1MB, 16MB

Expected Results As in previous experiment, the slowdown should be smaller when compared to the analogousExperiment6.1.4.3with sequential reading, because seeking already occurs due to random reading, andfragmented files should increase it only a little.





=64

MB

rss=

256K

Bcr

=1

fs=

64M

Brs

s=1M

Bcr

=1

fs=

64M

Brs

s=16

MB

cr=

1

fs=

128M

Brs

s=25

6KB

cr=

1

fs=

128M

Brs

s=1M

Bcr

=1

fs=

128M

Brs

s=16

MB

cr=

1

fs=

256M

Brs

s=25

6KB

cr=

1

fs=

256M

Brs

s=1M

Bcr

=1

fs=

256M

Brs

s=16

MB

cr=

1




=64

MB

rss=

256K

Bcr

=1

fs=

64M

Brs

s=1M

Bcr

=1

fs=

64M

Brs

s=16

MB

cr=

1

fs=

128M

Brs

s=25

6KB

cr=

1

fs=

128M

Brs

s=1M

Bcr

=1

fs=

128M

Brs

s=16

MB

cr=

1

fs=

256M

Brs

s=25

6KB

cr=

1

fs=

256M

Brs

s=1M

Bcr

=1

fs=

256M

Brs

s=16

MB

cr=

1



Measured Results The results of the experiment are shown in tables6.13, 6.14, and6.15for 2, 3, and 4 files,respectively. As expected, the results mostly show slowdown which is most pronounced in case of 2 files,files size 128 MB and read segment size 16 MB. Other than that, the timing information varies without visibletrend. We can also observe that thedata spanof the various workload configurations does not grow much,which suggests that random seeking during reads already covers most of the data area. Still, there is anincrease inrun stride for the fraction of requests that actually occur in a run, which means that the runs arefather apart.





=64

MB

cw=

1rs

s=25

6KB

fs=

64M

Bcw

=1

rss=

1MB

fs=

64M

Bcw

=1

rss=

16M

B

fs=

128M

Bcw

=1

rss=

256K

B

fs=

128M

Bcw

=1

rss=

1MB

fs=

128M

Bcw

=1

rss=

16M

B

fs=

256M

Bcw

=1

rss=

256K

B

fs=

256M

Bcw

=1

rss=

1MB

fs=

256M

Bcw

=1

rss=

16M

B


Table 6.13: Effect of concurrent vs. individual writing of 2 files on individual random reading. Common parameters:fc=2, wss=256KB, cr=0, rr=1


=64

MB

cw=

1rs

s=25

6KB

fs=

64M

Bcw

=1

rss=

1MB

fs=

64M

Bcw

=1

rss=

16M

B

fs=

128M

Bcw

=1

rss=

256K

B

fs=

128M

Bcw

=1

rss=

1MB

fs=

128M

Bcw

=1

rss=

16M

B

fs=

256M

Bcw

=1

rss=

256K

B

fs=

256M

Bcw

=1

rss=

1MB

fs=

256M

Bcw

=1

rss=

16M

B




=64

MB

cw=

1rs

s=25

6KB

fs=

64M

Bcw

=1

rss=

1MB

fs=

64M

Bcw

=1

rss=

16M

B

fs=

128M

Bcw

=1

rss=

256K

B

fs=

128M

Bcw

=1

rss=

1MB

fs=

128M

Bcw

=1

rss=

16M

B

fs=

256M

Bcw

=1

rss=

256K

B

fs=

256M

Bcw

=1

rss=

1MB

fs=

256M

Bcw

=1

rss=

16M

B






Chapter 7

Virtual Machine

When considering the shared resources associated with a virtual machine, we assume a common desktop andserver-based virtual machine with just-in-time compilation and collected heap. Examples of such virtual machinesinclude CLI and JVM.

Same as a physical processor is an important source of implicitly shared resources for the components it hosts,so does a virtual machine provide another set of implicitly shared resources unique to the components it hosts.Since many components and services use languages hosted by virtual machines, such as C# or Java, it is natural toextend the scope of implicitly shared resources to resources involved in the virtual machine operation.

7.1 Resource: Collected Heap

The collected heap is a shared resource that provides dynamic memory management to multiple componentsrunning on the same virtual machine. The essential functions provided by the collected heap to the componentsare allocation and freeing of memory blocks.

Whenever a component requests a memory block, a free block ofthe requested size is found on the heap. Theblock is marked as used and its address is returned to the component. The component can use the memory blockas long as it retains its address. When a free block of a sufficient size is not found on the heap, used blocks whoseaddresses are no longer retained by the components are identified as garbage and reclaimed as free.

There are many algorithms to manage collected heap, with various tradeoffs in the way the memory blocks areallocated, identified as garbage and reclaimed as free. The resource sharing experiments will focus on compactinggenerational garbage collectors, which are the garbage collectors of choice in many virtual machines. Althoughthe algorithms used by the virtual machines differ, the principles of the compacting generational garbage collectorsprovide common ground for the experiments.

A compacting collector tackles the problem of heap fragmentation. During garbage collection, used and freeblocks tend to be mixed on the heap. The free space is fragmented into many small blocks instead of one largeblock. This decreases memory utilization, since many smallblocks are unable to satisfy large allocation requeststhat one large block would, even if the total free space staysthe same. It also increases the overhead of lookingup the free blocks during allocation, since a free block of the appropriate size needs to be located among manyfree blocks, rather than simply cut off from one large block.Finally, it also decreases the efficiency of caching,since the granularity of caching does not match the granularity of the used blocks and parts of the free blocks willtherefore be cached alongside the used blocks.

A compacting collector moves the used blocks to avoid fragmenting of the free space. Besides the obviousoverhead of moving the used blocks, the compacting collector must also take care of referential integrity, updatingthe addresses of the used blocks retained by the application.

A generational collector tackles the problem of collectionoverhead. During garbage collection, all used blockson the heap are traversed, starting from root objects that are always accessible to an application and locating objectsthat are accessible transitively. A large heap can take a long time to traverse, bringing significant overhead not onlyin terms of computation time, but also in terms of synchronization time, since modifications of the heap are limitedduring traversal.




A generational collector relies on the objects being more likely to become garbage earlier in their lifetime.Objects are separated into generations, starting at the youngest and gradually moving towards the oldest as theysurvive collections. Younger generations are collected more often than older ones, since they are more likely tocontain garbage. In order to facilitate independent collection of individual generations, references that cross theoutside of generation boundary are treated as roots for the purpose of the particular generation collection.


This section describes platform dependent details of the virtual machines used in the experiments for platformsintroduced in Section3.2.

7.1.1.1 Platform Desktop

The virtual machine is equipped with a generational garbagecollector framework that distinguishes young andtenured generations. A configurable set of collector algorithms is available, with defaults selected depending onwhether a client class platform or a server class platform isdetected [17, 19, 20, 21].

The default choice on a client class platform is the Serial Collector. The Serial Collector algorithm usescopying in the young generation and marking and compacting in the tenured generation. The collector uses asingle processor and stops the mutator for the entire collection.

The default choice on a server class platform is the ParallelCollector, also called the Throughput Collector.The Parallel Collector differs from the Serial Collector byintroducing a multiprocessor copying algorithm forthe young generation collection. Optionally, the ParallelCollector can be configured to use a multiprocessorcompacting algorithm for the tenured generation collection.

An optional choice on both platform classes is the Mostly Concurrent Collector, also called the Low LatencyCollector. The Mostly Concurrent Collector algorithm differs from the Parallel Collector by introducing a multi-processor mark and sweep algorithm for the tenured generation collection [18]. The algorithm does not stop themutator for the entire collection.

By default, the virtual machine attempts to limit the total overhead of the garbage collection to 1 % of theexecution time, extending the total heap size if this goal cannot be met. The virtual machine also imposes a limiton the total heap size, which is by default 64 MB on a client class platform and 1/4 of physical memory but at most1 GB on a server class platform. The experiments were normalized to never exceed the limit on the total heap size.

To facilitate better utilization of the available heap, thevirtual machine introduces three special reference typesin addition to the regular object references. A soft reference is used with objects that should not be collected aslong as there is no shortage of memory. A weak reference is used with objects that should be collected but need tobe tracked until they are collected. A phantom reference is used with objects that are being collected but need tobe tracked while they are collected


The same virtual machine as on PlatformDesktopis used here. Only a server class machine is available though,hence using the Parallel Collector by default. To make the time measurements of the measured workload compa-rable with the time measurements of the garbage collector, the virtual machine was limited to run on a single coreonly.


The same virtual machine as on PlatformDesktopis used here. Only a server class machine is available though,hence using the Parallel Collector by default. To make the time measurements of the measured workload compa-rable with the time measurements of the garbage collector, the virtual machine was limited to run on a single coreonly.





The effects that can influence the quality attributes when multiple components share a collected heap include:

Heap dimensions The dimensions of the heap can influence both the collection efficiency and the collectionoverhead. The exact dependency, however, depends on the particular garbage collector implementation.Assuming that backtracking is used to traverse the heap, then the longer the sequences of references on theheap, the more backtracking information is required duringcollection.

Reference aliasing Assuming that a compacting collector with direct references is used, then the more refer-ences to the same object, the more work needs to be done updating references on compaction.

Object lifetimes The collection efficiency and the collection overhead differs for each generation. Changes inobject lifetimes can cause changes in the assignment of objects to generations.

7.1.3 General Composition

The effects of sharing the collected heap can be exhibited asfollows:

• Any composition of components that allocate memory on the heap will change the heap dimensions, poten-tially influencing both the collection efficiency and the collection overhead.

• Assume components that allocate temporary objects of relatively short lifetimes. Assume further that the life-times are tied to the processing speed of the components. A composition of such components can decreasethe processing power available to each component, thus decreasing the processing speed of the componentsand increasing the lifetime of the temporary objects. When the lifetimes of the temporary objects cross ageneration boundary, the collection efficiency and the collection overhead will change.

• Assume a component that uses soft references. A compositionof such a component with any componentsthat allocate memory on the heap will change the heap dimensions, potentially influencing the conditionsthat trigger the collection of soft references.

7.1.4 Artificial Experiments: Overhead Dependencies

The overhead of the garbage collector can depend on the allocation speed. To filter out the impact of the allocationspeed, the experiments are configured for a constant allocation speed, chosen to be practically reasonable butotherwise arbitrary. The range of practically reasonable allocation speeds has been determined by profiling thederby, serial, sunflow and compiler benchmarks from the SPECjvm2008 suite [29]. On PlatformDesktop, thethree benchmarks allocate from 260 K to 320 K objects per second, with the average object size from 35 B to 40 B.

7.1.4.1 Experiment: Object lifetime

The experiment to determine the dependency of the collectoroverhead on the object lifetime uses componentsthat allocate temporary objects in a queue. A constant number of componentsTotalComponentsis allocated onthe heap, each component allocates a constant number of payload objectsOb jectsPerComponent. Every timea component is invoked, it releases the reference to its oldest allocated object and allocates a new object. Theexperiment invokes each component a given number of timesConsecutiveInvocationsbefore advancing to thenext component. When the number of consecutive invocationsof a component is below the number of objectsper component, the lifetime of each object isTotalComponents×Ob jectsPerComponentinvocations. When thenumber of consecutive invocations of a component exceeds the number of objects per component, the lifetimeof Ob jectsPerComponent/ConsecutiveInvocationspercent of objects remains the same, while the lifetime of theremaining objects changes toOb jectsPerComponentinvocations.

Random accesses to the allocated objects are added to the workload to regulate the allocation speed.

Purpose Determine the dependency of the collector overhead on the object lifetime.




Listing 7.1: Object lifetime experiment.

1 // Component implementation2 class Component {3 // List initialized with ObjectsPerComponent instances of Payload4 ArrayList<Payload> oObjects;5

6 void Invoke () {7 oObjects. remove (0);8 oObjects. add ( new Payload ());9 }

10 }11

12 // Array initialized with TotalComponent instances of Comp onent13 Component[] aoComponents;14

15 // Workload generation16 for ( int iComponent = 0;17 iComponent < TotalComponents;18 iComponent ++) {19 for ( int iInvocation = 0;20 iInvocation < ConsecutiveInvocations;21 iInvocation ++) {22 aoComponents[ iComponent]. Invoke ();23 }24 }

Measured Time to perform a single component invocation from Listing7.1, and the overhead spent by theyoung and the tenured generation collector.

Parameters TotalComponents: 1-32 K; ObjectsPerComponent: 16; ConsecutiveInvocations: 1 for mostly longobject lifetimes, 256 for mostly short object lifetimes; Normalization: 300 K objects/s, 40 B size for 1 Kcomponents.

Expected Results A certain minimum number of components is necessary for the object lifetimes to cross theboundary between the young and the tenured generation. As soon as this number of components is reached,a difference in the collection efficiency and the collectionoverhead should be observed between the twoobject lifetime configurations. ForConsecutiveInvocationsset to 1, 100 % of objects has the long lifetime.ForConsecutiveInvocationsset to 256, 94 % of objects has the short lifetime and 6 % of objects has the longlifetime.

Measured Results With the client configuration of the virtual machine, the collection overhead on Figures7.2and 7.3 starts growing sharply when the object lifetime exceeds 64× 16 invocations, stabilizing arounda level of 35-40 % overhead for 2048× 16 invocations. With the server configuration of the virtualma-chine, the collection overhead on Figures7.5and7.6grows similarly, stabilizing around a level of 40-45%overhead for 2048×16 invocations. In both configurations, the point where the collection overhead startsincreasing corresponds exactly with the point where the invocation throughput starts decreasing, see Fig-ures7.1and7.4.

The results of the experiment suggest that the collection overhead of objects with long lifetime is significantlylarger than the collection overhead of objects with short lifetime.




1 10 100 1000 10000

2e+

053e

+05

4e+

056e

+05

8e+

05

Allocated components

Thr

ough

put [

1/s

− 1

0 se

c A

vg T

rim]

oldyoung

Figure 7.1: Invocation throughput in client configuration onDesktop.

1 10 100 1000 10000

02

46

810


Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]

all generationsyoung generationtenured generation

Figure 7.2: Collector overhead for short lifetimes in client configuration onDesktop.




1 10 100 1000 10000

010

2030

40


Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.3: Collector overhead for long lifetimes in client configuration onDesktop.

1 10 100 1000 10000

4000

0060

0000

1000

000


Thr

ough

put [

1/s

− 1

0 se

c A

vg T

rim]

oldyoung

Figure 7.4: Invocation throughput in server configuration onDesktop.




1 10 100 1000 10000

05

1015


Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.5: Collector overhead for short lifetimes in server configuration onDesktop.

1 10 100 1000 10000

010

2030

40


Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.6: Collector overhead for long lifetimes in server configuration onDesktop.




Effect Summary The overhead change can be visible in components with dynamically allocated objects keptaround for short durations, especially durations determined by outside invocations.

7.1.4.2 Experiment: Heap depth

The experiment to determine the dependency of the collectoroverhead on the depth of the heap uses objectsarranged in a doubly linked list. A constant number of objectsTotalOb jectsis allocated on the heap and arrangedin a doubly linked list. The experiment releases referencesto randomly selected objects and replaces them withnew objects. In addition to the doubly linked list, an array of TotalOb jectreferences, which is a root object,is also maintained. In a shallow configuration of the experiment, the array contains references to all the objectsof the doubly linked list, effectively setting the average distance of each object from the root to one. In a deepconfiguration of the experiment, the array contains references to the first object of the doubly linked list, effectivelysetting the average distance of each object from the root toTotalOb ject/2. An additional array ofTotalOb jectweak references to the objects is maintained to allow randomselection of objects with constant complexity.

Random accesses to the allocated objects are added to the workload to regulate the allocation speed.

Purpose Determine the dependency of the collector overhead on the depth of the heap.

Measured Time to perform a single object replacement from Listing7.2, and the overhead spent by the youngand the tenured generation collector.

Parameters TotalObjects: 2-128 K; Normalization: 200 K objects/s, 40 Bsize for 128 K objects.

Expected Results With the growing number of objects, the average object lifetime also grows, and so shouldthe average collector overhead, as demonstrated in Experiment7.1.4.1. If the collection overhead dependson the depth of the heap, a difference in overhead between theshallow and the deep configuration shouldalso appear.

Measured Results With both the client and the server configurations of the virtual machine, the collectionoverhead of the shallow heap is below the collection overhead of the deep heap for 16-512 objects, byas much as 60 %, see Figures7.8 and 7.9 for client configuration and Figures7.11 and 7.12 for serverconfiguration.

The results of the experiment suggest that the young generation collector is not able to collect garbage in thedeep configuration.

Effect Summary The overhead change can be visible in components whose dynamically allocated objects arelinked to references provided by outside invocations, especially when such references connect the objects indeep graphs.

7.1.4.3 Experiment: Heap size

The experiment to determine the dependency of the collectoroverhead on the size of the heap uses objects arrangedin a doubly linked graph. A constant number of objectsTotalOb jectsis allocated on the heap and arranged in adoubly linked graph with each object randomly selectingNeighborsPerOb jectneighbor objects. The experimentreleases references to randomly selected objects and replaces them with new objects. An array ofTotalRootsrefer-ences to randomly selected objects, which is a root object, is also maintained. An additional array ofTotalOb jectweak references to the objects is maintained to allow randomselection of objects with constant complexity.

Purpose Determine the dependency of the collector overhead on the size of the heap.

Measured Time to perform a single object replacement from Listing7.3, and the overhead spent by the youngand the tenured generation collector.




1e+01 1e+02 1e+03 1e+04 1e+05

2e+

055e

+05

1e+

062e

+06

Allocated objects

Thr

ough

put [

1/s

− 1

0 se

c A

vg T

rim]

deepshallow

Figure 7.7: Replacement throughput in client configuration onDesktop.

1e+01 1e+02 1e+03 1e+04 1e+05

020

4060

80

Allocated objects

Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.8: Collector overhead with shallow heap in client configuration onDesktop.




1e+01 1e+02 1e+03 1e+04 1e+05

020

4060

80

Allocated objects

Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.9: Collector overhead with deep heap in client configuration onDesktop.

1e+01 1e+02 1e+03 1e+04 1e+05

2e+

055e

+05

1e+

062e

+06

Allocated objects

Thr

ough

put [

1/s

− 1

0 se

c A

vg T

rim]

deepshallow

Figure 7.10: Replacement throughput in server configuration onDesktop.




1e+01 1e+02 1e+03 1e+04 1e+05

020

4060

80

Allocated objects

Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.11: Collector overhead with shallow heap in server configuration onDesktop.

1e+01 1e+02 1e+03 1e+04 1e+05

020

4060

80

Allocated objects

Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.12: Collector overhead with deep heap in server configuration onDesktop.




Listing 7.2: Heap depth experiment.

1 // Object implementation2 class Payload {3 Payload oPrev;4 Payload oNext;5 }6

7 // In shallow configuration, array initialized8 // with references to TotalObjects objects9

10 // In deep configuration, array initialized11 // with TotalObjects references to one object12 Payload[] aoRoot;13

14 // Array initialized with weak references15 // to TotalObjects objects16 WeakReference<Payload>[] aoObjects;17

18 // Workload generation19 while ( true) {20 // Pick a victim for replacement and create the replacement21 int iVictim = oRandom. nextInt ( aoObjects. length - 1) + 1;22 Payload oVictim = aoObjects[ iVictim]. get ();23 Payload oReplacement = new Payload ();24 aoObjects[ iVictim] = new WeakReference<Payload> ( oReplacement);25

26 // Connect the replacement in place of the victim.27 oReplacement. oPrev = oVictim. oPrev;28 oReplacement. oPrev. oNext = oReplacement;29 oReplacement. oNext = oVictim. oNext;30 oReplacement. oNext. oPrev = oReplacement;31

32 // In shallow configuration, connect to root33 if ( Shallow) aoRoot[ iVictim] = oReplacement;34 }

Parameters TotalObjects: 1 K-128 K; Normalization: 160 K objects/s, 40B size for 128 K objects.

Expected Results With the growing number of objects, the average object lifetime also grows, and so shouldthe average collector overhead, as demonstrated in Experiment7.1.4.1. If the collection overhead dependson the size of the heap, the dependency should also appear.

Measured Results The results of the experiment on Figures7.13to 7.16show the collection overhead stayingconstant for the young generation collector and growing slowly with the heap size for the tenured generationcollector.

Effect Summary The overhead change can be visible in any components with dynamically allocated objects.




1024 2048 4096 8192 16384 32768 65536 131072

3000

040

000

5000

070

000

Allocated objects

Thr

ough

put [

1/s

− 1

0 se

c A

vg]

base

Figure 7.13: Replacement throughput in client configuration onDesktop.

1e+03 2e+03 5e+03 1e+04 2e+04 5e+04 1e+05

010

2030

40

Allocated objects

Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.14: Collector overhead in client configuration onDesktop.




1024 2048 4096 8192 16384 32768 65536 131072

4000

060

000

8000

012

0000

Allocated objects

Thr

ough

put [

1/s

− 1

0 se

c A

vg]

base

Figure 7.15: Replacement throughput in server configuration onDesktop.

1e+03 2e+03 5e+03 1e+04 2e+04 5e+04 1e+05

010

2030

4050

Allocated objects

Col

lect

or o

verh

ead

[% −

10

sec

Avg

Trim

]


Figure 7.16: Collector overhead in server configuration onDesktop.




Listing 7.3: Heap size experiment.

1 // Object implementation2 class Payload {3 ArrayList<Payload> oForward;4 ArrayList<Payload> oBackward;5

6 // Establishes references to neighbor objects7 void Link () { ... };8

9 // Releases references to neighbor objects10 void Unlink () { ... };11 }12

13 // Array initialized with references to TotalRoots objects14 Payload[] aoRoots;15

16 // Array initialized with weak references to TotalObjects o bjects17 WeakReference<Payload>[] aoObjects;18

19 // Workload generation20 while ( true) {21 // Pick a victim for replacement22 // and create the replacement23 int iVictim = oRandom. nextInt ( aoObjects. length);24 Payload oVictim = aoObjects[ iVictim]. get ();25 Payload oReplacement = new Payload ();26 aoObjects[ iVictim] = new WeakReference<Payload> ( oReplacement);27

28 // Disconnect the victim and connect the replacement29 oVictim. Unlink ();30 oReplacement. Link ();31

32 // Update root list if necessary33 if ( oRoots. remove ( oVictim)) oRoots. add (...);34 }

7.1.4.4 Varying Allocation Speed

Sharing the collected heap influences the allocation speed.Additional experiments are therefore introduced toexamine the impact of the allocation speed on the collector overhead, noting that the allocation speed was constantin the previous experiments. The experiments use the workloads from Listings7.1, 7.2 and7.3, adjusting theallocation speed by changing the number of random accesses to the allocated objects, which are performed by theworkload to regulate the allocation speed.

7.1.4.5 Experiment: Allocation speed with object lifetime

Purpose Determine the dependency of the collector overhead on the allocation speed in combination withvarying object lifetimes from Listing7.1.

Measured The overhead spent by the young and the tenured generation collector.




0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07

05

1015

20

Allocation speed [objects/s]

Col

lect

or o

verh

ead

[% −

10

sec

Avg

]

Young generationTenured generationAll generations

Figure 7.17: Collector overhead with 1 components and short object lifetimes onIntel Server.

0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07

05

1015


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.18: Collector overhead with 512 components and short object lifetimes onIntel Server.

Parameters TotalComponents: 1-16 K objects; ObjectsPerComponent: 16; ConsecutiveInvocations: 1 formostly long object lifetimes, 256 for mostly short object lifetimes; Random accesses: 0-1024; Maximumheap size: 64 MB.

Expected Results With the growing allocation speed, the number of objects that need to be collected per unit oftime also grows and the collector overhead should thereforealso increase. The results of Experiment7.1.4.1suggest that the overhead can also differ for different object lifetimes.

Measured Results The collector overhead on a heap with mostly short object lifetimes is displayed on Fig-ures7.17to 7.20.

The collector overhead on a heap with mostly long object lifetimes is displayed on Figures7.21to 7.24.




0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07

010

2030

40


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.19: Collector overhead with 4 K components and short object lifetimes onIntel Server.

0.0e+00 5.0e+06 1.0e+07 1.5e+07

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.20: Collector overhead with 16 K components and short object lifetimes onIntel Server.




0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07

05

1015

20


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.21: Collector overhead with 1 component and long object lifetimes onIntel Server.

0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07

05

1015

2025


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.22: Collector overhead with 512 components and long object lifetimes onIntel Server.

The results suggest that for large heaps, the dependency of the collector overhead on the allocation speed isclose to linear. For small heaps, however, the dependency isanomalous in the sense that a higher allocationspeed can result in a smaller collector overhead.

Open Issues The results for one allocated component in both experiment configurations, displayed on Fig-ures7.17and7.21, show the overhead peaking near the middle of the plot. For higher allocation speeds, thecollection is triggered aproximately two times more often than in the peak case, but each collection takes25-30 times less time. This anomaly can have many causes, including internal optimizations.




0e+00 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06

010

2030

4050

60


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.23: Collector overhead with 4 K components and long object lifetimes onIntel Server.

200000 400000 600000 800000 1000000 1200000 1400000 1600000

020

4060

80


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.24: Collector overhead with 16 K components and long object lifetimes onIntel Server.

7.1.4.6 Experiment: Allocation speed with heap depth

Purpose Determine the dependency of the collector overhead on the allocation speed in combination withvarying depths of the heap from Listing7.2.


Parameters TotalObjects: 2-64 K; Random accesses: 0-1024; Maximum heap size: 64 MB.

Expected Results With the growing allocation speed, the number of objects that need to be collected per unit oftime also grows and the collector overhead should thereforealso increase. The results of Experiment7.1.4.2suggest that the overhead can also differ for different depths of the heap.

Measured Results The results of the experiment for shallow heap configurations are on Figures7.25to 7.28.




0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07

05

1015

2025


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.25: Collector overhead with 2 objects and shallow heap onIntel Server.

0e+00 1e+06 2e+06 3e+06 4e+06

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.26: Collector overhead with 1 K objects and shallow heap onIntel Server.

The results of the experiment for deep heap configurations are on Figures7.29to 7.32.

The results suggest that for large heaps, the dependency of the collector overhead on the allocation speed isclose to linear. For small heaps, however, the dependency isanomalous in the sense that a higher allocationspeed can result in a smaller collector overhead.

Open Issues The results for two objects in both experiment configurations, displayed on Figures7.25and7.29,show the overhead peaking near the middle of the plot. For higher allocation sppeds, the collection istriggered aproximately three times more often than in the peak case, but each collection takes 20-30 timesless time. This anomaly can have many causes, including internal optimizations or experiment errors, and isnot yet explained.

In the results for 1024 objects in the shallow heap configuration on Figure7.26, an outlier is displayed




0 500000 1000000 1500000 2000000 2500000 3000000

010

2030

4050

6070


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]



2e+05 4e+05 6e+05 8e+05

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]



among the higher allocation speeds. This outlier is observed on a configuration with 16 random accessesthat regulate the allocation speed, while two neighboring values are observed on configurations with 0 and 1random accesses. It is surprising that increasing the number of random accesses in between allocations canactually increase the allocation speed. This anomaly can have many causes, including internal optimizationsor experiment errors, and is not yet explained.

7.1.4.7 Experiment: Allocation speed with heap size

Purpose Determine the dependency of the collector overhead on the allocation speed in combination withvarying sizes of the heap from Listing7.3.





0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07

05

1015

20


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.29: Collector overhead with 2 objects and deep heap onIntel Server.

0e+00 1e+06 2e+06 3e+06 4e+06

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.30: Collector overhead with 1 K objects and deep heap onIntel Server.

Parameters TotalObjects: 1-64 K; Random accesses: 0-1024; Maximum heap size: 64 MB.

Expected Results With the growing allocation speed, the number of objects that need to be collected per unitof time also grows and the collector overhead should therefore also increase.

Measured Results The results of the experiment on Figures7.33to 7.36show that the collection overhead isgrowing with both the allocation speed and the heap size.

The correlation coefficients indicate that the dependency of the collector overhead on the allocation speed isclose to linear. The correlation coefficients are given for more heap sizes than the dependency graphs, whichwould take too much space.




0 500000 1000000 1500000 2000000 2500000 3000000

010

2030

4050

6070


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]



2e+05 4e+05 6e+05 8e+05 1e+06

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]



Number of objects Correlation coefficient1024 0.98592048 0.99344096 0.99648192 0.999516384 0.999432768 0.999765536 0.9989

If the plots are interpreted as linear, the dependency of their linear coefficients on the number of objects isalso linear, as displayed on Figure7.37. The correlation coefficient is 0.9987.




0 500000 1000000 1500000 2000000 2500000

02

46

810

12


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.33: Collector overhead with 1 K objects onIntel Server.

0 500000 1000000 1500000

05

1015


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]



7.1.4.8 Varying Maximum Heap Size

Previous work [43] indicates that the collector overhead depends on the relationship between the occupied heapsize and the maximum heap size. Additional experiments are therefore introduced to examine the impact of themaximum heap size on the collector overhead, noting that themaximum heap size was constant in the previousexperiments. The experiments use the workloads from Listings7.1, 7.2and7.3.

7.1.4.9 Experiment: Maximum heap size with object lifetime

Purpose Determine the dependency of the collector overhead on the maximum heap size in combination withvarying object lifetimes from Listing7.1.




1e+05 2e+05 3e+05 4e+05 5e+05

05

1015

2025

3035


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]



50000 100000 150000 200000 250000

010

2030

4050

60


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]




Parameters TotalComponents: 16 K; Random accesses: 0-1024; Maximum heap size: 64-2048MB.

Expected Results The collector overhead should decrease with increasing maximum heap size. The results ofExperiment7.1.4.1suggest that the overhead can also differ for different object lifetimes.

Measured Results The results on Figures7.38to 7.40show that for a heap with mostly short object lifetimes,the collection overhead generally decreases with increasing maximum heap size. Some exceptions to thetrend can be noted for small maximum heap sizes.

The results on Figures7.41to 7.42show similar behavior for a heap with mostly long object lifetimes.




0 10000 20000 30000 40000 50000 60000

0.00

000

0.00

005

0.00

010

0.00

015

0.00

020

0.00

025

Allocated objects

Line

ar c

oeffi

cien

ts

Figure 7.37: Linear coefficient of collector overhead with 1-64 K objectson Intel Server.

0 500 1000 1500 2000

010

2030

40

Maximum heap size [MB]

Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.38: Collector overhead with different maximum heap sizes, 0 random accesses and short object lifetimes onIntel Server.




0 500 1000 1500 2000

02

46

810


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.39: Collector overhead with different maximum heap sizes, 64 random accesses and short object lifetimes onIntel Server.

0 500 1000 1500 2000

0.0

0.5

1.0

1.5

2.0


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.40: Collector overhead with different maximum heap sizes, 1024random accesses and short object lifetimes onIntel Server.




0 500 1000 1500 2000

020

4060

80


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.41: Collector overhead with different maximum heap sizes, 0 random accesses and long object lifetimes onIntel Server.

0 500 1000 1500 2000

02

46

810


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.42: Collector overhead with different maximum heap sizes, 1024random accesses and long object lifetimes onIntel Server.




0.0e+00 5.0e+06 1.0e+07 1.5e+07

010

2030

40


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.43: Collector overhead with maximum heap size 64 MB, different allocation speed and short object lifetimes onIntel Server.

0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 3.0e+07

05

1015


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.44: Collector overhead with maximum heap size 2048 MB, different allocation speed and short object lifetimeson Intel Server.

The same results are presented in graphs arranged to plot thecollection overhead against the allocation speedrather than the maximum heap size, for heap with mostly shortobject lifetimes on Figures7.43and7.44,and for heap with mostly long object lifetimes on Figures7.45and7.46.




200000 400000 600000 800000 1000000 1200000 1400000 1600000

020

4060

80


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.45: Collector overhead with maximum heap size 64 MB, different allocation speed and long object lifetimes onIntel Server.

0e+00 2e+06 4e+06 6e+06 8e+06

01

23

45


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.46: Collector overhead with maximum heap size 2048 MB, different allocation speed and long object lifetimeson Intel Server.




7.1.4.10 Experiment: Maximum heap size with heap depth

Purpose Determine the dependency of the collector overhead on the maximum heap size in combination withvarying depths of the heap from Listing7.2.


Parameters TotalObjects: 128 K; Random accesses: 0-1024; Maximum heapsize: 64-2048MB.

Expected Results The collector overhead should decrease with increasing maximum heap size. The results ofExperiment7.1.4.2suggest that the overhead can also differ for different depths of the heap.

Measured Results The results on Figures7.47and7.48show the expected decrease of overhead for shallowheap configuration.

The results on Figures7.49and7.50show similar behavior for deep heap configuration.

The same results are presented in graphs arranged to plot thecollection overhead against the allocation speedrather than the maximum heap size, for shallow heap configuration on Figures7.51and7.52, and for deepheap configurations on Figures7.53and7.54.

7.1.4.11 Experiment: Maximum heap size with heap size

Purpose Determine the dependency of the collector overhead on the maximum heap size in combination withvarying sizes of the heap from Listing7.3.


Parameters TotalObjects: 1-64 K; Random accesses: 0-1024; Maximum heap size: 64-2048MB.

Expected Results The collector overhead should decrease with increasing maximum heap size.

Measured Results The results on Figures7.55to 7.57show the expected decrease of collection overhead.

The same results are presented in graphs arranged to plot thecollection overhead against the allocation speedrather than the maximum heap size on Figures7.58to 7.61.

7.1.4.12 Constant Heap Occupation Ratio

With various parameters of the garbage collector, such as generation sizes or collection triggers, being relative tothe maximum heap size, it is possible that the collector overhead depends on the ratio between the occupied heapsize and the maximum heap size. Additional experiments are therefore introduced to examine the stability of thecollector overhead when both the occupied heap size and the maximum heap size change, but their ratio staysconstant. The experiments use the workloads from Listings7.1, 7.2and7.3.




0 500 1000 1500 2000

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.47: Collector overhead with different maximum heap sizes, 0 random accesses and shallow heap configurationon Intel Server.

0 500 1000 1500 2000

02

46

8


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.48: Collector overhead with different maximum heap sizes, 1024random accesses and shallow heap configura-tion onIntel Server.




0 500 1000 1500 2000

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.49: Collector overhead with different maximum heap sizes, 0 random accesses and deep heap configuration onIntel Server.

0 500 1000 1500 2000

02

46

8


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.50: Collector overhead with different maximum heap sizes, 1024random accesses and deep heap configurationon Intel Server.




2e+05 4e+05 6e+05 8e+05

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.51: Collector overhead with maximum heap size 64 MB, different allocation speed and shallow heap onIntel Server.

500000 1000000 1500000 2000000 2500000

010

2030

40


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.52: Collector overhead with maximum heap size 2048 MB, different allocation speed and shallow heap onIntel Server.




2e+05 4e+05 6e+05 8e+05

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.53: Collector overhead with maximum heap size 64 MB, different allocation speed and deep heap onIntel Server.

0 500000 1000000 1500000 2000000 2500000

010

2030

40


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.54: Collector overhead with maximum heap size 2048 MB, different allocation speed and deep heap onIntel Server.




0 500 1000 1500 2000

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.55: Collector overhead with different maximum heap sizes and 0 random accesses onIntel Server.

0 500 1000 1500 2000

05

1015

2025


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.56: Collector overhead with different maximum heap sizes and 64random accesses onIntel Server.




0 500 1000 1500 2000

01

23


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.57: Collector overhead with different maximum heap sizes and 1024 random accesses onIntel Server.

50000 100000 150000 200000 250000

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.58: Collector overhead with maximum heap size 64 MB and different allocation speed onIntel Server.




0e+00 1e+05 2e+05 3e+05 4e+05

05

1015

2025


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]



0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

01

23

4


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]






0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

01

23


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]






7.1.4.13 Experiment: Constant heap occupation with object lifetime

Purpose Determine the dependency of the collector overhead on the maximum heap size with constant occu-pation ratio, in combination with varying object lifetimesfrom Listing7.1.


Parameters TotalComponents: 16-256K; Random accesses: 0-1024; Maximum heap size: 64-1024MB. To-talComponents and maximum heap size are doubled together.

Expected Results If enough parameters of the garbage collector are relative to the heap occupation, expressedas a ratio of the occupied heap size and the maximum heap size,the dependency of the collector overheadon the maximum heap size with constant occupation ratio should be rather small.

Measured Results The results for a heap with mostly short object lifetimes aredisplayed on Figures7.62and7.63.

The results for a heap with mostly long object lifetimes are displayed on Figures7.64and7.65.

For both configurations, the overhead remains almost constant regardless of the maximum heap size, andalso regardless of the allocation speed. The results for varying allocation speeds are not presented in detailbut match this general observation as well.

7.1.4.14 Experiment: Constant heap occupation with heap de pth

Purpose Determine the dependency of the collector overhead on the maximum heap size with constant occu-pation ratio, in combination with varying depths of the heapfrom Listing7.2.


Parameters TotalObjects: 64-1024K; Random accesses: 0-1024; Maximumheap size: 64-1024MB. TotalOb-jects and maximum heap size are doubled together.


Measured Results The results for deep heap configuration are displayed on Figures7.66and7.67.

The results for shallow heap configuration are displayed on Figures7.68and7.69.

For both configurations, the overhead remains almost constant regardless of the maximum heap size, andalso regardless of the allocation speed. The results for other allocation speeds are not presented in detail butmatch this general observation as well.

7.1.4.15 Experiment: Constant heap occupation with heap si ze

Purpose Determine the dependency of the collector overhead on the maximum heap size with constant occu-pation ratio, in combination with varying sizes of the heap from Listing7.3.


Parameters TotalObjects: 64-1024K; Random accesses: 0-1024; Maximumheap size: 64-1024MB. TotalOb-jects and maximum heap size are doubled together.


Measured Results The results are displayed on Figures7.70to 7.72.

The overhead remains almost constant regardless of the maximum heap size, and also regardless of theallocation speed.




200 400 600 800 1000

010

2030

4050

Maximum heap size [MB] and total number of components [1/4 K]

Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.62: Collector overhead with different maximum heap sizes and object counts, 0 random accesses and objectswith short lifetimes onIntel Server.

200 400 600 800 1000

02

46

810

12


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.63: Collector overhead with different maximum heap sizes and object counts, 64 random accesses and objectswith short lifetimes onIntel Server.




200 400 600 800 1000

020

4060


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.64: Collector overhead with different maximum heap sizes and object counts, 0 random accesses and objectswith long lifetimes onIntel Server.

200 400 600 800 1000

010

2030

40


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.65: Collector overhead with different maximum heap sizes and object counts, 64 random accesses and objectswith long lifetimes onIntel Server.




200 400 600 800 1000

010

2030

4050

60

Maximum heap size [MB] and total number of objects [K]

Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.66: Collector overhead with different maximum heap sizes and object counts, 0 random accesses and deep heapon Intel Server.

200 400 600 800 1000

05

1015


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.67: Collector overhead with different maximum heap sizes and object counts, 256 random accesses and deepheap onIntel Server.




200 400 600 800 1000

010

2030

4050

60


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.68: Collector overhead with different maximum heap sizes and object counts, 0 random accesses and shallowheap onIntel Server.

200 400 600 800 1000

05

1015

20


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.69: Collector overhead with different maximum heap sizes and object counts, 256 random accesses and shallowheap onIntel Server.




200 400 600 800 1000

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.70: Collector overhead with different maximum heap sizes and object counts, 0 random accesses onIntel Server.

200 400 600 800 1000

05

1015

2025


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]






200 400 600 800 1000

01

23


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]



7.1.5 Artificial Experiments: Workload Compositions

The experiments with collector overhead dependencies suggest that the character of the dependency on heap sizeand allocation speed is similar for different workloads, but constants describing the dependency change. Additionalexperiments investigate whether this behavior persists for composition of workloads from Listings7.3and7.2.

7.1.5.1 Experiment: Allocation speed with composed worklo ad

The experiment runs the workloads from Listings7.3 and7.2 in composition, changing the ratio of allocationsperformed by each of the two workloads.

Purpose Determine the dependency of the collector overhead on the allocation speed in the composed work-load. .

Measured The overhead spent by the young and the tenured generation collector in code combined from List-ings7.3and7.2.

Parameters TotalObjects: 64 K for both workloads; Random accesses: 0-1024; Maximum heap size: 128 MB;Workload allocation ratio: 4:1-1:4.

Expected Results Experiments7.1.4.7and7.1.4.6show that the collector overhead depends on the workload.The results for the composed workload can provide additional insight:

• If the collector overhead does not change with different allocation ratios, it would indicate that thecollector overhead depends only on the live objects and not on the garbage. This is because the numberof live objects maintained by each workload stays constant during the experiment, changing the ratioof allocations performed by the two workloads influences only the garbage.

• If the collector overhead does change with different allocation ratios, the collector overhead dependsalso on distribution of the live objects or the garbage.

Measured Results The results of the experiment for deep heap configuration aredisplayed on Figures7.73to 7.75.

For shallow heap configuration, the results are displayed onFigures7.76to 7.78.




1e+05 2e+05 3e+05 4e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.73: Collector overhead with allocation ratio 4:1 in favor of heap size workload and deep configuration onIntel Server.

1e+05 2e+05 3e+05 4e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.74: Collector overhead with allocation ratio 1:1 and deep configuration onIntel Server.




1e+05 2e+05 3e+05 4e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.75: Collector overhead with allocation ratio 1:4 in favor of heap depth workload and deep configuration onIntel Server.

1e+05 2e+05 3e+05 4e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.76: Collector overhead with allocation ratio 4:1 in favor of heap size workload and shallow configuration onIntel Server.




1e+05 2e+05 3e+05 4e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.77: Collector overhead with allocation ratio 1:1 and shallow configuration onIntel Server.

1e+05 2e+05 3e+05 4e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.78: Collector overhead with allocation ratio 1:4 in favor of heap depth workload and shallow configuration onIntel Server.




50000 100000 150000 200000 250000 300000 350000

010

2030

4050

60


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.79: Collector overhead with 16 K objects heap depth workload and112 K from other workload and deep con-figuration onIntel Server.

The results are almost independent of the allocation ratio,suggesting that the collector overhead does notdepend on the garbage as produced by the two workloads. The dependency on the allocation speed isvery close to linear. For the deep heap configuration, the linear coefficients are ranging from 0.0001171 to0.0001179. For the shallow heap configuration, the range is from 0.0001205 to 0.0001213.

7.1.5.2 Experiment: Heap size with composed workload

The experiment runs the workloads from Listings7.3and7.2in composition, changing the number of live objectsmaintained by each workload while keeping the total number of live objects constant.

Purpose Determine the dependency of the collector overhead on the ratio of live objects maintained by indi-vidual workloads.

Measured The overhead spent by the young and the tenured generation collector in code combined from List-ings7.3and7.2.

Parameters TotalObjects: 16-112K for first workload, 112-16K for second workload; Random accesses: 0-1024; Maximum heap size: 128 MB; Workload allocation ratio:1:1.

Expected Results Experiments7.1.4.7and7.1.4.6show that the collector overhead depends on the workload.Experiment7.1.5.1suggests that the collector overhead does not depend on the garbage. Assuming that thecollector overhead is associated with traversing the live objects, the overhead of traversing the objects in thecombined workload should consist of the overheads of traversing the objects in the individual workloads.The expectation therefore is that the overhead will be highest with most live objects maintained by workloadfrom Listing7.3and lowest with most live objects maintained by workload from Listing7.2.

Measured Results The results of the experiments for deep heap configuration are displayed on Figures7.79to 7.81.

For shallow heap configuration, the results are displayed onFigures7.82to 7.84.

All results, inluding those not displayed here for sake of brevity, exhibit a linear dependency of the collectoroverhead on the allocation speed. The dependency of the linear coefficient on the ratio of live objects




1e+05 2e+05 3e+05 4e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.80: Collector overhead with 64 K objects from both workloads anddeep configuration onIntel Server.

1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.81: Collector overhead with 112 K objects heap depth workload and 16 K from other workload and deep con-figuration onIntel Server.




50000 100000 150000 200000 250000 300000 350000

010

2030

4050

60


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.82: Collector overhead with 16 K objects heap depth workload and112 K from other workload and shallowconfiguration onIntel Server.

1e+05 2e+05 3e+05 4e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.83: Collector overhead with 64 K objects from both workloads andshallow configuration onIntel Server.




1e+05 2e+05 3e+05 4e+05 5e+05 6e+05

010

2030

4050


Col

lect

or o

verh

ead

[% −

10

sec

Avg

]


Figure 7.84: Collector overhead with 112 K objects heap depth workload and 16 K from other workload and shallowconfiguration onIntel Server.

2e+04 4e+04 6e+04 8e+04 1e+05

0.00

008

0.00

010

0.00

012

0.00

014

Heap depth workload objects

Line

ar c

oeffi

cien

ts

Figure 7.85: Linear coefficients as dependency of object count from workload with heap depth and deep configuration onIntel Server.




2e+04 4e+04 6e+04 8e+04 1e+05

0.00

008

0.00

010

0.00

012

0.00

014

Heap depth workload objects

Line

ar c

oeffi

cien

ts

Figure 7.86: Linear coefficients as dependency of object count from workload with heap depth and shallow configurationon Intel Server.

maintained by the individual workloads is displayed on Figure 7.85 for deep heap configuration and onFigure7.86for shallow heap configuration.

The dependency of the linear coefficient on the ratio of live objects maintained by the individual workloadsis almost linear again. The results also correspond with theexpectations in that presence of live objects fromListing 7.3is associated with smaller overhead than presence of live objects from Listing7.2.




Chapter 8

Predicting the Impact of ProcessorSharing on Performance

In real-time systems, it is essential to be able to measure where the processor is spending its cycles on afunction-by-functionor task-by-task basis, in order to evaluate the impact of processor sharing on system’s (timing)performance. Hence, the shared processor’s utilization determines the load to the (legacy) real-time system.

In this chapter, we investigate the response time of a processor sharing model. Analysis of such model ismotivated by capacity planning and performance optimization of multi-threaded systems. A typical illustration ofa system that adopts a multi-threaded real-time system architecture is given below, in Example8.2.

Typical examples of quality of service (QoS) of such systemsare, e.g, that the mean response time must be lessthan 4 seconds, and at least 90 % of requests must be respondedwithin 10 seconds. Therefore, accurate evaluationof response times is a key to choose appropriate number of threads and meet the performance requirements.

Earlier research at MRTC has produced a discrete event simulation framework for prediction of response timesand other measurable dynamic systems properties, such as queue lengths. The original purpose of this simula-tor was impact analysis with respect to such runtime properties for existing complex embedded systems. Thissimulator is however more general than its original purpose.

The simulated model, which is specified in C, describes the threads executing on a single processor. The centralpart of the simulation framework is an API that provide typical operating system services to the threads, such asinter-process communication, synchronization and threadmanagement, using an explicit notion of processor time.This means that the simulated threads are responsible for advancing the simulation clock (an integer counter) bycalling a specific API function,execute, which corresponds to consumption of processor time. The amount ofprocessor time to consume is specified as a probability distribution, which is derived from measurements of themodeled system, when available.

The simulator records a trace of the simulation, which is displayed by the Tracealyzer tool [24]. Apart fromthe visualizations possible in the Tracealyzer (scheduling trace, processor load graph, communication graph), theTracealyzer can also export data to text format to allow for analysis in other tools.

8.1 Simulation Example

A trivial model contain only two C functions,modelinit and a thread function, containing only a singleexecutestatement, as presented in Example8.1. This will produce a model containing two threads (or tasks,as oftenreferred in the RT community). One application-thread named task1, which executes periodically every 40000time-units with a constant execution time of 10000 time-units and one idle-thread, which always is implicitlycreated. Since the simulator uses fixed-priority scheduling (very common for embedded systems), the idle-threadis only executed whentask1is dormant.

Listing 8.1: Example of trivial model

1 void threadfunc()2 {3 while (1) {4 execute(10000);




5 sleep(30000);6 }7 }8

9 void model_init()10 {11 createTask( "TASK1", 1, 0, 0, 0, 0, threadfunc);12 }

A constant execution time is however not very realistic, in practice there are variations due to input data andhardware effects. This can be modeled in two ways, either by specifying the execution time as a uniform probabilitydistribution, or by specifying a data-file containing an empirical distribution, i.e., measurement from which valuesare sampled. For typical service oriented systems, heavy-tailed distributions such as Pareto or Log-Normal wouldbe appropriate service time approximations.

The relevant example requires several threads/tasks. In Example8.2, we have specified a client-server casecontaining one server thread, two other threads of higher priority and a client, which runs on a different processor.The clientsession thread stimulates the server without consuming processor time, i.e. using theexecutestatement.This thread thereby becomes invisible in this simulation.

ThecreateTaskfunction require several parameters which needs explanation. The first parameter is the nameof the thread/task, for display in the Tracealyzer. The second parameter is the scheduling priority, where lower ismore significant. The third is periodicity. If -1 is specified, the thread function (which is the last parameter) is onlycalled once, on simulation start. The fourth parameter is release time offset and the fifth parameter is release jitter,i.e. a random variation in release time. The sixth and last parameter is the thread function.

A simulation of this model is very fast due to the high level ofabstraction and the C implementation. Inthis experiment, 100.000.000 time units is simulated in 62 ms on a HP 6220 laptop, 2 GHz, 2 GB RAM. Thiscorresponds to about 110.000 simulator events (i.e. context switches, messages, etc). The simulation result can beinspected in the Tracealyzer, as depicted in Figure8.1. The processor usage graph presents the processor usageover the whole recording, which gives a complete overview. The processor usage is presented in an accumulatedmanner, so for each time interval both the processor usage ofthe individual tasks and the total processor usageis illustrated. It is possible to view only specific selectedtasks, and study their processor usage in isolation. TheTracealyzer can generate a report with the summary of the processor usage and timing properties, such as maximumor average execution time, of all or selected tasks. The report can then serve for evaluating the impact of the degreeof processor sharing on system performance (real-time response times of tasks). For our client-server example(Example8.2), the report is presented in Section8.3.

In addition, the Tracealyzer can export response time data and other properties to text format. It is thereby pos-sible to generate diagrams on properties like response times, as illustrated by Figure8.2, Figure8.3and Figure8.4.The histograms are generated using XLSTAT, a plugin for Microsoft Excel.

8.2 Simulation Optimization

Traditional probabilistic simulation is good for estimating the average case behavior, but however not suitablefor finding extreme values of a task’s response time. Due to the often vast state space of the models concerned,this random exploration is unlikely to encounter a task response-time close to the worst case. We have thereforeproposed a new approach [34] for best-effort response time analysis targeting extremevalues in timing properties.

The proposed approach usesmetaheuristicsearch algorithm, namedMABERA, on top of traditional probabilis-tic simulation, in order to focus the simulations on parts ofthe state-space that are considered more interesting,according to a heuristic selection method. Meta-heuristics are general high level strategies for iterative approx-imation of optimization problems. The search technique used by MABERA is related to two commonly usedtechniques, genetic algorithms and evolution strategies.We do not claim that MABERA is optimal, many im-provements are possible. The ambition with MABERA is to demonstrate the potential of extending probabilisticsimulation with a meta-heuristic search technique for the purpose of best-effort response-time analysis.




Listing 8.2: Example, client-server case

1 MBOX ServerQ;2 MBOX ClientQ;3

4 void server()5 {6 while (1) {7 int request = recvMessage( ServerQ, FOREVER);8 switch( request) {9 case START:

10 execute( uniform(1000, 1200));11 break;12 case STOP:13 execute( uniform(400, 500));14 break;15 case GETDATA:16 execute( uniform(9900, 12300));17 break;18 }19 }20 }21

22 void sensorPoll()23 {24 execute( uniform(200, 600));25 }26

27 void client_session()28 {29 int i;30

31 sendMessage( ServerQ, START, 0);32 delay( uniform(20000, 30000));33 for ( i = 0; i < uniform(3, 5); i++) {34 sendMessage( ServerQ, GETDATA, 0);35 delay( uniform(20000, 30000));36 }37 delay( uniform(20000, 30000));38 sendMessage( ServerQ, STOP, 0);39 }40

41 void model_init()42 {43 ServerQ = createMBOX( "ServerQ", 10);44 ClientQ = createMBOX( "ClientQ", 10);45

46 createTask( "Server", 10, -1, 0, 0, server);47 createTask( "Client", 0, 100000, 0, 50000, client_session);48

49 createTask( "sensor1", 1, 4000, 10000, 0, sensorPoll);50 createTask( "sensor2", 1, 4000, 0, 0, sensorPoll);51 }




Figure 8.1: Simulation trace in the Tracealyzer tool.

We have compared this approach with traditional probabilistic simulation, with respect to the discovered re-sponse times. The comparison was based on 200 replications of each simulation method on a fairly complexmodel, containing 4 application tasks and 2 environment tasks which emulate a remote system. The results fromthe MABERA analysis is presented in Figure8.5, as a histogram of relative frequency of response times, groupedin steps of 50 ms.

The highest response time discovered by MABERA was 8349 and the mean value was 8045. The highest peakcorresponds to values of 8324. Note that 47 % of the replications gave this result, so this result would most likelybe detected in only 2 or 3 replications, which only takes about 20 minutes on the relatively slow computer we used.

The corresponding result from traditional simulation is presented in Figure8.6. The highest response timediscovered by probabilistic simulation was 7929 and the mean value was 7593. Moreover, the results from prob-abilistic simulation follows a bell-shaped curve, where most results are found close to the mean value and only0.5 % of the results are above 7800, while 47 % of the MABERA results are close to the highest discovered value,8349.

These results indicate that the proposed MABERA approach issignificantly more efficient in finding extremeresponse times for a particular task than traditional probabilistic simulation.

8.3 Generated Statistics Report for Performance Analysis

The parameters that can influence the system’s performance,when multiple tasks share the same processor, include:

Priority The OS scheduling priority of the task, lower value is ”better”.




0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0 500 1000 1500 2000

Response Time

Rela

tive fre

qu

en

cy

Figure 8.2: Response time histogram, Server START.

0

0,05

0,1

0,15

0,2

0,25

0,3

0 5000 10000 15000 20000

Response time

Rela

tive fre

qu

en

cy

Figure 8.3: Response time histogram, Server GETDATA.




0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0 200 400 600 800 1000 1200

Response Time

Rela

tive fre

qu

en

cy

Figure 8.4: Response time histogram, Server STOP.

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

7000 7200 7400 7600 7800 8000 8200 8400

Response time

Rel

ativ

e fre

quen

cy

Figure 8.5: Results – MABERA.




0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

7000 7200 7400 7600 7800 8000 8200 8400

Response time

Rel

ativ

e fr

eque

ncy

Figure 8.6: Results – probabilistic simulation.

Processor usage The amount of processor time used by the task (in percent).

Count The number of instances (jobs/executions) of the task.

Fragments The number of fragments of each task instance (uninterrupted execution segments), not countinginterrupts.

Interrupts, density The number of interrupt requests in relation to the execution time of the task (requests/sec-ond).

Interrupts, average The average number of interrupt requests per task instance

Interrupts, max The maximum number of interrupt requests during a single task instance.

Figure8.7 shows the generated statistics report for Example8.2. Note that by execution time we mean theactual CPU time used for each task instance (in microseconds), and by response time we mean the real timebetween start and completion of each task instance (in microseconds). Also, in this report we omit the averagevalues for execution and response times, as well as the number of fragments.

Based on the results in the above table, we can conclude that if the client sendsGETDATAmessages to bufferServerQwith a higher frequency, the processor usage of the task server increases, entailing an increase of the task’sworst-case execution time.




Task Priority CPU

usage

Count Exec. Time

(Max)

Resp.Time

(Max)

Interrupts

(Max)

SERVER 10 43,152 1 12300 14000 2000

SENSOR 1 1 7,500 12500 600 600 0

SENSOR 2 1 7,500 12500 600 600 0

IDLE 255 41,848 1 0 0 0

Task Priority CPU

usage

Count Exec. Time

(Max)

Resp. Time

(Max)

Interrupts

CLIENT 0 100 2000 210014 210014 0

Figure 8.7: Task statistics.




Chapter 9

Conclusion

The purpose of this document is to quantify the impact of resource sharing on quality attributes, thus docu-menting the choice of resources to model in task T3.3 of the Q-ImPrESS project. Here, we summarize the impactof resource sharing on quality attributes in a table that lists, for each resource, what is the size and frequency of theobserved effects, and therefore what is the scale at which resource sharing impacts the quality attributes.

Register Content

Register content change any composition

Visibility: The overhead of register content change is unlikely to be influenced by componentcomposition, and therefore also unlikely to be visible as aneffect of component com-position.

Branch Predictor

Virtual function address prediction miss pipelined composition

Duration: 112 cyclesIntel Server, 75 cyclesAMD ServerFrequency: 1 per function invocationSlowdown: 39 %Intel Server, 27 %AMD ServerVisibility: The overhead can be visible in workloads with virtual functions of size comparable

to the address prediction miss, however, it is unlikely to bevisible with larger virtualfunctions.

Address Translation Buffers

Data address translation buffer miss pipelined composition

Duration: 2-21 cyclesIntel Server, 5-103 cyclesAMD Server20-65 cycles with L1 cache missIntel Server229-901 cycles with L2 cache missIntel Server

Frequency: 1 per page accessSlowdown: 200 % with L1 cache missIntel Server, 280 % with L1 cache missAMD Server




Visibility: The overhead can be visible in workloads with very poor locality of data references tovirtual pages that fit in the address translation buffer whenexecuted alone. Dependingon the range of accessed addresses, the workload can also cause additional cachemisses when traversing the address translation structures. The translation buffer misscan be repeated only as many times as there are the address translation buffer entries,the overhead will therefore only be significant in workloadswhere the number of dataaccesses per invocation is comparable to the size of the buffer.

Instruction address translation buffer miss pipelined composition

Duration: 19 cyclesIntel Server, 4-40 cyclesAMD ServerFrequency: 1 per page accessSlowdown: 367 %Intel Server, 320 %AMD ServerVisibility: The overhead can be visible in workloads with very poor locality of instruction ref-

erences to virtual pages that fit in the address translation buffer when executed alone.The translation buffer miss can be repeated only as many times as there are the addresstranslation buffer entries, the overhead will therefore only be significant in workloadswhere the number of instruction accesses per invocation is comparable to the size ofthe buffer.

Invalidating address translation buffer parallel composition

Slowdown: 75 %Intel Server, 160 %AMD ServerVisibility: The overhead can be visible in workloads with very poor locality of data references to

virtual pages that fit in the address translation buffer, when combined with workloadsthat frequently modify its address space.

Memory Content Caches

L1 data cache miss pipelined composition

Duration: 11 cyclesIntel Server12 cycles random sets, 27-40 cycles single setAMD Server

Frequency: 1 per line accessSlowdown: 150 % clean, 160 % dirtyIntel Server

400 %AMD ServerVisibility: The overhead can be visible in workloads with very good locality of data references

that fit in the L1 data cache when executed alone. The cache miss can be repeatedonly as many times as there are the L1 data cache entries, the overhead will thereforeonly be significant in workloads where the number of data accesses per invocation iscomparable to the size of the L1 data cache.

L2 cache miss pipelined composition

Duration: 256-286 cyclesIntel Server32-35 cycles random set, 16-63 cycles single setAMD Server

Frequency: 1 per line access without prefetchingSlowdown: 209 % clean, 223 % dirtyIntel Server

81 %AMD Server




Visibility: The overhead can be visible in workloads with very good locality of data referencesthat fit in the L2 cache when executed alone. The cache miss canbe repeated onlyas many times as there are the L2 cache entries (or pairs of entries on platform withadjacent line prefetch), the overhead will therefore only be significant in workloadswhere the number of data accesses per invocation is comparable to the size of the L2cache.

L3 cache miss pipelined composition

Duration: 208 cycles random set, 159-211 cycles single setAMD ServerFrequency: 1 per line accessSlowdown: 303 % clean, 307 % dirtyAMD ServerVisibility: The overhead can be visible in workloads with very good locality of data references

that fit in the L3 cache when executed alone. The cache miss canbe repeated only asmany times as there are the L3 cache entries, the overhead will therefore only be sig-nificant in workloads where the number of data accesses per invocation is comparableto the size of the L3 cache.

L1 instruction cache miss pipelined composition

Duration: 30 cyclesIntel Server20 cycles random sets, 25 cycles single setAMD Server

Frequency: 1 per line accessSlowdown: 460 %Intel Server, 130 %AMD ServerVisibility: The overhead can be visible in workloads that perform many jumps and branches

and that fit in the L1 instruction cache when executed alone. The cache miss can berepeated only as many times as there are the L1 instruction cache entries, the overheadwill therefore only be significant in workloads where the number of executed branchinstructions per invocation is comparable to the size of theL1 instruction cache.

Real workload data cache sharing (FFT) pipelined composition

Slowdown: 400 % read, 500 % writeIntel Server300 % read, 350 % writeAMD Server

Visibility: The overhead is visible in FFT as a real workload representative. The overhead de-pends on the size of the buffer submitted to FFT. In some cases, the interfering work-load can flush modified data, yielding apparently negative overhead of the measuredworkload.

Blind shared variable access overhead parallel composition

Duration: 72 cycles with shared cacheIntel Server90 cycles with shared packageIntel Server32 cycles otherwiseIntel Server

Frequency: 1 per accessSlowdown: 490 %Intel ServerVisibility: The overhead can be visible in workloads with frequent blindaccess to a shared vari-

able.

Reaching shared cache bandwidth limit parallel composition

Slowdown: 13 % with hits, 19 % with missesIntel ServerVisibility: The limit can be visible in workloads with high cache bandwidth requirements and

workloads where cache access latency is not masked by concurrent processing.

Sharing cache bandwidth parallel composition




Slowdown: 16 % both hit, 13 % misses with hits, 107 % hits with missesIntel Server5 % both hit, 0 % misses with hits, 49 % hits with missesAMD Server

Visibility: The impact can be visible in workloads with many pending requests to the sharedcache, where cache access latency is not masked by concurrent processing. The impactis significantly larger when one of the workloads misses in the shared cache.

Prefetching to the shared cache parallel composition

Slowdown: 63 %Intel ServerVisibility: The impact can be visible in workloads with working sets thatdo not fit in the shared

cache, but employ hardware prefetching to prevent demand request misses. Prefetch-ing can be disrupted by demand requests of the interfering workload, even if thoserequests do not miss in the shared cache.

Real workload data cache sharing (FFT) parallel composition

Slowdown: 10 % hitting, 70 % missing interference, small buffer43 % hitting, 148 % missing interference, large bufferIntel Server6 % hitting interference, large buffer, 6% missing interference, small bufferAMD Server.

Visibility: The overhead is visible in FFT as a real workload representative. The overhead issmaller when FFT fits in the shared cache and the interfering workload hits, and largerwhen FFT does not fit in the shared cache or the interfering workload misses.

Real workload data cache sharing (SPEC CPU2006) parallel composition

Slowdown: 26 % hitting, 90 % missing interferenceIntel ServerVisibility: The overhead is visible in both the integer and floating pointworkloads. The overhead

varies greatly from benchmark to benchmark, an interferingworkload that misses inthe shared cache has a larger impact than an interfering workload that hits.

Memory Buses

Reaching memory bus bandwidth limit parallel composition

Limit: >5880MB/sIntel ServerVisibility: The limit can be visible in workloads with high memory bandwidth requirements and

workloads where memory access latency is not masked by concurrent processing.

File Systems

Collected Heap

Collector overhead when increasing object lifetime any composition

Overhead: change of 32% on client, 28% on serverDesktopVisibility: The overhead change can be visible in components with dynamically allocated objects

kept around for short durations, especially durations determined by outside invoca-tions.

Collector overhead when increasing heap depth any composition




Overhead: change of 60%DesktopVisibility: The overhead change can be visible in components whose dynamically allocated ob-

jects are linked to references provided by outside invocations, especially when suchreferences connect the objects in deep graphs.

Collector overhead when increasing heap size any composition

Overhead: change of 20%DesktopVisibility: The overhead change can be visible in any components with dynamically allocated

objects.

As summarized in the table, the results show that resource sharing indeed impacts quality attributes – and whilethis statement alone is hardly surprising, the table contributes explicit limits of this impact for a wide spectrumof resources from memory content caches through file system resources to collected heap resources. The importof this statement becomes apparent when the work from task T3.3 is combined with task T3.1, which has beeninvestigating the prediction models to be used in the context of the Q-ImPrESS project [33].

A typical assumption made when applying a prediction model is that the quality annotations used to populatethe model are constant (this assumption is also visible in the simplified running example in deliverable D3.1, but isnot inherent to the sophisticated quality annotations outlined also in deliverable D3.1). Unless the entire spectrumof resources whose sharing impacts the quality attributes is included in the prediction model (which is not typicallydone), this assumption naturally does not hold.

Typically, the changes in quality attributes due to resource sharing therefore directly translate into the loss ofprecision within the prediction model, on a scale that can reach up to the limits listed in the summary table.

Acquiring detailed knowledge on the impact of resource sharing is the first step towards modeling this impact,as planned in the Q-ImPrESS project. Other issues to be solved include:

Modeling of individual resources. Modeling the impact of resource sharing requires modeling of individualresources. The availability of existing work on modeling ofindividual resources varies greatly from resourceto resource, from some apparently rarely modeled (collected heap) to some modeled rather frequently (mem-ory caches). Not all models of individual resources are directly applicable in the Q-ImPrESS project though,often because their output is expressed in units that are noteasily convertible into quality attributes (forexample cache miss count rather than cache miss penalty).

Description of resource utilization. The complex working of some resources requires a detailed description ofthe resource utilization by the modeled workload (for example stack distance profile of memory accesses).This description might not be readily available in the prediction scenarios envisioned by the Q-ImPrESSproject, since most of these scenarios take place during design stages of the software development process.

Integration into the prediction model. Integrating the models of individual resources directly into the predic-tion model is, in most cases, unlikely to work well, either due to incompatibility of modeling paradigms ordue to potential for state explosion. An iterative solutionof the models, or an integration into the simulationtechnologies available to the project partners, is envisioned as a feasible solution.

The listed issues were anticipated in the Q-ImPrESS projectproposal. The project plan provides the necessarycapacity for addressing the issues in both WP3 and WP4, the two workpackages that deal with prediction modelspecification, prediction model generation and model basedquality prediction.




Terminology

A glossary of frequently used terms follows. It should be noted that the definitions of the terms are used forconsistency throughout this document, but they can differ slightly from the definitions used elsewhere.

Processor Execution Core

Pipeline Queue of partially processed machine instructions.

Speculative Execution Program execution where some preconditions have been guessed and the executioneffects might need to be discarded should the guesses turn out to be wrong.

Superscalar Execution Program execution where multiple operations can be processed at the same time bydifferent processing units.

System Memory Architecture

Associativity Determines in which cache lines the data can be placed. The extremes are direct mapped (oneplace) and fully associative (all places) caches. A common compromise are theN-way set-associative caches.

Coherency When multiple caches of the main memory are used, the caches are said to be coherent if theircontent is synchronized to eliminate outdated copies of themain memory.

Critical Word First (CWF) Because transfers between memory caches and main memory need multiple mem-ory bus cycles to fetch a cache line, the data further in the line could take longer to become available, and acache miss penalty would therefore depend on the offset of the accessed data in the line. To mitigate this, aprocessor may employ the Critical Word First protocol, which transfers the accessed data first and the restof the cache line afterwards [8, page 16].

Index A number assigned to a cache line set. When searching for datain the cache, the address of the data isused to obtain the index of cache line set that is searched.

Least Recently Used (LRU) A cache replacement policy that always evicts the least recently accessed cacheentry.

Page Walk The process of traversing the hierarchical paging structures to translate a virtual address to a physi-cal address after an address translation buffer miss.

Pseudo Least Recently Used (PLRU) A cache replacement policy that mostly evicts the least recently ac-cessed cache entry, an approximation of LRU.

Replacement Policy Determines the cache line to store the data that is being brought the cache, possibly evict-ing the previously stored data. In a limited associativity cache, the policy is split into choosing a set in thecache and then choosing a way in the set.

Set In a limited associativity cache, each set contains a fixed number of cache lines. Given a pair of data andaddress, the data can be cached only in a single set selected directly by the address.

Translation Lookaside Buffer (TLB) A cache of translations from virtual addresses to physical addresses.

Way In a limited associativity cache, cache entries of a set are called ways.




Virtual Machine

Compacting Collector A garbage collector that moves live objects closer togetherwithin a heap area to avoidfragmentation.

Copying Collector A garbage collector that evacuates live objects from one heap area to another to avoidfragmentation.

Garbage An object that occupies space on the heap but can not be used bythe application, typically because itis not reachable from any root object.

Generation A group of objects with similar lifetime.

Generational Collector A garbage collector that introduces optimizations based onstatistical properties ofgenerations. Frequently applied optimization is separatecollection of individual generations.

Live Object An object that occupies space on the heap and can be used by theapplication, typically because itis reachable from some root object.

Mark And Sweep Collector A garbage collector that uses the marking pass to mark live objects and the sweep-ing pass to free garbage.

Mutator A process that modifies the heap, potentially interfering with the progress of the garbage collector.

Root An object that can be directly accessed by an application. A global variable or an allocated local variableis considered a root for the purpose of garbage collection.




References

[1] Intel 64 and IA-32 Architectures Optimization Reference Manual, Order Number 248966-016, Intel Corpo-ration, Nov 2007

[2] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture, Order Num-ber 253665-027, Intel Corporation, Apr 2008

[3] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A: Instruction Set Reference, A-M,Order Number 253666-027, Intel Corporation, Apr 2008

[4] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B: Instruction Set Reference, N-Z,Order Number 253667-027, Intel Corporation, Apr 2008

[5] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A: System Programming, Part 1,Order Number 253668-027, Intel Corporation, Jul 2008

[6] Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B: System Programming, Part 2,Order Number 253669-027, Intel Corporation, Jul 2008

[7] Intel 64 and IA-32 Architectures Application Note: TLBs, Paging-Structure Caches, and Their Invalidation,Order Number 317080-002, Intel Corporation, Apr 2008

[8] Intel 5000P/5000V/5000ZChipset Memory Controller Hub (MCH): Datasheet, Document Number 313071-003, Intel Corporation, Sep 2006.

[9] Doweck, J.:Inside Intel Core Microarchitecture, Intel Corporation, 2006.

[10] AMD64 Architecture Programmer’s Manual Volume 2: System Programming, Publication Number 24593,Revision 3.14, Advanced Micro Devices, Inc., Sep 2007.

[11] AMD CPUID Specification, Publication Number 25481, Revision 2.28, Advanced Micro Devices, Inc., Apr2008.

[12] AMD BIOS and Kernel Developer’s Guide For AMD Family 10h Processors, Publication Number 31116,Revision 3.06, Advanced Micro Devices, Inc., Mar 2008.

[13] AMD Software Optimization Guide for AMD Family 10h Processors, Publication Number 40546, Revision3.06, Advanced Micro Devices, Inc., Apr 2008.

[14] AMD Family 10h AMD Phenom Processor Product Data Sheet, Publication Number 44109, Revision 3.00,Advanced Micro Devices, Inc., Nov 2007.

[15] Kessler, R. E., Hill, M. D.:Page placement algorithms for large real-indexed cache, ACM Transactions onComputer Systems, Vol. 10, No. 4, 1992.

[16] Drepper, U.: What every programmer should know about memory,http://people.redhat.com/drepper/cpumemory.pdf , 2007.

[17] Memory Management in the Java HotSpot Virtual Machine, Sun Microsystems, Apr 2006.

[18] Detlefs, D., Printezis, T.:A Generational Mostly-Concurrent Garbage Collector, Report Number SMLITR-2000-88, Sun Microsystems, Jun 2000.


http://people.redhat.com/drepper/cpumemory.pdf



[19] Java SE 5 HotSpot Virtual Machine Garbage Collection Tuning, Sun Microsystems.

[20] Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning, Sun Microsystems.

[21] Printezis, T.:Garbage Collection in the Java HotSpot Virtual Machine, Sun Microsystems, 2005.

[22] http://sourceforge.net/projects/x86info

[23] http://ezix.org/project/wiki/HardwareLiSter

[24] http://www.tracealyzer.se

[25] http://icl.cs.utk.edu/papi

[26] http://user.it.uu.se/ mikpe/linux/perfctr

[27] http://www.fftw.org

[28] http://www.kernel.org

[29] http://www.spec.org/jvm2008/

[30] http://www.spec.org/cpu2006/

[31] Becker, S., Desic, S., Doppelhamer, J., Huljenic, D., Koziolek, H., Kruse, E., Masetti, M., Safonov, W.,Skuliber, I., Stammel, J., Trifu, M., Tysiak, J., Weiss, R.:Requirements Document, Q-ImPrESS Deliverable1.1, Jun 2008.

[32] Becker, S., Bulej, L., Bures, T., Hnetynka, P., Kapova,L., Kofron, J., Koziolek, H., Kraft, J., Mirandola, R.,Stammel, J., Tamburrelli, G., Trifu, M.:Service Architecture Meta Model, Q-ImPrESS Deliverable 2.1, Sep2008.

[33] Ardagna, D., Becker, S., Causevic, A., Ghezzi, C., Grassi, V., Kapova, L., Krogmann, K., Mirandola, R.,Seceleanu, C., Stammel, J., Tuma, P.:Prediction Model Specification, Q-ImPrESS Deliverable 3.1, Nov2008.

[34] Kraft, J., Lu, Y., Norstrom, C., Wall, A.:A metaheuristic approach for best effort timing analysis targetingcomplex legacy real-time systems, Proceedings of RTAS 2008.

[35] Burguiere, C., Rochange, C.:A Contribution to Branch Prediction Modeling in WCET Analysis, Proceed-ings of DATE 2005.

[36] Fagin, B., Mital, A.: The Performance of Counter- and Correlation-Based Schemesfor Branch TargetBuffers, IEEE Transactions on Computers, Vol. 44, No. 12, 1995.

[37] Mitra, T., Roychoudhury, A.:A Framework to Model Branch Prediction for Worst Case Execution TimeAnalysis, Proceedings of WCET 2002.

[38] Pino, J. L., Singh, B.:Performance Evaluation of One and Two-Level Dynamic BranchPrediction Schemesover Comparable Hardware Costs, University of California at Berkeley Technical Report ERL-94-045,1994.

[39] Chen, J. B., Borg, A., Jouppi, N. P.:A Simulation Based Study of TLB Performance, Proceedings of ISCA1992.

[40] Saavedra, R. H., Smith, A. J.:Measuring Cache and TLB Performance and Their Effects on BenchmarkRuntimes, IEEE Transactions on Computers, Vol. 44, No. 10, 1995.

[41] Tickoo, O., Kannan, H., Chadha, V., Illikkal, R., Iyer,R., Newell, D.:qTLB: Looking Inside the Look-AsideBuffer, Proceedings of HiPC 2007.




[42] Shriver, E., Merchant, A., Wilkes, J.:An Analytic Behavior Model for Disk Drives with Readahead Cachesand Request Reordering, Proceedings of SIGMETRICS 1998.

[43] Blackburn, S. M., Cheng, P., McKinley, K. S.:Myths and Realities: The Performance Impact of GarbageCollection, Proceedings of SIGMETRICS 2004.


Project Deliverable D3.3 Resource Usage Modeling...Abstract The Q-ImPrESS project deals with...

Documents

Transcript of Project Deliverable D3.3 Resource Usage Modeling...Abstract The Q-ImPrESS project deals with...