Proceedings Work-In-Progress Sessioncse.unl.edu/~goddard/ecrts04wip/proceedings.pdfL. Lo Bello, A....

Proceedings

Work-In-Progress Session of the 16th Euromicro Conference on

Real-Time Systems

June 30 - July 2, 2004

Catania, Italy

Organized by the

Euromicro Technical Committee on Real-time Systems

Edited by Steve Goddard

© Copyright 2004 by the authors

i

Table of Contents

Message from the WIP Chair iv

ECRTS'04 WIP 1: Wednesday June 30, 2004 / 15:30 – 16:30

High Performance Memory Architectures with Dynamic Locking Cache for Real-Time Systems E. Tamura, F. Rodríguez, J.V. Busquets-Mataix, and A. Martí Campoy

1

Towards an Efficient Use of Caches in State of the Art Processors for Real-Time Systems Alexander von Bülow, Jürgen Stohr, and Georg Färber

5

Using State of the Art Multiprocessor Systems as Real-Time Systems - The RECOMS Software Architecture Jürgen Stohr, Alexander von Bülow, and Georg Färber

9

Specification of Real-Time Schedulers Odile Nasr, Miloud Rached, Jean-Paul Bodeveix, and Mamoun Filali

13

SymTA/S - Symbolic Timing Analysis for Systems Arne Hamann, Rafik Henia, Razvan Racu, Marek Jersak, Kai Richter, and Rolf Ernst

17

Mechanisms for Reflection-based Monitoring of Real-Time Systems Ricardo Barbosa and Luís M. Pinho

21

Ethernet-based Systems: Contributions to the Holistic Analysis Nuno Pereira and Eduardo Tovar

25

Using Existing Infrastructure as Proxy Support for Sensor Networks Jonas Neander, Mikael Nolin, and Mats Björkman

29

ii

ECRTS'04 WIP 2: Friday July 2, 2004 / 14:30 – 15:30

QoS Control Framework: Improving Flexibility and Robustness in Consumer Terminals Laurenţiu M. Păpălău, Clara M. Otero Pérez, and Liesbeth Steffens

33

Tool Supported Development of Energy-aware or Quality-aware Real-time Applications for Embedded Systems Markus Ramsauer, Michael Coduro, and Raphael Hoffmann

37

A Novel Approach to Dynamic Voltage Scaling for Real-Time Systems L. Lo Bello, A. Di Gangi, R. Caponetto, and O. Mirabella

41

Load-Balancing for a Real-Time System Based on Asymmetric Multi-Processing Éric Piel, Philippe Marquet, Julien Soula, and Jean-Luc Dekeyser

45

A New Generalized Approach to Application-Defined Scheduling Mario Aldea Rivas and Michael González Harbour

49

Probability Distribution Function of Time Between Interrupt Requests Wojciech Noworyta

53

Exact Worst-Case Response Times of Real-Time Tasks Under fixed-priority scheduling with Deferred Preemption Reinder J. Bril, Wim F.J. Verhaegh, and Johan J. Lukkien

57

(m,k)-Firm Real-Time Distributed Transactions J. Haubert, B. Sadeg and L. Amanton

61

iii

Message from the WIP Chair

Dear Colleagues: Welcome to Catania and to the Work In Progress (WIP) sessions of the 16th Euromicro Conference on Real-Time Systems (ECRTS'04). I am pleased to present to you 16 excellent papers on WIP that cover the topics of computer architectures, formal methods, real-time and energy-aware scheduling, distributed transactions, multi-processor real-time systems, embedded networks, performance monitoring, and quality of service support for real-time systems. The purpose of the ECRTS WIP sessions is to provide researchers in Academia and Industry an opportunity to discuss their evolving ideas and to gather feedback from the real-time community at large. I hope these sessions will prove beneficial to both the WIP presenters and the ECRTS'04 attendees. If you would like to reference any article included in the ECRTS'04 WIP Proceedings, please note that these proceedings are published as a Technical Report from the University of Nebraska-Lincoln, Department of Computer Science and Engineering (TR-UNL-CSE-2004-0010). I would like to thank the researchers that submitted their work to the ECRTS'04 WIP sessions as well as the members of the ECRTS'04 Program Committee and others that reviewed the submissions. Steve Goddard ECRTS'04 WIP Chair June 2004

iv

High Performance Memory Architectures with Dynamic Locking Cache forReal-Time Systems

E. Tamura1, F. Rodríguez2, J.V. Busquets-Mataix2, A. Martí Campoy21 Grupo de Automática y Robótica, Pontificia Universidad Javeriana - Cali, Cali, Colombia

[email protected] Departamento de Informática de Sistemas y Computadores, Universidad Politécnica de

Valencia, Camino de Vera s/n, 46022 Valencia, España{prodrig, vbusque,amarti}@disca.upv.es

Abstract

In modern computers, memory hierarchies play aparamount role in improving the average executiontime. However, this fact is not so important in real-time systems, where the worst-case execution time iswhat matters the most. System designers must usecomplex analyses to guarantee that all tasks meettheir deadlines. As an alternative to making thosecomplex analyses, it is proposed to build a memoryhierarchy such that it provides high performancecoalesced with high predictability. At the same time,the memory assist should imply small-scalemodifications in the hardware. The solution is to becentred on instruction fetching since it represents thehighest number of memory accesses.

1. Introduction

Cache memories are extensively used to reduce theincreasing speed gap between processor and mainmemory because they provide lower average executiontimes by minimising the number of accesses to mainmemory. However, their use in real-time systems ariseshard problems in calculating the Worst Case ExecutionTime (WCET) because cache memories present adynamic, adaptive and non-predictable behaviour. Thislack of determinism precludes the use of well-knownschedulability analysis techniques in order to validatethe temporal correctness of the system. Several methodsand algorithms have been proposed to model the cachebehaviour and include it in the WCET [1], and to takeinto account cache effects in the schedulability analysis[2, 3]. As an alternative to cache modelling, other cacheschemas have been proposed to simplify the analysis,like the use of locking caches [4]. Locking cachepresents several advantages:• Implementation of locking cache is feasible and

presents no construction complexity• Locking cache offers a fully predictable behaviour,

allowing the use of simple and well-knowntechniques for schedulability analyses

• Performance of locking cache is similar to that ofconventional, non-predictable caches

Two ways of using locking caches have beenproposed: static use of locking caches [5] and dynamicuse of locking caches [6].

In static use of locking caches, cache contents,determined via the use of a genetic algorithm, areloaded and locked during system start-up, and thosecontents will never change during system operation.Since the static use of locking cache is fully predictable,it is easy to analyse its behaviour.

On the other hand, in the dynamic use of lockingcaches, cache contents are modified during systemoperation in a controlled way. Every time a task beginsor resumes execution, cache is flushed and thenreloaded with a set of instructions belonging to the newscheduled task. This set of instructions is always thesame for each task, so dynamic use of locking cachepresents not just a high degree of predictability, but atthe same time, it improves the performance of static useof locking cache [7] because each task may use all of thespace available in the whole cache for its owninstructions, while in static use the cache space is sharedby all of the tasks.

Furthermore, in more than 210 experiments withassorted caches –varying parameters like cache size, linesize, degree of associativity, miss and hit times-- andseveral sets of tasks, the dynamic use of locking cachesprovided the same or better performance thanconventional caches in about 60% of the tests [6]; insome cases, however, performance falls significantly.This loss of performance is due to the cost of loadingand locking the cache contents every time a task beginsor resumes its execution. The cost of loading andlocking one memory block in cache is about five times

ECRTS WIP 2004 1 Catania, Italy

the time needed to transfer a block from main memoryto cache in a conventional cache, because the load isaccomplished trough the execution of a small routine,which is included in the scheduler.

This paper explores several memory architectures inorder to load and lock the cache via hardwaremechanisms, reducing so the time needed to load eachblock. To load and lock memory blocks automaticallydemands modifications both in main memoryorganization as well as in cache memory.

2. Rationale

The system operation of dynamic use of lockingcache as proposed in [6] is the following:

Every time the operating system schedules a newtask, a small routine is executed. This routine readsfrom main memory the predefined set of addresseswhose contents must be loaded in cache, and then loadsthe blocks associated to them; each block has the samesize as the cache line size. After all of the blocks havebeen loaded, the locking cache is locked and the taskbegins/resumes execution. Both load and lockoperations are accomplished through cache-management instructions.

This way of operation is inefficient mainly due totwo reasons: first, several accesses to main memory arerequired for each block of main memory that must betransferred to cache, imposing a significant overhead.Second, the new task can not be dispatched immediatelyafter it is ready to execute, suffering a significantlatency because all of its to-be-locked blocks must beloaded before it begins/resumes execution.

In order to improve the performance of dynamicallylocking cache, two basic requirements must be satisfied:a) Main memory blocks must be transferred to cache

when fetched by the processor. That is, instructionswill be loaded and locked in cache, as the controlflow of the program requires them without invokingany piece of extra code.

b) The latter implies that the cache memory controllermust be able to identify the instructions to be lockedin an automatic way. Blocks –belonging to any task-to be locked in cache must be marked before thesystem begins execution, during system design orsystem start-up.

Should the previous requirements hold, taskexecution will begin as soon as possible since there is nopenalty involved by load and lock instructions, andthere are no delays involved in identifying the blocks tobe locked.

3. Cache memory requirements

Several commercial processors include locking

caches in their memory hierarchies, but to the authorsknowledge there is no one adequate enough to achievethe characteristics needed to get a predictable and highperformance cache schema. The most interesting is theIDT 79RC64574 family of standalone processors, thatinclude a two-way set associative cache with a lockingmechanism which can be enabled/disabled on a per-linebasis. This is all that is needed for static use of lockingcache, where the instructions locked in cache will neverchange. In the dynamic use of locking cache however,every time a new task is scheduled to run, cachecontents must be reloaded from main memory. With alocking cache like the one provided in the IDT, it wouldbe necessary to execute a small loop that writes the tagof each cache line, hence delaying task execution. So, inorder to get the maximum performance, the followingcharacteristics are proposed for a locking cache:• Cache should be locked in a per-line basis. That is,

the system designer should be able to select the mainmemory block to be loaded in each cache line.Albeit, the whole cache must be locked to guaranteepredictability.

• Every cache line includes an extra bit named LockState Flag (henceforth, LSF), to signal whether thecache line is locked or not. If any cache line islocked, its content will not be replaced; otherwise,its behaviour is the same as that of a non-lockingcache.

• The LSF is automatically adjusted when theprocessor fetches the block since the value of theLSF is somehow embedded in the instruction streamread from main memory.

• By executing a processor instruction the LSF iscleared for all the cache lines simultaneously,unlocking the entire cache in one operation. This isthe only mechanism available to a programmer toclear the LSFs.

• There exists a temporal buffer –or prefetch queue--with size equal to one cache line and same behaviourthan a conventional cache. This buffer improvesexecution time of non-locked instructions by takinginto account the spatial locality of non-lockedmemory blocks.

When the scheduler dispatches a new task orresumes the execution of a pre-empted task, it onlyneeds to clear the LSF of every cache line. After this,the cache is loaded and locked as instructions arefetched; the only penalty experienced is one miss incache for every sequence of L instructions that wereselected to be locked, where L is the cache line size.

The complexity of the described behaviour for theproposed locking cache is simpler than that provided bycurrent, high-performance, commercial processors, soimplementation issues are not expected. Nonetheless, inorder to have the desired behaviour withoutcompromising the operation time (which should remain


identical to the one provided by a conventional cache)the design of the locking cache memory must beperformed very carefully.

4. Main memory requirements

The success of the proposed use of dynamic lockingcache relies also on the main memory ability to embedinformation related to whether lock or not its contents.The system designer must add the value of the LSFssomehow in the tasks instructions. Embedding this flagis the major issue of the proposed schema. Following,several alternatives are described:

1. Embed the LSF in the instruction op code. Thisproposal poses many problems, since the designerhas to face the need to modify the instruction setrepertory, and hence, the processor decoding stage.Furthermore, sometimes it is not possible toaccomplish this in a simple manner, i.e. by justrearranging the bits for example; in those caseswhere every combination of bits is already used, it ismandatory to increase the processor instruction wordsize, which might lead in turn, to wider data buses.Additionally, it also requires modifying the compilerback-end.

2. The next approach is more software-oriented. Sincethe blocks to lock are known beforehand, it isfeasible to embed some sort of header at thebeginning of the binary image with thecorresponding map of blocks that should be loadedand locked for every task: if a task occupies mblocks of main memory, it requires a map of size mbits, plus some delimiter to mark the beginning andend of each map; if a bit is set, it means that theblock must be locked into the cache and not loadedinto cache otherwise. The main advantage of thisproposal is that it requires a minimum amount ofstorage. Its main disadvantage is that every time theprocessor fetches an instruction, the cache memorycontroller needs to access the map, which representsa delay. This delay, however, would be shorter if themap is stored in some sort of look-aside memory; incase the whole memory map does not fit into thelook-aside memory, this schema may introducesignificant lack of determinism and increase thecomplexity of schedulability analysis.In addition, the implementation of this alternativewould require modifications in the compiler backend, the linker, and the loader.

3. Increase the memory size word by one bit, whichwill store the Instruction Lock State Flag, ILSF.Each ILSF is adjusted when the program is loadedinto main memory; if it is set, it means that thecorresponding instruction must be locked. Wheneverthe processor fetches an instruction, the ILSF is also

copied to the instruction cache, so there is nopenalty involved in execution time.The main disadvantage is that the data memory busbetween the main memory and cache memory has tobe one bit wider; also, the compiler, linker, andloader require modifications. Besides, notice thatspace is wasted since in this schema, each instructionhas a corresponding bit, but just one bit is requiredby every memory block; hence m ( L – 1 ) bits arewasted, where L is the number of instructions permain memory block, and m is the number of blocks.

4. Use a memory organization following the layoutproposed in the Stanford's TORCH [8]. In thisarchitecture, groups of eight instructions arepreceded by eight extension bytes that provideinformation about dynamic instruction scheduling.In a similar vein, it is possible to group togethersome instructions into a parcel. There are twopossibilities in terms of gathering together theinstructions: a) the number of memory blocks in aparcel is constant and b) it is feasible to have parcelsof varying size. In any case, each parcel is precededby one Locking State Word, LSW, which containsthe locking state information for every memoryblock in the parcel. Each bit in the LSW determinesthe state of one memory block. The number ofblocks per parcel cannot exceed the instruction wordsize, w.In this approach the main disadvantage is that everytime that there is a cache miss, the worst-casepenalty will be equal to two main memory accesses,one to read the instructions and one more to read theconcerning LSW. Furthermore, in case b),calculating the LSW address requires an extra accessto a table to know the size of the current parcel. Inboth cases, the drawback could be alleviated bycaching the LSW into the cache controller but doingso would require a more detailed WCET analysis toget tighter results.On the processor side, every datum located at anaddress corresponding to an LSW should beconsidered as a no-operation instruction. In addition,the compiler back end and maybe the linker, have tosuffer considerable changes to patch the resultingbinary image.Last, but not least, whenever the amount of extrainformation required per instruction is substantial,the TORCH approach has its merits; yet in this case,for each main memory block just one bit is required.

5. This approach stems from a combination of theprevious two and provides an easy to modelarchitecture, space efficiency, and no delays.Furthermore, it does not require any modification inthe processing element, just in the memory system.In this approach, an extra, dedicated memory, theLocking State Memory (LSM), is added to thememory subsystem. Its depth should be the same as


the number of memory blocks in the main memoryand its width may be one. However, in order to useoff-the-shelf 8-bit wide memories, the informationfor eight blocks (comprised in 8 LSFs) will be storedin one LSW. Hence, given a main memory of depthdMM = mL, where m is the number of blocks, therequired LSM has depth, dLSM, equal to m / 8.Let bI be the number of bytes per instruction, and Lthe cache line size (then each memory block has Linstructions). Each parcel has eight memory blocks,so an LSW has information for eight memoryblocks. Then the number of instructions, I, that correspondsto each LSW is given by I = 8L. The address of anyLSW, aLSW, is such that aLSW mod I = 0.Now, every time that an instruction, Ir, at address aris referenced, the cache memory controller, at thesame time, has to access the memory system and theLSM, to check the LSW at address aLSWr, which isthe address of the LSW that corresponds to mr, thememory block that stores Ir. aLSWr is obtained bystripping off the Log2 IbI least significant bits of ar.Finally, it is necessary to extract the correspondingLSF within the LSW to determine whether to loadand lock the memory block or not; it is given by the3 bits to the left of the Log2 LbI bits of ar.In order to keep the same operation time providedby a conventional cache system, it is necessary toadd an extra line to the cache memory data bus tocarry the locking state information. On the otherhand, only one 8-way multiplexer and somedecoding stage is needed to address the LSM.

5. Conclusions and Future Work

Automating the locking process in the dynamic useof the locking cache offers, in an intuitive way, fasterexecution times. Unfortunately it is not so easy toprovide the necessary mechanisms to embed therequired information about locking states into theprogram code since many components, both on thehardware and on the software areas might be involved:on one side, the locking cache structure, the mainmemory organization, the data bus width, and even theprocessor itself; on the other hand, the compiler, thelinker and the loader.

The goal to pursue in the design of the memorysystem is then to incorporate this information withoutincreasing the memory requirements in terms of storageefficiency, or slowing down the general operation of thememory hierarchy. Furthermore, the processor itselfshould not have significant modifications in itsarchitecture.

The last approach illustrated seems very promising intrying to adhere to the previous requirements. Theresulting memory system has to be diligently tested andverified by means of thorough analyses and simulations

and at the end, by its implementation on an FPGA.Besides, it is necessary to evaluate the improvementsand the amount of resources involved in order to ponderthe cost-benefit of the proposed solution.

6. References

[1] F. Mueller, “Timing Analysis for Instruction Caches”,Real-Time Systems Journal, 18(2-3), Kluwer AcademicPublishers, Boston, USA, May 2000, pp. 217-247.

[2] J.V. Busquets-Mataix, J.J. Serrano, R. Ors, P. Gil, andA. Wellings, “Adding instruction cache effect toschedulability analysis of preemptive real-time systems”, InProceedings of the 1996 Real-Time Technology andApplications Symposium, IEEE Computer Society, Boston,USA, June 1996, pp. 204-213.

[3] C.G. Lee, J. Hahn, Y.M. Seo, S.L. Min, R. Ha, S. Hong,C. Y. Park, M. Lee, and C. S. Kim, “Analysis of Cache-Related Preemption Delay in Fixed-Priority PreemptiveScheduling”, IEEE Transactions on Computers, 47(6), IEEEComputer Society, Los Alamitos, USA, May 2000, pp. 217-247.

[4] A. Martí, A. Perles, and J.V. Busquets-Mataix, “UsingLocking Caches in Preemptive Real-Time Systems”, InProceedings of the 12th Real-Time Congress on Nuclear andPlasma Sciences, IEEE Computer Society, Valencia, Spain.June 2001, pp. 157-159.

[5] A. Martí, A. Perles, and J. V. Busquets-Mataix. "StaticUse of Locking Caches in Multitask Preemptive Real-TimeSystems" In IEEE/IEE Real-Time Embedded SystemsWorkshop (Satellite of the 22nd IEEE Real-Time SystemsSymposium), London, UK, December 2001.

[6] A. Martí, A. Perles, and J.V. Busquets-Mataix,"Dynamic Use Of Locking Caches In Multitask, PreemptiveReal-Time Systems", In Proceedings of the 15th WorldCongress of the International Federation of AutomaticControl, Elsevier Science, Barcelona, Spain. July 2002.

[7] A. Martí, S. Sáez, A. Perles, and J.V. Busquets-Mataix,"Performance Comparison of Locking Caches under Staticand Dynamic Schedulers", In Proceedings of the 27thIFAC/IFIP/IEEE Workshop on Real-Time Programming,IFAC/IFIP/IEEE, Lagow, Poland, May 2003.

[8] M. Smith, M. Horowitz, and M. Lam, "EfficientSuperscalar Performance Through Boosting", InProceedings of the 5th International Conference onArchitectural Support for Programming languages andOperating Systems, ACM/IEEE, Boston, USA, October1992, pp. 248-259.___

This work is supported by the Comisión Interministerial deCiencia y Tecnología under project CICYT DPI 2003-08320-C02-01


Towards an Efficient Use of Caches in State of the Art Processors forReal–Time Systems �

Alexander von Bülow Jürgen Stohr Georg FärberInstitute for Real–Time Computer Systems

Prof. Dr.–Ing. Georg FärberTechnische Universität München, Germany

�Alexander.Buelow,Juergen.Stohr,Georg.Faerber�@rcs.ei.tum.de

Abstract

State of the art processors like Intel Pentium or AMD Athlonimplement large cache memories. These caches bridge thegap between the high speed processors and the comparativelyslow main memories. However, for the use in real–time sys-tems, caches are a source of predictability problems. A lotof progress has been achieved to cope with caches in real–time systems to determine safe and precise bounds of the ex-ecution times in the presence of cache memories. In this pa-per we present an extension to algorithms that aim to placecode and data efficiently into cache memory. Our approachtakes the architectural features of the implementation of thecache memory of state of the art processors into account. Fur-thermore this approach allows it to lock regions of memory incache although these processors do not provide this feature inhardware. The benefits and the problems of this approach arediscussed.

1. Introduction

Present-day processors like the AMD Athlon or the In-tel Pentium 4 are not designed to act in hard real-time systems.They are optimized to deliver a good performance in the aver-age case. Nevertheless there are some properties which makethem interesting for use in real-time systems: They are veryfast in comparison to other processor architectures, they arecheap in price, the technological progress goes on rapidly andthey are in most cases downwardly compatible.

In a real–time system the correctness of a result depends onthe date at which it is produced. Therefore it is essential toknow theses dates exactly. In order to proof the real–time ca-pabilities of a system one has to choose an adequate schedulingpolicy and to perform a real–time analysis. One determiningparameter is the worst–case execution time (WCET) of each

�The work presented in this paper is supported by the DeutscheForschungsgemeinschaft as part of a research programme on “Real-time with Commercial Off-the-Shelf Multiprocessor Systems” underGrant Fa 109/15-1.

task of the system. The significant part of the formation ofthese WCETs on modern processors is wether the correspond-ing code and data is in cache when a task is executing.

Determining the WCET on processors with caches is a chal-lenging task and extensive studies have been done on this. Ourapproach focuses not only on one level of cache or only on in-struction caches, it treats the cache subsystem as an integratedwhole. It can be seen as an extension to the algorithms pre-sented in [6] and [7]. As a result of our extensions it is pos-sible to predict tighter estimations of the WCET of real–timesoftware. Furthermore, it is possible to lock certain code anddata blocks in cache even though this feature is not supportedin hardware. Throughout this paper we focus on the architec-ture of the AMD Athlon processor but the results can easilybe adapted to similar architectures like the Intel Pentium 4 orPower PC processors.

This paper is organized as follows: Section 2 gives anoverview of current approaches in conjunction with cachemanagement and methods that deal with the consideration ofcache effects in real–time analysis. Section 3 shows how thecache memory system is organized in present-day processorsand what difficulties arise for estimating execution times. Insection 4 our approach is presented and section 5 concludesthe paper.

2. Current Approaches

Current approaches which deal with caches in real–timesystems can be divided into two categories: The first categoryis the cache analysis where caches are used without any re-strictions. These analysis techniques predict the worst-caseimpact of the caches on the schedulability of the system [4].The second category are the cache partition techniques whereportions of the cache are assigned to certain tasks to make theirtiming behavior predictable [5] [8] [9] [17]. Another approachis the use of cache locking techniques to lock several tasks incache for the whole lifetime of the system (static cache lock-ing) or to change the mapping dynamically (dynamic cachelocking) [3] [13].


Petrank and Rawitz showed in [10] that the problem to findan optimal placement of contents in a cache memory in thesense that it minimizes the number of cache misses is NP–hard.The consequence is to find algorithms that optimize anothertarget than the minimization of cache misses.

Hashemi et al. present in [7] an approach to optimize in-struction cache usage. The optimization takes place at com-pile time. The method is based on a weighted call graph whichrepresents the call structure and call frequency of the software.Additionally, it takes the procedure size, cache size and cacheline size into account. This approach works for direct mappedand set associative caches. It can be extended to deal with ba-sic blocks instead of procedures.

Calder et al. present in [6] an approach based on the onein [7] to place data in cache with the aim to minimize the num-ber of cache misses. This approach consideres data on stack,global data and dynamically allocated data.

Sebek deals in [14] with the architecture of cache memoriesin general and its influence on real–time systems. A method tomeasure the cache related preemption delay (CRPD) on multi-processor systems is presented in [15]. Because of the com-plexity of state of the art processors it is impossible to getprecise execution times without measuring them on the targetsystem [11].

Our approach extends the algorithms presented in [6]and [7] to consider the features of the cache implementationof the x86 processors. It enables the feature of cache lockingthough not supported in hardware by these processors.

3. Cache Memory System

The AMD Athlon [1] processor is equipped with a largecache memory system, which is divided into two levels (L1and L2 Cache) where the L1-Cache is split into one cache usedfor instructions only and one used for data only (harvard archi-tecture). The L2-Cache is unified which means that it can beused for instructions and data simultaneously. The Athlon fea-tures an exclusive cache design which means the whole size ofthe L2-Cache can be used for both, code and data. Both cachesare organized as n set associative caches, so � cachelines shareone set. A cacheline is a continuous block of memory whichis loaded at once into cache when needed.

The assignment of the position in memory to a place incache corresponds with the address in main memory. Thelower � bits of the physical address in memory form the off-set within one cacheline (cache line size = ��). The following� bits denote the set in cache so the total number of sets in thecache is ��. The place for a cache line within the set dependson the replacement strategy of the cache and is transparent tosoftware. The AMD Athlon uses a pseudo least recently usedstrategy. The remaining bits of the physical address are usedas a tag to identify a block of memory in cache. Thereforesoftware can assign a block of code to a certain set by puttingit to the corresponding address in memory.

If there is no space left in L1-Cache one cache line is dis-placed into L2-Cache and another from L2-Cache to mainmemory. This means that one transfer from L1 to L2-Cacheis needed and, much worse, one from L2-Cache to main mem-ory. A transfer from main memory to L1-Cache and a transferfrom L1-Cache to L2-Cache are concurrent. If the contents ofa cache line changed not while it was in cache it only needsto be invalidated and no transfer to a lower memory level isneeded.

3. Load cacheline from

L2−Cache

4. turnaround (2 cycles)

Main Memory

L1−Cache Victim Buffer

1. clear VB (8 cycles)

2. turnaround (2 cycles)

L2−Cache (8 cycles)

Figure 1: Cache architecture of the AMD Athlon

Between the L1- and L2-Cache the AMD Athlon processorimplements a victim buffer which is a small cache (8 entries) tocompensate the differences in speed between the cache levels.Figure 1 shows the cache architecture of the AMD Athlon pro-cessor together with the execution times for certain transfers asstated by AMD [2] and verified by us. The time needed for atransfer from or to main memory depends on the underlyingconnection to the memory controller and the RAM architec-ture itself.

Accesses to main memory last not only much longer thanaccesses to cache memory, they are also much more unpre-dictable. They are dependend on the physical type of mainmemory being used and on the concurrent data traffic from orto main memory caused by other devices.

4. Main Memory Arrangement

Scenario. The real–time system consists of a set oftasks � � �� . Each task �� consists of basic blocks�� with the attributes �� and the size

� in bytes, where � � �� denotesthe data used by �� together with their attributes �� and size

�. The number of bytes � needed by each item denotes to�� when � is the alignment (usually fourbytes). The total number of cache lines needed by this systemis

��

��

�

��

�

��

��

��

��

�

where �� denotes the size of a cache line in bytes. In thefollowing we refer to an �� as an element of ��.


Our Approach. We extend the approaches presented in[6] and [7] to consider the attributes of the different items.These attributes reflect additional properties of the cache ar-chitecture. We arrange our items in the way that only itemswhich

� are read-only (attribute ro) or

� are writeable (attribute rw) or

� should be locked (attribute locked)

share one set in cache. Items which are writeable can cause amemory access when dismissed but they do not need to do so.Therefore we can never underestimate the WCET if we statethat all accesses to these sets cause a main memory access.

The assembler code is used to analyze the real–time system.During this step we automatically locate all items, their sizesand attributes. We treat code in units of procedures (no codesplitting). Data is treated in units of variables or structures oras a single stack unit. Dynamic data allocation during taskexecution is not useful for real–time tasks and not consideredhere. All these optimizations take place at link time or beforethe real–time system is started in an initialization phase.

Code is always read-only (self-modifying code is unde-sirable for predictable execution times), stack data is alwaysread-write and the remaining global data objects can be di-vided in read-only and read-write according to their section.In the next step we arrange all items according to our schemementioned above. We have to take care of the correct align-ment and not to spread code and data items of one procedure(spatial locality). This is very important for the use of theTranslation Lookaside Buffers (TLB). These TLBs are smallcaches used for the calculation of the physical address fromthe virtual address. One entry can address 4 kbytes of mem-ory (other page sizes are not used here) so it is important notto increase the TLB usage due to an unfavorable memory ar-rangement.

Discussion. This approach has two major advantages: Dis-placements of read-only items never cause an access to mainmemory which simplifies the analysis of cache related preemp-tion delays (CRPD). This makes the real–time system morepredictable and reduces the WCETs when executing code (re-member that code is always read-only).

The second advantage is that not only the places in codewhere a displacement can take place are exactly known butalso the type of displacement, with or without write-back tomain memory. This is true for both code and data items. Thisknowledge paired together with the ascertainable times for acache miss or a cache hit can be used by known methods forcalculating the WCET.

This approach allows it to lock certain items (code or data)in cache. This feature is not supported in hardware by theAMD Athlon or Intel Pentium processors but it can be veryuseful for hard real–time systems (see [13]). A drawback ofthis kind of cache locking is that all locations in main memory

that would be placed in a set containing locked cache linescannot be used any more. So if you reserve 10% of the cachefor locked items, you loose 10% of main memory minus thespace needed for the locked items. Therefore cache locking inthis way should be used sparingly and only for small items likeinterrupt service routines.

Multiprocessor Systems. On a multiprocessor systemthis approach provides new options for the software archi-tecture of a real–time system. With multiple processors thecache of each processor can be used for real–time jobs indi-vidually. This requires that each processor has its own part ofmain memory. In SMP (Symmetric Multi Processing) systemsthis can be achieved by a real–time operating system (RTOS)that splits the main memory in parts used by only one proces-sor, respectively. In computer systems that feature the NUMA(Non Uniform Memory Access) architecture each processorhas its own physical main memory which makes it much moreeasier to arrange certain jobs in the cache of certain processors.The usage of separate memory regions for each processor alsominimizes the waste of main memory when locking sets intothe cache.

Furthermore, the effect of cache snooping can be avoidedwhen using separate main memory regions. Cache snoopingis a software transparent technique to maintain data coherencebetween the processors. That means, if one processor changesthe contents of one cache line and another processor has thesame location of main memory in his cache, this techniqueautomatically updates the contents in cache.

5. Application Potential

To study the run time behavior we used a dual AMD Athloncomputer (Athlon MP 1800+, 1533 MHz, 512 kbytes DDRRAM). We implemented the method of arranging the items ofa real–time system as described in section 4. The computersystem we use implements the RECOMS Software Architec-ture as described in [16].

The execution times of the real–time software are deter-mined by measurement. The measurement attempt is basedon the work described in [12]. We extended this approach formultiprocessor systems and optimized it for measuring with aslow influence on the real–time software as possible. Describ-ing all details of this approach is not the scope of this paper.

One important result is that the difference in speed betweenthe L1-Cache and the L2-Cache is very small. Therefore it isjustified to regard the cache as one memory and not to distin-guish between different levels of cache (presumed the cachelevels are on–die).

Furthermore, we observed execution times of about 250 cy-cles for accesses to main memory (without any blocking timescaused by parallel bus transfers to main memory). This meansa factor of about 25 for L1-Cache and a factor of about 12 forL2-Cache in comparison to cache hits. This shows the need to


avoid unnecessary main memory accesses caused by an unsuit-able memory arrangement. This explicitely includes memoryaccesses to refill the TLBs.

Conclusions. The approach presented in this paper extendsthe algorithms presented in [6] and [7] to consider the imple-mentation of the cache architecture of the AMD Athlon andIntel Pentium processors. It assigns attributes to different codeand data items and places only items with identical attributesinto one set in cache. Furthermore, this method makes thelocking of items in the cache possible. All these optimizationstake place at link time or before the real–time system is ac-tually started (stack allocation). This approach makes the ex-ecution times of the real–time software more predictable andbecomes especially interesting if the whole real–time systemfits into the cache which is not unlikely regarding the cachesizes of state of the art processors.

Future Work. The next steps will be a deeper study of theinfluence of different cache architectures (in particular tracecache) and main memory types on the WCET to enhancethe presented memory placement strategies for different hard-ware and software architectures. The correlation of cache con-tents and techniques like branch prediction or hyper threadingshould be examined.

References

[1] AMD, Sunnyvale, CA, USA. AMD Athlon Processor x86Code Optimization Guide, September 2000.

[2] AMD, Sunnyvale, CA, USA. Whitepaper AMD AthlonProcessor and AMD Duron Processor with Full-SpeedOn-Die L2 Cache, June 2000.

[3] A. Arnaud and I. Puaut. Towards a predictable and highperformance use of instruction caches in hard real–timesystems. In Proceedings of the work-in-progress sessionof the 15th Euromicro Conference on Real-Time Systems,pages 61–64, Porto, Portugal, July 2003.

[4] J. V. Busquets-Mataix, A. Wellings, J.J. Serrano, R. Ors,and P. Gil. Adding instruction cache effect to schedu-lability analysis of preemptive real–time systems. InIEEE Real–Time Technology and Applications Sympo-sium (RTAS’96), pages 204–213, Washington-Brussels-Tokyo, June 1996.

[5] J.V. Busquets-Mataix and A. Wellings. Hybrid instruc-tion cache partitioning for preemptive real–time systems.In Proceedings of the 9th Euromicro Workshop of Real–Time Systems, pages 56–63, Toledo, Spain, June 1997.

[6] B. Calder, K. Chandra, S. John, and T. Austin. Cache-conscious data placement. In Proceedings of theEighth International Conference on Architectural Sup-port for Programming Languages and Operating Sys-tems (ASPLOS-VIII), San Jose, 1998.

[7] Amir H. Hashemi, David R. Kaeli, and Brad Calder. Ef-ficient procedure mapping using cache line coloring. InSIGPLAN Conference on Programming Language De-sign and Implementation, pages 171–182, 1997.

[8] David B. Kirk. SMART (strategic memory allocationfor real–time) cache design. In Proceedings of the Real–Time Systems Symposium, pages 229–239, Santa Mon-ica, California, USA, December 1989.

[9] Jochen Liedtke, Hermann Härtig, and Michael Hohmuth.OS–controlled cache predictability for real–time sys-tems. In Proceedings of the Third IEEE Real-time Tech-nology and Applications Symposium (RTAS’97), Mon-treal, Canada, June 9–11 1997.

[10] Erez Petrank and Dror Rawitz. The hardness of cacheconscious data placement. In Proceedings of the 29thACM SIGPLAN-SIGACT symposium on Principles ofprogramming languages, pages 101–112, Portland, Ore-gon, 2002.

[11] Stefan M. Petters. Bounding the execution of real–timetasks on modern processors. In Proc. of the 7th Int.Conf. on Real–Time Computing Systems and Applica-tions, Cheju Island, South Korea, December 12–14 2000.

[12] Stefan M. Petters. Worst Case Execution Time Estima-tion for Advanced Processor Architectures. PhD thesis,Institute for Real–Time Computer Systems, TechnischeUniversität München, September 2002.

[13] I. Puaut. Cache analysis vs static cache locking forschedulability analysis in multitasking real–time sys-tems. In Proceedings of the 2nd International Work-shop on worst–case execution time analysis in conjunc-tion with the 14th Euromicro Conference on Real–TimeSystems, Vienna, Austria, June 2002.

[14] Filip Sebek. Cache memories and real–time systems.Technical report, Department of Computer EngineeringMälardalen University Västerås, Sweden, 2001.

[15] Filip Sebek. Measuring cache related pre-emption de-lay on a multiprocessor real-time system. In Proceed-ings of IEEE Workshop on Real-Time Embedded Systems(RTES’01), London, December3 2001.

[16] Jürgen Stohr, Alexander von Bülow, and Georg Färber.Using State of the Art Multiprocessor Systems as Real-Time Systems – The RECOMS Software Architecture.In Proceedings of the 16th Euromicro Conference onReal-Time Systems – Work in Progress Session, Catania,Italy, June 2004.

[17] Andrew Wolfe. Software-based cache partitioning forreal–time applications. In Proceedings of the Third In-ternational workshop on Responsive Computer Systems,September 1993.


Using State of the Art Multiprocessor Systems as Real–Time Systems —The RECOMS Software Architecture ∗

Jürgen Stohr Alexander von Bülow Georg FärberInstitute for Real–Time Computer Systems

Prof. Dr.–Ing. Georg FärberTechnische Universität München, Germany

{Juergen.Stohr,Alexander.Buelow,Georg.Faerber}@rcs.ei.tum.de

Abstract

In this paper a software architecture for using a generalpurpose multiprocessor system based on the Intel x86 archi-tecture as a real–time system is presented. This approach al-lows the execution of standard and real–time tasks in parallel.It is possible to separate and to minimize the influences of thehardware on worst–case execution times of real–time software.They are mainly caused by the processor architecture, the chipset and the bus systems being used. Compared to single pro-cessor systems, computation times become more deterministicand response times are shorter.

1. Introduction

Modern general purpose computer systems are not designedto act as hard real–time systems. The focus of optimization oftheses systems is to provide a good average computing power.The minimization of maximum execution times of software re-mains unconsidered, as in most cases the users will not noticethem. Optimization techniques which improve average per-formance but may result in poor worst case behavior are forexample the use of caches, pipelines and the use of fairnesscommunication protocols. However, general purpose com-puter systems are very low priced compared to other archi-tectures, combined with a high computing power which makesthem interesting for real–time usage. In addition, a computersystem based on the Intel x86 architecture is in most casesdownwardly compatible; therefore existing real–time softwarecan easily be ported to newer hardware.

If a multiprocessor system is used for hard real–time tasks,the worst case execution time (WCET) of software is affectedin several ways: The optimization techniques being used bya certain processor result in varying execution times. Theseare for example the use of caches and the branch prediction.

∗The work presented in this paper is supported by the DeutscheForschungsgemeinschaft as part of a research programme on “Real-time with Commercial Off-the-Shelf Multiprocessor Systems” underGrant Fa 109/15-1.

The interdependencies of the CPUs and the chip set have tobe taken into account, too. If for instance there are two CPUsaccessing the main memory at the same time on a SMP (sym-metric multi processing) system, one CPU may have to waitfor the other. If a processor has to perform I/O, its execution isdelayed until it is granted access to the bus.

In this paper we present a software architecture for multi-processor systems that allows the execution of a general pur-pose operating system (GPOS) and a real–time operating sys-tem (RTOS) in parallel on different CPUs. Thus it is possibleto use all the comfort provided by the GPOS and to performhard real–time tasks in parallel. This software architecture al-lows to control and to minimize response times and the WCETof real–time software. It is shown that a general purpose mul-tiprocessor system can be configured in a way which leadsto more deterministic computation times and faster responsetimes compared to single processor systems. Our main objec-tive is to make the computing power of modern general pur-pose multiprocessor systems available for real–time systems.

This paper is organized as follows: Section 2 gives anoverview of current approaches for estimating the worst–caseexecution time of real–time software. Section 3 describes anexisting method of running standard and real–time tasks in par-allel which was the origin of our research work. This method isimproved by our software architecture presented in section 4.To prove the advantages of our concept, measurement resultsare presented in section 5. The paper ends with a conclusionin section 6.

2. Current Approaches

The estimation of the best and worst case execution timecan be done by modelling the hardware or by a measurementbased approach. Colin and Puaut investigate in [2] the branchprediction unit of the Intel Pentium processor. In order to re-duce complexity, they switched off the caches provided by theprocessor. Stappert and Altenbernd use a processor model in-cluding caches and pipelines in [7] and [8], but they simplifiedthe analysis by assuming linear code only. A methodology


to analyze cache and pipeline behavior by abstract interpreta-tion and pipeline modelling is presented in [4]. Here, only theinter-processor effects are investigated without taking the in-fluences of the chip set into account. Petters describes in [6] ameasurement based approach to estimate worst case executiontimes. A program is divided into basic blocks. The computa-tion time of each block is measured on a given hardware. In[1] an extreme value statistics approach is described in orderto deal with the measured worst case execution time. A largenumber of measurements is taken and an extreme value statis-tics density function is matched to the measured values.

In contrast to above–mentioned publications we try to ad-just the real–time system in order to make its best and worstcase execution times more deterministic. We take the wholearchitecture into account, regarding all the relevant aspects.

3. Underlying Concept

As we intend to provide a hard real–time system by adjust-ing the hardware and software of general purpose multiproces-sor systems based on the Intel x86 architecture, we have cho-sen Linux with the real–time extension RTAI [3] as a basis ofour work. The reason for this is the availability of the sourcecode of both, Linux and RTAI, and the functionality alreadyprovided by RTAI: It has got real–time schedulers that sup-port multiprocessors systems, and it can deal with semaphores,mailboxes and FIFOs.

RTAI itself uses the concept of running a GPOS as idle taskof a real–time system. A tiny layer between the hardware andthe GPOS is inserted (HAL), adding the real–time function-ality to the existing operating system. All interrupts are in-tercepted by the RTAI kernel which decides if an appearinginterrupt is relevant for real–time software or not. All non-relevant interrupts are forwarded to the GPOS being processedin the idle time only; the relevant interrupts are handled im-mediately. The GPOS itself is not allowed to enable or disableinterrupts, this functionality is performed by the RTOS.

When using this architecture on a multiprocessor system asshown in figure 1, each CPU has got its own part of the RTAIkernel. On each processor, the GPOS is executed as idle task.

There are three major disadvantages when using the basicarchitecture of RTAI, regarding the influence on the WCET ofreal–time software:

• The GPOS is executed as an idle task on every proces-sor. So if a real–time task is resumed on a certain CPU,the status of this processor is undefined. It cannot besaid which cache lines are in cache and which have tobe reloaded. The status of the Translation LookasideBuffers (TLBs) and the branch prediction is not known,too. In addition, the time needed to switch from usermode tasks into the kernel mode has to be taken into ac-count, as the privilege level of the processor has to bechanged here.

KernelRTAI

RT−Task

TaskRT−Linux

Kernel

KernelRTAI

RT−Task

TaskRT−Linux

Kernel

InterruptsDMAI/O I/O

CPU 2

I/ODMA

I/OInterrupts

CPU 1

sync

Peripheral Devices

Non−RT Applications

Figure 1: RTAI on a multiprocessor system

• Each processor has to deal with irrelevant interrupts.Normally, there are only a few real–time devices and a lotof devices being used by the GPOS exclusively. In mostcases the generated interrupts are irrelevant for real–timeusage, but the RTOS has to react to all of them interrupt-ing the execution of real–time tasks. If for example akey is pressed on the keyboard, an interrupt is generatedwhich disrupts the execution of a real–time task.

• Each processor is able to perform I/O and is thus able toinitiate DMA transfers. Previous studies have shown thatthe interaction between the chip set and the processors ona multiprocessor system cannot be neglected [5]. So if areal–time task wants to perform I/O, e. g. when writing toa hard disk, it may have to wait until the pending requestsof other CPUs and peripheral devices are processed. Thislatency depends on the chip set and the communicationprotocols being used.

4. The RECOMS Software Architecture

In order to deal with the constraints mentioned in section 3,we have extended the architecture of the RTAI microkernel.Hardware and software of the real–time system are adjustedto get predictable best and worst case execution times. Theinfluence of the GPOS on the real–time tasks is minimizedallowing the utilisation of the GPOS and RTOS in parallel.

Our enhancements to RTAI are shown in figure 2. TheCPUs are divided into two groups: One CPU is called the Gen-eral Purpose Unit (GPU). On this processor the GPOS and itstasks are executed. The remaining CPUs are called the RealTime Units (RTUs). All real–time tasks are run on these pro-cessors exclusively. As the GPOS is bound to the GPU, theRTUs are available for real–time usage only. If a dual CPUcomputer is used, there is one GPU and one RTU, on a quadCPU system there are one GPU and three RTUs.

The interrupts are divided into two groups. The interruptsbeing irrelevant for real–time usage and the ones being trig-


KernelRTAI

Linux

Kernel

InterruptsDMAI/ODMART−

I/ORT−

KernelRTAI

RT−Task

TaskRT−Idle

Task

InterruptsRT−

blockingSu

perv

isio

n

Con

trol

configure

DataData

control

Peripheral Devices

Non−RT Applications

a priori configuration

I/O DMA Interrupts

GPU RTU

Figure 2: RECOMS Software Architecture

gered by real–time devices. The first group is routed to theGPU directly, so this processor has to react to these interruptsonly. If for example the user moves the mouse or presses a keyon his keyboard only the GPU handles these interrupts. Theinterrupts needed in real–time context are routed to the RTUsonly. Here, a real-time handler has to respond to these inter-rupts. If there is more than one RTU the relevant interruptshave to be routed to the CPU executing the appropriate real–time interrupt service routine. The routing of the interrupts canbe adjusted a priori, therefore there is no overhead at runtime.

Similar to the mechanism of enabling and disabling inter-rupts provided by RTAI, I/O accesses of the GPOS are super-vised by the RECOMS Software Architecture. At any pointin time it is possible for a real–time task to block the I/O ac-cesses of the GPOS. I/O regions can be defined that shouldbe excluded from blocking, as not all I/O accesses result in aperformance penalty for real–time tasks. If an I/O access getsblocked on the GPU, only the task performing the I/O opera-tion is suspended. All other tasks of the GPOS keep on run-ning. For real–time tasks the overhead to perform the blockingof I/O accesses is negligible, as only a semaphore has to bechanged which is always non-blocking.

In addition to disabling I/O, PCI devices can be hinderedfrom initiating DMA transfers. Here, the Master Enable Bitof the PCI devices is cleared during critical real-time opera-tions [9].

The RECOMS software architecture leads to the followingadvantages for real–time software:

• The privilege level needs not to be changed. As theGPOS is executed on the GPU exclusively, the RTUs arealways running in kernel mode. Thus there is no addi-tional latency to interrupts caused by the switching fromuser mode to kernel mode.

• Real–time tasks are not interrupted by irrelevant de-mands. As only the interrupts needed for real–time tasks

are routed to the RTUs, there is no additional delay to thereal–time software.

• The caches and the TLBs are not displaced by the GPOS.As the GPOS with its tasks and interrupt handlers is ex-ecuted on the GPU exclusively, the caches and the TLBsof the RTUs are always in a well known state. Thus itis possible to arrange the code of the real–time softwarea priori in order to minimize the displacement in cachesand TLBs. If the relevant real–time code is very small,with this architecture it is now possible to lock this codein the cache of an RTU [10].

• There is no remaining interaction of the chip set andits communication protocols to real–time software. Itis possible to postpone the I/O operations of the GPUwhich results in unhindered accesses of an RTU to a spe-cific real–time device. If a real–time process wants toperform I/O, it is now able to access the chip set withoutinterference of other non real–time devices, whereby thereal–time tasks have to synchronize their I/O requests byusing semaphores.

5. Results

The advantages of our architecture are demonstrated by thecentral aspect of postponing the I/O accesses of the GPU. Areal–time task is executed on the RTU of a dual CPU machinewhich performs a series of 1000 I/O accesses. This is needed ifsome data is written to a hard disk from a real–time task. Eachmeasured value reflects the time needed to execute these 1000I/O instructions. The code was fully arranged in the cache ofthe RTU and the interrupts were disabled during each mea-surement. In parallel to each measurement, the GPU accessedlocal and network file systems heavily.

Figure 3 shows the measured results for the situation whenthe GPU can perform its I/O accesses without any restrictions.The time needed to execute the 1000 I/O instructions variesfrom 195 microseconds to 940 microseconds. Compared tothe best case, the execution time is increased by 382%.

If the I/O accesses of the GPU are postponed and the PCIdevices are hindered from initiating DMA transfers duringeach measurement, we get the results as shown in figure 4.Now the measured values are only varying from 195 microsec-onds to 199 microseconds. Compared to the best case, the ex-ecution times are now increased by maximal 2%.

6. Conclusion

In this paper we present a software architecture for multi-processor systems based on the Intel x86 architecture that al-lows the execution of standard and real–time tasks in parallel.Hardware and software of the real–time system are configuredin a way that leads to more deterministic computation times.

As the interdependencies of the CPUs and the chip set havea major influence on the execution time of real–time software,


0.1

1

10

100

1000

10000

200 300 400 500 600 700 800 900

num

ber

of v

alue

s (lg

)

time in microseconds (usec)

300000 measured values

minimum : 195.123usecmaximum : 940.074usecaverage : 207.033usec

Figure 3: Execution times with I/O

0.1

1

10

100

1000

10000

100000

200 300 400 500 600 700 800 900

num

ber

of v

alue

s (lg

)

time in microseconds (usec)

300000 measured values

minimum : 195.123usecmaximum : 199.155usecaverage : 195.414usec

Figure 4: Execution times without I/O

these impacts have to be prevented: Standard and real–timetasks are executed on different CPUs, interrupts are routedto the CPUs executing the according tasks. I/O accesses ofthe GPOS are supervised by the RECOMS Software Architec-ture and can be postponed. Therefore it is now possible forreal–time tasks to perform their accesses to peripherals with-out competing requests.

Compared to single processor systems the computationtimes of real–time software are more deterministic. There areno displacements in the caches due to the execution of theGPOS. Thus the caches of the RTUs can always be held ina well known state. Response times get shorter as the privi-lege level needs not to be changed if there is a real–time re-quest. Real–time handlers are executed immediately and canbe locked in cache leading to faster response times.

Future work deals with the synchronization issues of real–time tasks when performing I/O and initiating DMA transfers.

In particular, the I/O accesses of the RTUs should not occursimultaneously. Furthermore, the advantages when using aNUMA architecture as real–time system ought to be exam-ined.

References

[1] Alan Burns and Stewart Edgar. The use of extreme statis-tics to predict computation time for advanced proces-sor architectures with branch prediction. Technical re-port, University of York, United Kingdom, Departmentof Computer Science, Real-Time Research Group, 2000.

[2] Antoine Colin and Isabelle Puaut. Worst case execu-tion time analysis for a processor with branch prediction.Journal of Realtime Systems, 18:249–274, 2000.

[3] DIAPM, Dipartimento di Ingegneria Aerospaziale Po-litecnico di Milano. A Hard Real Time support forLINUX, 2002.

[4] Reinhold Heckmann, Marc Langenbach, StephanThesing, and Reinhard Wilhelm. The Influence of Pro-cessor Architecture on the Design and the Results ofWCET Tools. In Proceedings of the IEEE, volume 91,July 2003.

[5] Thomas Hopfner, Jürgen Stohr, Wolfram Faul, andGeorg Färber. RTCPU – Realzeitanwendungen auf Dual-Prozessor PC Architekturen. it+ti — Informationstech-nik und Technische Informatik, 43(6):291, December2001.

[6] Stefan M. Petters. Worst Case Execution Time Estima-tion for Advanced Processor Architectures. PhD thesis,Institute for Real–Time Computer Systems, TechnischeUniversität München, September 2002.

[7] F. Stappert and P. Altenbernd. Complete worst–case ex-ecution time analysis of straight–line hard real–time pro-grams. Technical Report 27/97, C–Lab, Fürstenallee 11,Paderborn, Germany, December 1997.

[8] F. Stappert and P. Altenbernd. Complete worst–case ex-ecution time analysis of straight–line hard real–time pro-grams. Journal of Systems Architecture, The EUROMI-CRO Journal, 46:339–355, 2000.

[9] Jürgen Stohr, Alexander von Bülow, and Georg Färber.Controlling the Influence of PCI DMA Transfers onWorst Case Execution Times of Real–Time Software. InProceedings of the 4th International Workshop on worst–case execution time analysis in conjunction with the 16thEuromicro Conference on Real–Time Systems, Catania,Italy, June 2004.

[10] Alexander von Bülow, Jürgen Stohr, and Georg Färber.Towards an Efficient Use of Caches in State of the ArtProcessors for Real–Time Systems. In Proceedings ofthe 16th Euromicro Conference on Real-Time Systems –Work in Progress Session, Catania, Italy, June 2004.


Specification of real-time schedulers

Odile NASR, Miloud RACHED, Jean-Paul BODEVEIX, Mamoun FILALI

IRIT, Paul Sabatier University118 Route de Narbonne, 31062 Toulouse FRANCE

E-mail: �nasr,rached,bodeveix,filali�@irit.fr

Abstract

The purpose of this paper is to specify, within the fra-mework of the Cotre project, different scheduling policieswith an aim of checking real-time properties defined by theuser. We show how to specify a scheduler using basic timedautomata of Uppaal to express scheduling policies.

1. Introduction

The work presented here was completed within the fra-mework of the Cotre project1. The objective of this pro-ject consists in defining an architectural language (inspi-red of Hrt-Hood [2]) for the modeling and the validationof real-time systems. The Cotre language supports severalanalyses : atemporal analyses where it is done time abstrac-tion, timed analyses with maximum parallelism where eachprocess has a processor and timed analyses with scheduling.In this paper, we are interested in this last type of analysesand thereafter in checking safety properties of a real-timesystem in the presence of the scheduler. Our goal consistsin expressing the scheduling algorithms in the Cotre lan-guage and checking systems describing processes control-led by a scheduler. The model checking consists to translatethe model into basic timed automata of Uppaal [8] and theschedulability is described by safety properties in timed lo-gic.

2. The Cotre language

The Cotre language is an architectural description lan-guage which can be qualified as a common interface to se-veral model checking formalisms taken into considerationwithin the framework of the Cotre project (timed automata[1], Petri nets, . . .) [5].

1RNTL project (COmposant Temps Réel) over two years (2002-2004)(http ://www.laas.fr/COTRE). The consortium was composed by AIR-BUS, TNI-Valiosys, ENST-Bretagne and the three laboratories of FéRIA(LAAS, IRIT, CERT).

2.1 Characteristics of the Cotre language

The static architecture is described with componentsand connectors. The structure of components is hierarchi-cal. The interface of a component consists of I/O ports.The components interconnection is described by multi-transmitters and multi-receivers connectors. These connec-tors consist of one-way ports.

The dynamic architecture is described by transition sys-tems attached to elementary components. These transitionsystems are automata communicating by rendez-vous. Ele-ments of this dynamic treat real-time aspects (bounded wai-ting, periodicity, . . .).

Cotre allows to specify qualitative and quantitative pro-perties. A family of generic properties was defined and syn-tactic constructions were associated to them.

2.2. The real-time in Cotre

In Cotre, the real-time appears on two levels : at thestructural level where we have constructors of periodic andsporadic processes and at the instructions level.

Constructors of real-time processes periodic and spora-dic processes have the following execution attributs :

– for periodic : Æ is an integer expressing the releasetime (relative date of the first execution), P the acti-vation’s period and deadline the maximum delay bet-ween the release event and the date of execution’s end.

– for sporadic : min P is an integer indicating the mini-mal duration between two executions and deadline themaximum duration for the execution’s end.

The concrete syntax of Cotre real-time processes is :real_time_process ::=

periodic ’(’Æ,P,deadline,priority’)’ behavior| sporadic [’(’min P,deadline,priority’)’] behavior

Real-time instructions Cotre proposes three instruc-tions : timeout, delay and computation. Timeout limits thewaiting duration of a synchronization. Delay imposes a wai-ting time to the process. Computation indicates the compu-ting time of a process, it has two parameters : min is an in-teger which indicates the minimal duration of computation


and delta the authorized maximum variation. Thus the com-putation takes between min and min+delta units of time.

real_time_instruction ::=delay ’(’ min[,delta] ’)’

| computation ’(’ min[,delta] ’)’

3. Real-time scheduling

A real-time application is made from a set of tasks whichcan be periodic (tasks of control reactivated according toa precise interval called period), aperiodic (tasks reacti-vated by events). The execution of these tasks must res-pect a maximum delay, no task should finish after its ab-solute deadline. The absolute deadline � � of a task �� isthe moment where the execution of a task must finish andwhose passing can entail a timing failure [4]. The principalgoal of scheduling is to allow the respect of these temporalconstraints. It thus consists in defining an attribution policyof the processor so that no timing failure is made. A task � �has four parameters (��, ��, ��, ��) where �� is the date ofthe first activation, �� the WCET, �� the relative deadline(the maximum delay between the request of execution andits effective end) and �� the period.

The Periodic scheduling [3] takes into account periodictasks and it is based on the concept of priority. The proces-sor is allocated to the process which has the greatest prio-rity but tasks can be preempted by another priority tasks.There are three principal periodic algorithms : Rate Mono-tonic (RM) [7], Deadline Monotonic (DM), Earliest Dead-line (EDF). Characteristics of each one of these policies arerepresented by the following array :

Policy Prio(��)�Prio(�� ) Condition

RM ��

��

��

�

� � ��

DM ��

��

��

�

� � ��

EDF ��

��

��

There is also the Joint scheduling which takes into ac-count aperiodic tasks with periodic tasks following one ofthe three policies already presented.

However, we can define another class of algorithms foraperiodic tasks. We declare a maximum apparition fre-quency of tasks and aperiodic tasks can then be treated likeperiodic tasks.

Remark Scheduling analyses used generally take into ac-count worst time but do not treat behaviors and depen-dencies (exchange of messages) between various tasks.

P=10 Prodperiodic(10){ c1! -> compute(4);p_w[] c2! -> compute(4);p_w}

sporadic(10)(c){ c? -> compute(4);skip}

c1 c2

P=10

Cons1

c P=10

Cons2

c

whenwhen

when

Figure 1: Production/Consumption

The figure 1 illustrates an example extracted from an avio-nics model. It represents three processes of which a periodicproducer (Prod) of period P=10 and two sporadic consu-mers (Cons1 and Cons2) of period P=10. Prod has twochannels c1 and c2 to communicate respectively with chan-nels c of Cons1 and Cons2. When it sends a signal on c1 orc2, it makes a computation of 4 units of time and waits forthe next period. When Cons1 and Cons2 receive a signal,they also make a computation of 4 units of time. In a sta-tic analysis, we consider that the three processes make theircomputations at each cycle, the worst cases execution thusrequire 12 units of time, we have a violation of the period.A dynamic analysis detects that the producer and only oneof the two consumers make their computations what takesto the maximum 8 units of time per cycle : the schedulingof tasks is thus possible.

4. Scheduling policies in Cotre

Algorithms that we saw previously can treat interactionsof synchronization and communication between tasks [6].The goal of the analysis presented here is to propose a moreprecise method to treat these synchronization’s interactionswith an aim of being able to check properties by taking intoaccount at the same time problems of synchronization andallowance of processor. We are located here in a contextuniprocessor, we thus consider the allocation of a single re-source processor. The modelled system can be representedoverall as follows :

� � �� schedulerCotre processes transformed into tasks send requests of

access to the processor. Different requests are managed bythe scheduler who applies scheduling policies to allocatethe processor to the priority task without no timing failurebeing made. Each process has a channel which dispatch re-quests to the scheduler. This one has an array request whereit receives requests of various processes and an array exitfrom where it sends notifications of end of execution to pro-cesses. Each one of these arrays has NBPROC size whichrepresents the number of processes of the application ; exitis an array of ports and request is also an array of portsthrough which can transit integers corresponding to the mi-nimal duration of computation. When the scheduler sendsthe notification of end of execution, it chooses the new pro-cess which must execute. The scheduler is represented bycomponent policy_scheduler

(request: array NBPROC of port of nat;exit: out array NBPROC of port)

� � � � �end policy_scheduler

A Cotre process can make several computations, resul-ting in requests for processor time. A computation has as pa-rameters :self,min,delta,R,prio,P where self isthe identifier of the process, min the minimal duration of


computation, delta the authorized maximum variation, R therelative deadline, prio the priority and P the period. Theseparameters result either from those transmitted to construc-tors of process (2.2), or from those of computation. A com-putation’s request results in :

– sending a request on the channel request of the sche-duler (request!min), initializing the clock cl dof request for execution (cl d[self] :=0),assigning parameters of computation in cor-responding arrays declared as global variables(R[self] :=R;prio[self] :=prio;P[self] :=P;dcet[self] :=delta),

– waiting the exit signal which indicate that requiredprocessor time was spent.

Global variables of the system are the following ones :var R:array NBPROC of nat --Relative deadlinesvar prio:array NBPROC of nat --Prioritiesvar P:array NBPROC of nat --Periodsvar cl_d:array NBPROC of clock --Clocks indicating--the delay past since the request for execution

var dcet:array NBPROC of nat --Latitudes of--computing time

The verification of the following global invariant ensuresthe respect of temporal constraints (ready being the array ofprocesses which required the access to the processor).property ct: invariant all i=1..NBPROC.ready[i]� cl_d[i] � R[i]

In what follows, we expose the modeling in Cotre ofscheduling policies by the expression of different processesScheduler.

Remarks :– In the Cotre environment, the process Scheduler is au-

tomatically generated once the scheduling policy se-lected.

– Instructions delay and timeout do not give place to anyinteraction with the scheduler because they do not re-quire an allowance of processors.

4.1. Non-preemptive scheduling

For illustrating non-preemptive scheduling, we can citethe First Come First Served policy (FCFS) where requestsare treated by order of arrival without no preemption beingpossible. The request to be treated must satisfy the follo-wing invariant where the active process is that whose re-quest is oldest :property is_FCFS: invariant running > 0 �

all i:1..NBPROC.ready[i]�cl_d[i]�cl_d[running]

This property must be verified by all executions res-pecting temporal constraints (specified by the clauseconstraint).

The local variable running is used to indicate the activeprocess, cl e is a clock indicating the time past since the be-ginning of a request’s execution, comp is a dynamic array

which corresponds to the computing time and ready indi-cates if the process is ready. To ensure the smooth runningof FCFS, the clock cl e does not have to exceed the boundrelating to the computing time comp. When a request arrivesto the scheduler, it is treated immediately with a successionof initializations and tests. When the active process finishesits execution, a signal exit is sent and this process becomesinactive. The next process to be satisfied must be that whowaited longest among all those which are ready to execute.

init(p,min) � comp[p]:= min; ready[p]:= truecomponent FCFS_scheduler

(request: array NBPROC of port of nat;exit: out array NBPROC of port)

var running: 0..NBPROC --Active processvar cl_e: clock --Clock indicating the delay--past since the beginning of the execution

var comp: array NBPROC of nat --Computing timevar ready: array NBPROC of bool --Ready processesconstraint running > 0 �cl_e � comp[running]+dcet[running]

property is_FCFS: invariant running > 0 � alli:1..NBPROC.ready[i] � cl_d[i] � cl_d[running]

initially all j:1..NBPROC.ready[j]= false & comp[j]= 0cyclic�-- Execution of a request[p:1..NBPROC] when request[p]? min � init(p,min);

if running= 0 then cl_e:= 0; running:= p end-- End of execution[] from cl_e � comp[running] when exit[running]!

� ready[running]= false; running:= 0-- Choice of the next process[] [p:1..NBPROC] from running= 0 & ready[p] & ( all

q:1..NBPROC.ready[q] � cl_d[p] � cl_d[q])� cl_e:= 0; running:= p

�end FCFS_scheduler

Remark We treat the modeling of the consumption oftime and not the system view of the scheduler and differentprocesses. It is considered that each computation is execu-ted until its term of execution (to within about delta) andcannot have a premature termination.

4.2. Preemptive scheduling

The general principle of preemption rests on thedelayed-action of the computation’s bound. When a pro-cess is preempted, its computation’s bound must be delayedof a value equal to the computation’s bound of the processwhich preempted it.comp[next] := comp[next]+comp[running]where next represents the preempted process and runningthe active process.

The various preemptive scheduling policies can be des-cribed by a single algorithm with the following pseudocode. The scheduler must test if the new request has priorityto do the preemption.


component preemptive_scheduler(request:array NBPROC of port of nat;exit:out array NBPROC of port)

var running: 0..NBPROC --Active processvar cl_e: array NBPROC of clock --The delay--past since the beginning of the execution

var duration: nat --Computing time of a processvar comp: array NBPROC of nat --All Computing timesvar ready: array NBPROC of bool --Ready processesvar next: 0..NBPROC --Process continuationvar preempted: array NBPROC of 0..NBPROC--Chaining of preempted

constraint running �� 0 �cl_e[running]�comp[running]+dcet[running]

initially next = 0 & running = 0 &all j:1..NBPROC.ready[j] = false & comp[j] = 0

[p:1..NBPROC] from ( all q:1..NBPROC.not ready[q])when request[p]? duration� initialize(p,duration);

active(p,0)[][p:1..NBPROC] from running �� 0

when request[p]?duration � initialize(p,duration);if preempte(p) then active(p,running) --Preemptionelse preempted[p]:= 0 end

--Termination[] from running �� 0 & ended(running)

when exit[running]! � if next ��0 thencomp[next]:=comp[next]+comp[running] end;ready[running]:= false; running:= 0

--Election[][p:1..NBPROC] from running=0 & ready[p]& priority(p)�

if p=next then running:=p;next:=preempted[p] endelse active(p,next) end

end preemptive_scheduler

The constraint is an hypothesis that the model must res-pect, it is translated into Uppaal invariants. When a processbecomes active for the first time, its date of termination isexpressed by : � �� .When a process p takes the hand whereas there is no otherprocess asking for the processor, p becomes active with forcontinuation 0 ; it didn’t thus preempt any other process.When a process is preempted by a priority process, it’s chai-ned to the interrupting. When a stopped process takes againcontrol, its date of termination is delayed of the time du-ring which it was stopped. The scheduler doesn’t block arequest, it always ends up serving a request on standby im-mediately (time doesn’t pass once that the request was ex-pressed).

The property �� ensures that thefirst branch which makes the action active(p, 0) (thusdestroyed next) cannot be started if next �� .

Instantiation The already definite algorithm is applicableto various policies of scheduling. We will use the EDF po-licy to illustrate this algorithm.

To ensure the correct functioning of EDF policy, the pro-cess running must respect the following constraint :cl_e[running] � comp[running] + dcet[running]

thus the clock relating to the execution (cl e) should notexceed the sum of the bound relating to the computing time(comp) and of the latitude (dcet).

When the scheduler receives the request p for access tothe processor whereas there is already active processes, theparameter duration is put in the array comp and p becomesready. initialize(p, duration) will then be defined by :

initialize(p,duration)�comp[p]:=duration;ready[p]:=true

The process p has priority and there is preemption of theprocess running if R[p] � R[running] - cl d[running]preempte(p) � cl_d[running] � R[running] - R[p]

The process p becomes active with for continuation n. nis memorized and the execution’s clock is initialized.active(p,n) � next:= n; running:= p;

preempted[p]:= n; cl e[p]:= 0

A process can finish when its clock of execution exceedsthe minimal terminal of computation.ended(running) � cl_e[running] � comp[running]

A process has a maximum priority for EDF if it has thenearest absolute deadline compared to any other process qwhich is ready (R[p] - cl d[p]� R[q] - cl d[q]).priority(p) � all q:Process.ready[q] �

cl_d[p] - cl_d[q] � R[p] - R[q]

5. Conclusion

In this study, we presented the Cotre language allowingto express real-time behaviors as well as some properties.Construction semantics of the language was expressed bytranslation in term of Uppaal’s timed automata. The cou-pling with the Uppaal tool then makes it possible to checkthe satisfaction of properties by a Cotre model. Finally,we expressed , more particularly, scheduling algorithms inCotre. We are working actually on the management of cri-tical ressources in the presence of the scheduler. A way tocontinue this work would be to make experiments in orderto test the efficacity of the methods suggested and to do thevalidation’s stage of these methods.

Références

[1] R. Alur and D. Dill. A theory of timed automata. TheoreticalComput. Sci., 126(1) :183–235, February 1994.

[2] A. Burns and A. Wellings. HRT-HOOD a structured designmethod for hard real-time Ada Systems. Elsevier, 1995.

[3] A. Choquet-Geniet. Panorama de l’ordonnancement tempsréel monoprocesseur. In Ecole d’été Temps réel, ETR2003Systèmes réseaux et Applications, pages 213–226. IRIT, Tou-louse, Septembre 2003.

[4] F. Cottet, J. Delacroix, C. Kaiser, and Z. Mammeri. Ordon-nancement temps réel. HERMES Science Publications, 2000.

[5] J.-M. Farines, B. Berthomieu, J.-P. Bodeveix, P. Dissaux,P. Farail, M. Filali, P. Gaufillet, H. Hafidi, J.-L. Lambert,P. Michel, and F. Vernadat. The Cotre project : rigorous de-velopment for real time systems in Avionics . In WRTP’0327th IFAC/IFIP/IEEE Workshop on real-time programming ,Logow(Pologne), pages 51–56. IEEE, 14-17 mai 2003.

[6] M. Klein and Al. A Practitioners’ Handbook for Real-TimeAnalysis : Guide to Rate Monotonic Analysis for Real-TimeSystems. Kluwer Academic Publishers, 1993.

[7] C. Liu and J. Layland. Scheduling algorithms for multi-programming in a hard real time environment. J. ACM,20(1) :46–61, January 1973.

[8] UPPAAL - A Tool Suite for Verification of Real-Time Sys-tems - http ://www.docs.uu.se/docs/rtmv/uppaal/.


SymTA/S - Symbolic Timing Analysis for SystemsArne Hamann, Rafik Henia, Razvan Racu,

Marek Jersak, Kai Richter, Rolf ErnstInstitute of Computer and Communication Network Engineering

Technical University of BraunschweigD-38106 Braunschweig / Germany

{hamann|henia |racu |jersak |richter |ernst }@ida.ing.tu-bs.deABSTRACT

SymTA/S is a performance and timing analysis tool basedon formal scheduling analysis techniques and symbolic simu-lation. The tool supports heterogeneous architectures, complextask dependencies, context aware analysis, and combinesoptimization algorithms with system sensitivity analysis forrapid design space exploration. This paper gives an overviewof the current and future research interests in the SymTA/Sproject.

I. I NTRODUCTION

Although there are countless approaches for formal per-formance and timing analysis known from real-time systemresearch, only very few have been adopted in design ofheterogeneous SoCs and distributed systems. The SymTA/Sapproach enables a completely new view on system level anal-ysis and supports explicitly the combination and integration ofheterogeneous subsystems.

In the first part of the paper we will give a brief overviewabout the formal core of SymTA/S. Afterwards, we willshortly introduce current research interests in the SymTA/Sproject. These are context aware analysis, optimization, andsensitivity analysis. We conclude the paper with an exampledemonstrating the relevance of these aspects.

II. T HE SYM TA/S APPROACH

SymTA/S [1] is a software tool for formal performanceanalysis of heterogeneous SoCs and distributed systems. Thecore of SymTA/S is our recently developed technique to couplescheduling analysis algorithms using event streams [9], [11].Event streams describe the possible I/O timing of tasks andare characterized by appropriate event models such as periodicevents with jitter or bursts and sporadic events. At the systemlevel, event streams are used to connect local analyses accord-ing to the systems application and communication structure.

In contrast to all known work, SymTA/S explicitly supportsthe combination and integration of different kinds of analysistechniques known from real-time research. For this purpose, itis essential to transition between the often incompatible eventstream models resulting from the dissimilitude of the localtechniques. This kind of incompatibility appears for instancebetween an analysis technique assuming periodic events withjitter and an analysis technique requiring sporadic events.In SymTA/S we useevent model interfaces (EMIFs)andevent adaptation functions (EAFs)to realize these essentialtransitions [9].

However, integration of heterogeneous systems is not thesole domain of application for EMIFs and EAFs. In SymTA/Sso-called shapers can be connected with any event stream.Shapers are basically EMIF-EAF combinations which ma-nipulate an event stream and thus the interaction betweentwo components. More precisely, they provide control aboutthe timing of exchanged events and data. Consequently, theyenable the user to model buffering and perform traffic shaping.This is important because buffering and traffic shaping breakup non-functional dependency cycles and can tremendouslyreduce transient load peaks in dynamic systems [10]. In otherwords, due to the event model transformation provided byEMIFs and EAFs SymTA/S is able to analyze many real worldexamples that holistic approaches [12], [8] cannot handle.

In order to perform a system level analysis, Sy

Proceedings Work-In-Progress Sessioncse.unl.edu/~goddard/ecrts04wip/proceedings.pdfL. Lo Bello, A....

Documents

Transcript of Proceedings Work-In-Progress Sessioncse.unl.edu/~goddard/ecrts04wip/proceedings.pdfL. Lo Bello, A....