Trap handling with hardware multi-threading Informatica ...

Bachelor Informatica

Trap handling with hardwaremulti-threading

Koen Koning6301452 / [email protected]

June 12, 2013

Supervisor(s): Raphael ‘kena’ Poss (UvA)

Signed:

Informatica—

Universiteit

vanAmst

erdam

Abstract

Processors often use traps to notify a program about events which are too exceptional tocheck in software. This thesis analyzes the possibilities of implementing such mechanismsin hardware multithreaded architectures. Using these architectures allows for more efficienttrap handling, resulting in better overall performance of the software. One such system wasimplemented in the Microgrid architecture for further examination, showing that the systemworks as expected. This system is shown to have no impact on other software running onthe same core, and outperform contemporary architectures during instruction emulation.This thesis shows there is potential in highly parallel architectures for trap handling, whilestill leaving many design decisions to be made, depending on the requirements for such asystem.

3

Contents

1 Introduction 71.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Microgrid 112.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 MGSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Instruction Set Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Design space exploration 153.1 Traps in other architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Preciseness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.1 Inside the victim’s context . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.3.2 Separate thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.1 In-pipeline exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4.2 Asynchronous exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.3 Memory exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4.4 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Implementation 234.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Handler thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.1 More efficient chkex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.3.2 Nested exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.3 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4 Thread inspection and modification . . . . . . . . . . . . . . . . . . . . . . . . . . 274.4.1 Dealing with control bits . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Experiments 315.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Multiple handlers and victims . . . . . . . . . . . . . . . . . . . . . . . . . 315.1.2 Disassembler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.1 Empty handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.2 Performance impact of a handler interleaved with other threads . . . . . . 345.2.3 Instruction emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5

6 Conclusions 356.1 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A Usage 39

6

CHAPTER 1

Introduction

Over the past fifty years computers have evolved enormously in terms of size, performance andusage. Although the complexity of computers also increased dramatically during this time, thecore principles have not changed much since the 1970s. One of the common properties is theimplementation of storage and processing elements as logic circuits on silicon, with the gate(transistor) as basic element. One of these elements is the Central Processing Unit (CPU),which forms a central role in the computer, executing instructions and controlling most othercomponents. Traditionally, the performance of a computer is realated to how fast these circuitsoperate on chip. A well known term in this area is “Moore’s law”, stating that the number oftransistors on a processor doubles approximately every two years [1]. While this exponentialgrowth has been true since it was published (in 1965), it has become much harder to obtain insingle processors [2] and often results in extreme power dissipation. This has led to alternativeways to improve processor performance, such as parallel execution.

Parallelism in processors means two or more instructions are executed at the same time, oftenaccomplished by duplicating most components resulting in multiple cores per processor. Anotherway of using parallelism in processors is by using multiple threads of execution. The executionof these threads is interleaved on a single core, giving the illusion of parallel execution. To seehow this could result in actual better performance, it is important to note the biggest bottleneckin modern-day computers is its memory. Reading data from memory may take as long as ittakes the processor to execute hundreds of instructions. But when the processor needs thisdata for its execution, it must wait for this data, causing the execution to be suspended fora significant amount of time. During this time, the processor would be idle, unless anotherprogram, independent of the other one and its memory operation, could run on it. Thus, byusing multiple threads a processor can be kept active most of the time, which results in betterperformance.

There are, however, a few problems. First and foremost, concurrency is hard. It can result in ahuge range of problems often involving shared resources, leading to indeterminacy. Because ofthis it can be hard to make efficient use of parallel computing facilities. Another problem is thatmost processor architectures in use today were not designed with concurrency in mind. Whilemost architectures support multi-core and multi-threading capabilities, these are very limited(e.g. a high-end consumer processor contains four cores each capable of running two threads). Anexample of an architecture using multi-threading and multiple cores is Sun’s SPARC architecture.This problem also led to the research architecture by the Computer Systems Architecture groupon the University of Amsterdam: the Microgrid.

When designing a processor architecture from the ground up with the ideas of parallelism, theremay be aspects of the processor that can be done more efficiently than is the case in traditionalprocessors. One of these aspects is trap handling: there are many occasions where a notificationabout something that happened in the processor is desirable, such as exceptions. The delivery

7

of such notifications in classical architectures often involves stopping and saving the currentlyexecuting code to make room for code to handle the notification. It may be possible to do thismore efficiently using highly parallel processor architectures, which is what this thesis focuseson.

1.1 Problem

Prior to this project, the Microgrid architecture did not support the handling of traps usingsoftware running on the chip itself. There was no mention of such a system in the specification,and the simulator software only implemented a basic system where the entire system was halted.One could then inspect and modify the processors state from outside of the chip. While someof the needs for a more advanced trap handling system were solved in other ways (e.g. forI/O notifications), there was no way of knowing an exception occurred for running program.Although some of these exceptions could be detected locally in software (by explicitly checkingall intermediate resuls), this is both inefficient and tiresome for the programmer. Therefore, asystem was desired that would inform the program if something went wrong. While a lot ofresearch has been done in the area of traps, this project focuses on a hardware multi-threadedarchitecture, where the design space is larger than on traditional, mostly sequential, processors.

The requirements for such a system in Microgrid are that it fits in with the current architectureand its concepts such as hardware multithreading and dataflow (see chapter 2). Furthermore,the system must be able to provide at least a mechanism for most exception handling, with oneof its goals the possibility of instruction emulation. It must be flexible enough to allow for furtherresearch and modifications, and perform well enough to be usable (at least as well as classicalapproaches).

Although a working system with these requirements is expected of this project, it should byno means yield a completely finished analysis and solution to every aspect involved with traphandling. This research should however provide a basis for both further research into the Micro-grid architecture and other (hardware multithreaded) architectures looking to implement a traphandling system.

1.2 Contribution

This thesis’ main goal is to look at the considerations for implementing exceptions in massivelyhardware multithreaded architectures. For this, a study was done to determine previous im-plementations in other architectures. Based upon this study an analysis of possibilities will bepresented in chapter 3, with their advantages and disadvantages. This research could be used byother architectures based on hardware multithreading to determine the best approach for theirimplementation of exceptions.

Out of these possibilities an implementation for the Microgrid is presented and implemented inits simulator, MGSim. This allows for even further considerations, both because of the analysisof problems encountered during the implementation and because it allows for performance tests.It also allows research into the Microgrid architecture to further explore the possibilities of thisnew system.

1.3 Prior work

The past fifty years a lot of research has been done into exceptions. A big focus was the conceptof precise exceptions, which will be discussed later, in section 3.2. While that research can mostcertainly be used for this thesis, the use of a hardware multithreaded architecture opens up avariety of new possibilities.

8

The 1995 paper by Walker and Cragon discusses a number of methods that can be used forsaving and serializing processor state [3]. In their research they look at stacks, shadow registersand stacks, checkpointing hardware and handling exceptions on different processors. They alsolook at serialization solutions for a number of different architecture designs, such as reorderinginstructions and keeping a history of certain state. They conclude that the solutions differs perarchitecture, and that by designing the architecture in clever ways many of the problems can beavoided. This includes the strategy of only changing state at the end of the pipeline, or at leastafter an exception can cause that instruction to flush.

A 1995 paper regarding the M-Machine Multicomputer discusses many aspects of the architec-ture, including how events are handled asynchronously in a separate thread. This concept wasfurther explored in a 1999 paper by Keckler et al. [4, 5].

Another 1999 research by Zilles et al. examined the speedup of exceptions (mostly TLB misses) byhandling the exception in a separate thread in order to avoid squashing of in-flight instructions [6].However, their mechanism is limited to exceptions that return to the excepting instruction andlimits access to registers from the other thread. Because of this the system is not efficient fora general exception handling system. Furthermore, while achieving a performance boost overtraditional TLB misses, a hardware solution is still faster.

The most relevant to this research is the research done by Michiel van Tol in 2006 [7], also lookingat a possible implementation for exceptions in the Microgrid architecture. This thesis explorestwo strategies: using a separate handler thread either per family of threads, or per job.

The original idea of having a handler thread assigned to another thread comes from the MachException Handling Facility [8]. Mach is an operating system kernel which features a inter-process communication (IPC) method very different from traditional Unix-based systems. Thissystem is also used for its exception handling, where a handler thread can be registered to anotherthread. When an exception occurs a message it sent over this IPC message queue (called a port)to the handler, which can send messages back over this port to the excepting thread.

This thesis will first introduce the Microgrid architecture in chapter 2, which serves as a referencepoint for the later chapters. Then an analysis of traps and their possibilities is made in chapter 3.From this a single solution is discussed in more detail in chapter 4, as it is implemented in theMicrogrid architecture. Finally, chapter 5 discusses experiments performed on the implementingto prove its correctness and allow for comparison to other architectures.

9

CHAPTER 2

Microgrid

This chapter discusses the details of the Microgrid architecture as was present before this researchproject, to provide a context for the implementation of traps. The original research and moredetails can be found in multiple publications [9].

2.1 Overview

The Microgrid is a research processor architecture. It is a project of the Computer SystemsArchitecture (CSA) Group at the University of Amsterdam which studies concurrency on many-core processor chips. It aims to be a scalable general-purpose chip which heavily depends onhardware multi-threading. It features a relatively simple in-order, single issue RISC pipelinewith some additional components for multi-threading. The entire thread management, includingthread scheduling and creation, takes place inside the core. Besides using many threads inside asingle core, the architecture also supports multi-core setups where a core can easily dispatch workto other cores. The architecture aims to be very scalable, ranging from embedded microcontrollersto entire datacenters. An example configuration could be a chip with 256 cores, each capable ofrunning 256 threads.

The threads are managed by the cores themselves instead of an operating system. This greatlyimproves the performance since all operations can be implemented in hardware. However, it doesnot allow for more complex or dynamic scheduling strategies normally employed by an operatingsystem. The threading in the microgrid makes heavy use of dataflow scheduling. Every registerhas dataflow state bits, thus giving each registers the ability to either be full (i.e. contain data) orempty (e.g. waiting for a long latency operation). When such a register is later read by anotherinstruction, the thread is suspended until the data is available and the register is filled. Everycore contains a Thread Management Unit which enables bulk creation and synchronization ofthreads over the network connecting the cores. The core also contains a single Register File inwhich every thread has its own window. This also allows threads to easily share registers, andfor threads to have different windows sizes (and thus a different amount of available registers).This, in combination with the bulk operations, gives the architecture a very lightweight approachto threads.

In order to be scalable in the number of cores, an easy way to communicate with these other coresis required. For this, there is a network present on the chips which connects all cores, a Network-on-Chip (NoC). This network allows for easy sending and receiving of data and instructions fromother cores. This network is used for the remote creation and synchronization of threads, andthe inspection and modification of remote registers. For requests that are actually local (i.e. havethe same destination as source core) a loopback interface is present so these messages do not

11

MEMORY

MEMORY

CONTROLNoC

DECODE & REGADDR

IRF

ALU

LSU

FETCH & SWITCH

L1D & MCU

FRF

ALU (async)

L1I

WB

TMU & SCHEDULER

READ & ISSUE

reschedule

FPU (async)

TT & FT NCU

Figure 2.1: A single core in the Microgrid architecture. The pipeline is shown on the left side. TheExecute stage is represented by the ALU and FPU components (blue). The green componentstogether represent the Memory stage. This picture shows two register files: an integer and afloating point version. Note that the FPU may actually exist ouside of the core, where it isshared between multiple cores. Image from [10].

actually leave the core. Due to this abstraction these local messages only take a single cycle tobe received again.

2.2 Architecture

Every core has an in-order pipeline consisting of six stages: Fetch, Decode, Read, Execute,Memory and Write Back. It is similar to the classic 5-stage RISC pipeline, except that theDecode stage is split into two stages: Decode (which determines what registers are required) andRead (which actually fetches the registers from the Register File). It has no branch prediction.

There are several other components in the core: there are the caches for instructions and memoryoperations, the I-Cache and D-Cache respectively. There is a single big Register File. Everythread owns a (variable) number of these registers inside the register file (the window). In theThread Table per-thread information is stored, such as the Program Counter and the start ofthe window in the Register File. Every thread belongs to a Family which has its own attributes,stored in the Family Table. These include the initial program counter upon creation and the sizeof the window in the Register File. The Thread Management Unit facilitates bulk operationssuch as thread creation and synchronization.

Finally, the Scheduler simply contains a number of lists with thread ID’s. When a thread isactivated, its instructions are probably not available in the I-Cache. When the data is fetchedfrom memory and present in the I-Cache, the thread ID is moved to another list, where allthreads that are ready to enter the pipeline are stored. The Fetch stage will take one of thesethreads and put the instruction through the pipeline. When a thread is suspended it is removedfrom these lists.

12

2.2.1 Dataflow

Whenever a long latency operation is issued the target register of that operation is marked aspending in its state-bits. This means the register will eventually be filled, in an unknown amountof time (however, it is guaranteed to be filled eventually). When this happens, the thread willcontinue to execute instructions until one of the operands is a pending register. This is detectedin the Read stage and causes the thread to suspend. Once the long-latency operation completesany thread waiting for this register are resumed. The instruction that needed the register as itsoperand is reissued and normal execution continues.

There are a few cases of long-latency operations currently present in the architecture. The mostcommon one is a read operation from memory. If the requested data is not present in the D-Cache if must be fetched from a slower part of the memory-hierarchy (worst case something likea hard drive or tape) which may take the time equivalent of hundreds of cycles to complete.Thus, if a read instruction reaches the Memory stage and the data is not present in the D-Cache,the register is marked as suspended. Any instruction now using this data causes the thread tosuspend.

Another case of long-latency operations is the Floating Point Unit (FPU). This unit is sharedamong two cores and may take several cycles to complete its operation. Thus, if a floating pointinstruction is issued, the Execute stage will queue this operation to the FPU and marks thedestination register as pending.

In order to optimize the scheduling, instructions can be annotated with additional bits, one ofwhich forces the fetch stage to switch to another thread. These annotations are statically addedat compile time, and are grouped in four bytes (with 2 bits per following instruction). Thisswitch bit is added when it is highly likely that the instruction would cause a switch anywaybecause of missing operands. This same system is used for ending a thread: the other bit inthese annotations tell that a thread should be killed. The last instruction in a thread has bothbits set, causing the thread to be switched away and (when reaching the WriteBack stage) bekilled.

2.3 MGSim

In order to perform tests on this new architecture, a full-system simulator was written in C++ atthe University of Amsterdam [11, 10]: MGSim1. Its major purpose is to support scientific researchon many-core general purpose processor architectures, which is also the context it is being usedfor this research project. However, MGSim was also written for educational purposes on thesubjects of computer architecture, parallel programming, compiler construction and operatingsystem design. This simulator is a software implementation at the component level, meaningcomponents and relations between them are the same as the actual architecture. This levelof abstraction provides a balance between performance and relation to the actual hardwareimplementation. On the one hand, since this thesis aims to study the performance of a solutionat the level of individual instruction, a cycle-accurate simulation is required. For this requirement,a higher-level, functional core model would be inappropriate because most existing models do notdescribe the combination of dataflow scheduling with hardware multithreading and would thus betoo inaccurate. MGSim provides cycle-level core simulation and thus satisfies this requirement.On the other hand, since this thesis only explores the design considerations, a more detailedsimulation (e.g. at the gate level) would be inappropriate. Such a simulation would have moreoverhead and therefore run slower, leading to experiments that do not fit in the time frame forthis project. The ability of MGSim to run C programs within a few seconds of real time providesthe appropriate balance between accuracy and run time of experiments.

1http://github.com/svp-dev/mgsim

13

2.4 Instruction Set Architecture

The Instruction Set Architecture (ISA) is an abstraction layer between the hardware and thesoftware running on top of it. It defines all instructions, registers, I/O and exception mechanisms,etc. Software is written for a specific ISA, and the hardware is an implementation of an ISA.Two different chips can implement the same ISA (and therefore are both able to run the samesoftware) but can have different inner workings (such as performance and production costs) [12].A well known example of this would be the x86-64 instruction set which has many differentimplementations (like the Pentium 4 and Core i7 families by Intel and the Athlon 64 and FXfamilies by AMD).

For the Microgrid two ISA’s exist, which are extensions of the Alpha and SPARC ISA’s. Theextensions to both ISA’s are the same, only the basis is different. However, due to the differentinstruction formats, the opcodes are also different. Other than the instruction decoding andopcode evaluation (in respectively the Decode and Execute stage), all other logic is shared.This is clearly visible in MGSim: there are ISA-specific Decode and Execute stages, all othercomponents are shared between ISA’s. The exception system that will be defined by this thesiswill also be applicable to both ISA’s. The implementation, on the other hand, will only bedelivered for the Alpha ISA. Note that not all aspects of the original ISA’s are followed. Forexample, the PALcode function of the Alpha instruction set are not available.

One of the concepts defined in the ISA is how I/O currently works. Since this is normallyachieved with interrupts, it is worth noting how this problem was solved. All I/O, meaningcommunication outside of the processor, is achieved through memory mapped I/O. There areregions of memory reserved for certain devices. When data is written to such a region, it is sentto the corresponding external device. The same happens when such a location is read. However,such reads will not complete until the device actually sends data. This means the register ismarked pending, and when the register is later used the thread will suspend because of thedataflow scheduling. This in practice means a thread is dedicated to receiving I/O events, andis woken up when such an event happens. This works almost like a handler that runs completelyconcurrent to the other threads [13].

A developer toolchain for the ISA’s is provided by a modified version of GCC2 and GNU Binutils3

(containing, among other things, the assembler). These modifications allow compilation of SL [14](C with microgrid specific extensions) programs for the microgrid architecture.

2http://github.com/svp-dev/gcc3http://github.com/svp-dev/binutils

14

CHAPTER 3

Design space exploration

Traps are a mechanism for giving notifications that something has happened to the programcurrently executing on the processor. They are sometimes also called exceptions, faults or in-terrupts, the terminology differs among the literature. Therefore, we define the terms as such:An Exception or Fault indicates something went wrong during executing. This generates anInterrupt : a signal sent upon the detection of an event, such as an exception. These interruptscan come from either inside or outside of the chip. This signal is finally received by a traphandler, which will perform some action because of this, such as the pausing of execution andthe triggering of some handler code. An example of an exception would be an arithmetic errorsuch as division by zero or an invalid memory access. The code that will run when an exceptionoccurs is called the handler, sometimes also called Interrupt Service Routine (ISR).

The general overview of how exceptions change the control flow often resembles:

1. Exceptions occurs and is detected.

2. Normal execution is stopped.

3. A context switch to the handler occurs, saving the old state somewhere.

4. Handler code runs.

5. The original program is restored and normal execution continues.

Traps are useful for event-driven programs. The alternative to this is to manually check if someevent occurred every so often (polling). Especially in the case of exceptions this is inefficient: onewould for example need a check after every division (often consisting of multiple instructions,including branches). While it is entirely possible to do without exceptions, it is often convenientto have them. In case of exceptions, the general rule is that the hardware cost outweighs thecost of software checking since these conditions are so exceptional.

The handler can do multiple things when an exception occurs. It can try to fix the situation, andresume normal execution. It may also simply report the problem or stop the offending code fromrunning any further. In order to do this the handler must have access to some of the context ofthe offending instruction such as its registers. There may be a single handler which will be calledfor any kind of exception, or a different handler per exception (Vectored Interrupt). When anexception occurs inside the handler, this often called a double fault. Because the implementationof trapping mechanisms differ per architecture, the behavior of double faults also differs. Theyare however quite rare in most architectures. For example, they are so exceptional that in thex86 architecture multiple occurrences of this event (i.e. a triple fault) resets the entire processor.

Since exceptions are often expensive to handle, in some cases it is possible to filter out certainexceptions, or disable them altogether. The first concept is often called exception masking. Itallows the program to define per exception type if they should cause an interrupt or should

15

be ignored. This concept could be used in hardware multithreaded cases, although it must beconsidered if the cost of implementing such mechanism (requiring worst case per-thread exceptionmask storage and logic to alter it) outweighs the performance cost of simply implementing anempty handler. This was often not the case in classic architectures, but may be possible due toefficient concurrent exception handling. A system to enable and disable interrupts altogether isoften implemented besides an exception mask. The implementation cost of this system is muchless expensive (requiring at most a single bit of extra administration per thread), but may notbe of much use.

3.1 Traps in other architectures

The most common implementation of traps is to pause the execution, dump the state of thecurrent context (such as status words) and then set to program counter to the location of thehandler. In the x86 architecture this is exactly what happens: the current state (status wordsand program counter) is pushed to the stack and the program counter is modified. The handlerruns in the same context, so any registers that need to be used by the handler itself must firstbe manually saved. When these are not saved but modified by the handler, the original programwill suddenly use different values. At the end of the handler a special instruction is used whichinforms the processor to restore the status words and program counter from the stack.

In ARM there are two levels of interrupts, normal and fast interrupts. For the Fast InterruptMode a number of registers are banked: registers 8 through 13 are replaced by different registers,only available in this mode. This has the benefit of not having to push registers to the stack (withexpensive memory operations), but being able to do calculations immediately. Furthermore,since these registers are only available to the (fast) interrupt handlers, they may contain somepersistent state. The downside of this system is that these registers are not accessible from thesefast interrupts, and that this solution requires more complex hardware [15]. Besides the bankedregisters, the normal interrupts are otherwise the same, except that fast interrupts have a higherpriority.

The SPARC architecture uses a technique called a register window, which is also used for itsinterrupts. The register window allows a procedure to share registers with its caller and callee.Every procedure has access to 24 registers inside a bigger register file (the register window).When a procedure call takes place, this window slides 16 places, which causes an overlap of 8registers between two procedures. So every at any point in time there are 8 registers shared withthe caller, 8 local registers and 8 registers that will be shared with the callee. This mechanismis also used in the case of an interrupt: the register window simply slides 16 places (and savesany other state such as status words).

In other more hardware multithreaded oriented architectures, such as the M-Machine, the handlercode runs simultaneously to the code that caused the exception. This way, (almost) no overheadis caused by exceptions, although it may make it harder to write code for the handler. Theseapproaches have often the possibility for a exception barrier which pauses the thread until anyexceptions are resolved.

While they are not directly a trap-like system in a processor architecture, signals in POSIX-compliant operating systems are still relevant. In most Unix-like systems, they provide theonly method for userspace programs to receive exceptions, and form one of the major methodsavailable for inter-process communication. Any program can register per type of signal a functionthat should be called when the signal occurs. The execution is stopped by the operating system,and the signal handler is called within the same context as the program itself. Because thethreading for most processor architectures is handled by the operating system, this mechanismcan be substituted by more low-level constructions in the Microgrid architecture, where threadingis handled in the processor itself.

16

3.2 Preciseness

Since processors are rather complex, it may not be possible to call the handler at the exactmoment an exception occurs: instructions after the offending instruction may already have al-tered some state. Therefore, we use the definition of precise exceptions as used by Smith andPleszkun [16]:

• All instructions preceding the offending instruction are finished and shall not alter the stateanymore.

• No instructions following the offending instructions have altered the state.

• The exact instruction that caused the exception is known (e.g. via the program counter).

When these conditions are not met by an exception, it is imprecise. There is often a trade-offbetween the preciseness of exceptions and the complexity and performance of the architecture. Itmay be necessary to roll back certain state, especially in architectures which feature out-of-orderexecution. When a rollback is not possible but precise exceptions are still required, it may benecessary to (preemptively) stall the pipeline in case an exception occurs. This is of course a bigperformance loss for something that should not happen often, and may therefore only happen ina debug mode.

Since the Microgrid architecture uses in-order execution precise exceptions are rather trivial inmost cases: when an exception occurs inside the pipeline, the pipeline can simply be flushed.By flushing the pipeline it is ensured all instructions after it are immediately removed, andall instructions before it finish, since they are allowed to leave the pipeline normally. Thereare, however, exceptions to this: exceptions that are not caused by the pipeline, asynchronousexceptions. The primary example of this is the FPU. FPU operations are issued to the FPU inthe execute stage, but may be executed hundreds of cycles later. While some of these problemscan be detected before they are sent to the FPU, such as devision by zero, some problems likeover- and underflows are only known during the actual calculation. For these exceptions, thereare three possibilities:

• Define these exceptions to be imprecise.

• Stall the execution (of the thread) until the floating-point operation has been completed.

• Introduce a rollback mechanism.

The second option of stalling the thread may be very expensive performance-wise, especiallysince the FPU is shared between multiple cores. The introduction of a rollback mechanism onthe other hand, may be very expensive hardware-wise (more complexity and more components).It is probably advisable to have a debug mode which stalls the execution until the result returns,since precise exceptions are probably not often required for this special case. To verify this, anumber of random popular open source packages1 were checked, proving this mechanism is notoften used in software.

3.3 Handler

The context in which the handler will run will determine a lot of the workings of trap handlingsystem. It determines what needs to be saved when launching the handler, and what possibilitiesthe handler has. The major options for the context of the handler in hardware-multithreadedarchitectures such as the Microgrid are:

• The exact same context (change program counter to handler).

• A slightly altered context, e.g. provide (a few) different registers.

1GCC, Binutils, Linux, Awesome

17

• A completely different context (where both contexts exist at the same time).

3.3.1 Inside the victim’s context

The first option is a proven strategy, and the cheapest in area but more complex in logic. Sincethe handler runs on the same hardware as normal programs, no hardware has to be added forthis. However, any required state such as status words and the program counter must be storedto memory, which requires additional logic for a state machine and coordination with the memorystage. When the state is saved, the program counter is simply changed to the address of thehandler code. If the handler needs any more registers it must probably save these to memorytoo. The main problem with this is that context switches are expensive: the memory must beaccessed for both entering and leaving the handler. This is what x86 and DEC Alpha use.

The second option tries to reduce the number of memory operations required for a context switchto the handler, by using specialized hardware. This is what ARM and SPARC do: they (partly)replace the visible registers so these do not have to be stored to memory. The original programcounter is similarly stored. While this is a more expensive solution hardware-wise, it is stillrelatively easy to add to an architecture. It behaves almost the same as the first option, onlymaking state-saving cheaper performance-wise.

3.3.2 Separate thread

The last option, which is conceptually very different from the other two, is to create an entirelynew context for the handler. This way nothing has to be stored in memory. This approach is,however, the most costly in area on the chip, at least in classical processor architectures. Butwhen an architecture already provisions silicon to support many hardware threads, it becomesadvantageous to reuse them for trap handling as well. In Microgrid, every thread already hasits own context, and switching between these is almost free (without any memory operations).Thus, if the victim thread (where the exception occurred) is suspended and another thread runsthe handler code, performance is greatly improved. This also fixes another problem the otherapproaches would have: in the Microgrid architecture, every thread can have a different numberof registers available. So when the handler needs 30 registers (which are defined at compile-time)and the victim thread only has 10 registers, the handler code simply cannot run in that context.This problem is solved by running the handler in its own dedicated thread, there would alwaysbe enough registers allocated for the code to run.

This strategy can be implemented by spawning a new thread upon every exception, or by having athread dedicated to handling exceptions, requiring only a wakeup. The first approach, spawningnew threads for every exception, has multiple problems. Thread creation may be required to waituntil a thread entry is available, and thus cause unbounded latency before the fault is handled,since worst-case the required resources never become available. On the other hand, when therealready is a thread waiting a comparatively simple wake-up is sufficient, and has a boundedlatency. In the (very unlikely) event all threads on a core have an exceptions, and each onewants to spawn a handler, not a single handler can ever be spawned. This causes all threadsto remain in suspended state, causing the core to deadlock. When pre-allocating a thread forhandling exceptions, this scenario is much less likely to happen. Downsides of having such athread is that there are always resources being used by it, even though it is idle most of the time.

Both spawning threads and having dedicated handler threads allow for having a single handlerthread per core (at most), having a handler thread per thread, or somewhere in between. Thetrade-off here is that when there are less handler threads, an exception may have to be bufferedbefore it can be processed. Furthermore, exception handlers cannot be interleaved when thereis only a single thread doing all the work, causing the processor to potentially be idle duringmemory I/O. The downside is that in the case of a low threads-per-handler ratio, the core getsclogged with large amounts of (mostly idle) handler threads taking up resources. Therefore, amiddle ground is probably preferred: have a handler thread per family or group of families.

18

In order for the handler to do anything useful, it must have access to at least some of the state ofthe victim thread. This must at least include, but is not limited to: what exception occurred andthe current program counter. Access to its registers may also be desirable. Depending on thearchitecture, these operations may or may not be already possible. In case of the Microgrid, theseactions were not possible before this project and were therefore added. When introducing sucha mechanism, security measures must considered, otherwise any thread may inspect or modifyany other thread. It may also be desirable to do memory operations from the context of victimthread, so its permissions are adhered to. If this is overlooked, the handler may write to memorythe victim thread did not have permissions to.

When letting another thread handle with an exception, it is not required that this thread runs onthe same core. It may be desirable to run these handlers on a different core to balance the loadof the processor. It is even possible to dedicate a core entirely to handle exceptions. This is espe-cially interesting for systems with non-uniform cores. When an exception occurs on a core withsome special-purpose components on it, it is a waste to occupy this core with exception handlersthat do not need this hardware. In this case, a lightweight core would be more appropriate. Thedownside of relocating handlers to different cores is that this may introduce significant overhead,which may be incompatible with real-time event processing with tight latency deadlines.

3.4 Causes

There are many places in the processor where something can go wrong. Depending on where theexception occurs, different mechanisms may have to be used. These places can be generalized tothree different situations. On top of those, it may be desirable to use the trap handling systemfor more general-purpose situations.

3.4.1 In-pipeline exceptions

The first case is when an exception occurs in any stage of the pipeline, or is directly triggered bythe pipeline. When this happens, it is exactly known what instruction caused the exception, andwhat instructions came before and after it (in in-order architectures like the Microgrid). Theseexceptions are, in case of the Microgrid:

• Invalid I-Cache read (from the Fetch stage).

• Unknown instruction format (from the Decode stage).

• Invalid register (from the Read stage).

• Arithmetic errors (divide by zero, under-/overflow, from the Execute stage).

• Software breakpoints (from the Execute stage).

• Invalid D-Cache errors (from the Memory stage).

With in-pipeline exceptions, when the execution needs to be stopped, it is often as easy asflushing the pipeline. This is a mechanism already in place in most architectures, since forexample a branch causes the program counter to change, invalidating all instructions currentlyin the pipeline before the execute stage. A flush causes all previous stages (including the causingstage) to be cleared and effectively change into nop instructions. The only problem with thisapproach may come from any external state that may be influenced by an earlier stage of thepipeline. In the Microgrid architecture there are luckily not many of these scenario’s: mostactions only modify (external) state in the WriteBack stage. The only exceptions to this areissuing of floating point operations to the FPU (in the Execute stage) and D-Cache operations(in the Memory stage). This can be solved by having a roll-back mechanism in place or stall theprocessor in certain dangerous situations.

19

For the Microgrid D-Cache operations are not a problem, since the WriteBack stage cannotgenerate any exceptions, and thus cannot cause the Memory to be flushed. In the case of floatingpoint operations, it should be observed that the Execute stage can only receive a flush when theMemory stage has an exception. In turn, the Memory stage can only have an exception if theinstruction before the floating point instruction is a memory operation from the same thread.The proposed solution for this is to simply avoid this situation. This can be achieved by stallingthe pipeline when this specific situation is detected. This approach may be further optimized byhaving the compiler either avoid these situations, or at least annotate the memory instructionwith the switch bit.

3.4.2 Asynchronous exceptions

Floating point operations are often more expensive than most other operations, they will takemultiple cycles to complete. In the Microgrid the FPU is an asynchronous component with itsown pipeline, running alongside the normal pipeline. FPU’s can even be shared by multiple cores,so it is possible for floating point operations to end up in a buffer, which causes the floating pointoperation to take even longer to complete. Floating point operations are issued to the FPU inthe Execute stage, while the thread continues to run normally. The thread only stalls when theoutput register for the FPU operation is required as input for another operation.

When the FPU generates an exception, it is therefore not trivial to know what instruction causedthis. More importantly, the exception is most likely not precise. When precise exceptions arecritical there must either be a rollback mechanism in place (which may have to roll back tens ofinstructions) or cause the issuing thread to suspend until the floating point operation completes.It should be carefully analyzed whether these precise exceptions are actually required, and theymay be offered in the form of a debug mode.

3.4.3 Memory exceptions

The last category of exception causes are those caused by memory components. These compo-nents may be located outside of a core, which introduces a range of new considerations regardingexceptions on top of those of asynchronous processes.

It may not be known what thread caused the memory operation, it can even be caused by multiplethreads. In this case, it is hard to decide what handler should be called and what threads shouldbe suspended (if any). When the threads are known and there are multiple threads responsible forthe exception, it may not be desired to start multiple handlers for basically the same exception.When both handlers try to fix the situation in the memory component, they may cause evenmore problems. This problem becomes even more complicated when there are multiple coresinvolved in the exception. The exception could be sent to a single core, or all cores. In order toavoid any problems with multiple handlers trying to fix the same situation a possible solutionmay be to have a single thread on a single core dedicated to handling such exceptions. However,this approach lacks flexibility and may cause specific problems not to be fixed correctly or at all.

An even bigger problem can occur when something fails during an I-Cache read or any of itsresulting operations in the memory hierarchy. This would cause the thread(s) to stall until thisproblem is solved. More importantly, this may render the handler unrunnable since its codecannot be loaded into the I-Cache, causing a deadlock. This was, however, outside the scopeof this research, and thus requires further research in order to find a suitable solution for suchproblems.

Translation lookaside buffer

One of the major uses of exceptions in traditional systems is the handling of virtual memory.Virtual memory is an abstraction of the memory system, where, from the perspective of running

20

processes, there is a single, contiguous memory space. In reality, this memory may be locatedanywhere, and may not even be loaded at all. This mapping may exist either per process orglobally. The entire mapping of virtual addresses to physical memory is often present in the pagetable. This structure can be quite big, and is loaded in memory. In order to speedup addresstranslation for memory access, a cache structure is often used, which contains a subset of thepage table: the translation lookaside buffer (TLB). This structure allows for fast lookups byusing content addressable memory (CAM), which given a value can return the index at which itoccurs in a single cycle. The downside of this piece of hardware is that it is rather expensive toproduce.

Since the TLB is of limited size, a mapping may not be present, which is called a “TLB miss”.When such an event occurs, either the hardware itself looks up the virtual address in the pagetable, or an exception may be generated. Using the first approach, it may be possible for anentry to also not be present in the page table, which will also generate an exception (often calleda Page Fault). In both cases, some handler code (provided by the operating system) will beexecuted to fix the situation. In case of an TLB miss, the handler must consult the page tableitself. When the entry is not present in the page table, the handler must load the correspondingpage into memory (e.g. from a harddisk).

Since this action is often expensive enough, and most likely critical for the execution of a program,as little overhead as possible is preferred. Furthermore, this action is not specific to a single thread(or even core). The operating system should provide a handler which is called for every TLBmiss or page fault. While it may be possible to run multiple of these handlers at the same time,synchronization between them is critical to preserve page table consistency.

3.4.4 Events

While the Microgrid does not need an interrupt-like system for its I/O events, it may be desiredto still use the trap system for such things. Since it is already in place for exceptions, it can serveas a very generic platform for these events. The discussed concept of having a handler threadper thread or family can easily be adjusted to have a handler per source, where the source caneither be a thread/family or another component on the chip. This way, there could be a singlehandler for every I/O event or per I/O device, depending on the desired level of control. Thissystem could also be used for timer events such as present on many current chips, or core-to-coremessages, building on the already present NoC.

21

CHAPTER 4

Implementation

Out of all the discussed possibilities for trap handling in hardware multithreaded architectures,a single was chosen to be implemented in the Microgrid architecture and its simulator, MGSim.The result of this implementation, and all modifications made to MGSim, can be found in git1.

4.1 Overview

Figure 4.1: The first part of the newException Table, where per thread itshandler and exception related status isstored.

The general idea of the implemented design is that ev-ery thread can appoint another thread as handler, onthe same core. Multiple threads can appoint the samethread as handler. This handler thread should be pre-allocated by the program, it will not be automaticallycreated when appointing a handler nor when an excep-tion occurs. The handler is suspended when it has noexceptions to handle. When an exception occurs thevictim thread is suspended and the handler thread iswoken up if necessary. The handler thread can thenuse (the new) instructions to inspect, modify and re-sume the victim thread. Most of the exceptions aredefined to be precise at no extra cost, except for float-ing point exceptions that come from the FPU itself, asdiscussed in section 3.2. There is currently no mech-anism in place to make these exceptions precise. Forboth testing purposes, and to provide functionality forsoftware breakpoint, the first and most basic additionis the trap instruction. This instruction simply alwaysgenerates an exception.

The administration for this system is all location in a single new component, the Exception Table.The component actually contains two hardware memory structures, both indexed by thread ID:

• The handler/status table is a structure which maps a TID to its handler thread, currentexception flags and whether the thread is currently waiting for exceptions to occur (if thethread is a handler).

• The active handler table maps a handler TID to any victim TIDs, that is, threads withexceptions that are unhandled, and should be handled by that handler. A lookup in this

1On GitHub: http://github.com/koenk/mgsim

23

Figure 4.2: The second part of the newException Table, where for every threadis stored whether it needs a handler tohandle new exceptions. This structureis implemented using Content Address-able Memory so the handler can lookup its own TID and immediately knowswhat thread needs that handler.

table by the handler will return a single victim TID, if any. This is a separate struc-ture so it can be efficiently searched for where a handler is needed: it is implemented ascontent-addressable memory (CAM). This table contains either nothing (no exception) orthe handler TID (unhandled exception) for every TID. Conceptually, this table is a per-handler queue of pending exceptions (threads), merged together in a (very fast) structure.

The currently active exceptions are denoted by a bitmask in this table. It is possible for multipleexceptions to be active at the same time like a pipeline exception and a asynchronous exceptionssuch as the FPU. Every bit corresponds to a type of exception, which can be found in table 4.1.

Bit Exception0 Software breakpoint1 Arithmetic error: divide by zero2 Arithmetic error: under-/overflow3 Unknown instruction format or opcode4 Invalid I-Cache read5 Invalid (unmapped) register access6 Unaligned memory access7 Invalid memory address8 Invalid memory permissions9 Translation lookaside buffer (TLB) miss

10 Floating point exception (if enabled)

Table 4.1: The possible exceptions with their corresponding bit in the bitmask. For example, tocheck whether an unknown opcode exception occurred, bit 3 should be checked, using a bitmaskof 8 (0000 0000 1000).

4.2 Detection

When an exception occurs, the associated pipeline stage will detect this. This will cause a flushof the pipeline and information about the exception is added to the outgoing latch of the stage.This flush causes all following instructions (in earlier stages of the pipeline) to be cleared, andforces a thread switch. The exception information will flow through the pipeline until it reachesthe WriteBack stage. When this stage detects this information, it suspends the thread and passesthe information to the Exception Handler unit. This unit (as shown in figure 4.3) updates thecurrent active exceptions for the victim thread, retrieves the thread handler thread and marksthe victim thread as having an unhandled exception. Lastly, the handler thread is activated if itwas currently suspended because of waiting for an exception.

By letting this information flow to the WriteBack stage, it solves a few problems. First andforemost, it makes it a lot easier to have only one (synchronous) exception per thread. An

24

Figure 4.3: The new Exception Handler unit with its connections to the pipeline (WriteBackstage), Exception Table and Scheduler.

instruction in an earlier stage can also have an exception, but this will be flushed by any exceptionin a later stage. It also greatly reduces complexity of the hardware: only one connection betweenthe Exception Handler and the pipeline is required. The implementation cost of the currentsolution is rather small: a single latch has to be added to contain exception information betweenevery stage. Arbitration for the Exception is still required however, when both the pipeline andan asynchronous component cause an exception.

There is no form of exception masking present. When such a system would be required itwould be rather trivial to implement: this could be stored per thread in the Exception Tableas a bitmask. This could than easily be fetched and compared inside the Exception Handler.Similarly, there is no global way to disable interrupts. Such a mechanism is more useful whentraps are used for more common events such as I/O events, and they cause the entire core tostall, which is not the case with this implementation.

4.3 Handler thread

A handler thread is a thread that may be expected to handle an exception at any point in time.Any thread can be used as handler. There is a new instruction that a thread uses to check ifthere are any exception it should handle: chkex. This instruction will return a thread id of a(single) thread that had an exception, and appointed this thread as handler. This thread idwill be written to the register passed as an argument to chkex. Any subsequent call will returnanother pending exception, until there are no more exceptions to handle for this thread. At thispoint the target register will be marked as pending, and causes any subsequent instruction usingthat register to suspend the thread. When a new exceptions occurs, the register is then filled,and the thread will be resumed. This mechanism functions the same way as the I/O systemdiscussed in section 2.4, using dataflow. The suggested format for a handler is to run a loop,starting with the chkex instruction.

The chkex instruction is mostly handled in the execute stage of the pipeline. Here, the ExceptionTable will be queried (more specifically, the active handler table part). Because of the CAMimplementation this lookup can be done in a single cycle. This will return a single TID of athread that has an unhandled exception, and will clear the entry in that table automatically. Inorder to ensure fairness in the CAM lookup process, this should be done starting from the lastreturned TID. Otherwise, processes with higher thread ID’s might experience starvation when itcomes to exception handling. When no victim thread was found the target register is given thepending state and its address will be marked in the Exception Table, so that when an exceptionoccurs it is known what register should be written to.

4.3.1 More efficient chkex

In order to optimize the latency and implementation cost of the handlers, and to provide amore intuitive interface in the software, the actual implementation of the chkex instruction issomewhat different from what was previously discussed. This does not use the existing dataflow

25

mechanisms, and therefore is conceptually more complex. During the implementation of thismechanisms, it introduced several more problems that had to be addressed. The main differenceis that the chkex instruction itself suspends the thread when no exceptions were found, insteadof marking the target register as pending. When a thread is suspended because of the chkex

instruction this will be marked in the Exception Table. When, during the occurrence of anexception, this is detected by the Exception Handler, the handler thread will be rescheduled.This will rerun the chkex instruction, now returning the thread ID of the thread that just hadthe exception.

Looking at the details of this implementation, when no exceptions are found, the chkex will causea pipeline flush (causing an immediate thread switch) and marks the threads as suspended. Thissuspended state will flow to the WriteBack stage, where this state will be propagated to thescheduler. There is, however, a number of cycles where the thread is marked as suspended butnot actually suspended. When during this time an exception reaches the Exception Handler, itwill see the thread as suspended and try to reactivate it, which is not possible. Therefore, theactual marking of a thread as waiting in the Exception Table is done in the WriteBack stage. Theproblem with this situation is that the chkex suspends the thread, and the Exception Handlernever wakes it. It is not possible to do the entire logic in the WriteBack stage: a cycle is notlong enough to inquire the Exception Table, get a result, and send this result to the RegisterFile. So the WriteBack stage will redo the CAM lookup, but will only reschedule the thread, notactually write the result to the register. When the thread is reissued and the chkex reaches theexecute stage, it will see this new exception as it normally would.

While more efficient and using less storage (the Exception Table does not need to hold a registerto write to when an exception occurs), it does bypass most of the existing mechanisms for threadscheduling. Normally a thread is only suspended because of a missing operand (i.e. emptyregister), a memory barrier or, since this project, an exception. Adding this behaviour to thechkex instruction adds more complexity and possibilities for race conditions. Future researchon the architecture may opt to implement the old mechanism which should provide a simplerdesign.

4.3.2 Nested exceptions

In the discussed implementation, there is no difference between a ‘normal’ thread and a handlerthread: these are simply threads that happen to execute a chkex instruction. Therefore, thereare no special semantics necessary for exceptions inside a handler thread, these ‘just work’. Onecould create a chain of threads, each handler having its own handler. However, this chain willeventually end, when it encounters itself, a thread that was already somewhere in the chain, ora completely unrelated thread.

When a thread has an exception, and its handler is itself, either nothing could happen (hopingsome other thread will detect and resolve it), a special handler could be triggered (although theusefulness of this is to be determined) or the processor could simply reset itself (like triple faultsdo in x86). The current implementation uses the first option: nothing special happens. It shouldbe noted such situations are very rare (or at least should be), therefore the implementationassumes it just does not occur at all.

The other discussed options (a cycle in a chain of handlers or a completely unrelated threadwhich is not a handler) are very hard if not impossible to detect. It of probably safe to assumethese never occur at all as well, since it probably indicates a bug somewhere in the handler orits appointment process in the software.

4.3.3 Alternatives

With this design, it is not possible to have the handler for a thread on a different core asdiscussed in section 3.3.2. When such a feature is required in the future, it may require some

26

modification of the current system. First of all, the core ID should be stored along with thecurrent handler thread ID. A handler thread could be dedicated to a single core, or any numberof cores. Implementing handlers that are triggered from a single remote location requires that theCAM-lookup is done over the inter-core network. This makes chkex an asynchronous operation,making it more attractive to change the semantics of chkex. Currently, chkex itself causes athread suspend, but when it becomes an asynchronous operation there is no more benefit forthat, then its operand register could simply be marked as pending. When a message comes backfrom the core, this register could be filled. When the response is that there are no new exceptions,this register would be filled when an exception occurs. This strategy thus also requires a slightmodification to the Exception Handler, to send its wakeup over the network.

Having a handler that can receive exceptions from any core on the chip would require even morechanges: the chkex instruction cannot easily inquire for unhandled exceptions, since every corecould respond. A possible solution would be to have a buffer on the core where the handlersare located. When an exception occurs, the Exception Handler sends this over the network tothe handler-core. Here it either wakes up the handler thread or is placed in the buffer. Thismay require a buffer per (handler-)thread. Both this and the previously discussed approach foroff-core handlers also have the drawback that more overhead is incurred for every exception andthe handling thereof.

Most of the time, thread families share the same code and behavior. Therefore, a logical choicemay be to have a single handler thread per family. This would save some administration spaceand requires less initialization when creating a number of threads: currently every thread createdby a batch operation needs to set its handler thread separately2. This approach was not chosenbecause of both flexibility and complexity. A family of threads can be created over a number ofcores, so this approach either requires that handlers can be located on different cores, or that thehandler code is automatically copied to the other core when no such handler is present. Bothsolutions present a far more complex situation than is currently implemented, without providingmuch benefit. It would save a few administration records (every family now has a handler threadID, instead of every thread), but other administration such as per-thread exception informationwould still be present.

4.4 Thread inspection and modification

Followed by the chkex is the actual code of the handler. This will inspect, and optionally modify,the victim thread. The handler finishes by either resuming of killing the victim thread. Whena handler does neither of those, the victim thread will stay suspended indefinitely. For theinspection and modification of other threads, two new instructions are introduced: rget andrput. These instruction can respectively retrieve and modify any state of a thread. This mayinclude the currently active exceptions, its program counter, any of its registers or whether it issuspended. For a full list of these properties, see table 4.2. When a new exception occurs, thefirst thing it will probably want to do is find out what exception(s), since a vectored interrupthandler is not used. Note that when an exception is handled, the handler should manually unsetthe correct bit in the exception mask. The handler can now look at the instruction that causedthe exception and its operands.

These state inspection and modification operations are, in order to provide more flexibility,implemented as network operations. This means any rget or rput operation is sent over theNoC. However, when the thread is located on the same core as the handler (as is often the casewith the current system), the operation is actually caught by a loopback interface before it isleaves the core. This method costs a single cycle extra for every network access.

Both rget and rput are detected in the Execute stage of the pipeline, where the arguments areunpacked and the remote message for the network is created. Both operations take a core-id

2Although this could be fixed different ways, such as being able to define a default handler thread for thecreate.

27

Field Name Description0 Handler TID Handler thread ID for the thread.1 Exception flags A bitmask of the currently active exceptions.2 Program Counter The current program counter of the thread.

In case of an exception, this points to the in-struction after the faulting instruction.

3 Number of registers Any family can have a different number of reg-isters available.

4 Number of floating point registers Any family can have a different number offloating point register available.

5 Family ID The ID of the family the thread was createdin.

6 Resume Writing a nonzero value to this field will re-sume execution of the thread. Should only beused when the thread was suspended becauseof an exception. Cannot be read.

7 Terminate Writing a nonzero value to this field will killthe thread. Cannot be read.

8 Break family Has the same effect as the break instruction,which stops the creation of a family currentlyin progress. It will not kill any existing threadsof the family.

256 Registers The fields 256 through 256+n each containsone of the n registers available to the thread,with the first registers located at 256.

512 FP Registers The fields 512 through 512+n each containsone of the n floating point registers availableto the thread.

768 Register statuses The fields 768 through 768+n each containsthe status (full or empty) of one of the n reg-isters available to the thread.

1024 FP Register statuses The fields 1024 through 1024+n each containsthe status (full or empty) of one of the n float-ing point registers available to the thread.

1280 Status words Fields 1280 through 1280+n each contains anISA specific status word.

Table 4.2: All fields of a thread’s state that can be inspected and/or modified using rget andrput.

28

Figure 4.4: Diagram showing the interaction between the victim and handler when an exceptionoccurs over time.

and thread-id as parameters, as well as a field to be inspected or modified (see table 4.2). Therget instruction takes the core and thread id’s as its first argument register, and the field in itssecond argument. The third argument is the register where the value that is being requestedshould be written to. The rput takes both id’s and the field in its first register, and the value tobe written in the second3. The remote message formed from these parameters flows through allstages of the pipeline as normal, until it reaches the WriteBack stage. Here it is sent to the NoC,where the network component on the other core (or the same core) receives it. This will forwardthe message to the new Thread Inspector component present in every core. This component hasaccess to any other component on the core that it might need to fetch or modify the state of. Inthe case of an rget operation the result is then sent back over the NoC.

The system discussed so far has a major flaw: its security. These two new instructions, rgetand rput, make it possible to do anything to any other thread on the chip. This can be abusedby malicious software to circumvent any other security measures put in place by the operatingsystem. Therefore, access controls are almost unavoidable before this system can be used in anyproduction environment. When a thread is handler of another thread it must obviously havepermission to use these operations. It is also not a problem for a thread to have access to its ownstate in most of the cases, and it is required in order to allow a thread to modify its handler.These permissions can also be extended to any thread in the same family or the parent thread(the thread that created it). It should be noted that the current implementation in MGSim doesnot have any of these measures in place, since this was outside the scope of this project.

4.4.1 Dealing with control bits

The end of a thread is marked by the presence of a kill bit in the statically added control bits, asdiscussed in section 2.2.1. The exception system throws these annotations away: the thread isalways switched (because it is suspended) and is never killed. A handler thread cannot operateon a thread that does not exist anymore. A problem arises when the handler resumes the threadon the program counter after the instruction that raised the exception: the end-of-thread killbit is placed only on the trapping instruction. When the handler ignores this fact and simply

3It is not trivial, in the current architecture, to have an instruction read from its third argument. This registeris only intended for writeback, and changing this would require significant modification of the current architecture.

29

continues executing, the thread will continue to execute garbage data, and eventually otherfunctions present in the binary.

A simple solution to this problem would be to simply ignore it, and place the responsibility onthe software to prevent this situation. The responsibility would then be that of the programmeror compiler to place a nop at the end of the thread, when the last instruction could trigger anexception. However, this is not a very elegant solution.

Therefore, this problem should not be the responsibility of the software, but that of the handleror architecture. The handler could simply check the control bits, and kill the thread when itdetects the kill-bit for the victim’s program counter. This solution, while functional, is neitherelegant nor efficient. Since such logic would always be required in a handler, it would introducea lot of overhead (reading memory for the control bits). The final solution for this problem isto save the state of the kill-bit somewhere. Since this state should be cleared when the programcounter is modified by the handler, and all instructions are aligned to multiples of four, thisinformation could be stored in the lower bit of the program counter. When this bit is still setwhen a thread is resumed by an rput instruction, it will automatically be killed.

30

CHAPTER 5

Experiments

The reference implementation as discussed in chapter 4 has been implemented in the full-systemMicrogrid simulator, MGSim, in order to test whether it works, detect any flaws and finally toperform benchmarks on it. First a number of tests were made to see whether the implementationbehaved as it should. Then a number of benchmarks were run to see if it performed as expected,and to allow comparison with other architectures. All tests can be found on GitHub1.

5.1 Tests

5.1.1 Multiple handlers and victims

The most basic test to determine if the system behaves correctly is to spawn a number of handlersand threads that issue the trap instruction. Since in such a limited test scenario all thread id’sare deterministic, this does not have to be fetched from the handlers. To test is demonstratedin pseudocode below.

main:

count1 := 0, count2 := 0, count3 := 0

spawn thread with function ‘handler‘ with reference to ‘count1‘



spawn 30 threads with function ‘victim‘ on the same core

wait until all 30 threads finish

print values of count1, count2 and count3

handler (reference to counter):

repeat:

tid := chkex

counter := counter + 1

resume tid using rput

victim:

index := thread index of spawn (i.e. 0..29)

handler := index % 3 + 1

set handler to ‘handler‘ using rput

1http://github.com/koenk/afstudeerproject

31

trap

This will demonstrate that the system supports multiple handlers and exceptions at the sametime. The calculation inside the victim threads will ensure all handlers will receive 10 exceptionsto handle, where every new exception will alternate between handler. When the actual versionof this test2 runs, its output is indeed three times ‘10’ as expected.

5.1.2 Disassembler

To further test the capabilities of the handler, a handler was written which would disassemblethe faulting instruction, printing the instruction and the contents of all its register operands3.The basis of this disassembler was taken from the modified version of the GNU Binutils4. Fur-thermore, when a divide by zero exception occurs, the destination register is filled with a distinctvalue that can easily be recognized when printed afterwards.

The used code is given below in pseudocode form:

main:

spawn thread at function ‘handler‘

spawn thread at function ‘victim‘

wait for thread ‘victim‘

handler:

repeat:

vtid := chkex

excp := rget exception flags

pc := rget program counter

print tid, excp, pc

disassemble at pc and print instruction and registers

if excp = divide by zero:

rput value ’42’ to write-operand of instruction at program counter

resume thread using rput

victim:

a := 1

b := 0

c := a / b

print "Victim:", a, b, c

This will cause an arithmetic error (divide by zero) in the victim thread. When this code isexecuted, the output is:

HT1 exception 2 (div0) in T2 @ 0x22ec

0x22ec: 00 01 20 4c divl $1.rv,$0.at,$0.at

reg $1.rv: 1

reg $0.at: 0

reg $0.at: 0

Victim: 1 / 0 = 42

This indicates the handler thread with thread ID 1 received an exception from thread 2 withexception flags 2, which is the bit for ‘divide by 0’ errors. The program counter at which

2test1.c3dis.c4https://github.com/svp-dev/binutils

32

this exception occurred was 0x1e2c. This is passed to the disassembler, which prints a divl

instruction with its operands. The contents of these registers are then printed. Note that thesecond and third argument are the same register, indicating it will overwrite the denominatorwith the result of the expression. The handler finally overwrites the output register with thevalue 42, after which the handler thread resumes the execution of the victim thread. When theresult of the division is printed, it is easy to see the handler worked, since 42 is printed.

5.2 Benchmarks

All benchmarks were performed on the latest version of the MGSim fork for this project (commit1613b436a246cc050a236bf4d06f48d2ab6c600c). All other benchmarks were performed on Linux3.9.4-1-ARCH x86 64 on a Intel i7-2600k running at 4.5GHz and Mac OS X 10.8.3 Darwin 12.3.0Intel i5-3427U at 1.8GHz.

5.2.1 Empty handler

The most benchmark is to determine how many cycles it takes to return to the victim threadwhen a trap occurs. The handler would in this case be empty (as possible). In pseudocode, thesetup for this looks as follows:

handler:

repeat:

vtid := chkex

resume thread using rput

victim:

repeat n times:

t1 := clock()

trap

t2 := clock()

print t2 - t1

The clock function used here returns the current processor cycle. The results for this test arecompared to an empty Unix signal handler on Linux. Note that the Unix signal handler has totraverse more software logic before control is handed to the handler and back. Assuming mostof that time is spend doing context switches, the cost of these can be estimated from the results.An empty handler can cause up to six context switches, whereas an SIG IGN can cause up tofour. Therefore the same tests were ran on a Raspberry Pi, running Linux on an ARMv6 700Mhzprocessor, to examine the differences between x86-64 and ARM. It should finally be noted thatDarwin uses an more layers of abstraction: the Unix signals are implemented on top of Mach’ssystem. The results can be found in table 5.1.

minimum average maximumLinux, empty handler 2130 2182 18883Linux, SIG IGN 305 320 25342Linux, ARM, empty handler - 14112 -Linux, ARM, SIG IGN - 2153 -Darwin, empty handler 16256 34393 5847538Darwin, SIG IGN 10095 20406 4854794Microgrid 23 23 23

Table 5.1: Number of cycles spent between causing a trap and returning to the victim thread,using an empty handler.

33

5.2.2 Performance impact of a handler interleaved with other threads

To see how much overhead an exception generates, we interleave a normal algorithm with acontinuous stream of exceptions from a separate thread. The algorithm that was used is theapproximation of the Mandelbrot fractal. This algorithm is fairly calculation heavy (instead ofwaiting on memory operations), and thus keeps the pipeline busy most of the time. It can alsobe parallelized quite well, with a thread per row of the image. When there are n threads working,the resolution of the Mandelbrot approximation will be n*n pixels. The instructions per cycle(IPC) is measured for this test, which is a reasonable measure of how busy a core is. Ideally theIPC would be 1, meaning every cycle an instruction completes. However, every stall and suchwill cause the IPC to drop.

The test is done with a thread continuously nop’ing and then the same thread continuouslyexecuting trap instructions. All threads run on the same core, and the maximum number ofsimultaneous thread per core is 10. When absolutely no interleaving could occur, one wouldexpect the Mandelbrot calculation to take approximately twice as long. Similarly, one wouldexpect the IPC to drop drastically if the handler activation requires a lot of time during whichnothing actually happens (i.e. no instructions are executed). The source code can be found underthe name benchmark empty interleaved.c, and the results are displayed in table 5.2.

IPCThreads no traps traps Slowdown1 0.298 0.291 -1.64%2 0.371 0.350 -5.59%5 0.470 0.463 1.14%10 0.727 0.730 3.84%20 0.905 0.904 4.32%50 0.982 0.983 4.46%100 0.993 0.998 4.46%

Table 5.2: Instructions-per-cycle and slowdown for different resolutions of a Mandelbrot fractalapproximation with and without simultaneously having a thread generating traps.

5.2.3 Instruction emulation

One of the major benefits of an exception system is that it allows for instruction emulation. Thismeans that when an instruction is encountered that is not implemented in the architecture, anexception is raised. The handler can than, in software, implement the workings of this instruction.This allows for testing of mechanisms before they are actually implemented, or to allow for legacycode to still be compatible when the architecture changes. However, when a single instructionis replaced by a piece of software, as little overhead as possible is preferred in order still get areasonable performance.

For this experiment the fabs instruction was implemented in software5. This instruction, thatreturns the absolute value of its floating point operand, is simple to implement in software. Thehandler first has to check if the exception is about an illegal instruction, and that that instructionis fabs. Then the handler extracts the input and output register from the opcode, and fetchesthe input value using rget. The uppermost bit is switched off if present, and the value is finallywritten to the output register using rput. The results of this test are shown in in table 5.3.

minimum average maximumMicrogrid 99 99 218

Table 5.3: Cycles the fabs operation took using instruction emulation.

5instr emu fabs.c

34

CHAPTER 6

Conclusions

There are many new possibilities to classical trap handling systems when an hardware multi-threaded architecture is used. These solutions can reduce the overhead of switching to handlercode significantly. Depending on the details of the architecture and the desired level of controland other specifics, there are multiple solutions. One such implementation was worked out inmore detail and tested in the Microgrid architecture. As demonstrated by the experiments inchapter 5 it both works as expected, and performs very well in comparison with other mod-ern architectures, which require hundreds of cycles per context switch. In the case of a singlethreaded application, the overhead of the exception handling mechanism is 24 cycles. Moreover,when there are more threads present on the core most of these cycles can be filled with usefulinstructions from other threads due to interleaving. The instruction emulation benchmark pro-vides a good comparison: the detection, handling and resuming of an illegal instruction took lessthan 100 cycles, whereas the minimum cost of a single context switch on x86 and ARM is morethan 150 cycles.

The chosen implementation was able to handle all requirements due to the flexibility of the mech-anisms, while still outperforming most other architectures because of the free context switching.The system is relatively simple, both its implementation and the interface for the user. Thesystem is also easy to modify to support remote handlers, exception masking and a differentlevel of handler scope (e.g. per family), as discussed in section 4.3.3. It should therefore alsoform a good basis for future research and further improvements.

6.1 Limitations and future work

The most significant problem present in the current implementation is that exceptions fromoutside a core are not possible. While it should be trivial to modify the NoC to support thedelivering of exceptions, more research is needed to decide where such an exception should bedelivered to in the first place.

It may also be desirable to do further research into simplifying the current system in place forI/O, to reduce complexity of the architecture overall.

Finally, the current system can easily be modified or extended when when more specific needsarise in later research. This may range from relatively simple modifications such as remotehandlers and handler-specific cores to mechanisms where a thread runs concurrently to theoriginal thread, without suspension. Such mechanisms would require far more in-depth researchinto both the complexities that may arise in the as, as well as in the semantics for the user. Thisresearch has simply laid out a number of options, and tested a single approach more thoroughly

35

in order to provide a (working) basis for future research into more complex and problem-specificoptimizations and extensions.

36

Bibliography

[1] Gordon E Moore et al. Cramming more components onto integrated circuits, 1965.

[2] R. Ronen, A. Mendelson, K. Lai, Shih-Lien Lu, F. Pollack, and J.P. Shen. Coming challengesin microarchitecture and architecture. Proceedings of the IEEE, 89(3):325–340, mar 2001.

[3] Wade Walker and Harvey G. Cragon. Interrupt processing in concurrent processors. Com-puter, 28(6):36–46, 1995.

[4] Marco Fillo, Stephen W Keckler, William J Dally, Nicholas P Carter, Andrew Chang,Yevgeny Gurevich, and Whay S Lee. The m-machine multicomputer. In Proceedings of the28th annual international symposium on Microarchitecture, pages 146–156. IEEE ComputerSociety Press, 1995.

[5] SW Keckler, Andrew Chang, WSLS Chatterjee, and WJ Dally. Concurrent event handlingthrough multithreading. Computers, IEEE Transactions on, 48(9):903–916, 1999.

[6] Craig B Zilles, Joel S Emer, and Gurindar S Sohi. The use of multithreading for excep-tion handling. In Proceedings of the 32nd annual ACM/IEEE international symposium onMicroarchitecture, pages 219–229. IEEE Computer Society, 1999.

[7] Michiel W. van Tol. Exceptions in a microthreaded architecture. Master’s thesis, Universityof Amsterdam, 2006.

[8] David L. Black, David B. Golub, Karl Hauth, Avadis Tevanian, and Richard Sanzi. Themach exception handling facility. In Proceedings of the 1988 ACM SIGPLAN and SIGOPSworkshop on Parallel and distributed debugging, PADD ’88, pages 45–56. ACM, 1988.

[9] Raphael ‘kena’ Poss. On the realizability of hardware microthreading. PhD thesis, Universityof Amsterdam, 2012.

[10] Raphael Poss, Mike Lankamp, Qiang Yang, Jian Fu, Irfan Uddin, and Chris Jesshope.MGSim—a simulation environment for multi-core research and education. In Proc. Intl.Conf. on Embedded Computer Systems: Architectures, MOdeling and Simulation (SAMOS),Samos, Greece, July 2013. IEEE. (to appear).

[11] Mike Lankamp, Raphael Poss, Qiang Yang, Jian Fu, Irfan Uddin, and Chris Jesshope.MGSim—simulation tools for multi-core processor architectures. Technical report, Univer-sity of Amsterdam, February 2013.

[12] David A. Patterson and John L. Hennessy. Computer Organization and Design (4th edition).Morgan Kaufmann, 2009.

[13] Raphael Poss, Mike Lankamp, M. Irfan Uddin, Jaroslav Sykora, and Leos Kafka. Heteroge-neous integration to simplify many-core architecture simulations. In Proceedings of the 2012Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, RAPIDO’12. ACM, 2012.

37

[14] Raphael ‘kena’ Poss. SL—a “quick and dirty” but working intermediate language for SVPsystems. Technical Report arXiv:1208.4572v1 [cs.PL], University of Amsterdam, August2012.

[15] ARM Limited. ARM7TDMI Technical Reference Manual, 2004.

[16] James E. Smith and Andrew R. Pleszkun. Implementation of precise interrupts in pipelinedprocessors. In Proceedings of the 12th annual international symposium on Computer archi-tecture, ISCA ’85, pages 36–44. IEEE Computer Society Press, 1985.

38

APPENDIX A

Usage

The discussed features are currently only available as assembly instructions. There are no SLfeatures defined to access any of the exception-related mechanisms. Nevertheless it is still pos-sible to use these features in SL programs through the usage of inline assembly, using the asm

statement. While not the best solution, it allows for testing the features. Most test programsneed to use these instructions often, and therefore a file was created containing a set of commondefinitions and features. The contents of this file are displayed below.

// Exception flags

enum

{

EXCP_NONE = 0,

EXCP_BREAKPOINT = (1u << 0),

EXCP_ARITH_DIVIDE = (1u << 1),

EXCP_ARITH_OVERFLOW = (1u << 2),

EXCP_INVALID_OPCODE = (1u << 3),

EXCP_ICACHE_READ = (1u << 4),

EXCP_INVALID_REGISTER = (1u << 5),

EXCP_UNALIGNED_MEM_ACCESS = (1u << 6),

EXCP_INVALID_MEM_ADDR = (1u << 7),

EXCP_MEM_ACCESS_VIOLATION = (1u << 8),

EXCP_TLB_MISS = (1u << 9),

EXCP_FPU = (1u << 10),

};

// RGET/RPUT fields

enum Field

{

F_HTID = 0,

F_EXCP = 1,

F_PC = 2,

F_NUM_REGS = 3,

F_NUM_FP_REGS = 4,

F_FID = 5,

F_RESUME = 6,

F_TERMINATED = 7,

F_FAMILY_BREAK = 8,

F_REGISTERS = 0x100,

F_FP_REGISTERS = 0x200,

F_REGISTER_STATUSES = 0x300,

39

F_FP_REGISTER_STATUSES = 0x400,

F_STATUS_WORDS = 0x500,

};

#define INSTRUCTION_SIZE 4

// Easy interface for rput, abstracting the merging of multiple fields to a

// single register.

inline void rput(int vtid, int field, int value)

{

int t = ((vtid) << 16) | (field);

unsigned long long v = (unsigned long long)value;

asm volatile ("rput %0, %1"

:

:"r"(t), "r"(v));

}

Using this file, a simple demo of using exceptions in SL is given below. Because of the controllerenvironment of this test, thread ID’s have been hardcoded: the main function runs in thread 0.The handler and victim will receive thread id’s 1 en 2 respectively.

#include <stdio.h>

#include "common.h"

sl_def(handler, void)

{

int vtid, excp, pc;

while (1)

{

// Block until exception

asm volatile ("chkex %0"

:"=r"(vtid));

// Exception!

// Check what exception(s) and at what instruction (program counter)

asm volatile ("rget %2, %3, %0;\n"

"rget %2, %4, %1"

:"=r"(excp), "=r"(pc)

:"r"(vtid), "I"(F_EXCP), "I"(F_PC));

// Mark any exceptions as handled

rput(vtid, F_EXCP, EXCP_NONE);

// The pc currently contains the address of the instruction *after* the

// instruction causing the exception.

pc -= INSTRUCTION_SIZE;

printf("Handler: exception %d in T%d @ %#x\n", excp, vtid, pc);

// Resume thread.

rput(vtid, F_RESUME, 1);

}

}

sl_enddef

sl_def(victim, void)

40

{

printf("New victim\n");

asm volatile ("trap");

printf("End of victim\n");

}

sl_enddef

sl_def(t_main, void)

{

sl_create(,,,,,,,handler);

sl_detach();

sl_create(,,,,,,,victim);

sl_sync();

}

sl_enddef

41

Trap handling with hardware multi-threading Informatica ...

Documents

Transcript of Trap handling with hardware multi-threading Informatica ...