ARCHITECTURAL SUPPORT FOR USER-LEVEL …lambert/pdf/dissertation.pdf0 ARCHITECTURAL SUPPORT FOR...

0

ARCHITECTURAL SUPPORT FOR

USER-LEVEL INPUT/OUTPUT

by

Lambert Schaelicke

A dissertation submitted to the faculty ofThe University of Utah

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Science

School of Computing

The University of Utah

December 2001

i

Copyright © Lambert Schaelicke 2001

All Rights Reserved

ii

THE UNIVERSITY OF UTAH GRADUATE SCHOOL

SUPERVISORY COMMITTEE APPROVAL

of a dissertation submitted by

Lambert Schaelicke

Thisdissertationhasbeenreadbyeachmemberof thefollowing supervisorycommittee and by majority vote has been found to be satisfactory.

Alan L. Davis

Erik L. Brunvand

John B. Carter

Sally A. McKee

Ulrich Brüning

Chair:

iii

THE UNIVERSITY OF UTAH GRADUATE SCHOOL

FINAL READING APPROVAL

To the Graduate Council of the University of Utah:

I havereadthedissertationof LambertSchaelickein its final form andhavefoundthat(1) its format,citations,andbibliographicstyleareconsistentandacceptable;(2) its illustrativematerialsincludingfigures,tables,andchartsarein place;and(3) the final manuscriptis satisfactoryto the supervisorycommittee and is ready for submission to The Graduate School.

Alan L. DavisChair: Supervisory Committee

Date

Approved for the Major Department

Thomas C. HendersonDirector

Approved for the Graduate Council

David S. ChapmanDean of The Graduate School

iv

ABSTRACT

Theperformanceof the input/outputsubsystemis becomingincreasinglyimportant

for manyapplications.CommercialI/O intensiveapplicationsarea fast growing market

segmentandexperienceconstantlyincreasingperformancedemands.Manyof theseappli-

cationsexploitconcurrencyto overlapthelatencyof I/O operationsto improvethroughput.

At thesametime,semiconductortechnologytrendsresultin agrowinggapbetweenappli-

cationandoperatingsystemperformance.Consequently,operatingsystemoverheadin-

creasinglylimits theefficiencyof latency-hidingtechniquesto improvethroughput.This

dissertationdevelopsandevaluatesa novel I/O architecturethat,by providinguser-level

accessto theI/O subsystem,minimizesI/O overheadwhile maintainingthelevelof protec-

tion andprogrammingflexibility of conventionalkernel-basedarchitectures.Inexpensive

hardwaremechanismsin theI/O deviceandhostprocessorimplementprotecteduser-level

requestinitiation, user-spacedatatransfers,anduser-levelnotifications.Together,these

mechanismsareableto reduceI/O overheadby upto two ordersof magnitude.As aresult,

applicationsare able to efficiently overlap long-latency I/O operationsto maximize

throughputandto exploit thescalablebandwidthof next-generationdistributedI/O archi-

tectures.Theflexibility of thebasicmechanismsfacilitateslibrary implementationsof ava-

riety of standardI/O programmingmodelswith low overhead,asthearchitecturedoesnot

restrict the allocation and use of I/O buffers.

v

A prototypeof theuser-levelI/O architectureis implementedandevaluatedin anex-

ecution-drivensystemsimulator.The simulationsystemcombinesdetailedmodelsof a

modernmicroprocessorandcaches,which arebasedon anexistingsimulator,a memory

controllerandI/O devices,with a UNIX-compatibleoperatingsystem.Validation of the

simulatoragainstarealworkstationshowthatthetool accuratelycapturestheperformance

characteristicsof existingcomputersystems.Syntheticbenchmarksdemonstratethat the

user-levelI/O architectureachievestwice theaggregatebandwidthon 23 requeststreams

comparedto kernel-basedI/O, while at thesametimereducingCPUoccupancyby 98per-

cent.TheMySQL databaseserveris ableto improvethroughputby upto 25percent,with-

out requiring any program modifications.

CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ix

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xi

1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.1 Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.2 I/O Intensive Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.3 Operating System Performance Trends. . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.4 Low-Overhead I/O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61.5 Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

2. MODERN I/O ARCHITECTURE OVERVIEW. . . . . . . . . . . . . . . . . . . . . . . . .11

2.1 Kernel-based I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122.2 File I/O Overhead Characterization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.3 Network Attached Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .322.4 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .342.5 User-level Network Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .362.6 Virtual I/O Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .402.7 Operating System Support for High-Performance I/O . . . . . . . . . . . . . . . .422.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

3. THE L-RSIM ARCHITECTURAL SIMULATOR . . . . . . . . . . . . . . . . . . . . . . .50

3.1 Simulator Machine Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .513.2 LAMIX Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .533.3 Simulator Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .543.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65

4. USER-LEVEL I/O ARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67

4.1 UIO Architecture Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69

vii

4.2 Application Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .724.3 Basic UIO Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .754.4 UIO Device Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .764.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79

5. ATOMIC DEVICE ACCESS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81

5.1 The Conditional Store Buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .825.2 Conditional Store Buffer at the I/O Device. . . . . . . . . . . . . . . . . . . . . . . . .945.3 Performance Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .955.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103

6. DIRECT USER-SPACE TRANSFER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105

6.1 Device TLB Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1076.2 TLB Misses and Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1096.3 TLB Coherence and Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1136.4 TLB Miss Handling with Kernel Interrupts. . . . . . . . . . . . . . . . . . . . . . . .1156.5 TLB Miss Handling with a Programmable TLB Fill Engine . . . . . . . . . .1176.6 Device TLB Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .1336.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139

7. USER-LEVEL NOTIFICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141

7.1 Lightweight Notification Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . .1437.2 Processor Notification Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1487.3 Multiprocessor Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1507.4 Performance Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1517.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154

8. PERFORMANCE EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156

8.1 Prototype Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1578.2 UIO Bandwidth Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1618.3 Application Throughput Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1658.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .168

9. CONCLUSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169

9.1 User-level I/O Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1699.2 Limitations and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171

REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .176

viii

LIST OF TABLES

Table Page

1. Experimental Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

2. Read System Call Overhead with Disk Access. . . . . . . . . . . . . . . . . . . . . . . . . . . .19

3. Cache Misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

4. Simulator Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

5. LMBench Average Latency Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57

6. LMBench Average Bandwidth Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

7. System Call Latencies in Microseconds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61

8. File System Performance in Creations/Deletions per Second. . . . . . . . . . . . . . . . .62

9. SPEC 2000 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64

10. UIO Request Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

11. System Configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96

12. Kernel TLB Miss Handling Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117

13. Table Walk Engine Instruction Set Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . .119

14. Table Walk Engine Area Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124

15. PowerPC Page Table Lookup Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128

LIST OF FIGURES

Figure Page

1 Read System Call with Disk Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

2 Measuring Disk Read Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

3 Structure of Interrupt Overhead Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22

4 I/O Bandwidth Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

5 MySQL Database Throughput Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32

6 Network-attached Secure Disk Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33

7 InfiniBand Distributed I/O Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

8 L-RSIM Machine Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52

9 LMBench Memory Load Latency for 256-byte Stride . . . . . . . . . . . . . . . . . . . . . .57

10 LMBench Memory Read Bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58

11 LMBench Disk Seek Time and Read Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . .60

12 User-level I/O Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68

13 User-level I/O Architecture Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

14 UIO Device Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77

15 Architectural Model with Conditional Store Buffer . . . . . . . . . . . . . . . . . . . . . . . .83

16 Conditional Store Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84

17 Conditional Store Buffer Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92

x

18 Request Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97

19 Context Switch Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101

20 Device TLB Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108

21 TLB Refill Engine Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

22 32-bit PowerPC Page Table Lookup Overview. . . . . . . . . . . . . . . . . . . . . . . . . . .126

23 32-bit PowerPC Page Table Lookup Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .127

24 32-bit MIPS Page Table Lookup Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129

25 32-bit MIPS Page Table Lookup Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130

26 IA-32 Page Table Lookup Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131

27 IA-32 Page Table Lookup Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132

28 TLB Miss Handler Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134

29 TLB Miss Handler CPU Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136

30 Effective DMA Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .138

31 Notification Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .144

32 Queued Notification Handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146

33 Notification Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151

34 File Descriptors in a User-level I/O Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158

35 I/O Bandwidth Scaling Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162

36 I/O Bandwidth Scaling Comparison with Network Effects. . . . . . . . . . . . . . . . . .164

37 MySQL Database Throughput Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .166

ACKNOWLEDGMENTS

This dissertationwould not havebeenpossiblewithout thesupportandencourage-

mentof manyindividuals,onlysomeof whomcanbementionedhere.My advisorAl Davis

wasalwaysavailablefor technicaldiscussionsandknew whento give me the freedomI

neededto find fulfillment in this work. My supervisorycommitteemembersJohnCarter,

SallyMcKee,Erik BrunvandandUlrich Brüningfrequentlygaveadviceontechnicalprob-

lemsaswell ashelpedimprovethepresentationof thisdissertation.In addition,JohnCarter

supportedmethrougharesearchassistantpositionoverseveralyears.Throughhisgenuine

interestin thiswork despitethegeographicaldistance,my externalcommitteememberUl-

rich Brüning made my effort even more meaningful.

Frequentdiscussionswith my fellow studentMike Parkerhelpedsolvemanytechni-

cal challengesandunderstandunexpectedresults.Seeminglyunsolvableproblemsdisap-

pearedin the presenceof his listening ears. Uros Prestor and Mike Hibler where

instrumentalin my understandingof modernoperatingsystemstructure.Thefruitful dis-

cussionswith manyof my fellow studentshelpedmy own understandingof theproblem

domain.

My family providedthesupportneededto completea projectof this scale.My wife

Rita tirelesslyencouragedmewhenI neededit andnevermademefeelguilty for spending

xii

longhoursat thelab.My sonIndri, althoughtooyoungto know,gavemereasonsto smile

at the end of many long days.

Thiswork wassupportedin partby theDefenseAdvancedResearchProjectsAgency

underagreementnumberN0003995C0018andF306029810101,by aNSFResearchInfra-

structureGrantNumberCDA9623614andby a GraduateResearchFellowshipfrom the

University of Utah Graduate School for the academic year 2000/2001.

1. INTRODUCTION

The efficiency of the input/output(I/O) subsystemplaysa largerole in the overall

performanceof manyapplications,andits relativeimportanceis increasing.I/O intensive

commercialapplicationsarea significantandfast growing marketof computersystems.

Growingdatavolumesanduser’sperformancedemandsleadto increasingpressureon the

I/O subsystem.Respondingto theimportanceof highI/O performance,next-generationI/O

architecturesarebasedonadistributedI/O deviceorganization.A system-areanetworkre-

placestheconventionalI/O busto connectmultipleclientswith anumberof distributedI/O

devices,providing betterbandwidthscalability and greaterexpandability.At the same

time,semiconductortechnologytrendsresultin agrowinggapbetweenapplicationandop-

eratingsystemperformance.Thelow cacheandTLB locality of operatingsystemcode,to-

getherwith low degreesof availableinstructionlevel parallelism,makeoperatingsystem

codemostly memoryperformancebound.As a result,many I/O intensiveapplications

spendincreasingamountsof time executingoperatingsystemcode.This trendseverely

limits applicationsability to overlaplong-latencyI/O operationswith independentwork to

improve throughput.

ThisdissertationintroducesandevaluatesanovelI/O architecturethatminimizesI/O

overheadby bypassingtheoperatingsystemfor I/O requestsin thecontextof adistributed

I/O architecture.It is basedon innovativehardwarefeatureslocatedin thehostprocessor

andthesystem-areanetworkinterfacethatallow user-levelprocessesto directlyaccessthe

2

I/O subsystem.As aresult,I/O overheadis reducedby afactorof hundred,allowingappli-

cationstomoreefficientlyapplylatencyhidingtechniquesto improveI/O performanceand

resultingin betterthroughoutscalability.Bypassingtheoperatingsystemfor I/O operations

meansthat evenin the faceof the wideninggapbetweenmemoryandprocessorperfor-

mance,I/O intensiveapplicationsareableto achievehighperformancethroughconcurrent

I/O.

1.1 Organization

Thisdissertationis organizedin ninechapters.Thefollowing sectionsdiscusstrends

in I/O performancerequirementsandoperatingsystembehaviorandthengiveanoverview

of thegoalsandmechanismsof theproposeduser-levelI/O architecture.Thenextchapter

discussesthecontemporaryI/O architecturecommonlyfoundin workstationsandservers,

describesthesourcesof operating-systeminducedI/O overhead,presentsmethodologies

to quantifytheoverhead,measuresits impacton systemperformance,anddiscussesa va-

riety of optimizationspreviouslydeveloped.Chapter3 describesthe functionality of the

simulationsystemusedfor this work andpresentsresultsof its validationagainsta real

workstation.Chapter4 givesa detailedoverviewof theuser-levelI/O architecture,Chap-

ters5 through7 discussandevaluatethe individual hardwareandsoftwaremechanisms.

Chapter8 expandson the microbenchmarkmeasurementsof the precedingchaptersand

presentsperformanceresultsfor asyntheticI/O intensiveworkloadandarealisticdatabase

application.Finally, Chapter9 summarizestheresults,drawsconclusionsandexploresar-

eas of possible future work.

3

1.2 I/O Intensive Applications

Commercialapplicationssuchasdatabases,emailandWebserversmakeupthelarg-

estand fastestgrowing marketsegmentfor multiprocessorcomputersystems.Many of

theseworkloadsareconsideredI/O intensive,becauseof theamountof datatheyprocess

and becauseof users’performancedemands.Most commercialapplicationsoperateon

largedatasetsresidingonsecondaryor tertiarystorage,while communicatingwith distrib-

utedclient systemsovernetworks.Work environmentsarebecomingmorecollaborative,

thusplacinghigherperformancedemandson file servers[107], email andnewsservers

[97]. Theincreasedfile serverdemandsstemfrom a growthin datavolumethatoutpaces

theserverscachingcapability,andfrom increasingrequestratesfrommoreusers.Although

email andnewsserverperformancefor the individual enduseris usuallynot considered

critical, theamountof dataandnumberof files passingthroughtheservercanleadto sig-

nificantI/O throughputdemands.DistributedsystemssuchasSequoia[79] allow research-

ers to collaborateacrossgeographicallydistributed laboratories.Data from satellite

observationsor scientificsimulationscantakeonenormousvolume,andexchangingsuch

data between machines requires significant I/O bandwidth.

With theconstantgrowthof theWorld Wide Web,Webservershavebecomea sig-

nificantmarket.Websitesarealsobecominglargerandmorecomplex,leadingto increas-

ing demandsonwebserverperformance[76] dueto thelargerdatasetsandhigherrequest

ratesfrom moreclients.In manycasestheserverperformsno or only minimal operations

onthedata,thusemphasizingtheimportanceof theI/O subsystemfor overallperformance.

However,the increasingpopularityof dynamicallycreatedweb contentleadsto higher

CPU performance requirements for web servers as well.

4

Multimedia applicationssuchasvideo conferencingor video-on-demandnot only

tax theperformanceof currentI/O subsystemsbut alsorequireentirelynewservicessuch

asguaranteedbandwidthandlimited latencyvariationof I/O datastreams.Theserequire-

ments are only partly met by current I/O architectures.

Although databasestraditionally emphasizereliability andavailability over perfor-

mance,thedramaticgrowthof commercialtransactionsperformedovernetworksleadsto

higherperformancerequirementsfor databaseenginesaswell asto a growingmarketfor

commercialserversystems.Databases,like Webandfile servers,achievehighI/O through-

putby overlappingI/O requestsof independenttransactions[9][85]. Commercialdatabase

serversusuallyrun on a shared-memoryor clusteredmultiprocessorsystemswith a large

numberof disks.Hundredsof disksarerequirednot only to storethedatabasetablesbut

alsoto providesufficientparallelismandredundancyin thestoragesystemto allow effi-

cient overlap of requests to hide access latencies.

1.3 Operating System Performance Trends

At thesametimeasapplicationsincreasethedemandsontheI/O subsystemin terms

of performanceandservicesprovided,semiconductortechnologytrendsleadto I/O perfor-

mancebeingmoreandmorelimited by operatingsystemoverheads.Becausein mostgen-

eral-purposesystem,theoperatingsystemis involvedin all I/O operations,its performance

directly affectsI/O performance.Althoughtheperformanceof thecentralprocessingunit

(CPU)is generallyimprovingat a fasterrate[44] thanI/O deviceperformance[57], oper-

atingsystemsdonotbenefitfrom this trendasmuchasapplications[74][88]. Becauseop-

eratingsystemcodeisexecutedrelativelyinfrequently,it incursmanycacheandtranslation

5

lookasidebuffer (TLB) missesandis hencemorelimited by mainmemoryperformance.

OScodealsodoesnotbenefitfrom dynamicinstructionschedulingasmuchasapplication

softwarebecauseoperatingsystemcodedoesnotexhibit largeamountsof instruction-level

parallelismand frequentlyexecutesprivileged instructionsthat serializethe superscalar

processor pipeline.

I/O relatedoperatingsystemoverheadcanbecategorizedaseithercontroloverhead

or datatransferoverhead.Controloverheadis aresultof contextswitchesandthecrossing

of protectiondomains.WhenanapplicationissuesanI/O requestto thekernel,it performs

asystemcall thatsavesprocessstateandperformsvariousprotectionchecksaspartof the

user-to-kernelmodetransition.HandlingtheI/O requestmayinvolvemultiplekernelpro-

cesses,in whichcaseadditionalcontextswitchesandschedulingoperationsareperformed.

Finally, I/O completionsare traditionally signaledto the host processorvia interrupts,

which incur additionalcontextswitchesanddisrupt the currentlyexecutingapplication.

Datatransferrelatedoverheadis incurredwhenthekernelprovidesintermediatebuffering

for I/O data.In thiscasedataarecopiedbetweenapplicationandkernelbuffers,whichcon-

sumesvaluableprocessorcyclesandnegativelyaffectsthe application’scacheandTLB

behavior.

Sinceboth I/O deviceperformanceandI/O relatedoperatingsystemoverheadim-

proveataslowerratethanto applicationperformance,I/O overheadmaybecomethedom-

inating factor for manyI/O intensiveapplicationsin the future.For instance,commercial

databasesystemshaveuntil now beenableto hideI/O latencyalmostcompletely.This is

madeevidentby the fact that databasesarenow often memoryperformancelimited [9].

High-performancemicroprocessorshavesofar allowedapplicationsto offsetthegrowing

6

gapbetweenI/O andprocessorperformancethroughconcurrencyandlatencyhiding.Mod-

erndatabasesystemsspendbetween10and30percentof thetotalCPUcyclesin operating

systemcode[9]. Given the growing discrepancybetweenprocessorandmemoryperfor-

mance,andhencebetweenapplicationandoperatingsystemperformance,operatingsys-

tem overheadincreasinglylimits the ability of applicationsto exploit parallelismin I/O

requests which results in decreased throughput.

1.4 Low-Overhead I/O

Theability to hideanykind of latencydependson theavailability of sufficientcon-

currencysothatindependentwork canbeperformedduringlong-latencyoperations.Many

server-classapplicationsareableto exploitconcurrencyamongrequeststo overlapI/O op-

erations.However,the CPU overheadincurredby eachlong-latencyoperationlimits the

throughputimprovementsachievableby latencyhiding, sinceoverheadconstituteswork

that cannot be overlappedandthat is executedsequentially.For example,a 10 percent

overheadto scheduleandcompletea disk requestmeansthata singleprocessoris ableto

overlapatmost10diskrequests,atwhichpointnoCPUcyclesareavailablefor application

processing.AlthoughboththeI/O latencyandoverheadvarywidely betweenindividualre-

quests,differentI/O devices,andrequesttypes,theoveralleffectis thatoverheadlimits the

scalability of I/O latency-hiding techniques.

Thiswork presentsandevaluatesanI/O architecturethatminimizesI/O overheadby

bypassingtheoperatingsystemfor performance-criticalI/O operations.Theuser-levelin-

put/output(UIO) architectureaddressesall sourcesof I/O overhead,from requestinitiation

to datatransfersto completionnotifications,throughacombinationof novelhardwareand

7

softwaremechanisms.Thesemechanismsareinexpensivein termsof requiredchip area

anddonot impactthecycletimeof themicroprocessoror theI/O device.A statelesscom-

municationprotocolis usedto avoidscalabilitylimitationsin theI/O device.Extensionsto

themicroprocessorarerestrictedto thebusinterfacewheretheydo not impactthecritical

path of the execution core.

A software-controlledcombiningbuffer in theprocessorbusinterfaceenablesappli-

cationsto issuerequestsdirectlyto theI/O subsystem,thusavoidingthecostof systemcalls

andcontextswitches.Dataaretransferreddirectly to andfrom applicationbufferswith the

helpof anI/O deviceTLB to makeoptimaluseof theavailablesystembusbandwidthand

to minimizeCPUoccupancy.A light weight interrupthandlerprocessescompletionnoti-

ficationsfrom a notificationqueuein theprocessorbusinterfaceandsignalsthemto the

userprocess,allowingapplicationsto implementoptimizedlow-overheadnotificationhan-

dling without expensivecontextswitches.Theoverheadof anI/O requestin this architec-

tureis at leasttwo ordersof magnitudelowerthanthatof akernel-basedI/O operation,and

unlike for a kernel-mediatedrequest,it is independentof therequestsize.As a result,ap-

plicationsareableto moreefficiently overlapI/O operationswith independentwork, lead-

ing to higherandmorescalablesystemthroughput.The throughputimprovementcanbe

realizeddirectly by theapplicationthroughlatencyhiding techniques,or indirectly by in-

dependentprocessesthatbenefitfrom theincreasedCPUidle time duringI/O operations.

In manycasesapplicationscanrealizeperformanceimprovementswithoutextensivesoft-

warerestructuringastheuser-levelI/O architecturedoesnotrestricttheuseormanagement

of I/O buffers as previous work has often done [24][27][29][62][75][99].

8

The prototypeimplementationof this architectureis ableto improvethe aggregate

bandwidthof 23 disk I/O streamsof a distributedstoragearchitectureby a factor of two

while reducingCPUoccupancyby almosta factorof 100.Thesignificantlyreducedover-

headenablesa databaseserverto improvethroughputfor 15 requestsby over25 percent,

without requiringchangesto theprogramstructure.The fact that theoperatingsystemis

not involvedin I/O operationsmeansthattheperformanceadvantagesof theUIO architec-

ture can be retained in the presence of the widening processor/memory performance gap.

Previouswork tominimizeoperatingsystemoverheadfor improvedI/O performance

hasoften requiredchangesof the I/O programminginterfaceor hasleadto complexand

nonscalablehardwareimplementations.Theproposeduser-levelI/O architectureexportsa

flexible low-overheadnonblockingprogramminginterfaceto applications,ontopof which

user-levellibrariescanimplementa wide variety of applicationprogramminginterfaces.

Theseinterfacescanrangefrom a standardUNIX interfaceto user-levelmultithreading

thatimplementsconventionalblockingI/O callsonaper-threadbasis.Theability to imple-

ment theselibraries with little overheadallows applicationsto realizeperformanceim-

provementswithout modificationsto the applicationstructure.Alternatively,application

programmerscandirectly accessthe low-level primitives to implementspecializedsyn-

chronization schemes.

A prototypeof theuser-levelI/O architectureis implementedin anexecution-driven

simulatorthat combinesdetailedhardwaremodelswith a fully-functional operatingsys-

tem.Execution-drivensimulationis a commonlyusedtool to investigatetheperformance

implicationsof novelhardwarefeatureson realisticworkloads.Thesimulatorusedin this

studyaccuratelymodelshardwarecomponentsof a workstationsuchasthe dynamically

9

scheduledprocessorwith caches,amemorycontrollerandI/O devices.On topof thesim-

ulatedhardwarerunsa UNIX compatiblemultitaskingoperatingsystemthat includesa

completefilesystemwith buffer cacheanddevicedrivers.Thesimulationenvironmentis

extendedwith modelsof theproposedhardwarefeaturesandsoftwaremechanismsto eval-

uatethefeasibilityandperformanceimpactof theUIO architecture.Alternativeimplemen-

tationsof individual mechanismsare evaluatedusing microbenchmarksand fine-grain

measurementsof latencyandoverhead.A syntheticI/O-intensivebenchmarkis usedto

demonstratethe impactof I/O overheadon systemthroughput,andto separateI/O over-

headeffectsfrom application-specificbehavior.At thesametime,adatabaseserveris sim-

ulatedto demonstratehow a real I/O intensiveapplicationis ableto realizeperformance

improvements without any modifications.

Implementingmodelsof newhardwarein anarchitecturalsimulatorallowsresearch-

ersto investigateperformanceimplicationsof thesemechanisms,to explorealternatives

andto investigatetheinteractionwith therestof thesystem.However,this approachdoes

not indicatethecostof suchmechanismsin termsof hardwarecomplexity,critical timing

pathandrequiredchip area.To this end,this studypresentshardwareimplementationsof

theproposedmechanismsthatshowthatevenmoderatelyoptimizeddesignsdo not nega-

tively affect the cycle time and that the area requirement is small.

1.5 Contributions

This dissertationintroducesnovelhardwareandsoftwaremechanismsthatgive ap-

plications low-overheadaccessto I/O deviceswithout operatingsysteminvolvement,

while maintainingthe level of protectionandmostof thesemanticsof traditionalkernel-

10

basedI/O with copysemantics.To evaluatetheperformancebenefitof thenewarchitec-

ture,adetailedexecution-drivensystemsimulatoris developedandvalidated.Variousim-

plementationsof the proposedmechanismsareevaluatedandcomparedwith respectto

overhead,hardwarecomplexityandeaseof integrationinto existingsystemorganizations.

The impact of the overheadreductionsrealizedby thesemechanismsis experimentally

quantified using a synthetic benchmark and a database server.

In addition,this dissertationpresentsportablemethodologiesto quantifyvariousas-

pectsof I/O overheadin currentsystemsandappliesthesemethodologiesto two different

hardware platforms.

To facilitatetheperformanceevaluationof theprototypeuser-levelI/O architecture,

thiswork introducesanarchitecturalsystem-levelsimulatorthatcombinesdetailedmodels

of amicroprocessor,caches,memorycontrollerandI/O deviceswith arealisticUNIX com-

patibleoperatingsystem.Thesimulationsystemis anextensionof theRSIM architectural

simulator.Thevalidationof thesimulatoragainstanexistingworkstationshowsthat it is

able to closely approximate the performance of real computer systems.

2. MODERN I/O ARCHITECTURE OVERVIEW

The termI/O architecturerefersto theorganizationof I/O devices,hostprocessors

andmainmemory,andthemethodsby whichthesecomponentsexchangecontrolinforma-

tion anddata.Traditionally, most I/O architecturesarebasedon a sharedI/O bus,with

memory-mappeddeviceaccessmediatedby theoperatingsystem[30]. Althoughbeingrel-

ativelysimpleandcosteffective,bus-basedarchitecturesareinherentlylimited in termsof

scalabilityandbandwidth.In additionto thesehardwarelimitations,thesoftwareoverhead

associatedwith I/O operationscanhavea largeimpacton I/O performance.This chapter

discussescurrentandnext-generationI/O architecturesandshowshow existingoptimiza-

tionsrelateto theproposeduser-levelI/O architecture.Thefollowing two sectionsidentify

andmeasurethesourcesof operatingsystemoverheadin currentsystems,andshowhow

overheadaffectsapplicationthroughputscalability.Sections2.3and2.4describedetailsof

next-generationdistributedI/O architecturesthateliminatethebandwidthbottleneckof the

I/O busandreplaceit with a scalablenetwork.Sections2.5 and2.6 discussesseveralap-

proachesto reduceoperatingsystemoverheadin thecontextof user-levelcommunication

architectures.Finally,Section2.7givesanoverviewof operatingsystemoptimizationsthat

reduce the cost of data copy operations and control overhead.

12

2.1 Kernel-based I/O

Two maintasksof anoperatingsystemaremanaginghardwareresourcesandprovid-

ing aconvenientandhardwareindependentlayerof abstraction.Resourcemanagementre-

latesto fair andsecuresharingof hardwareresourcesamongprocesses.To sharehardware

resources,the operatingsystemmultiplexesrequestsfrom competingprocessesonto the

actualhardware.Theoperatingsystemcanassignhardwareresourcesdirectly to anappli-

cationandrevokeit forcefully whenit is neededby anotherprocess.For instance,applica-

tions executedirectly on the microprocessor.Timer interruptsare usedto preemptthe

currentprocessandassigntheCPUto anotherprocess.Main memoryis logically divided

into pageswhich areassignedto differentprocesses.Theoperatingsystemcanremovea

physicalpagefrom oneprocessandassignit to anotherprocess.In bothcases,applications

havedirectaccessto theresourceonceit hasbeenassigned.Protectionchecksandresource

assignment and revocation are supported by hardware features in the microprocessor.

In thecaseof input/output,theoperatingsystemmultiplexesrequestsfrom multiple

applicationsto thesamedeviceby performingthemonbehalfof theapplications.Thisde-

sign allows the operatingsystemto performprotectionandintegrity checkson requests,

andto implementschedulingstrategiesto maximizeI/O deviceutilization. For instance,

disk driversoften reorderrequeststo minimizeheadseektimes,anddelaywrites to give

priority to readrequests.Involving theoperatingsystemin everyI/O transactionis neces-

sarypartly becauseprocessorsdo not providehardwaresupportfor directapplicationac-

cessto theI/O subsystemasis thecasewith virtual memory.Memoryaccessesperformed

by applicationsaretransparentlycheckedandtranslatedby theprocessorTLB. Theoper-

atingsystemis involvedin memoryaccessesonly in exceptionalcasessuchasaccessvio-

13

lationsor pagefaults.I/O deviceaccesses,ontheotherhand,donotenjoysimilarhardware

supportin theprocessor,becausesuchaccessesconsistof complexsequencesof loadand

store operations.

Anotherservicetheoperatingsystemprovidesis abstraction.It bridgesthesemantic

gapbetweenhigh-levelI/O requestsandthecapabilitiesof the I/O devicehardware.For

instance,theoperatingsystemtranslatesanapplication’saccessto afile segmentto control

registerreadsandwrites that triggera disk accessto individual blocks,andit maycache

disk blocksin memoryfor fasteraccess.In thecaseof networkI/O, theoperatingsystem

mayimplementreliablein-orderdeliveryof packetsontopof anunreliablenetwork.A side

effect of this designis that the OS is ableto hide the detailsof manydifferent hardware

implementationsundera commonstandardinterface.Often, this standardinterfacealso

hidesthevariablelatencyof requestsfrom applicationsby blockingprocessesuntil thede-

sired I/O operation completes.

ThestandardI/O interfacein UNIX andmanyotheroperatingsystemsspecifiescopy

semantics.Applicationsarenot ableto observeincorrector partialdatain aninput buffer,

or in otherwords,theinput routinedoesnot returnuntil theinputoperationhascompleted

successfully.Giventhatin thecaseof networkI/O, input datamayarrivebeforeanappli-

cationpostsa request,buffering input datain intermediatekernelbuffersandcopyingit

uponrequestis necessary.Foroutputoperations,it is guaranteedthattheoutputbuffercan

bemodifiedaftertheoutputroutinereturnswithoutaffectingtheresultof theoutputoper-

ation.If thesystemchoosesto delaytheoutputoperation,intermediatebufferingis needed.

Copyingdatabetweenapplicationandkernelbuffersis alsonecessarywhentheI/O device

hasbuffer alignmentrestrictionswhich aregenerallynot knownby applications,or when

14

theoperatingsystemperformstransformationsonthedata,for instanceby packetizingnet-

work data.Copyingdatabetweenuserandkernelbufferscanbeviewedasanotherservice

providedby the operatingsystem,giving applicationsthe flexibility to usearbitraryI/O

buffers allocated under program control.

Although the servicesprovidedby the operatingsystemsimplify applicationpro-

grams,they canintroducesignificantoverhead.The overheadcanbe classifiedinto two

components:1) controloverheadfrom contextswitchesandlongcodepathsinsidetheop-

eratingsystem;and2) datatransferoverheadassociatedwith copying.Unfortunately,both

componentsarelimited to a largedegreeby mainmemoryperformanceandhencedo not

scaleproportionallyto theprocessor-orientedperformanceof compute-intensiveapplica-

tions.As discussedin Section1.3,theprimaryreasonfor this is thatoperatingsystemcode

isexecutedinfrequentlyenoughto incurmanycacheandTLB misses,whicharedominated

by mainmemoryperformance.Manyinstructionsexecutedduringcontextswitchesexhibit

shortdatadependenciesandcannot takeadvantageof theout-of-orderexecutioncapabil-

itiesof modernprocessors.In addition,manyprivilegedinstructionsaccessglobalproces-

sor stateand are implementedsuch that they serializethe superscalarpipeline of the

processor.Copying databetweenaddressspacesfrequently incurs many cachemisses,

sinceeithersourceor destinationbuffer waspreviouslyinvalidatedin theprocessorcache

by adirectmemoryaccess(DMA) operation.Furthermore,unlikeaDMA engine,anindi-

vidualprocessoris usuallynotableto exploit thefull memorybandwidthavailableto pipe-

lined burstbustransactionssinceit supportsonly a smallnumberof outstandingmemory

operations.

15

2.2 File I/O Overhead Characterization

File I/O is oneof themostcommonformsof I/O operationsperformedby applica-

tions.Theaccesslatencyandtransferratesof harddiskshavenot keptpacewith theper-

formanceof otherI/O systems,e.g.,local areanetworks.Disksarefoundin virtually any

generalpurposecomputersystem.Many I/O intensiveapplicationsusethestoragesystem

as either the source or sink of their operations.

Figure1 showsthedifferentphasesof a readsystemcall andthedifferentprocessor

executionencountered.Dueto the long latencyof a disk access,a file readtransactionis

performedin two phases.Whenthereadsystemcall detectsthattherequesteddiskblocks

arenot in thebuffercache,it initiatesadisk readrequestandsuspendsthecallingprocess.

After thediskcontrollerhastransferredthedatainto thebuffercache,it interruptsthehost

CPU.Theinterrupthandlerdeterminesreschedulestheapplicationprocess,which thenre-

sumesexecutionin thesystemcall. After beingwokenup, thesystemcall copiesthe re-

quested data from the buffer cache into the application buffer and returns to user mode.

Figure 1: Read System Call with Disk Access

initiate disk request

application

kernel

interrupt handler

read disk status

interrupt

system call latency

idle/other processexec

utio

n co

ntex

t

16

Althoughthehostprocessorcanexecuteanotherprocessduringthelong-latencydisk

access,theinvolvementof theoperatingsystemduringthetransfersetupandtheinterrupt

handlingandprocessreschedulingis consideredoverhead.Thefollowing sectionsdescribe

experiments that quantify this overhead.

2.2.1 File Read System Call Overhead

2.2.1.1 Methodology. Thechallengein measuringtheoverheadincurredby a disk

readis that the initiating processis suspendedduring theactualdisk read.However,it is

possibleto observetheeffectiveidle time duringa readsystemcall andsubtractthis time

from thereadsystemcall latencyobservedby thecallingapplication.Thesetupfor thisex-

periment consists of two closely cooperating processes, as shown in Figure 2.

An I/O processperformsthereadsystemcall while anobservationprocessmeasures

theidle timebetweenthedisk transferinitiation andthecompletion,aswell astheend-to-

endcontextswitchtime.Thetwoprocessescommunicateandsynchronizethroughashared

memorystructurethat providesstoragefor a synchronizationflag andseveraltime vari-

ables.Beforeenteringthereadsystemcall, theI/O processincrementsthesynchronization

Figure 2: Measuring Disk Read Overhead

I/O process

observation process

setup overhead completion overhead

kernel

exec

utio

n co

ntex

t

idle time

17

flag andsavesthecurrenttime in a sharedvariable.Thesystemcall suspendstheI/O pro-

cessandswitchesto theobservationprocesswhichis spinningonthesynchronizationflag.

It detectsthatthesystemcall hassuspendedtheI/O processandsavesthecurrenttimeread

via asystemcall in anothersharedvariable.It thencontinuouslyreadsthecurrenttimeand

storesit in athird sharedvariable.Theobservationprocessrunsuntil thedisktransfercom-

pletes,atwhichpoint thedisk interrupthandlerwakestheI/O processupandcausesacon-

text switch.Thevaluein thethird sharedvariablerepresentsthetimeof thedisk interrupt.

After returningfrom thereadsystemcall, theI/O processdeterminesthetotal systemcall

latency,aswell asthedifferencebetweenthevarioustimestampsandcomputesthesystem

call overheads.

Thismethodologyrepresentsaportablewayof measuringdiskI/O overhead.It relies

only on thefork() systemcall andtheSystemV sharedmemoryinterface.It alsorequires

thattheexperimentis performedon a uniprocessorsystemandthatno otherprocessesare

activeduring theexperiment.Note thatmeasuringindividual setupandcompletionover-

headsrequiresthatthesystemprovidesaglobalhigh-resolutiontimer,while thetotalover-

head can be measured using process-local timers.

2.2.1.2 Experimental setup. Theexperimentsusingthemicrobenchmarkareper-

formedon anSGI Origin 200[56] anda SUN Ultra-1 workstation,bothrunningcommer-

cial UNIX variants.Table1 summarizestherelevantfeaturesof theseplatforms.Thetwo

systemsrepresentvery different architecturalapproaches.The Ultra-1 is a uniprocessor

workstationusinga superscalarin-ordermicroprocessor[102], while theOrigin-200is a

distributedsharedmemorysystemwith four MIPS R10000processors[68] on two nodes.

18

Theoperatingsystems,ontheotherhand,arebothmodern,internallymultithreadedUNIX

System V variants [37].

2.2.1.3 Results. All experimentsaccessa localSCSIdisk.Beforeeachexperiment,

theoperatingsystembuffercacheis flushedby readingafile of thesamesizeasmainmem-

ory, thusensuringthatthebuffer cacheis completelyfilled with blocksfrom this file. The

flush file mustresideon thesamedeviceasthemeasurementfiles, sincethebuffer cache

is indexed using the device number.

Table2 summarizestheaverageof 16 measurementsfor thetwo differentsystems.

Theresultsclearlyshowthatboththesetupoverheadandthetotaloverheadscalewith the

Table 1: Experimental Platforms

SGI Origin 200 Sun Ultra-1

OS IRIX 6.5 Solaris 2.6

CPU 4 x 225 MHz R10000 143 MHz UltraSPARC-1

L1 D-Cache 32 Kbyte32-byte blocks2-way set-associativewrite-backvirtually indexed

16 Kbyte32-byte blocksdirect-mappedwrite-throughvirtually indexed

L1 I-Cache 32 Kbyte64-byte blocks2-way set-associativevirtually indexedphysically tagged

16 Kbyte32-byte blocks2-way set-associativevirtually indexedvirtually tagged

L2 Cache 2 Mbyte128-byte blocks2-way set-associativephysically indexed & tagged

0.5 Mbyte64-byte blocksdirect-mappedphysically indexed & tagged

TLB 64 entries unified withmicro I-TLB

64 entry D-TLB64 entry I-TLB

Main Memory 1024 Mbyte 256 Mbyte

19

requestsize.Beforeaccessingthe disk, the readsystemcall checksif the requestedfile

blocksarein thebuffer cache,asfile sizegrowsmoreblocksneedto bechecked.At the

endof thedisk transfer,therequestedamountof datais transferredfrom thebuffer cache

to theuserbuffer.Thiscopyoperationis performedby thehostprocessorandtheoverhead

is dependenton theamountof data.Theslightly betterscalingof thetotaloverheadon the

Ultra-1 may indicate a more efficient implementation of the copy operation in Solaris.

2.2.2 Interrupt Overhead

Interruptsarean importantpartof the I/O subsystemof moderncomputersystems.

During normaloperations,interruptsareusedto signalexternaleventsfrom I/O devices

suchasthecompletionof adisk transaction,thesuccessfultransmissionor thearrivalof a

networkpacket,or a periodictimer eventto thekernel.Becauseof thefrequentuse,inter-

ruptsaffect all aspectsof OS performance,andthey representan increasinglyimportant

bottleneckin modernsystems.Indeed,interruptperformancebecomescrucial for gigabit

networking,or highly parallelor pipelinedI/O. For instance,Gallatinet al. [40] find that

Table 2: Read System Call Overhead with Disk Access

System RequestSize

System CallLatency in ms

SetupOverhead (µs)

TotalOverhead (µs)

Sun Ultra-1 8 k 10.5 - 21.6 260 - 480 300 - 700

32 k 20.0 - 25.5 260 - 500 400 - 800

256 k 63.7 - 102.4 200 - 480 350 - 800

SGI Origin 200 8 k 11.6 - 17.7 240 - 280 300 - 400

32 k 11.2 - 19.8 240 - 340 400 - 600

256 k 23.2 - 34.2 330 - 420 1200 - 1500

20

interrupthandlingaccountsfor between8 and25percentof receiveroverheadin theirmea-

surementsof TCP/IP performanceon an Alpha 21164workstation.The authorhasob-

servedtheMySQL databaseserverto spendover35percentof its CPUcyclesin thekernel,

and20 to 25 percentof thekerneltime canberelatedto interrupthandling.Thetrendto-

wardsmultithreadedandmodularoperatingsystemsfurther increasesthe interrupthan-

dling cost.

This sectionpresentsa portablemethodologyfor measuringthecacheimpactof in-

terruptswhichis subsequentlyusedto studydiskinterruptsonaSununiprocessorworksta-

tion andonaSGIsharedmemorymultiprocessor.Themethodologycanbeappliedto both

diskandnetworkinterrupts,thissectioncontainsonly asummaryof theresultsfor disk in-

terrupts [92].

2.2.2.1 Methodology. To measurethecacheeffectsof interruptsfrom a userpro-

gram’sperspective,themethodologyemploysanapplicationwith perfectcachebehavior

thatrepeatedlytouchesall cachelines.An I/O interruptdisturbstheapplicationby replac-

ing somenumberof cachelinesin eachlevelof thecachehierarchy.In addition,theinter-

rupt handler itself incurs cache misses.To measurethese effects, the experimental

applicationfirst performsanoperationthatwill leadto anI/O interruptat a laterpoint in

time (phase 1 in Figure 3(a)).

It thenstartstheeventcountersandfills thecachewith applicationdata(phase2) by

readinga dataarraywith thestrideequalto thecacheblock sizeanda sizematchingthe

cachesizewithout actuallyconsumingthedata.Filling theinstructioncachecanbesimi-

larly accomplishedwith aroutinethatrepeatedlybranchesforward,touchingeveryinstruc-

tion cacheblockexactlyonce.After afixed timeperiod,duringwhichtheI/O interrupthas

21

beenhandled(phase3), theapplicationtoucheseverycacheline again,andstopstheevent

counters(phase4). The numberof cachemissesmeasuredby the eventcounterscorre-

spondsto thenumberof cachelines thathavebeenreplacedby the interrupthandler.By

varying the size of this array, one can measure L1, L2, or TLB effects.

Countingcachemissesin differentcountermodesallowsoneto observeavarietyof

effects.In usermode,the numberof cachemissesindicateshow manycachelines have

beenreplacedby theinterrupthandler.Sincetheexperimentalapplicationtouchestheen-

tire cache,this representstheworst-casecostof interrupthandling,andcanbeusedto es-

timatethecachefootprint of the interrupthandler.Whencountingin kernelor exception

mode,theexperimentsmeasurethenumberof cachemissesincurredby theinterrupthan-

dler itself.

Initial experimentsusingthismethodologyrevealedthatnormalperiodicsystemac-

tivity introducedasignificantnumberof cachemissesin anotherwiseidle system.For in-

stance,waitingfor adiskeventtakes10to20ms.If theapplicationdoesnottouchthecache

during this time, manyof theapplication’sL1 datacachelineswill beevictedby system

threadsbeforetheI/O interrupthandlerruns.Thissystemactivity is causedby variouspe-

riodic clock interrupts and related handler threads and network broadcast messages.

Figure3(b) illustratesa refinedapproachthat isolatestheeffectsof thedisk or net-

work interrupt.While waiting for theparticularI/O interruptto occur,theapplicationre-

peatedlytouchesall cachelines,forcing theminto thecache.This guaranteesthatdespite

theperiodicclock interrupts,theinterrupthandlerbeingmeasuredincurscloseto themax-

imumnumberof cachemisses.Omittingtheinterrupt-schedulingsystemcall measuresthe

number of cache misses in an idle system over a fixed period of time.

22

To generateadisk interrupt,theapplicationissuesanasynchronousreadsystemcall

to asmallfile residingonalocaldisk.Theasynchronousreadallowstheapplicationto con-

tinue executingwhile the datais transferredfrom disk. To guaranteethat the file is read

from disk, thefile cacheis flushedby readinga largefile (equalto themainmemorysize

of theexperimentalmachine).Note that theamountof datatransferredis smallerthanor

equalto thesmallestcacheline sizein thesystem,so that themeasuredinterrupthandler

overhead is not dominated by the data copy to or from user space.

All experimentsarerepeatedat least16 times,until the95 percentconfidenceinter-

val is lessthan10percentof thearithmeticmeanof thesamples(±5percent).Beforecalcu-

Figure 3: Structure of Interrupt Overhead Experimentsa) Basic Experiment; b) Refined Experiment to Eliminate Effects of Other System Activity

interrupt

kernel

user

exec

utio

n co

ntex

t

start counters stop counters

fill cache

interrupt

handler threadsystem call

fill cacheother system activity

interrupt

kernel

userexec

utio

n co

ntex

t

start counters stop countersinterrupt

handler threadsystem call

fill cacheother system activity

phase 1 phase 2 phase 4phase 3

fill cache

23

lating the mean,high and low outliers are removed.The resultspresentedhereare the

arithmetic mean of the remaining data points.

2.2.2.2 Experimental setup. The interruptoverheadmeasurementsareperformed

onaSunUltra-1andanSGIOrigin 200workstation.Thesearethesamesystemsusedpre-

viously to measureoverall disk I/O overhead.The relevantmachinecharacteristicsare

summarized in Table 1.

2.2.2.3 Results. Table3 presentstheresultsfor bothplatformsfor theL1 dataand

instructioncachesandtheL2 cache.As expected,whenthekerneldeliversa signalto the

applicationat theendof thedisk transfer,thenumberof cachemissesincreasesslightly.

TheL1 datacacheeffectsarevery similar for bothplatforms.Thedatacachefootprint of

theinterrupthandleris approximately3-4kilobytes.NotethatsincetheUltra-1cacheline

sizeis 16bytes,thenumberof cachemissesincurredby theapplicationis twice thatof the

Origin 200.

Table 3: Cache Misses

L1 D-Cache L1 I-Cache L2 Cache

Mode Description O-200 Ultra-1 O-200 Ultra-1 O-200 Ultra-1

User no signal delivered 104 215 169 402 298 253

signal delivered 130 216 156 444 300 243

Kernel no signal delivered 52 273 58 100 70 245

signal delivered 63 228 61 110 65 264

Exception no signal delivered 48 n/a 80 n/a 64 n/a

signal delivered 53 n/a 86 n/a 74 n/a

24

SincetheUltra-1performancecountersdonotdistinguishbetweenkernelandexcep-

tion mode,theSolariskernelmodecachemissesshouldcorrespondto thesumof thekernel

andexceptioncachemissesin IRIX. However,becauseSolariseventsarecountedregard-

lessof theprocesscontext,theresultishigherthanthecorrespondingIRIX results.In IRIX,

ontheotherhand,thecachemissesincurredby theinterrupthandlerthreadarenotincluded

in thesemeasurements;hencethesumof kernelandexceptionmissesis lessthanthetotal

number of replaced cache lines.

The L1 instructioncacheresultsfollow the sametrendasfor the datacache.Both

platformsshowaninstructioncachefootprint of about10-13kilobytes.TheSolarisinter-

rupt handlerreplacesabout15-18percentmoreinstructioncachelinesthantheIRIX han-

dler. However,theSolarisresultsshowa largevariation(especiallyin kernelmode)and

occasionallydonot reachthe95percentconfidenceintervalof ±5percentof themean.This

is dueto theOrigin 200beinga four-processorsystem,whereclock interruptandnetwork

packetprocessingcanbemovedto otherprocessors.On theUltra-1,all systemactivity is

handledby thesingleprocessor.This introducesmorenoiseinto theexperiments,especial-

ly when measuring over the relatively long period of 25 ms.

Note that due to the inclusion propertyof caches,instructioncachelines may be

evictedwhenthe correspondingL2 cacheline is replaced,regardlessof whetherit is re-

placedby dataor instructions.This explainswhy thenumberof kernelinstructioncache

misses is lower then the total number of L2 cache lines replaced by the interrupt handler.

Thenumberof L2 cachelinesreplacedby theinterrupthandleris approximatelythe

samefor bothplatforms,with theSunresultsbeingslightly lower.SincetheL2 cacheline

sizeon theUltraSPARCprocessoris half thatof theMIPS R10000,this indicatesthatthe

25

interrupthandlerdoesnot exhibit sufficientspatiallocality to benefitfrom a largercache

line size.This is confirmedby theobservationthatthesumof thenumberof L1 instruction

anddatacachelinesreplacedis almostequalto thenumberof replacedL2 cachelineson

the SGI platform.

On the Sun,on the otherhand,the sumof L1 instructionanddatacachemissesin

usermodeis higherthanthenumberof L2 misses,possiblybecausein thesmallerL2 cache

of theUltraSPARCinstructionsanddataoverlapandconflict with eachother,creatinga

smaller footprint.

Theseobservationsconfirmthatfrom theapplication’sperspectiveinterruptsin mul-

tithreadedoperatingsystemshavea highercostthanin traditionaloperatingsystems.The

additionalthreadschedulingactivity andcontextswitchescausemanymoreapplication

cache misses [78].

2.2.3 Latency Hiding and I/O Bandwidth Scaling

Thevaryinglatencyof I/O requestsis hiddenfrom applicationsby theoperatingsys-

temby blockingtherequestingapplicationin thesystemcall while thekernelperformsa

contextswitchto anotherprocess.This latencyhiding techniqueimprovesoverallsystem

throughput,andit canalsoimproveI/O throughputif requeststo independentI/O devices

canbeoverlapped.Thethroughputimprovementis in partlimited by theoperatingsystem

overhead associated with each I/O request.

A syntheticI/O intensivebenchmarkservesto demonstratethe effect of operating

systemoverheadon I/O throughput.The benchmarkissuesmultiple streamsof readre-

queststo a collection of independentdisks.Eachrequeststreamextractsthe maximum

26

bandwidthfrom a disk, while overlappingrequestsmaximizeoverall I/O throughput.To

eliminatebandwidthlimiting effectsof theSCSIbusor hostadapter,eachdisk is attached

to aseparateSCSIhostadapter.Thepurposeof thebenchmarkis to measurethemaximum

obtainableI/O throughputunderideal conditions,wherestreamsaredirectedat separate

disksandtheapplicationperformsno operationon thedata.As such,thebenchmarkdoes

not representrealapplicationbehavior,but it demonstratesanupperboundof obtainable

I/O performance when applications overlap I/O requests with other work.

Theresultsshownhereareobtainedby runningthebenchmarkon theL-RSIM sim-

ulationsystem[90]. L-RSIM is adetailedexecution-drivensimulatorthataccuratelymod-

elsanout-of-orderprocessorwith caches,a mainmemorycontrolleranda numberof I/O

devices.ThesimulatorexecutesaBSD-basedoperatingsystemthatimplementsacomplete

I/O subsystem,includingfilesystemanddevicedrivers.Section3 containsamoredetailed

descriptionof the simulationsystem,and presentsresultsof a validation againsta real

workstation.Thisexperiment,aswell astheexperimentsin Chapter8, usethesystemcon-

figurationsummarizedin Table4.Thisconfigurationis anapproximationof ahypothetical

400 MHz R12000 based workstation such as an SGI Origin-200.

Figure4showsaggregateI/O bandwidthfor varyingnumbersof requeststreams.The

top graphsshowthe total I/O bandwidthif eachstreamrequestsdatain a pseudorandom

sequencewith twodifferentrequestsizes.Bandwidthsaturateswhenthenumberof streams

approaches10,andremainslargelyconstantasthepressureontheI/O systemincreasesfur-

ther.Thesaturationis dueto thefact thatthehostprocessoroccupancyreaches100%,add-

ing morerequestsdoesnot leadto increasedoverlap.It shouldbenotedthatthesaturation

27

point remainsalmostunchangedfor differentrequestsizes,while theaggregatebandwidth

is slightly higher for larger requests.

Thebottomgraphsin Figure4 showaggregatebandwidthfor sequentialaccesspat-

terns.Issuingrequestssequentiallyleadsto higheraggregatebandwidthbecauseboth the

diskscontrollersandtheoperatingsystemareableto successfullyprefetchdiskblocksfor

subsequentrequests.However,theoperatingsystemoverheadremainsunchanged.As are-

sult, bandwidthsaturatesfor anevensmallernumberof requests,becauseprefetchingre-

duces the effective latency of I/O requests, and thus reduces the potential for overlap.

Thebenchmarkusedfor theseexperimentsissuesstreamsof I/O requestsfrom inde-

pendentprocesses.Many modernoperatingsystemsalsoprovidea nonblockingI/O inter-

facethatallowsapplicationsto continueexecutingwhile anI/O requestis processed.The

traditionalUNIX kernel,however,is basedon a blocking I/O model.To implementthe

nonblockingI/O interfacewith minimalchangesto thekernelstructure,mostoperatingsys-

Table 4: Simulator Configuration

Parameter Value

Processor 400 MHzdynamically scheduled48-entry reorder buffer4-way dispatch, issue & graduation

L1 caches 32 Kbyte 2-way set associative instruction & data cache

L2 cache 2 Mbyte 2-way set associative

System Bus 100 MHz 64-bit multiplexed address & data

Main Memory 100 MHz SDRAM, 4 banks

I/O bus 66 MHz PCI, 320 ns read latency

Disk 9 Gbyte, 10000 rpm, 5.3 ns average seek time

28

Figure 4: I/O Bandwidth Scaling

0

10

20

30

40

50

aggr

egat

e ba

ndw

idth

in M

byte

/s

0 4 8 12 16 20 24

# of streams

0

10

20

30

40

50

aggr

egat

e ba

ndw

idth

in M

byte

/s# of streams

Nonsequential 16 Kbyte Blocks Nonsequential 64 Kbyte Blocks

0

20

40

60

80

100

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams

0

20

40

60

80

100

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams

Sequential 16 Kbyte Blocks Sequential 64 Kbyte Blocks

120 120

0 4 8 12 16 20 24

0 4 8 12 16 20 24 0 4 8 12 16 20 24

29

temsusekernelthreadsto performtheI/O relatedwork in thebackground.Uponissuinga

nonblockingI/O request,a library routinespawnsanI/O thread,which executesa normal

blockingsystemcall andsignalstheoriginal threadwhenit completes.Althoughthis inter-

faceavoidstheoverheadof switchingbetweenseparateprocesses,it executesseveralsys-

temcallsto createandremovetheI/O thread.As aresult,nonblockingrequestsareat least

as costly in terms of operating system overhead as traditional blocking requests.

2.2.4 Application Throughput Scaling

Many applicationswith high I/O throughputrequirementsexploit latencyhiding

techniquesto improvethroughput.DatabasesandWeb serversprocessmultiple requests

concurrentlyand use locking to synchronizeconflicting accesses.Data are distributed

acrossmanydisksto not only providethe necessarystoragecapacitybut alsoto support

overlappingdiskaccesses.Theexperimentdescribedin thissectionsimulatestheMySQL

databaseserver[98] on theL-RSIM simulationsystemto demonstratetheimpactof oper-

ating system overhead on I/O throughput scaling.

MySQL is a public-domaindatabaseserverthat supportsthe SQL databasequery

language.It is internallymultithreadedandcantakeadvantageof nonblockingI/O opera-

tions if supportedby the kernel. If usedwith a user-levelthreadlibrary, I/O operations

blocktheentireprocessincludingall threads.Databasesarerepresentedasregulardirecto-

ries andfiles, makingthe serverextremelyportable.However,performingdisk accesses

throughthebuffercacheis not representativeof manycommercialdatabases,which tradi-

tionally accesstherawdiskdevice.Ontheotherhand,modernnetworkattachedfileserver

appliances[63] exportonly a NFS filesysteminterfaceandrequirecommercialdatabase

30

systemsto movetowardsa regularfile I/O interfaceaswell [22]. In this experiment,the

MySQL databasesystemservesasa representativeof a numberof I/O intensiveapplica-

tions that arecharacterizedby datasizeson disk that exceedmain memorycapacityand

high I/O throughputrequirements.Theseapplicationsincludefile servers,mail andnews

servers, web servers and databases.

Eachexperimentrunsanumberof identicalcopiesof thedatabaseserverthatexecute

thesamequeriesonseparatedisksto approximatethebehaviorof amultithreadeddatabase

server.Executingmultiple copiesof thedatabaseserverto improveI/O throughputis not

necessarilyrepresentativeof real I/O intensive applications.However, even though

MySQL is internally multithreadedand can exploit overlappingI/O requestsamong

threads,thesimulatoroperatingsystemdoesnotprovidethenecessarythreadsystemcalls

andkernelthreadsupport.To morecloselyapproximatetheperformanceof amultithread-

eddatabase,theleveloneandlevel two cachesizeandassociativityof thesimulatedarchi-

tectureis scaledwith thenumberof serverprocesses,thusminimizing theimpactof cache

conflicts among processes.

Eachcopy of the serverexecutesa sequentialqueryon a 30 Mbyte database,per-

formedmostlyasa seriesof 16 Kbyte readrequests.Thebuffer cacheis configuredwith

20 Mbytesfor all experiments.Sincethecacheis initially emptyandthedatabaseis too

large to fit in the cache, buffer cache effects do not influence I/O performance.

In this experiment,a singlequeryexecutesin approximately5 secondsreal-timeon

the simulatedarchitecture.Due to the largenumberof disk seekoperations,the CPU is

busyfor about8percentof thetime.Overonethirdof theseCPUcyclesarespentexecuting

operatingsystemcoderelatedto I/O operations.Thisnumberis higherthanthepercentage

31

of operatingsystemcyclesof commercialdatabaseengines,partly becauseMySQL is not

ashighly optimizedto avoidI/O operations.On theotherhand,othertypesof serversthat

performfewerdatamanipulations,suchasfile serversor webservers,exhibitanevenhigh-

er operating system component of the execution time.

Figure5 showsthespeedupof parallelqueriesoverasinglequeryaswell asuserand

kernelCPUutilization components.Thesimulatorconfigurationfor theseexperimentsis

the same as in the previous experiment and is summarized in Table 4.

Theleft graphshowsthethroughputrelativeto asinglequery.Thedatabasethrough-

put increasesalmostlinearlyup to ninequeries.Beyondthispoint, throughoutsaturatesat

closeto a speedupof eight.Theright graphplotsCPUutilization in termsof busycycles

and cycles spent in the operatingsystem.The busy cycles curve closely follows the

throughputcurve.As the processorutilization approaches100 percent,throughoutsatu-

rates.However,evenwith alargenumberof queries,theprocessoris nevercompletelyuti-

lized. This effect is dueto the fact thatdisk latencydependson whena requestis issued

with respectto thepositionof thediskheadrelativeto therequestedsector.Thisnonlinear

behaviorpreventsthesystemfromfine-grainmultiprogrammingwith perfectprocessoruti-

lization.Theoperatingsystemcomponentof theprocessorutilization remainsalmostcon-

stantbetween34and36percent,slightly increasingtowardslargernumbersof queries.The

operatingsystemoverheadof aboutonethird of theprocessorcyclescanbeattributedal-

mostcompletelyto I/O relatedactivity. Reducingthis overheadmakesthesecyclesavail-

able for application processingand improves throughputscalability of I/O intensive

applications such as the MySQL database server.

32

2.3 Network Attached Disks

Network-attachedsecuredisks(NASD) [42] areapromisingarchitecturefor scalable

file serversthataddressessomeof theshortcomingsof currentstoragearchitecturesby sep-

aratingstoragefunctionalityfrom storagemanagement,andby delegatinglower-levelde-

vice management functionality to the disk controller, as shown in Figure 6.

Separatingstoragefrom accessmanagementpermitsdatatransfersdirectly between

disksandclients,thusbypassingthemainmemorybottleneckof conventionalfile servers.

This directdatatransferis madepossibleby thehigh-levelinterfacethat thenetwork-at-

tacheddisksprovide,andby thedisksability to checkaccesspermissionsfor eachrequest.

NASD devicesstoreandoperateon objects,ratherthanindividual blocks,eliminatingthe

needfor a block-managemententity that is found in conventionalfile systems.Accessto

theseobjectsis managedby aseparatefile manager,whichgrantscapabilitiesto clientsto

Figure 5: MySQL Database Throughput Scaling

4

3

2

1

5

6

7

8

0

0 2 4 6 8 10 12

spee

dup

# of queries

40

20

60

100

80

0

CP

U u

tiliz

atio

n in

%

# of queries

14 0 2 4 6 8 10 12 14

busy cycles

kernel I/O cycles

33

accessobjectson thedisks.Thefile manageris alsoresponsiblefor mappingfilenamesto

disks,volumesandobjects,andfor metadataupdatessuchasobjectcreationandattribute

changes.Datatransfersfor readandwrite operationsareperformedautonomouslybetween

clients and disks over a scalable system area network

To maintaintheintegrityof thedisks,andto provideUNIX-level protection,thecli-

entsand disks usecapabilitiesin combinationwith public/privatekey encryptionwhen

transmittingrequestsanddata.Whenfirst connectingto thedistributedstoragesystem,a

clientrequestsasetof capabilitiesfrom thefile managerthatthenallow theclientto direct-

ly contact the disks to read or write data.

A studyby Gibsonetal. [42] finds thatnetwork-attachedsecuredisksindeedleadto

scalabledistributedfile systemswith performanceat leastcomparableto existingserver-

basedsolutions.However,thecommunicationoverheadbetweenclientsandfile manager

is identifiedasamajorbottleneck,potentiallylimiting thescalabilityof theirexperimental

prototype.

Figure 6: Network-attached Secure Disk Architecture

File Manager

Client 0

Client 1

create, d

eletecapabilities

read/write

read/write

34

2.4 InfiniBand

InfiniBand [51] is a recentlydevelopedindustrystandardfor a next-generationdis-

tributedI/O architecture.It addressesseveralshortcomingsof currentI/O architectures,in-

cludingbandwidthlimitationsimposedby sharedI/O busesandtheoverheadincurredby

operating system mediated access to I/O devices.

InfiniBand centersarounda distributedI/O architecturewith a scalablesystem-area

networkconsistingof switches,routersandhostchanneladapters(HCA) thatconnectcli-

ent systemsto autonomousI/O devices.Figure7 showsanexampleof theswitchednet-

work with a variety of client systemsandI/O devices.Comparedto sharedI/O buses,a

Figure 7: InfiniBand Distributed I/O Architecture

Switch

CPUCPU

Mem HCA

HCA HCA HCA

CPU

Mem HCA

Switch

I/OI/OI/O

Switch

HCA

I/O

HCA

I/O

Router

othe

r su

bnet

s

35

switchednetworkis ableto providescalablebandwidthfor largernumbersof devices,and

is ableto providehighertransferratesthroughtheuseof point-to-pointlinks thatcanop-

erateathighersignallingfrequencies.It alsopermitsdirectdevice-to-devicedatatransfers

and can be used to communicate between client systems.

Similarto NASD, InfiniBandrequiresautonomousI/O devicesthatareableto export

ahigh-levelinterfacetoclientsandthatperformcomplexbuffermanagementfunctionsand

accessprotectionchecks.However,theexactsemanticsof requestssentto I/O devicesis

not part of the InfiniBand standard, and remains to be defined.

TheInfiniBandarchitecturerecognizestheneedtobypasstheclientoperatingsystem

for I/O requeststo achievehighI/O throughput.Themechanismusedto communicatewith

thelocal I/O networkinterface,andultimatelywith remotedevices,is basedon theU-Net

user-levelnetworkarchitecture[32]. Eachclient applicationestablishesat leastonenet-

work endpointconsistingof send,receivequeue,andcompletionqueueslocatedin mem-

ory. The host channeladapter(HCA) polls all work queuesand processesrequests

deposited in these queues.

To enabletheHCA to accessusermemory,thework queuesanddatabuffersreside

in memorythathasbeenregisteredwith thenetworkinterfacethroughanoperatingsystem

service.Theoperatingsystemensuresthatthecorrespondingphysicalpagesarecontiguous

andarenonpageable,andit communicatesthevirtual-to-physicaladdressmappingto the

interface.Theseprearrangedbuffersfacilitatelow-costDMA operationswithoutoperating

systeminvolvement,but restricttheapplicationsuseof I/O buffers.It canalsolimit scal-

ability, asall registeredbuffer pagesmustbe presentin physicalmemory.Applications

mayconservativelyallocatemorebufferspacethanis required,thusincreasingthepressure

36

on thephysicalmemorypool further.RestrictingtheHCA to usepinnedandprearranged

memoryregionssimplifiestheDMA enginehardwareandtheintegrationwith existingsys-

temsoftware,but placesrestrictionson applicationprogrammers.On theotherhand,by-

passingthe operatingsystemfor requestinitiation movesthe responsibilityto multiplex

andschedulemultipleI/O requeststo theHCA hardware.Thistaskrequireseithercomplex

finite statemachinesto control theHCA operationor a programmableon-chipcontroller,

bothof which increasetheHCAshardwarecomplexity.In addition,in asystemwith many

activeprocesses,polling manyqueuescanincreaserequestlatencysignificantly,evenif

most queues are frequently empty.

2.5 User-level Network Architectures

Communicationsubsystemsfor tightly coupledparallelmachinesandmorerecently

for clustersof commodityworkstationsregularlybypasstheoperatingsystemfor common

operationsto reducecommunicationoverheadandimprovescalability.Most of thesesys-

temsprovidesomeformof protectedandatomicuser-levelaccessto thenetwork,minimize

the number of data copies for data transfers and implement lightweight notifications.

Manyuser-levelcommunicationarchitecturesuseaconnection-orientedapproachto

minimizetheamountof informationthatneedsto betransferredto thedevicefor eachmes-

sage.Duringtheconnectionsetup,theoperatingsystemcanpin memory,communicatead-

dressmappingsto thedeviceandperformprotectionchecks.Applicationsoftwareis then

ableto initiate messagetransfersusingtheestablishedendpointwhile bypassingtheoper-

atingsystem.Privilegedargumentsareinferredby thenetworkinterfacefrom theconnec-

tion endpoint.This approachis very effective in reducingthe per-messageoverheadby

37

movingaportionof thesoftwareoverheadto theconnectionsetuptime.However,its scal-

ability to a largenumberof connectionscanbeproblematicasthenetworkdeviceneedsto

manageandstorethestateof eachconnection.It alsodoesnot reduceoverheadfor appli-

cations that require frequent connection setup and teardown operations.

U-Net[32] andthevirtual networkinterfacein NOW [62] useexistingmemoryman-

agementfacilities to providemultiple applicationswith theillusion of exclusiveaccessto

thenetwork.User-levelsoftwareneverdirectly accessesthedevicehardwarebut commu-

nicateswith thedevicethroughshareddatastructuresor networkendpointsin virtual mem-

ory.Thesedatastructuresallow applicationsto assembletherequestargumentsin memory

without concernfor atomicity beforeinforming the devicehardwareof the new request.

Virtualizing thedevicehardwarethroughper-applicationendpointsin memorysimplifies

thesoftwareinterfaceandeliminatestheneedto providehardwaresupportfor atomicity.

However,it increasesthedevicehardwarecomplexitysignificantlyby puttingtheburden

of multiplexing the endpoints on the network interface.

Otherdesignsutilize theinherentatomicityof abustransactionto implementatomic

deviceaccess.ATOLL [17] usestheunusedupperaddressbitsof anuncachedwrite trans-

actionasan index into a routing tableto initiate a DMA transfer.Avalancheplacesa re-

questcontrol structurein kernelmemoryandupdatesa queuecounterin the NI usinga

singleuncachedwrite. TheMedusanetworkadapter[7] combinesa networkpacketstart

addressandlengthinto a32-bitwordthatis writtento ahardwaretransmitFIFOinto asin-

glebustransaction.Sinceabustransactionis anaturalunit of atomicityin mostcomputer

systems,it is temptingto useit to implementatomicmessagetransfersetup.Thesmallsize

of uncachedtransfers,however,limits thenumberof requestparametersthatcanbetrans-

38

ferredwith eachrequest.Cacheline transactionsareableto transfera largernumberargu-

mentsatomically, but softwarehas no control over thesetransactionsbecausecache

conflicts and replacements are handled entirely by the cache controller.

PerformingDMA operationsthathavebeeninitiatedby user-levelsoftwarewithout

operatingsysteminvolvementiscomplicatedby thefactthatapplicationsoperateonvirtual

addresses,whereasI/O deviceusephysicaladdressesto accessmemory.Providingthere-

quiredaddresstranslationsandpinningtheassociatedphysicalpagesis oftendoneduring

connectionsetupandimpliesthatmessagebuffersremainfixed for thedurationof thecon-

nection.Overcomingtherestrictionof staticallypinnedDMA buffersrequiresaddingady-

namicaddresstranslationcapabilityto theDMA engine.Similar to a processorTLB, the

DMA enginecancacheaddresstranslationsin on-chipmemoryandthusamortizethecost

of addresstranslationsby exploitingspatiallocality in applications.Theoperatingsystem

performsaddresstranslationsandprotectionchecksatthetimeamappingis installedin the

DMA TLB. Althoughthiscanbedoneondemandusinginterrupts,handlingpagefaultsat

interrupttime is difficult at bestbecauseno contextis availablethatcanblock waiting for

thedisk request.To avoidextensivemodificationsof theoperatingsystemkernel,theU-

TLB mechanismperformsaddresstranslationsandpinspagesin advanceunderapplication

control [21]. Applicationsmanagea tableof translationslocatedat the DMA device.To

implementprotectedaddresstranslation,applicationshaveno accessto the physicalad-

dressof a mappingbut specifyDMA addressesusingan index into the translationtable.

Systemcallsareusedto install mappingsin the table,at which time thephysicalpageis

pinned.

39

TheU-Netarchitecturehasbeenextendedwith asimilardemandpinningschemeand

a TLB at theDMA engine[106]. Unlike theU-TLB, addressmappingsarerequestedby

thenetworkinterfaceduringa DMA transferandareprocessedthroughkernelinterrupts.

Pagefaultsarehandledby akernelthreador processthatis triggeredby theinterrupthan-

dler. Pagesarepinnedwhenthecorrespondingtranslationis installedin thedeviceTLB,

andremainpinneduntil thedeviceevictstheTLB entry.To avoidstallingthenetworkon

aTLB miss,thenetworkinterfaceprefetchestranslationsfor receivebuffers.Applications

arenot involvedin managingtheTLB or maintainingtranslations.TheU-Net/MM mech-

anismprovidestransparentuser-levelDMA with arbitrarybufferspacesat thecostof add-

ed complexity in the operating system.

Virtual memory-mappedcommunication[14][39] is a user-levelcommunication

mechanismthatavoidstheuser-levelDMA problemsbyperformingall communicationbe-

tweenpairsof virtual memorypages.Insteadof explicitly specifyingdatatransfers,it for-

wardsall modificationsmadeto a local memorypageto anassociatedremotepage.This

communicationschemereducescommunicationoverheadandlatencyto a minimum,but

sincethehostprocessoractivelyperformsall datatransfers,bandwidthfor bulk transfersis

limited by the uncached write bandwidth of the processor.

Many user-levelcommunicationarchitecturesreducemessagepassingoverheadby

minimizing thenumberof copyoperationsfor a messagetransfer.Avoiding copyingdata

whenreceivingmessagesis complicatedby the fact thatmessagesmayarrivebeforethe

receiverprocesshasposteda receivebuffer. If thecommunicationlibrary providesinter-

mediatebufferingfor theseunexpectedmessages,communicationoverheadincreasesdue

to thenecessarydatacopy.U-Net andInfiniBand requirethatprocessesenqueuereceive

40

buffersin advance.Sincethearrivalorderof messagesisundefined,processeshavenocon-

trol overwhich messageis depositedin which buffer.Active Messages[23][33] is a com-

municationmodel that avoids copying at the receiverby invoking a messagehandler

routineuponmessagearrival.Thehandleris specifiedin themessageitself. It is responsi-

ble for removingthemessagefrom thenetworkandintegratingit into theflow of compu-

tation.Performingall communicationin pairsof request/replymessagesenablesthesender

to setupareceivebufferfor thereplymessageaspartof themessagetransmissionprocess,

thusavoidingintermediatebufferingof incomingmessages.However,only tightly coupled

parallelprogramscanberequiredto follow thisprogrammingmodel,sincethesenderspec-

ifies a user-level routine in the receivers address space when transmitting a message.

2.6 Virtual I/O Devices

Oneof thetasksof anoperatingsystemis to virtualizethemachinehardwaresothat

applicationshavetheillusion of exclusiveaccessto it. If theoperatingsystemis bypassed

for somehardwareaccesses,thehardwaremustbevirtualizedby someothermeans.Trust-

ing applicationsto synchronizeamongeachotherwhenaccessingdevicehardwareis only

an option in restrictedenvironmentssuchastightly-coupledmultiprocessors.In general-

purpose multiuser systems, processes must be protected from uncooperative applications.

User-levelnetworkinterfacesareusuallyvirtualizedby providingmultipleendpoints

thataremultiplexedby thedevicehardware.U-Net providesprocess-privatequeuesthat

arepolled by the devicefor requests.MemoryChannelmapsdistinct physicalpagesinto

different processesaddressspaceandmultiplexeswrites to the pages.The NOW virtual

networkinterfaceusesvirtual memorytechniquesto improvescalabilityof thenetworkin-

41

terface[62]. Like in U-Net, networkendpointsarelocatedin main memoryandmapped

into applicationaddressspace.Endpointscanberesidentin thenetworkinterfaceoutboard

memory,or swappedout in main memory.The network interfacemultiplexesmessages

from the setof residentendpoints,andcooperateswith the hostoperatingsystemwhen

swappingendpoints,thusextendingthenumberof availableendpointsbeyondwhatis sup-

ported by the device hardware.

High-performancegraphicsadaptersareanotherclassof I/O devicesthatcanbenefit

from bypassingtheoperatingsystem.In manyworkstations,theX-serveris theonly entity

with directaccessto thegraphicshardware.Like anoperatingsystem,it multiplexesgraph-

icscommandsthatit receivesfrom applicationsontothehardwarewhile maintainingwin-

dow boundariesandotherapplicationcontext.In high-performanceapplications,the X-

serverbecomesabottleneckbecauseit requiresmultiplecontextswitchesbetweentheap-

plicationandtheserverto transmitacommandto thegraphicsengine.Directrenderingby-

passesthe X-server and lets applicationstransmit commandsdirectly to the graphics

hardware, which needs to be virtualized to handle multiple client applications.

SGI’s directrenderingimplementationvirtualizesthegraphicsadapterby treatingit

aspartof theprocesscontext[53]. To avoid thecostof loadingandunloadingtheentire

graphicsenginestateoneverycontextswitch,theoperatingsystemimplementsalazycon-

textswitchstrategyandperformstheswitchonly if thenewprocessattemptsto accessthe

graphicshardware.TheDIGITAL PowerStorm4DT graphicsadaptertakesa lessaggres-

siveapproachto direct renderingandusestheX-serverto multiplex graphicscommands

[59]. Insteadof usingheavyweightinterprocesscommunicationmechanismsto transfer

high-levelcommands,applicationsqueuehardwarecommandsin sharedmemorypages

42

anddirecttheX-serverto thesecommands.Bothapproachesaddresstheoverheadinvolved

in multiplexingapplicationcommandson to thegraphicshardwareby optimizingor elim-

inating the intermediate software entities.

2.7 Operating System Support for High-Performance I/O

Apart from theI/O devicehardwarecapabilities,I/O performanceis alsoaffectedby

operatingsystemoverheadsconsistingof controloverheadsuchassystemcallsanddata

transfersbetweenprotectiondomains.Operatingsystemsupportfor high-performanceI/O

addressesbothof thesesourcesby avoidingcopyingdatabetweenkernelanduserbuffers,

by optimizing the overheadof layeredand modularsoftwaredesignand by providing

mechanisms to customize operating system services for application’s needs.

2.7.1 Copy Avoidance

Data transfersbetweenkernel and userspaceas well as within the kernel often

amountto thelargestcomponentof I/O overheadandcanlimit overallI/O performanceof

a system.The I/O interfaceimplementedin UNIX andotheroperatingsystemsimpliesa

copysemanticsfor I/O operationsby giving applicationscontrolover theI/O buffer allo-

cationanddeallocation.Thesystemplacesno restrictionson thelocationor alignmentof

I/O buffers.In addition,theoperatingsystemmaintainsstrongintegrity of theapplication

I/O buffers,suchthatapplicationsneverobserveincorrectI/O dataoninput,andthatoutput

datacanbemodifiedwithout ill effectsaftertheoutputoperationhasexecuted.To imple-

mentthesesemanticswithout placingrestrictionson theapplication,mostoperatingsys-

tems copy I/O data betweenuser spaceand intermediatekernel buffers. Using kernel

43

buffersfor I/O alsoallowsthekernelto exploit locality andsharingbetweenapplication,

for instancethrougha file buffer cache.However,copyingdatabetweenprotectiondo-

mainscostsvaluableprocessorcyclesanddisplacescacheandTLB entriesthatlaterneed

to bereloadedby theapplication.Performanceof copyoperationsis largelydeterminedby

mainmemoryperformance,increasesin CPUcomputeperformancegenerallydonottrans-

late into proportionallyimprovedcopy bandwidth[88]. The amountof overheadscales

with the request size and limits the achievable I/O performance.

Copy-on-write[86] is anoptimizationthatattemptsto avoidcopyingof outputdata

andusestheapplicationbuffer for I/O. Ratherthancopyingthedata,thekernelmarksthe

pagesthatcontaintheoutputbufferasread-onlyby theapplication.If theapplicationmod-

ifies theoutputbufferbeforetheI/O operationcompletes,it incursapagefault,uponwhich

thevirtual pagescontainingthebuffer arecopiedto a kernelbuffer.Alternatively,theap-

plication canbe stalleduntil the I/O operationcompletes[8]. Copy-on-writeintroduces

overheadsthatmakethisschemebeneficialonly for largertransfers.Theapplicationbuff-

ers must be pinned, which can be more costly than pinning kernel buffers.

Furtheroptimizationsarepossibleif theprogrammingmodelis modified.Giving the

kernelresponsibilityfor I/O bufferallocationanddeallocationallowsit to usepageremap-

ping insteadof copyingfor both input andoutput.Fbufsarea system-widecross-domain

datatransferfacility thatavoidsdatacopyingby remappingandsharingvirtualbufferpages

betweenprotectiondomains[27]. To furtherreducethetransfercost,fbufs canbecached

in individualdomainsfor laterreuse,andbuffersaremappedat thesamevirtual addressin

all domains.Fbufsarea generalmechanismfor efficiently transferringdatabetweenpro-

tectiondomains,includingtheI/O subsystem,buttheyimply significantchangesto thepro-

44

grammingmodel.Softwareexplicitly allocatesanddeallocatesI/O buffers,but it hasno

controloverthealignmentandthevirtualaddressof inputdata,andcannotreusetheoutput

buffer afteran I/O operation.Therequiredchangesof the I/O programminginterfaceare

onepossiblereasonwhy, apartfrom anexperimentalimplementationin Solaris[99], fbufs

haveonly slowly foundtheirwayinto generalpurposeoperatingsystems,despitetheirsig-

nificant performance advantages over copying.

Containershippingis anI/O facility that,similar to fbufs, replacesdatacopyopera-

tionswith pageremappings[79] [80]. It restrictsthegeneralityof fbufs by implementing

movesemanticswhenpassingcontainers,andtheactualdatapagesareremappedonly if

thereceiverrequiresaccessto thedata,otherwiseonly thecontainerdescriptoris passed.

Themorerestrictedsemanticsof containershippingcomparedto fbufsallowsamoreeffi-

cientimplementation,sinceestablishingsharedpagemappingsis morecostlythansimply

unmappingandmappinga page.It alsosimplifies the programminginterfacesomewhat

and lends itself better to strict producer-consumer I/O scenarios.

Thevirtual transfertechniqueemployedby fbufs andcontainershippingcanbefur-

ther specializedand optimizedfor peer-to-peerI/O transfers.Suchtransfersmove data

from asourceI/O devicethroughthekernelto asink I/O devicewithoutanydatamanipu-

lation.Splicingis atechniquethatallowsapplicationstoestablishsuchin-kerneldatatrans-

fer paths[79]. By eliminatingthedatatransferto andfrom applicationspace,splicing is

ableto sharekernelbuffersbetweensourceandsinkdevicedrivers,thuscompletelyelim-

inatingin-memorydatacopies.Peer-to-peerI/O with splicingcandrasticallyimproveI/O

performancefor a significantsubsetof I/O intensiveapplications,but is restrictedto sce-

45

narioswhereapplicationsdonot inspectthedatabeforeforwardingit, andneitherapplica-

tions nor kernel modules modify the data.

IO-lite [75] generalizesthe sharingof kernel buffers into a unified buffering and

cachingschemeandextendstheability to shareI/O buffersto applications.In additionto

providingacopy-freecross-domaindatatransferfacility, it minimizestheamountof phys-

ical memoryusedfor cachingby integratingthefile cacheinto thesharedbufferpool. IO-

lite thusextendstheutility of schemeslike fbufs andsplicing to applicationsthat inspect

data before forwarding it to other I/O facilities, and which combine file and network I/O.

2.7.2 Control Overhead Reduction

Additionalperformanceimprovementscanbeachievedby minimizingcontrolover-

headassociatedwith I/O operations.Thisoptimizationis especiallyeffectivefor smalldata

transfers.Operatingsystems,like manycomplexsoftwarepieces,uselayeringandabstrac-

tion to reducecomplexityandimprovemodularityandflexibility. Thisapproachis benefi-

cial for the software developmentprocess,but it is often counterproductivefor I/O

performance.Passingrequeststhroughmultiple layers,sometimesinvolving crossingpro-

tectiondomains,incursoverheadin theform of procedurecalls,contextswitchesanddu-

plicate work performed in different layers.

Thex-kerneloperatingsystemcombinesaframeworkfor high-performancenetwork

protocolimplementationswith supportfor modularprotocolcomposition[50]. It provides

basiclow-level servicessuchasthreadmanagement,buffer managementandeventhan-

dling thatareoptimizedfor networkprotocolprocessing.High performanceis achievedby

avoidingcontextswitchesandby providingastreamlinedinterfacebothbetweenprotocols

46

within thekernelandapplications.Threadsareassociatedwith messagesratherthanwith

protocols,a threadshepherdsa messagethrougha seriesof protocolswithout context

switches.TheScoutoperatingsystemtakestheideaof minimizingthecostof multiplepro-

tocol layersonestepfurtherandexplicitly specifiespathsthroughtheprotocolstack[70].

Thusthesystemisabletoapplysoftwareoptimizationssuchascommonsubexpressionand

deadcodeelimination,inlining andconstantfolding atalargescopeandwith betterresults.

In addition,schedulingdecisionscanbe basedon the entirepathof a message,thusen-

ablingthekernelto providequality-of-serviceandotherreal-timeguarantees.Scout’sno-

tion of a pathalignscontrol flow anddatatransferandenablesoptimizationsthatbenefit

bothoverheadcomponentsat thesametime,while alsointroducingnewoptimizationop-

portunities.

Extensibleoperatingsystemslike Spin[12] or Exokernel[36] allow applicationsto

implementspecializedoperatingsystemfunctionalityin anextensiblekernel,whichresults

in betterperformancefor many applicationsfor which the general-purposeinterfaceof

monolithickernelsis inappropriate.Consequently,the interfaceprovidedby suchkernels

providesonly low-level primitives andplacesa greaterburdenon applicationsto imple-

mentresourcemanagementpoliciesandotherextensions.At thesametime, it givespro-

grammerstheopportunityto applydomain-specificoptimizationsto improveperformance

or implement new application functionality.

Initiating anI/O requestnormallyrequiresa systemcall. Oneof thereasonsI/O de-

vice accessis performedonly by theoperatingsystemis to ensureatomicityof theaccess

sequence.Theoperatingsystemguaranteesthatprocessescompetingfor deviceaccessdo

not interferewith eachotherand interleaverequestarguments.Bypassingthe operating

47

systemwheninitiating I/O requestsavoidsthecostof asystemcall andcanleadto perfor-

manceimprovementsespeciallyfor low-latencyI/O operationssuchasnetworkcommuni-

cation.Software[11][71] andhardware[47][68] basedsynchronizationmechanismsexist

that implementuser-levelatomicsequences,but thesesolutionsalwaysrequirethat soft-

warevoluntarily usesthesynchronizationmechanismto avoidcorruptionof shareddata.

Uncooperativeapplicationscanbypassthesynchronizationmechanismandcanaffect the

integrityof theentiresystem.Hence,thesesolutionsareunsuitableto implementuser-level

I/O device access.

Synchronousor asynchronouseventssuchaserrorconditionsor I/O completionno-

tificationsareusuallyhandledby theoperatingsystem,to allow thekernelto hidethelow-

leveldetailsof theseconditionsfromapplications.Applicationscanbeinformedof asubset

of theseeventsvia signals.However,thecostof this mechanismprohibitsits usefor fre-

quenteventssuchaswriteprotectionfaultsin agarbagecollector.Low-overheaduser-level

exceptionhandling[100] reducesthe costof a general-purposesignalhandlerby saving

only minimalstatebeforeexecutingtheuser-levelhandler,andby avoidingthesystemcall

normally requiredto resumeprogramexecution.However,themechanismsupportsonly

synchronousexceptionsthatcanbehandledin thecurrentprocesscontext.As suchit is not

directly applicable to asynchronous I/O interrupts.

2.8 Summary

TheI/O interfaceof mostcontemporaryoperatingsystemsspecifiescopysemantics,

which meansthat applicationscanuseI/O dataasif they arecopiedbetweenkerneland

applicationbuffersaspartof anI/O operation.This modelgivesgreatflexibility to appli-

48

cationprogrammers,but its implementationusuallyrequiresanactualcopyoperation.This

copyoperationaswell asthesystemcallsandinterruptsinvolved in executinganI/O re-

questintroducesignificantoverheadthatlimits effectivebandwidthandthroughputof I/O

operations.Optimizingthecopyoperationsin thegeneralcasewithoutchangingtheappli-

cationI/O interfaceis difficult. Theoperatingsystemoverheadassociatedwith I/O opera-

tionsnot only limits the throughputof individual requestsbut alsorestrictstheability of

applications to exploit latency hiding techniques to improve system throughput.

RecentlyproposedstorageandI/O architecturesaddressthebandwidthlimitation of

currentI/O andmemorybusesby distributingI/O deviceson a scalablesystemareanet-

work, andplacingtheresponsibilityfor administrativeoversightonaseparateaccesscon-

trol server.Data transferscan be performeddirectly betweenI/O devicesand clients,

bypassingthe memorysystembottleneckof traditional servers.By applying user-level

communicationtechniquesto I/O, InfiniBand is ableto bypasstheclientoperatingsystem

for manyI/O requests.However,theconnection-orienteddesignof theprogramminginter-

facecanleadto scalabilityproblems,astheI/O networkinterfacehardwareneedsto main-

tain state for every process.

User-levelcommunicationarchitecturesminimizeoverheadby bypassingtheoper-

atingsystemfor messagetransfers.Thiscanbeaccomplishedby separatingconnectionset-

up from messagetransfers. Connections are establishedunder operating system

supervisionto performprotectionchecksandto communicateprivilegedinformationto the

devicehardware.Subsequentmessagetransfersareinitiateddirectlyby theapplicationus-

ing connectionendpointdescriptorsprovidedby theoperatingsystem.Theseendpointsare

49

maintainedandmultiplexedby thenetworkinterface,whichleadsto complexandnonscal-

able hardware implementations.

Betterscalabilityof suchvirtual I/O devicesis achievedif thedevicecontextis treat-

ed aspart of, or similar to a processcontext.This designallows the operatingsystemto

swapanunlimitednumberof communicationcontextson to a finite numberof hardware

contexts,similar to demandpagingof physicalmemoryor CPUtime slicesin a multipro-

grammedsystem.However,triggeringandhandlingdevicecontextswitchesintroducesad-

ditional overheads and can negatively impact latency.

Operatingsystemimprovementstargetedat high-performanceI/O haveaddressed

both dataand control overheadwhile keepingthe operatingsysteminvolved in I/O re-

quests.Reducingdatatransferoverheadmayrequiredrasticmodificationsto theprogram-

ming interface, as applicationsand the kernel must cooperateclosely in I/O buffer

management.If theoperatingsystemis awareof thecontrolpathof I/O data,it is ableto

optimizethesepaths,significantlyreducingoreliminatingduplicateworkperformedin dif-

ferentmodulesandby applyingglobaloptimizationtechniques.Thisoptimizationbenefits

mostlythelatencyof smalldatatransferssinceit doesnotnecessarilyeliminatecopyoper-

ations.

Otherresearchhasfocusedon allowing applicationsto tailor both the I/O interface

andtheservicesthatanoperatingsystemprovidesto their individual needs.Suchextensi-

ble kernelscanimproveI/O performancesignificantly,but requirea largeeffort from the

programmerto providethe neededkernelextensions.Otheroptimizationstargetingsyn-

chronizationandexceptionhandlingarenot directly applicableto I/O astheyassumevol-

untary cooperation of applications and do not support asynchronous interrupts.

3. THE L-RSIM ARCHITECTURAL SIMULATOR

Execution-drivensimulationis avaluabletool for evaluatingnewcomputerarchitec-

tures,sinceit allowstheresearcherto modify virtually anypartof acomputersystemwith-

out incurringthecostof developingnewhardware.Designinga detailedexecution-driven

simulatoris a trade-offbetweenaccuracy,simulationspeed,anddevelopmenteffort. In

manycases,assumptionsaremadethatyield asimplerandfasterimplementationwith suf-

ficient accuracy,aslong astheseassumptionshold. For instance,it is oftenassumedthat

instructioncachemisseshavea negligibleimpacton performanceandhencea perfectin-

structioncacheis modeled.For similar reasons,manysimulatorsignoreoperatingsystem

andI/O activity.Theresultingsimulatorsarecapabletoolsfor evaluatingnewideasfor im-

proving instruction-levelparallelismor cacheperformance[19][77]. However,when it

comesto workloadsthatexhibit a significantamountof operatingsystemactivity, multi-

programmingor I/O, theassumptionsmadewhensuchsimulatorsweredevelopednolong-

er hold, and a different approach must be taken.

To addresstheseworkloads,severalfull-systemsimulatorssuchasSimOS[48] and

SimICS[61] havebeendeveloped.ThesetoolsmodelI/O devicesto suchdetailthatanal-

mostunmodifiedoperatingsystemcanbebootedin thesimulationenvironment,whichal-

lows researchersto evaluatevirtually any workload. However, to achieveacceptable

simulationperformance,the processorandcachemodelsemployedby thesefull-system

simulatorsareoftenverysimple,anddonotsimulatethecomplexinteractionsof amodern

51

microarchitectureandmemoryhierarchy.In addition,theresultingsimulationenvironment

is acomplexsystemconsistingof thesimulatorandoftenacomplete,off-the-shelfOSlike

Linux [20], makingthelearningcurvefor newusersverysteep.TheL-RSIM simulatorde-

velopedfor thisstudycombinesdetailedprocessorandcachemodelsthatarebasedon the

RSIM architecturalsimulator[77] with theability to simulateoperatingsystemeffectsand

I/O device behavior [90].

3.1 Simulator Machine Model

The processormodel implementsa dynamicallyscheduled32-bit SparcV9 [104]

processorwith registerrenaming.Theexecutioncoreusesa unified instructionissuewin-

dow of configurablesize,but it alsoallowsfurtherdispatchandissuerestrictionsfor dif-

ferentinstructionclasses.ThecachehierarchyincludesL1 instructionanddatacachesof

configurablesizeandassociativity,a unified L2 cache,anda combiningbuffer for un-

cachedstores.Cachecoherencyis maintainedusingtheMESI protocol,with extensionsfor

efficient DMA transferssuchasread-currentandwrite-purgebustransactions.Themem-

ory controller supports multiple banks of SDRAM memory.

TheI/O subsystemconsistsof aPCIbridgethatsupportsboot-timedevicedetection

andconfigurationasoutlinedin thePCIspecification[81]. Forsimplicity, thePCIbusitself

is notmodeled.Instead,I/O devicesaremodeledasif theyaredirectlyattachedto thesys-

tembusby configurabledelaypipelinesthatapproximatethelatencyof aPCIbridge.The

real-timeclock is modeledafter theMOSTEK 48T02clock chip [60]. It providestwo in-

dependentinterruptsourceswith aprogrammableperiodrangingfrom 1 millisecondto 10

seconds.The SCSI adapteris compatiblewith the AdaptecAIC7770-basedSCSI host

52

adapterfamily [1]. It supportsmultiple outstandingrequests,disconnectsand request

queueing.EachadaptercontrolsoneSCSIbusof configurablewidthandtransferrate.SCSI

busarbitrationdelayandidle timesaremodeledaccurately,while datatransfersalwaysoc-

cur at the maximum synchronous transfer rate of the particular bus.

ThemodeledSCSIdisk implementsaprefetchandwrite cacheof configurablesize.

The cacheis divided into segments.Eachof the segmentscancontaina streamof disk

blocksstartingatanarbitraryoffset,upto themaximumsizeof thesegment.After finishing

areadrequest,thediskfetchesblockssequentiallyinto thesegmentuntil eitherthesegment

is full or anewrequestarrives.If write cachingis enabled,thediskbuffersdatain thecache

andreportscompletionimmediately.Theactualwrite-backhappenslaterwhenthedisk is

idle, whena newwrite requestarrivesandno write segmentis available,or whena syn-

Figure 8: L-RSIM Machine Architecture

CPU

L1 I-Cache L1 D-Cache

L2 Cache

Memory Controller

DRAM DRAM. . . . .

PCI Config

RealtimeClock

SCSIAdapter

System Bus

PCI Bridge

SCSI Bus

Disk

53

chronizerequestis received.Blocks thatarewritten to thedisk aresavedin a file on the

simulationhost,which allows thedisk modelto maintainits dataacrosssimulationruns.

The internaloperationof the disk drive aswell as the methodsfor computingseekand

transfer times are based on the very detailed descriptions by Ganger [41] and Lee [57].

Thearchitectureof themodeledcomputersystemdoesnot representanyparticular

system.The processorusesthe Sparcinstructionset,but its microarchitectureresembles

theMIPSR10000.Thisdesignchoicewasmadein theoriginalRSIM systemwhich is the

basisof thecurrentprocessorandcachemodels.UsingtheSparcinstructionsetarchitec-

tureopensthesimulatorto avarietyof compilersandalargenumberexistingapplications.

Modeling a particularSparc-basedworkstationwould have requiredimplementingthe

TLBs, interruptdeliverymechanismsandI/O subsystemin awaywhichmightnotalways

be thebestchoicefor a dynamically-scheduledprocessor.Second,thegoalof thedesign

effort wasto provideenoughdetailto simulatetheeffectsof I/O andoperatingsystemac-

tivity, ratherthanto bootanexistingoperatingsystem.Theco-developmentof thesimula-

tor and operatingsystemalso allowed easiertesting of new featureswhile they were

implemented.Most importantly,thesimulatoris intendedto representaclassof machines

andnotnecessarilyoneparticularsystem.By developinganewmachinearchitectureit was

easierto avoidtheidiosyncrasiesof aparticularsystem,whichmightaffectthevalidity of

future results.

3.2 LAMIX Kernel

Theprocessoranddevicemodelsprovideenoughdetailto simulateafully functional

UNIX compatibleoperatingsystemwith I/O subsystem,including systemcalls, buffer

54

cache,device drivers and interrupts.Partsof the operatingsystemare designedfrom

scratch,basedon Linux [20] andBSD [66] sourcecode,while the filesystemanddevice

driver codeis takenfrom the NetBSDsources.The kernel is compatiblewith the 32-bit

subsetof Solaris,which meansit is ableto run executablescompiledfor Solariswithout

modification.Internally,thekernelis structuredsimilar to BSD 4.4. It implementsmulti-

programming,signalhandling,processsynchronizationandsharedmemory,aswell asa

completefilesystemwith vnodelayer,a fixed-sizefile cacheandthenativeBSD filesys-

temsFFSandLFS. A newHostFSfilesystemgivessimulatedapplicationsaccessto files

on the simulation host.

The hybrid designof the kernelsimplified the simulatordesigndrastically.Rather

thanhavingto understanda largeportionof anexistingOSsuchasLinux or BSDin order

to port it to thesimulator,thekernelwasbuilt up incrementallyasthesimulatorbecame

moredetailed.For instance,earlyversionsdid notsupportcontextswitchesor time-related

systemcalls.Thesefeatureswhereonly addedwhenthereal-timeclockmodelwasimple-

mented.In addition,sincetheLAMIX kernelis targetedspecificallyat file systemanddisk

I/O, it is lesscomplexandeasierto understandandmodify thana completeoff-the-shelf

OS.

3.3 Simulator Validation

Whendesigningandusingacomplexsimulationenvironmentwith varioushardware

modelsandsoftwarecomponents,caremustbe takenthat the resultingsystemfaithfully

modelsnot only the logical behaviorbut alsothe performanceof existingcomputersys-

tems.Althoughsimulatorsaremostlyusedto evaluatedifferentoptionsin a hypothetical

55

architecture,thesimulationenvironmentshouldbeableto approximatetheperformanceof

acurrentsystemascloselyaspossible.Suchvalidationincreasesconfidencein theresults

of research that uses the simulation environment.

Onewayto validateasimulatoris to runawidevarietyof workloadsonarealsystem

andonthesimulatorundersimilarconditionsandcomparetheexecutiontimes.In addition,

furtherdetailssuchasthenumberof cacheor TLB missescanbeused,althoughthis is less

important if different workloads are known to stress different parts of the system.

Thevalidationmethodologyusedin thisstudyconsistsof threesteps.In thefirst step,

basicarchitecturalparametersaremeasuredandcomparedto thesystemusedfor valida-

tion. If necessary,simulatorparametersarechangedto matchthevalidationsystem.In the

secondstep,basicoperatingsystemperformanceis measured.Assumingthatthearchitec-

tural parametersof the two systemsmatch,theseresultsshowhow closelythesimulated

operatingsystemmodelsthevalidationsystem.In thethird step,realapplicationsaresim-

ulatedandexecutedandthetotal runtimefor thetwo systemsis compared.Theseapplica-

tions should be sufficiently diverse to validate all performance aspects of the system.

ThisstudyusesLMBench[67] to validatebasicarchitecturalparametersandto find

appropriatesettingsfor unknownsystemparameterssuchasthememorycontrollerconfig-

urationandtheL2 cachelatency.In addition,LMBenchis aconvenienttool to validateop-

eratingsystemparameterssuchassystemcall latencies,copy bandwidthandfile system

performance.A varietyof applicationsfrom theSPEC2000[45] benchmarksuiteandthe

Andrewfilesystembenchmarkarethenusedto validatethesimulatorwith realisticwork-

loads.

56

An SGI Octaneworkstationrunningat 175MHz actsasreferenceplatformfor the

validation.This systemwaschosenbecauseits machinearchitectureis similar to thesim-

ulator architecture,althoughit usesa different instructionsetandoperatingsystem.On

both platforms,the LMBench tools werecompiledusing the nativecompilersin 32-bit

modewith -O2 optimizationandno othersystem-specificoptimizationsenabled.For this

reason,theresultsfor theSGI Octanemaydiffer from otherpublishednumbers.TheSGI

workstationusesa9 GbyteIBM Ultrastar9ZX harddisk [103] for local storage,while the

simulateddisk is configuredwith identicalseektimeandplatterconfigurationparameters.

3.3.1 Memory Hierarchy Validation

Memory latencyis measuredusinga pointer-chasingmicrobenchmark.Thebench-

marksetsupanarrayof pointersof aspecifiedsizeandstridethatis walkedbackwardsin

anunrolledloop.Dependingonthearrayandstridesize,themeasuredlatencycorresponds

to theL1, L2 or mainmemorylatency.Thegraphin Figure9 showsplotsof thememory

load latencyfor 256-bytestridesfor bothplatforms,andTable5 summarizesthe latency

ratiosfor all measuredstrides.Notethatsincethestrideis largerthantheL2 cacheline size,

memoryaccessesexhibit no spatiallocality andthetestrevealsthemaximumlatencyfor

thememorysystem.Thegraphsshowtwo distinctstepswhenthearraysizecrossestheL1

andL2 cachesize.Theplateausin betweenthestepscorrespondto theloadlatencyfor the

respectivelevel of thememoryhierarchy.Resultsfor otherstridesshowsimilarly strong

correlations, but are omitted here for brevity.

57

Figure 9: LMBench Memory Load Latency for 256-byte Stride

Table 5: LMBench Average Latency Ratios

Stride Latency Ratiolatl-rsim / latoctane

16 1.0957

32 0.9939

64 0.9839

128 1.0079

256 1.0026

512 0.9989

1024 1.0063

10

20

40

80

160

320

640

16k 64k 256k 1M 4M1k 16M4k

L-RSIM

SGI Octane

Array Size in bytes

Late

ncy

in n

s

11-12

66 -78

550 -565

58

Figure10showsthereadbandwidthfor differentarraysizeswith unit stride,andTa-

ble6 containsthegeometricmeansof thebandwidthratiobwl-rsim / bwoctaneof thetwo sys-

temsfor variousreadandwrite benchmarks.Thetwo curvesdonotmatchaswell asin the

latencyexperiment,mostlikely becauseof loadinstructionissuerestrictionsor cacheport

contentionin theR10000thatarenotmodeledin thesimulator.Thefact thatthesimulator

is generallyfasterthanthevalidationsystemmeansthattheoverheadof datatransferswill

likely be underestimated by the simulator.

NotethatthebcopyandbzerotestsusethenativeC-libraryroutines,whicharecoded

very differently for the two platforms.The Irix/MIPS versionsareunrolledseveraltimes

whereasthebasicSolarislibrary performsone32-bitmemoryaccessperloopiteration.An

optimizedversionthat is unrolledandusesfloating point registersfor 64-bit memoryac-

cessesis alsoavailable,butwasnotusedfor thesetests,becauseit is evenmoreoptimized

than the SGI version and would not provide a better correlation between the two systems.

Figure 10: LMBench Memory Read Bandwidth

0

400

500

700

16k 64k 256k 1M 4M1k 16M4k

L-RSIM

SGI Octane

Array Size in bytes

band

wid

th in

Mby

te/s

300

200

100

600

59

3.3.2 Disk Validation

LMBenchmeasuresdiskseektimesandreadbandwidthby readingfrom therawde-

viceatvariousoffsets.Theresultinggraphsshowalargevariationfor seeksof similardis-

tancedueto varyingrotationaldelay,which dependson whenthereadrequestwasissued

with respectto thetargetblock position.Figure11 showscleaned-upversionsof theseek

timeandbandwidthcurves.Theseektimecurvesmatchverywell for thetwo systems,re-

flectingontheaccuracyof theseektimeapproximationusedin thesimulatormodel,which

is based on work by [41] and [57].

The readbandwidthof the simulationmodel is constantand doesnot follow the

downwardslopeof therealdiskbecausethemodeldoesnotimplementzoneswith different

sectorcountspercylinder.Instead,it usesasimpleconstantsectorspercylindergeometry,

which results in a constant read bandwidth for all cylinders.

Table 6: LMBench Average Bandwidth Ratios

Parameter BandwidthRatiobwl-rsim / bwoctane

libc bcopy 0.9314

unrolled copy 1.0964

read 1.0479

partial read 1.0041

write 1.2620

partial write 1.4995

partial read/write 1.4964

libc bzero 0.8179

60

3.3.3 Operating System Validation

Operatingsystemperformanceis an importantpartof overall systemperformance.

LMBenchmeasuresthelatencyof anumberof basicsystemcalls.Table7 showstheindi-

vidual results as well as the latency ratios for the two platforms.

Theemptysystemcall latencycorrelateswell, despitethesignificantorganizational

differencesof thetwo operatingsystems.This indicatesthattheLAMIX systemcall entry

andexit codeintroducesoverheadssimilar to theIRIX code.However,theremainingsys-

temcallsdo not correlateto thesameextent,dueto thedifferent internalorganizationof

Figure 11: LMBench Disk Seek Time and Read Bandwidth

0

10

15

5

2000 4000 6000 8000

seek distance in cylinders

cylinder

seek

tim

e (m

s)ba

ndw

idth

(M

byte

/s)

L-RSIM

SGI Octane

20

0

10

15

5

2000 4000 6000 8000

61

the two operatingsystems.Generally,filesystemcalls aresignificantly slower in IRIX.

Only the‘stat’ call and‘select’with largenumbersof file descriptorsproducecomparable

results.Sincememorylatenciesandbandwidthshowamuchbettercorrelationbetweenthe

two systems,andthesesystemcallsdonotmoveanysignificantamountof data,thediffer-

encecanbeexplainedonly by thedifferentcodepaths.Runningthesametestsondifferent

Sparcsystemsalsoshowswidely varyingresults,confirmingthishypothesis.For instance,

a nominally slower Ultra-1 workstation is faster than L-RSIM for the ‘stat’ and

‘open/close’tests,anda 450MHz Ultra-60hashigherlatenciesfor the‘read’ and‘select’

tests.TheseresultsshowthattheLAMIX kernelcannot realisticallybecomparedto com-

mercialoperatingsystems,dueto thewidely varyingkernelorganizations.Ideally,theref-

erencesystemshouldberunninga BSD variant,but at thetime of this work no BSD port

Table 7: System Call Latencies in Microseconds

System Call L-RSIM Octane latl-rsim/latoctane

empty system call 2.5522 2.6097 0.9780

read from /dev/zero 6.6345 10.3270 0.6424

read from local disk 7.6376 26.7427 0.2856

write to /dev/null 6.3088 12.2352 0.5156

stat 42.9370 49.4505 0.8683

fstat 4.5793 7.9985 0.5725

open/close 50.6481 74.0685 0.6838

select on 10 fds 7.7895 11.0776 0.7032

select on 60 fds 31.6243 38.7986 0.8151

signal handler installation 2.857 8.152 0.3505

signal handler overhead 8.622 31.856 0.2707

62

for MIPSR10000basedworkstationsexisted.However,theresultsindicatethatthesimu-

lator andits kernelareableto at leastapproximatetheperformanceof commercialUNIX

variants.LAMIX is frequentlyfasterthanits multithreadedcommercialcounterparts,lead-

ing to pessimisticresultswhenit is usedto evaluatetheimpactof operatingsystemover-

head on performance.

LMBenchmeasuresfile readbandwidthby repeatedlyreadingafiles of differentsiz-

es.For small to moderatesizes,thefile is residentin thebuffer cacheandreadbandwidth

is dominatedby memorycopybandwidth.Thecorrelationbetweenthetwo systemsis not

asgoodasfor thesimplecopytests.Thegeometricmeanof thefile readbandwidthratio

bwl-rsim/bwoctaneis 1.4726,andimprovesslightly to 1.3946whentheopenandcloseover-

headis included.This discrepancyis mostlikely dueto different implementationsof the

kernelto userspacecopyroutine.TheLAMIX routineusesdouble-wordloadsandstores

for largetransfers,whereastheimplementationdetailsof thecopyroutinesin IRIX areun-

known.

Filesystemperformanceis measuredasthenumberof file creationsanddeletionsper

second for varying file sizes, and is shown in Table 8.

Table 8: File System Performance in Creations/Deletions per Second

File Size L-RSIM Octane Ultra-1/140 Ultra-60/450

0 Kbytes 68/120 528/350 38/87 56/119

1 Kbytes 37/73 210/247 37/35 63/59

4 Kbytes 35/66 207/229 28/37 52/63

10 Kbyte 27/73 218/227 27/38 48/68

63

XFS is a modernhigh-performancejournaling64-bit filesystemdevelopedby SGI.

Not surprisingly,file creationanddeletiononXFS is significantlyfasterthanon theorigi-

nalFFS,sincethiswasoneof thedesigngoals.To makeamorerealisticcomparisonwith

otherFFSimplementations,Table8 alsoincludesresultsfor two differentSparcplatforms

runningSolaris.TheL-RSIM/LAMIX performanceis comparableto the450MHz Ultra-

60,eventhoughtheSunworkstationis nominallyfasterthanthesimulated200MHz sys-

tem, indicating that the LAMIX FFS implementationis closerto Sun’s implementation

than to SGI’s.

3.3.4 SPEC 2000 Validation

A numberof SPEC2000programsareusedto validatethesimulationsystemusing

realisticworkloadsthatexerciseboththeprocessorcoreandmemoryhierarchy.TheSPEC

2000suitewaschosenbecauseit containsa wide varietyof portableapplicationsthatare

well understoodby researchers.Theseapplicationsrangefrom compressionandintegrated

circuit optimizationto facerecognitionandarewritten in severaldifferentprogramming

languages.For thevalidationprocess,thetrainingdatasetsprovidedwith thebenchmark

suitewereused,with theexceptionof lucasandsixtrackfor which thetestdatasetswhere

usedto keepthesimulationtime manageable.Table9 summarizesthetotal runtimeof the

SPEC 2000 applications in seconds on the simulator and the Octane workstation.

SPEC2000applicationsaremuchmorecompileroptimizationdependentthanLM-

Bench.It wasfound that whenusingequivalentoptimizationlevels,the MIPS compiler

producessignificantly fastercodethantheSparccompiler.This effect is particularlypro-

nouncedfor the floating point codesof SPEC2000.To achievequalitativelycomparable

64

executablecode,theMIPSbinariesarecompiledwith optimizationlevel2 andinterproce-

duraloptimizationenabled,while theSparcbinariesarecompiledwith optimizationlevel

four.

Theapplicationruntimesgenerallyshowgoodcorrelationbetweenthetwo systems,

with a few exceptions.Forapplicationswith extremelylargeworkingsetssuchasmcf, the

differentpagesizesof thetwo systemscanleadto performancediscrepanciesdueto TLB

hit ratedifferences.The performanceof the floating point applicationswim is generally

Table 9: SPEC 2000 Runtime

Benchmark L-RSIM Octane tl-rsim/toctane

gzip 172.0 171.0 1.0058

mcf 235.0 154.0 1.5260

parser 47.0 41.0 1.1463

eon (kajiya) 49.0 50.0 0.9800

eon (cook) 9.6 9.5 1.0105

eon (rushmeier) 13.6 14.0 0.9714

gap 39.2 44.0 0.8909

vortex 87.0 72.0 1.2083

twolf 77.0 60.0 1.2833

wupwise 198.0 198.0 1.0000

swim 147.0 96.0 1.5313

mgrid 158.9 122.0 1.2951

applu 71.4 88.0 0.8114

art 95.0 80.0 1.1875

lucas (test) 26.0 41.0 0.6341

sixtrack (test) 45.2 44.0 1.0273

aplsi 81.5 85.0 0.9588

65

consideredto beverycompilerdependent,andthewidely differing runtimesin thisexper-

iment confirm this observation.

3.4 Summary

TheL-RSIM simulationenvironmentusedin thisstudycombinesadetaileddynam-

ically scheduledprocessormodelwith a sufficiently detailedI/O subsystemto executea

realistickernel.This uniquecombinationof featuresallowsresearchersto exploreI/O re-

latedperformanceeffectsin greatdetail,while keepingthe complexityof the simulation

system more manageable compared to other full system simulators.

Validatingtheaccuracyof thesimulationsystemensuresthateffectsmeasuredin the

simulatorrepresentrealsystembehavior,thusincreasingconfidencein resultsobtainedus-

ing thesystem.Microarchitecturalparameterssuchascachelatencyandbandwidthshow

very closecorrelationto a SGI workstation,indicatingthat thehardwaremodelsprovide

enoughdetail to accuratelycapturethemajoreffectsof modernmicroarchitectures.Oper-

ating systemperformancedoesnot correlateaswell betweenthe simulatorandthe SGI

workstation,becausetheinternalorganizationof thetwo operatingsystemsis toodifferent.

However,in mostcasesthe simulatorOS performsfasterthenIRIX. As a result,perfor-

manceimprovementsmeasuredwith thesimulatorcanbeassumedto bepessimistic.The

sometimeswidely differing applicationperformancewhenrunningSPEC2000pointsto

thedifficulty of comparingsystemswith differentinstructionsetarchitecturesandcompil-

ers.

TheL-RSIM simulationsystemis usedin thisstudyfor bothdetailedmeasurements

usingmicrobenchmarksandto measureoverallsystemperformanceunderavarietyof ap-

66

plications.Thegenerallygoodcorrelationfoundwhenvalidatingthesimulatorgivescon-

fidencethatthedemonstratedperformanceimprovementscarryoverto realsystems,while

thevalidationresultsshowthat theexistingmodelingerror leadsto generallypessimistic

results when measuring the operating system overhead.

4. USER-LEVEL I/O ARCHITECTURE

MostI/O requestsconsistof threedistinctphases:requestinitiation,datatransferand

requestcompletion.Eachof thesephasesincursoverhead,by usingprocessingresources

thataremadeunavailableto theapplication.If thekernelmediatesaccessto theI/O device,

enteringinto andexiting from thekernelin a protectedway consumesconsiderabletime.

ThiscontextswitchalsoreplacescacheandTLB entriesthatneedto bereloadedby theap-

plication.Datatransfersbetweenthe I/O deviceandmemoryor betweenkernelanduser

spaceoftenconstitutethelargestcomponentof I/O overhead.Usingthehostprocessorto

copydatausesupmanyprocessingcyclesandleadsto cacheandTLB pollutionasdatais

movedacrossthe cachehierarchytwice. Requestcompletionis typically signaledvia an

interrupt,which is at leastasdisruptivefor applicationperformanceasasystemcall, since

it saves and restores comparable amounts of state.

This dissertationintroducesa user-levelI/O architecturethat addressesall three

sourcesof I/O overheadby eliminatingtheoperatingsystemalmostcompletelyfrom com-

monI/O operations.Thearchitectureis designedfor anext-generationdistributedI/O sys-

tem with a scalablesystem-areanetwork connectingclient systemsto autonomousI/O

devices,asshownin Figure12.TheUIO deviceactsasanetworkinterfaceto theI/O net-

work. Novel hardwarefeaturesin thehostprocessorandtheUIO deviceimplementlow-

overheadmechanismsfor atomicuser-leveldeviceaccess,user-spacedatatransfersand

user-levelnotifications.I/O requestsareissuedfrom theapplicationdirectly to thedevice

68

with the supportof a hardwarebuffer structurethat combinesthe requestargumentsand

transmitsthem atomically to the device.Data transfersoccur directly to and from user

space,eliminatingtheneedfor thehostprocessorto useits valuablecomputeresourcesfor

this task.Completionnotificationsarehandledalmostcompletelyby theapplication,en-

ablingit to performfine-grainuser-levelthreadschedulingor othersynchronizationswith

very low overhead.

ReducingI/O overheadallowsapplicationsto moreefficiently overlaplong-latency

I/O operationswith unrelatedcomputation,thusimprovingoverallthroughput.Lowerper-

requestcostenablesapplicationsoftwareto takefull advantageof thebandwidthavailable

from distributedI/O systemssuchasnetwork-attacheddisksandto efficiently employla-

Figure 12: User-level I/O Architecture

CPU CPU

I/O BridgeMemory

UIODev0

system bus

I/O bus

Dev1

local I/O devices

Client 0

Client 1

Client 2

I/O Network

I/O Device 0

I/O Device 1

I/O Device 2

I/O Device 3

I/O Manager

69

tencyhidingtechniquesto maximizethroughout.Bypassingtheoperatingsystemfor com-

monI/O requestsmeansthatapplicationperformanceis lesssensitiveto thegrowinggap

between application and operating system performance.

At the sametime, the user-levelI/O architecturemaintainsthe level of protection

foundin mostmodernoperatingsystems.Thesystemprovidesseparateaddressspacesto

protectprocessesandthe operatingsystemfrom inadvertentor maliciousaccessesto in-

valid memoryregions.Furthermore,theI/O architectureprovidesa level of programming

flexibility usuallynot foundin communicationor I/O architecturesthatbypasstheoperat-

ing system,by allowingprogrammersto performI/O operationsto arbitrarymemoryloca-

tions without the needto pin or preallocatebuffers.As a result,manyapplicationscan

realizeperformanceimprovementswithoutextensivesoftwaremodificationsasuser-level

librariescanprovideconventionalprogramminginterfaceswith smalladditionaloverhead.

4.1 UIO Architecture Overview

ThedistributedI/O systemis composedof clientsystemsthatconnectto thenetwork

via aUIO device,anumberof network-attachedautonomousI/O devices,andanI/O man-

ageroverseeingaccessrightsto thedevices.I/O devicesperformnecessarylow-levelman-

agementtaskslocally onthedevicecontroller.For instance,network-attachedsecuredisks

mapbytestreamsto disk blocksandexportanobjectinterfaceto theclients.In addition,

network-attachedI/O devicesperformprotectionchecksfor everyI/O request,in coordi-

nationwith aglobalI/O manager.To accessa featureonaremotedevice,aclient requests

anencryptedcapabilityfrom theI/O manager.Usingthecapabilityto identify itself andits

permissions,theclient issuesrequestsdirectlyto theremotedevice.Sincethisorganization

70

assumesthatdevicesdo not trust remoteclients,removingtheoperatingsystemfrom the

client I/O path does not weaken the security model.

The networkis a scalablesystemareanetwork,suchasMyrinet [15] or ServerNet

[49]. Theonly assumptionthatthearchitecturemakesis thatthenetworkerrorrateis suf-

ficiently low thatend-to-enderrordetectionandcorrectionis not necessary.Many system

areanetworksoperatesuccessfullywith thisassumption,sincetheelectricalcharacteristics

of cablesandconnectorscanbetightly controlledandwire lengthsarelimited. In addition,

per-hoplink-level error detectionandcorrectioncanbe implementedin hardwarein the

network interfaces and routers to further reduce the error probability [54].

TheUIO deviceis at thecoreof theuser-levelI/O architecture.Thisdeviceactsasa

low-overheaduser-levelnetworkinterfaceto theI/O subsystem.Its design,in combination

with certainprocessorfeatures,providesprotectedandefficientuser-levelaccessto there-

moteI/O devices.Thekeycomponentsof theUIO architectureareshownin Figure13.The

conditionalstorebuffer in theprocessorbusinterfaceimplementsnonblockinguser-level

synchronizationandflow control, it is usedto atomicallytransferthe requestarguments

from applicationsto the UIO device.The designandevaluationof the conditionalstore

buffer is describedin moredetail in Chapter5. User-spacedatatransfersarefacilitatedby

the device TLB, which performsvirtual-to-physicaladdresstranslationand protection

checks.ThedeviceTLB characteristics,its integrationwith thehostoperatingsystemand

differentTLB misshandlingmechanismsarediscussedin Chapter6. User-levelnotifica-

tionsaredelivereddirectly to theapplicationby thehostoperatingsystembasedon infor-

mationprovidedby theUIO deviceatthetimethenotificationinterruptis triggered.Details

of thenotificationmechanismaredescribedin Chapter7.Theremainderof thischapterin-

71

troducesthesoftwareinterfaceof theUIO device,describesthebasicoperationof thear-

chitecture and gives an overview of a possible UIO device implementation.

The local UIO devicedetermineshow applicationprogramscommunicatewith the

remotedeviceatthelowestlevel.However,inherentdifferencesin theobjectinterfacesex-

portedby different deviceclassesarevisible to applicationsoftwareandshouldbe dealt

with in systemlibraries.In addition,only theperformance-criticalsubsetof theI/O opera-

tionsneedstobeissueddirectlybyuser-levelsoftware.ThisincludesanyI/O operationthat

transfersasignificantamountof data,suchasstoragereadandwriteoperations.Infrequent-

ly performedI/O operationssuchasnetworkconnectionestablishmentsor storageobject

Figure 13: User-level I/O Architecture Components

Send Engine Receive Engine

IO Network

CSB

uncached stores interrupt

Processor

UIO Device

request notifcationDMA data

TLB

notificationqueue

72

metaoperationscaninvolvetheclientoperatingsystemto hidesomeof thelow leveldetails

from applications.However,involving theclientoperatingsystemdoesnotchangetheba-

sicsecuritymodel,asremotedevicesgenerallydonottrustclientsystems,regardlessof the

protection level of the initiating software entity.

4.2 Application Interface

Thebasicstructureusedto communicatebetweenapplicationsoftware,theUIO de-

viceandtheremotedevicesis arequestpacket, or requeststructure.Thisstructurecontains

all informationneededto handlea request,bothlocally at theclient andat theremotede-

vice.Therequeststructureis passedfrom theapplicationto theUIO deviceandthenfor-

wardedto theremoteI/O device.WhentheI/O requestcompletes,thestructureis returned

to theUIO deviceandeventuallyreachestheapplicationin theform of anotification.Since

theremotedevicereturnsthesamestructurewhentherequestcompletes,theUIO device

doesnotneedto storeanystatewhile arequestis pending.Havingsoftwareprovideall rel-

evantinformationwith everyrequestslightly increasestheamountof datatransferredwith

everyrequest,but it facilitatesa simpleandscalableUIO devicedesignsinceno statein-

formationneedsto bemaintainedby thehardware.In addition,thestatelessdevicedesign

simplifies failure recoveryas no stateneedsto be reconstructedor synchronizedin the

eventof atransientfailure.Theslight increasein per-requestdatatransferredbetweensoft-

wareandtheUIO devicedoesnotaffectperformanceaslongastheentirerequeststructure

doesnotexceedthesizeof acacheblock,sincethatis thenaturaltransfersizeonthesystem

bus.Table10 lists thecomponentsof theUIO requeststructure,separatedinto remoteand

local information.

73

Remoteinformationsuchasthecapability,commandandargumentsis usedonly by

theremotedevice;theUIO devicesimply forwardsit. Thecapabilityis anencryptedvalue

thatidentifiesboththeobjectto operateonaswell astheaccessrightsof theclient for that

object.It enablestheremotedevicetocheckif theclientisallowedtoperformtherequested

operation.For instance,in thecaseof network-attacheddisks,acapabilityspecifieswhich

disk objectto accessandif theclient is allowedto read,write or deleteit. Sincethecapa-

bility is encryptedwhenit is grantedfrom theI/O manager,clientsarenot ableto bypass

the securitychecksof the remotedevicesby modifying the capability.The encryption

mechanismis establisheddynamicallybetweenthe I/O managerandthe remotedevices.

Otherremotelyusedinformationincludesthedesiredoperationandanyotherarguments

andflags.For instance,whenwriting to aremotestorageobjectthatrepresentsafile, addi-

tional argumentsmay includetheamountof dataandpossiblya flag requestingsynchro-

nousoperation.The return statusfield is usedby the remotedevice to indicate if the

Table 10: UIO Request Structure

Group Name Description

remote capability identifies remote object and clients access rights

information command desired operation

arguments and flags additional operation specific arguments

return status completion status of operation, returned to client

local context ID identifies process to device

information buffer & length buffer address and length for data transfers

notification buffer application space buffer for notification

notification handler user-level routine called during notification

request argument application defined value to identify request

74

operationwassuccessful,similarto thereturnvalueof asystemcall.Theformatandcoding

of all theremotelyusedinformationmustbestandardized,sothatdifferentclient systems

cancommunicatewith remotedeviceswithoutregardwhichparticulardevicetypetheyac-

cess.Suchstandardcommandsetsaresimilar to existingSCSIor ATA commandsusedon

local storage buses, but generally operate at a semantically higher level.

Argumentsin thelocalgroupin Table10areusedonly by theUIO device.However,

sincesomeof themareusedwhentherequestreturnsto theapplication,theyarealsofor-

wardedto the remotedeviceandreturneduponcompletion.Theseargumentsincludea

uniqueprocessidentifier thatenablestheUIO deviceto performlocalprotectioncheckson

thebufferaddressesandto delivercompletionnotificationsto thecorrectprocess.A buffer

addressandlengthis usedby requeststhattransferdatato or from theremotedevice.This

includesreadandwrite requestsaswell asstatusinquiriesthatreturndatastructures.The

notification buffer addressis a location in applicationaddressspacewherethe request

structureis depositedwhenit is returnedfrom theremotedevice,sothattheapplicationcan

inspectthe returnvalue.The notification handleris a user-levelroutinethat is executed

when the requestcompletes,it canperformalmostarbitrarysynchronizationoperations

within the application.The requestargumentis an applicationdefinedvaluethat canbe

usedto uniquelyidentify therequestto thenotificationhandler.Forinstance,in auser-level

threadlibrary thisvaluemaybetheID of thethreadblockingon theI/O request,or it may

beapointerto thethreadstructure.Thevalueis notusedby theUIO devicenorby there-

moteI/O device,it is completelyunderapplicationcontrol.Theoretically,this valueis re-

dundant,sincethenotificationbufferaddressshouldbeuniqueandcanbeusedto identify

75

therequest.However,allowing softwareto providea moreconvenientidentifier cansim-

plify the notification handler.

4.3 Basic UIO Operation

To initiatea request,softwareassemblesall requiredargumentsinto a requeststruc-

tureandsendsit to thelocalUIO device.Thisprocessoccursin userspacewithout invok-

ing theoperatingsystemthroughasystemcall. Low leveldetailssuchastheaddressof the

UIO deviceandthe layoutof theUIO requeststructurecanbehiddenby a library. After

initiating the I/O request,the applicationcontinuesexecutingsinceno kernelscheduling

operationwasinvolved.ThesenonblockingI/O requestsenabletheapplicationto optimal-

ly overlaplong-latencyI/O operationswith independentcomputation.A user-levelthread

library, for instance,caninitiate a low-overheadthreadswitchandthuspreservethetradi-

tional blocking I/O programmingmodelfor individual threads,while avoidingthe over-

head of kernel support for nonblocking I/O.

After receivingtherequeststructure,theUIO deviceperformsanynecessaryDMA

readoperation(e.g.,for awrite request)andforwardstherequestandassociateddatato the

remoteI/O device.Thatdeviceperformstheapplicableprotectioncheck,executesthere-

questandreturnstherequeststructurewith thereturnstatus,aswell asanyrequesteddata,

to theoriginal UIO device.After writing anyreturneddatainto theapplicationbuffer, the

UIO devicedepositstherequeststructurein theprearrangednotificationbuffer wherethe

returnstatuscanbe inspectedby theapplication.Thesamerequeststructureis written to

an externallyvisible host processorregisterto deliver the notification by executingthe

user-level notification handler specified with the request.

76

Thebasicprogrammingmodelof theuser-levelI/O architecturecombinesnonblock-

ing requestswith asymmetriccommunicationmodel.Eachrequestreceivesaresponse,in-

dicatingthattherequestcompleted.Thismodelapplieswell to passiveI/O devicessuchas

storageandto manyoutputdeviceslike framebuffers.Thesedevicesonly operatewhen

triggeredby thehostprocessor,makingthemagoodmatchfor thesymmetriccommunica-

tion model.Otherdevicessuchasnetworkadaptersandmanyinput devicesoperatewhen

receivingexternaltriggersandinvokesoftwareassistancethroughinterrupts.Thiswork fo-

cusesonstoragedeviceI/O, sincetheinterfaceandfunctionalityof network-attacheddisks

are more clearly developed than for other high-performance I/O device classes.

4.4 UIO Device Architecture

TheUIO deviceservesasauser-levelI/O networkinterface.Figure14showsapos-

sibleimplementationof aUIO device.Thedeviceactsasbothabusslaveandbusmaster,

asit acceptsandrespondstohostprocessorreadandwrite requestsandperformsDMA data

transfers.Applicationsoftwarewritestherequeststructureto thedevicerequestqueuevia

theconditionalstorebuffer in thehostprocessor,wheretheyarebuffereduntil processed

by a transmitDMA engine.This simpleinterfacereducesthecomplexityof the transmit

statemachinessignificantly.Requestsarehandledby oneor moretransmitDMA engines,

which performanynecessaryDMA readoperation(e.g.,for a write request)andforward

therequeststructureandanydatato thetransmitbuffer.DMA operationsaretranslatedand

verifiedby thedeviceTLB. Equivalentto aprocessorTLB, thedeviceTLB translatesvir-

tualbufferaddressesprovidedby applicationsinto physicaladdresses,andperformsaccess

protectionchecksfor everybustransaction.Repliesfrom remotedevicesarehandledby a

77

setof receiveDMA engines.TheseDMA engineswrite datainto applicationbuffersif ap-

plicableanddepositthereturnedrequeststructurein a notificationqueuefrom whereit is

sent to the host CPU for processing.

Thedesigndecisionto combineall requestargumentsin a commonstructuretrans-

ferredbetweenapplicationsoftware,thelocalUIO deviceandtheremotedevicefacilitates

asimplehardwareorganizationof theUIO device,comparedto manyotheruser-levelnet-

work interfaces.Thedevicestorestherequeststateonly for thedurationof thelocalDMA

operation,thuseliminatingtheneedfor largestorageareasthathaveto bemanagedby the

Figure 14: UIO Device Structure

Bus Master InterfaceBus Slave Interface

I/O Bus

I/O Network

Receive DMAReceive DMA

Transmit DMATransmit DMA

Transmit Buffer Receive Buffer

Network Tx/Rx

Req

uest

Que

ue

Not

ifica

tion

Que

ue

I/O Bus

TLB

78

hardwareor theoperatingsystem.Also, thedevicehardwaredoesnot imposeanylimits on

thenumberof outstandingrequests,thenumberof activeprocessesor theamountof data

involved in requests.

Thedepthof therequestqueuedeterminesthenumberof requeststhatsoftwarecan

issuebeforenewrequestsarerejected.ReadrequestsdonotrequireDMA transactionsdur-

ing transmissionandhenceoccupya DMA enginefor only a shortamountof time.Since

in manyenvironmentsI/O readrequestsdominatewrites, evena shallowrequestqueue

should be able to provide sufficient buffering.

Theoptimalnumberof DMA enginesanddetailsof thebuffer managementscheme

dependon theparticularnetworkarchitectureandflow controlmechanism.If thenetwork

providesvirtual channels,multiple DMA enginescantakeadvantageof theavailablepar-

allelism andprocessrequestsconcurrently.Similarly, with networksthat transmitsdata

streamsasa sequenceof independentlyroutedpackets,multiple DMA enginescaninter-

leaverequestson thenetworkandshouldbeableto hidemainmemoryaccesslatencies.

The transmitbuffer shouldprovidetheabstractionof onebuffer perDMA engine,either

staticallyor dynamicallypartitioned.Staticpartitioningimpliesaverysimplemanagement

schemewhereessentiallythebuffer is managedby theDMA engine.Dynamicpartitioning

eliminatesinefficienciesdueto fragmentationbutrequireshighermanagementcomplexity.

However,sinceonly oneDMA operationcanreturndatafrom theI/O busat a time,only

onebuffer port is needed,regardlessof thepartitioningscheme.Theamountof buffering

requireddependson theflow controlgranularity.Eachbuffermustbeableto holdenough

datato continuetransmittingdatain thepresenceof aDMA stalluntil flow controlcantake

effect.

79

Thereceivepathof theUIO deviceis almostsymmetricto thetransmitpath,with a

unified or distributedbuffer feedingdatainto a setof DMA engines.However,dueto the

notification scheme,the DMA enginecontrol is slightly morecomplex.The notification

queueholdstherequeststructurethatwasreturnedfrom theremotedevice.It is usedto de-

coupletheDMA enginesfrom thenotificationmechanism,which is flow controlledby the

hostprocessor.Consideringthat theamountof datareturnedpernotification is oftensig-

nificant, the queue does not need to be very deep to provide the desired decoupling effect.

4.5 Summary

Theuser-levelI/O architecturereducesI/O overheadin thecontextof a distributed

I/O architectureby almostcompletelyeliminatingtheoperatingsystemfrom commonI/O

operations.By providinguser-levelaccessto I/O devicesandtransferringdatadirectly to

andfrom applicationbuffers,hostprocessoroccupancyfor I/O requestsis minimized.The

overheadreductionsallow applicationsto efficiently overlaplong-latencyI/O operations

with otherwork andresult in highersystemthroughputandI/O bandwidth.At the same

time, the architecturemaintainsa level of processprotectionfound in operatingsystems

suchasUNIX andprovidesflexibility by not restrictingprogrammersto pinnedor other-

wise preallocated buffer spaces.

ThelocalUIO deviceactsasanetworkinterfacetoadistributedI/O system.Together

with thehostprocessor,it implementsa simplelow-overheadinterfaceto initiate I/O re-

questsanddelivernotifications.User-levelI/O requestsaredescribedby arequeststructure

thatcontainsall of theinformationneededto processtherequestbothlocally andat there-

moteI/O device.Thestructurepassesthroughthe local andremoteI/O devicesasthere-

80

questis processed.This designsignificantlyreducesthehardwarecomplexityof theUIO

device,asrequeststateneedsto bestoredonly for thedurationof a DMA operation.The

following chaptersdiscussthe individual UIO mechanismsin greaterdetail andprovide

comparative overhead measurements for a variety of alternative mechanisms.

5. ATOMIC DEVICE ACCESS

Thegoalof theuser-levelI/O architecturetominimizeoverheadbybypassingtheop-

eratingsystemfor performancecritical I/O requestsrequireschangesto all phasesof anI/O

transaction.Although datatransfersarefrequentlythe dominantsourceof I/O overhead,

thecontroloverheadinvolvedin initiating I/O requestscanoftenbeconsiderableaswell,

especially for small transfers of low-latency requests.

Wheninitiating an I/O request,softwaremustensurethat the individual arguments

andparametersarecommunicatedto thedeviceatomicallywith respectto otherentitiesac-

cessingthedevice.Many I/O devices,suchasSCSIhostadapters,exporta setof control

registersinto which softwarewrites the requestargumentsvia a sequenceof uncached

stores.Writing to a dedicatedregistertriggersthedeviceto startprocessingtherequest.If

multipleentitieswrite to theseregisterssimultaneously,argumentsof differentrequestsare

interleaved,leadingto invalid requests.In addition,softwaremustensurethatthedeviceis

ableto accepttherequest.Many devicesprovidestatusregistersfor this purposethatare

checked by software before a request is initiated to detect flow control situations.

If I/O requestsareinitiatedonly by entitieswithin theoperatingsystemkernel,both

atomicity and flow control areachievedthroughsoftwaresynchronizationmechanisms.

Disablinginterruptsandacquiringa lock preventsotherentitieswithin thekernelfrom ac-

cessingthe device.The resultingcritical region allows the devicedriver to atomically

check the device status and write the required arguments to the device control registers.

82

Restrictingaccessto I/O devicesto thedevicedriver insidethekernelprovidesboth

atomicity and flow control, but sucha designincurssignificantoverheadfor every I/O

transaction.In multiuseroperatingsystemswith protectedaddressspaces,thesystemcall

entrycodesavesa largeamountof processstateon thekernelstack,beforeexecutingthe

actualsystemcall routine.Whenreturningto usermode,thesystemcall checksif asignal

is pendingfor thecurrentprocess,andif theschedulerneedsto beinvoked,beforerestoring

theprocessstate.Theseactivitiesnotonly consumemanyCPUcyclesbutalsohavesignif-

icant impacton cachesandTLBs, which leadsto degradedapplicationperformanceafter

the system call is complete.

Eliminatingsystemcallsfrom theI/O requestinitiation helpsto reduceI/O overhead.

If applicationsareable to directly transferthe requestcontrol information to the device

hardwarewithout involving theoperatingsystem,multiple entitiesarecompetingfor de-

viceaccesswithoutglobalsynchronization.Sinceuser-levelprocessescanbepreemptedat

anytimeandcannotbetrustedto performsynchronizationthatimplementsgloballyatomic

sections,a low-overheadhardwaremechanismis neededto provideatomicuser-levelde-

viceaccess.Thischapterintroducestheconditionalstorebuffer,asoftwarecontrolledcom-

bining buffer that implements nonblocking synchronization and flow control.

5.1 The Conditional Store Buffer

The conditionalstorebuffer (CSB) is a softwarecontrolled,uncached,combining

hardwarebuffer thatallows I/O storesto becombinedinto a singlebustransactionup to

themaximumsizeof onecacheline. It reducesI/O overheadwith aminor increasein hard-

warecomplexityby implementingnonblockingatomicityandflow controlattheuserlevel.

83

TheCSBpermitsuserlevel codeto explicitly controlwhichstoreswill becombinedand

whenthecombinedsetof storeswill beissuedon thesystembus.This guaranteesthatthe

sequenceof combinedstoreinstructionsis atomicandprovidesthenecessaryexactlyonce

semanticsfor the resultingbustransaction.In addition,the conditionalstorebuffer pro-

videsanonblockingflow controlmechanismto reliably inform applicationsif requestsare

issued at a faster rate than the I/O device can process them.

Figure15 illustratestheresultingsystemmodel.It showsa block diagramof a con-

ventionalsystemwith a two-level cachehierarchyanda typical uncachedloadandstore

capabilitythat hasbeenenhancedwith the additionof a CSB.Side-effectfree uncached

loadsandstorescanbehandledin thenormalmanner,while dedicatedcombiningstorein-

structions are handled by the CSB.

Figure 15: Architectural Model with Conditional Store Buffer

system bus

unca

ched

buf

fer

L2 cache

L1 cacheconditionalstorebuffer

processor

bus interface

mainmemory

I/Odevices

84

5.1.1 Conditional Store Buffer Design

Figure16showsthestructureof theconditionalstorebuffer.Thedatabufferprovides

spacefor onecacheline worth of dataandthecacheline alignedphysicaladdressof the

mostrecentcombiningstore.Thehit counterimplementsthenonblockingconditionalflush

operation.It countsthe numberof consecutivestoresthat havebeenissuedby a process

without conflict.

Conflictsaredetectedin two differentways.Whenthebuffer receivesa combining

store,it comparesthedestinationaddresswith thevaluethathasbeensavedfrom thepre-

viousstoreinstruction.Onamatch,it storesthedatain theappropriateslotandincrements

thehit counter.If thecomparisonfails, thebuffer is cleared,thehit counteris setto one,

and the new data are stored.

Figure 16: Conditional Store Buffer

data buffer

transaction to system interface

storeconditional flush

address

counter

instructions from CPU core

85

To achieveinterprocessatomicity, theconditionalstorebuffer is clearedunderany

conditionthatmayleadto aprocesscontextswitch.This includesanumberof trapsandall

externalinterrupts.Errortrapsor systemcallsmaysuspendor terminatethecurrentprocess

andswitchto adifferentprocess,or performacontextswitchif thetimesliceof thecurrent

processis exhausted.Similarly, externalinterruptssuchasclock ticks or SCSIinterrupts

may invoke the schedulerandleadto contextswitches.Whensuchan eventoccurs,the

CSB hit counter is reset to zero and the data buffer is cleared.

At the endof the uncachedstoresequence,whenall requestargumentshavebeen

written, theapplicationissuesa conditionalflush instructionto indicatethat thesequence

is completeandto checkatomicityandflow controlstatus.Theconditionalflush instruc-

tion communicatestheexpectedvalueof thehit counterto theCSB,andreturnsboth the

atomicityandflow controlstatusto theapplication.If thecountervalueisequalto thevalue

providedby the instruction,andthe destinationaddressmatchesthe valuepresentin the

CSB,thenthedataissentto thesysteminterfaceasasinglebursttransaction,andthebuffer

andhit counterarecleared.Whenreceivingthebursttransaction,theI/O devicereturnsthe

flow controlstatusto theprocessor,whichbecomesthereturnvalueof theconditionalflush

instruction.If either the addressesor countervaluesdo not match,the dataregisterand

counterarecleared,nothingis issuedto thesysteminterface,andtheconditionalflush in-

struction returns a negative result to the application.

Softwareis responsiblefor checkingthereturnvalueof theconditionalflushinstruc-

tions.Theapplicationmayrecoverfrom afailedflushfor instanceby branchingbackto the

beginningof thesequenceof combiningstores.Thefollowing pseudocodesegmentshows

an example of how software might access the CSB.

86

.RETRY:

stc %r4, [%r1] ! sequence of combining stores

stc %r10, [%r1+40]

! ... 5 additional combining stores

stc %r12, [%r1+8]

flushcond [%r1], %r4 ! conditional flush

cmp %r4, SUCC ! check return value

bneq .RETRY ! retry on failure

Suppose,for example,that this processis interruptedbeforeit executedthe condi-

tional flush instruction.Theinterruptwill clearthehit counteraswell asall datastoredso

far by thecurrentprocess.If thenewprocessissuesasequenceof combiningstores,thehit

counteris incrementedfor everystoreandtheconditionalflushinstructionsucceeds.When

theoriginalprocessattemptsto flush thebuffer, theexpectedcountervaluewill notmatch

thevaluestoredin thebuffer andtheconditionalflush instructionwill returna 0 to signal

theconflict. If, on theotherhand,no interruptoccurredwhile thesequenceof storeswas

executed,the conditionalflush issuesthe buffer contentsandthe devicereturnsits flow

control status.

Thenonblockingconflict detectionpolicy removestheneedto lock theCSBprior to

accessandcompetingprocessesdonotblockonaconflict [27]. Thepolicy is optimisticin

its assumptionthatconflictsarerareandit is morecosteffectiveto replaceheavyweight

synchronizationon everysequencewith a softwarerecoverymechanismon a failed at-

tempt.Sincelock-freesynchronizationschemesdo not preventcompetingprocessesfrom

accessinga resource,theydo not leadto problemslike priority inversionor thedifficulty

of deadlock avoidance.

Theoretically,it is possiblefor two processesto bescheduledsuchthateachcontin-

uouslyconflictswith theother.Therearenumeroussimplesolutionsfor this livelock sce-

87

nario.Onecanlimit thenumberof failedconditionalflushes,or useanexponentialbackoff

algorithm to reduce the likelihood of a conflict.

Similarly, theflow controlschemeis optimistic in its assumptionthat in mostcases

thedeviceis ableto accepttherequest.Ratherthanpreventingtheprocessorfrom issuing

requests,it notifiessoftwarein theeventthattherequestqueueof thedevicewasfull. This

flow controlschemerequiresthatboththesystembusandI/O bussupportaswapbustrans-

action,which writesdatato a targetdeviceandreturnsa statusvalue.Conceptually,it is a

write transactionimmediatelyfollowedby areadtransaction.In fact, if thearbitrationpro-

tocol canguaranteesuchbackto backtransactions,it maybe implementedin thesystem

interface as a write followed by a read without any other transactions in between.

Thesizeof thedataregisteris equalto thesizeof a cacheline, sincethesystemin-

terfaceandsystembusarealreadyoptimizedto handlecacheline sizedburst transfers.

Sincemostsystembusesdonotallow arbitrarylengthbursts,theCSBmodelin this study

alwaysissuesa full cacheline, regardlessof thenumberof combiningstoreinstructions.

Thisrestrictioncouldberelaxedin aCSBdesignfor abusthatpermitsmultipleburstsizes.

Notethattheperformanceimprovementthat is madepossibleby theCSBdependson the

ability of thetargetI/O deviceto acceptburstwrites.In general,thisincreasesthecomplex-

ity of anI/O device.On theotherhand,thepotentialperformancegainmorethanjustifies

theincreasedcost.NotealsothatmanymodernI/O adaptersalreadyprovidethiscapability.

It shouldalsobenotedthat it is not strictly necessaryto includethedestinationad-

dressin theconflict check.However,thisallowsdetectionof conflictsbetweencompeting

user-levelthreads.It alsoservesasanaidto detectprogrammingerrorsin whichindividual

request arguments may be written to different target addresses.

88

5.1.2 Instruction Set Architecture Modifications

The CSB designrequirestwo architecturalmodifications.First, softwaremust be

ableto specifywhich storeinstructionsshouldbecombined.Theobviouschoiceis to in-

troducea newinstructionstorecombine. However,addingnewinstructionsto anexisting

architectureshouldnot betakenlightly. TheCSBdesignin theprototypeimplementation

thereforeusesexistingmemorymappinghardwareto indicatewhich addressesshouldbe

combined.Severalarchitecturesalreadyencodecachepoliciesandothermemoryattributes

in pagetableentries.ThePowerPCallowsthespecificationof write-throughor write-back

caching,alongwith otherattributes,onaper-pagebasis[83]. In theR10000,theaccelerat-

eduncachedbuffer is enabledby abit in thepagetableentry[68]. Hence,theencodingof

one additional attribute is a minor extension to existing TLB designs.

Thesecondrequiredmodificationinvolvestheadditionof a conditional-flushcapa-

bility. Thepurposeof this instructionis twofold. It musttriggertheflushingof theCSB(if

no conflict wasdetected),andsignalthesuccessof the flush to theprogram.Ratherthan

introducinga newinstruction,theprototypeimplementationusestheSparcswapinstruc-

tion for theconditionalflush. If thedestinationaddressis in uncachedcombiningaddress

space,the instructionis sentto theCSB.Theconditionalflush instructionreturns0 if the

flush failed dueto anatomicityviolation, otherwisetheflow controlstatusreturnedfrom

the I/O device becomes the instructions result.

Thesemanticsof theconditionalflush instructionareverysimilar to thestore-condi-

tional instructionusedin manyarchitecturesfor interprocesssynchronization.A store-con-

ditional writes a newvalueto a memorylocationif no otherCPU accessedthat location

sincea load-linkedwasexecuted,andreturnsthesuccessor failure statusof theoperation

89

in aregisteror conditioncodebit. Architecturesthatprovidesuchstore-conditionalinstruc-

tion but no swap may use this instruction as conditional flush.

5.1.3 Process Context Identifier

Whensettingup a transferat the user-levelI/O device,the applicationprocessnot

only specifiestheparametersandbufferaddressesnecessaryfor theparticularrequest,but

alsoneedsto identify itself to thedevice.Thecontextidentificationis neededwhenthede-

vicetranslatesandverifiesbufferaddressesprovidedwith therequest,andwhenacomple-

tion notification is sent to the application.

This contextidentifier cantakemanydifferent forms suchasthe processstructure

pointer,a pagetablepointeror theprocessID. Thebestchoicedependson theparticular

kernelorganizationanddevicestructure.A commonrequirementis that the identifier is

uniqueamongall currentlyrunningprocesses.If theI/O deviceperformsaddresstransla-

tionswithout kernelassistance,thecontextidentifier mustprovideenoughinformationto

find theprocesspagetable.In manycasesthephysicaladdressof thepagetableor aphys-

ical pointerto theprocessstructurewhich in turn containspointersto thepageor segment

tablecanbeused.If addresstranslationis donewith kernelassistance,thecontextidentifier

shouldenablethekernelto quickly find theassociatedprocess.This canbeaccomplished

by providingthevirtual processstructurepointerwith everyrequest.Finally, low-overhead

completionnotificationsrequirethatthekernelbeabletoquicklyconfirmif thenotification

targetsthecurrentlyrunningprocess,andto find thecorrectprocessstructureif this is not

the case. Again, the virtual pointer to the process structure is a good choice.

90

However,no matterwhat form of contextidentificationis used,allowing applica-

tionsto identify themselvesmakestheentiresystemvulnerableto faulty or maliciousap-

plications,unlessa secureandtrustedway is devisedin which the processidentification

canbe communicatedto the device.The mostefficient way to communicatethe context

identifier to theI/O deviceis to havetheCSBinserttheneededinformationautomatically

onaflush.Thisschemerequiresthattheidentifieris availablein aprivilegedprocessorreg-

ister.Severalarchitecturesstoresomeform of processID in a supervisormoderegisterto

detectaliasingof cacheor TLB entries.For instance,theMIPS architecturedefinesan8-

bit addressspaceidentifier thathelpsto avoid flushingtheTLB on everycontextswitch;

PA-RISCusesan18-bit identifier to de-aliasreferencesto virtually addressedcaches[55];

andtheAlpha 21164storesa 7-bit processID in a privilegedregister[2]. Althoughthese

identifiersareuniqueamongall processes,theymaynot beideal for device-basedvirtual

to physicaladdresstranslationor low-overheadprocessnotification.In addition,hardwir-

ing a CSB entry to a privileged register limits flexibility.

A moreflexible alternative,althoughassociatedwith higheroverhead,is to trap to

thekernelwhentheCSBflush instructionis executed.Thekernelcantheninsertthe re-

quiredcontextidentifier andemulatetheflush operationon behalfof theapplication.The

overheadof thisschemedependslargelyon implementationdetails.If theflush instruction

triggersageneralprivilegedinstructiontrap,thecodesequenceis verysimilar to asystem

call, andconsequentlyintroducessimilar overheads.Theonly advantageof theCSBover

a systemcall in this caseis theimprovedbandwidthcomparedwith a sequenceof single-

wordstores,andtheintra-CPUatomicitythateliminatestheneedfor costlyglobalsynchro-

nization.

91

A moreoptimizedimplementationis possibleif adedicatedtrapvectoris usedfor the

CSBflush instruction.Thetraphandlerneedsto saveonly minimalstate,inserttheneeded

informationandexecutetheflushinstruction.Unlike ageneral-purposesystemcall, it does

not needto checkif a signalneedsto bedeliveredto theprocess,or if a contextswitchis

scheduled.Bothsoftwarebasedsolutionsoffer greaterflexibility. Thetraphandlercanin-

sertanyform of contextidentifier in anypositionin theCSB.Thedisadvantageis thein-

creased CPU overhead, which can approach that of a system call.

A third optionfor communicatingthecontextidentifier to theI/O deviceis to decou-

ple thecontextfrom therequest.Ratherthanspecifyinga contextwith everyrequest,the

operatingsystemsinformsthedeviceof thenewidentifier duringa contextswitch.When

thedevicereceivesarequest,it insertsthecurrentcontextidentifier in therequeststructure.

Updatingthecurrentprocesscontextat theI/O devicereducesthetransfersetupoverhead,

but it requiresmodificationof the contextswitch handler,which inevitably leadsto in-

creasedcontextswitchcosts.To avoidmakingthecontextswitchhandlercodedependent

on thehardwareconfiguration,devicedriverswould registercontextswitchcallbackrou-

tineswith thekernelthatarecalledby thecontextswitchroutine.In asystemwith asmall

numberof processesandfew contextswitchesbut frequentI/O operations,thereducedre-

questoverheadmayoffsettheincreasedcontextswitchoverhead.Decouplingthecontext

from therequestis particularlyattractiveif theCSBis locatedin thedeviceandnot in the

processor,asdiscussedin a latersection.In thiscase,updatingthecurrentprocesscontext

implicitly notifies the device of a context switch and clears the CSB hit counter.

92

5.1.4 CSB Hardware Implementation

A samplehardwareimplementationservesto assessthehardwarecomplexityof the

conditionalstorebuffer whenit is addedto a processorsysteminterface.Figure17 shows

ablockdiagramof theconditionalstorebufferwith genericinterfacesto theprocessorcore

andbusinterface.Centralto thedesignarethe64-bytedatabuffer,theaddressregisterand

theaccesscounter.A statemachineincrementsthecounterfor everycombiningstorethat

matchesthecurrentCSBaddress.If thedatavalueprovidedby theCSBflush instruction

matchesthecurrentcountervalue,thestatemachinerequestsatransactionfrom thebusin-

terface.Whenacknowledged,thestatemachinesendstheaddressandthedatabuffercon-

Figure 17: Conditional Store Buffer Implementation

d0 d1 d2 d3

d15d13d11d9

processor interface

data out

system bus interface

0x0000

d0 d2 d4 d6

d7d5d3d1

data out

data incontextaddr in

addr

count

data in

cmp

cmp

cntl

address process context data result

bus address & data return data

93

tentsin nineconsecutivecyclesto thebusinterface,assuminga multiplexed64-bit wide

systembus.A datareturnfrom thebusinterfaceis forwardedto theprocessorcore.In case

of afailedCSBflush,thevaluezerois immediatelyreturnedto theprocessor.To minimize

the numberof bits neededto countconsecutiveCSB accesses,the counterimplementsa

stickyoverflowbit. Thecounterprovidesonly thenumberof bitsneededfor asequenceof

uniquebytestores.Thesticky overflow bit indicatesa counteroverflow, it is resetduring

an interruptor whena storewith a conflicting addressor a conditionalflush is received.

Theconditionalflushinstructionfails if thebit is set.All datapathsfrom andto theproces-

sor core are 32 bits wide.

In high performancedesigns,it is critical that all inputsandoutputsof a block are

directlyconnectedto registers.Thisdesignguidelinesimplifiesglobaltiming significantly,

leavingalmosttheentireclockcyclefor signalpropagationbetweenblocksthatmaybelo-

catedrelativelyfar from eachotheron thedie.TheCSBsampledesignfollows thisguide-

line and includes input and output registers for all data and control signals.

Thedesignis implementedin Verilog andsynthesizedfor a0.25µm technologyus-

ing a commercialstandardcell library with a targetfrequencyof 400MHz. Althoughthis

methodologyis rarelyusedfor high performancemicroprocessordesigns,it givesa good

indicationof theworstcasecycletimeandarearequirements.TheSynopsyssynthesistool

[25] reportsanareaof approximately0.22mm2, includingestimatedroutingarea.Thecrit-

ical pathof thedesignexceedsthe targetclock cycleby 0.6 ns,resultingin anoperating

frequency350MHz. Theaddresscomparisonlogic leadingto themaincontrollogic com-

prisesthecritical path.An optimizedcustomdesigned32-bit comparatorwould eliminate

this problem,resultingin a designthatmeetsthetiming requirementsfor microprocessors

94

usingthis technology.TheD-flipflops usedfor thedatabuffer andpipelineregistershave

relativelyhighsetuptimerequirements.Customdesignsof thedatabufferandpipelinereg-

isterswould improvenot only thearearequiredfor the implementationbut alsothecycle

time further.

5.2 Conditional Store Buffer at the I/O Device

ThebasicCSBprincipleof nonblockingsynchronizationcanbeappliedattheI/O de-

vice levelaswell. In thiscase,theI/O deviceprovidesadatabufferandhit counterfor ev-

ery CPU in the system.For everystoreto the store-combiningaddressspace,the device

storesthedatain thecorrespondingdatabuffer entryandincrementsthehit counter.The

hit counteranddatabufferareclearedoneverycontextswitch,sothatthedeviceis ableto

detectif thesequenceof storeswasatomic.Similarto theCSB,aswapbustransactioncom-

municatestheexpectedhit countervalueto thedeviceandreturnsthesuccessor failuresta-

tus to theapplication.Alternatively, if thenumberof storesin a sequenceis fixed, a read

bus transaction is sufficient to start the I/O transaction and return the flow control status.

In a multiprocessorsystem,thedevicemustprovidea CSBfor eachprocessor.This

requiresthat the devicecan determinewhich CPU issueda bus transaction,so that the

transactionis associatedwith thecorrectCSB.Mostmodernsystembusescarrysomeform

of processoridentificationin eachtransaction,eitherin someunusedaddressbitsor aspart

of the transaction ID used by split-transaction bus protocols.

Themainadvantageof movingtheCSBto theI/O deviceis that it doesnot require

anymodificationsof theprocessorsysteminterfaceor memorymanagementlogic,beyond

supportfor a swapbustransaction.However,thecontextswitchhandlerof theoperating

95

systemneedsto bemodifiedto clearthedeviceCSBoneverycontextswitch.Suchmodi-

fication is usuallynot desirablebecausethe contextswitch codeis highly optimizedfor

minimumcodelengthandlatency.Increasingthecontextswitchlatencypenalizesall pro-

cesses, not just applications that utilize the user-level I/O features.

5.3 Performance Evaluation

This sectionpresentsoverheadresultsfor thedifferent transfersetupschemeson a

setof currentandnearfuturesystems.All resultsareobtainedfrom a prototypeuser-level

I/O implementationin theL-RSIM architecturalsimulator.Theoverheadresultsaretheav-

erageof approximately60requestsof differenttypes.In all experiments,therequeststruc-

ture is copied from an applicationbuffer into the CSB before the CSB is flushed.To

evaluatetheimpactof architecturaladvancesonthevarioustransfersetupschemes,theex-

perimentsvary theprocessorandcachesubsystemindependentlyfrom thesystembusand

mainmemorysubsystem.For eachsubsystem,two configurationsaretested,representing

currentandnearfuturearchitectures.Eachconfigurationis characterizedby acombination

of CPUcoreandmemoryclock frequency.Table11summarizesthemostrelevantconfig-

urationparametersof the resultingfour systems.Takentogether,thesecombinationsare

designed to cover a wide range of processors and main memory subsystems.

Figure18 showsboththetotal overheadof thedifferenttransfersetupschemesdis-

cussedbefore,aswell astheoverheadreductionsrelativeto thebaselinesystemcall imple-

mentation.Eachgroupof bar graphscorrespondsto a setof measurementson the same

system,startingfrom thebaselinesystemcall schemeto theCSBwith hardwiredprocess

context.The bar graphsshowingthe absoluteoverheadsare split into minimum values

96

(shadedbottomportion) andaveragevalues(white box stackedon top). The maximum

overheadincurredby eachschemeis essentiallyunboundeddueto cacheandTLB misses

and interrupts,andhenceis not shownhere.The secondsetof graphsshowsminimum

overheads normalized to the system call scheme.

Thesystemcall schemerepresentsa baselinethatusesanioctl() call handledby the

UIO devicedriver.Thesystemcall codenecessaryto switchfrom userto kernelspaceand

backusually includesseveraltensto hundredsof instructions.The codesequencesaves

mostof thecurrentprocessstate,switchestoakernelandchecksfor errorconditions.When

returningto usermode,thekernelfirst checksif a signalneedsto bedeliveredto theuser

process,or if a contextswitchis scheduled.If neitheris thecase,it restorestheuserstate,

switchesbackto theoriginal stackandresumesexecutingin usermode.After savingthe

Table 11: System Configurations

Parameter 400 / 100 400 / 200 2G / 100 2G / 200

CPU Frequency 400 MHz 400 MHz 2 GHz 2 GHz

Superscalarity 4 way 4 way 6 way 6 way

Instruction Window 48 entries 48 entries 96 entries 96 entries

L1 Cache size / assoc. 32K / 2 32K / 2 256K / 4 256K / 4

L1 Cache latency 1 cycles 1 cycle 2 cycles 2 cycles

L2 Cache size / assoc. 2M / 2 2M / 2 8M / 2 8M / 2

L2 Cache latency 14 cycles 14 cycles 26 cycles 26 cycles

System bus frequency 100 MHz 200 MHz 100 MHz 200 MHz

System bus width 8 byte 16 byte 8 byte 16 byte

Main memory latency 550 ns 225 ns 550 ns 225 ns

I/O latency 420 ns 340 ns 420 ns 340 ns

97

currentprocessstate,theUIO systemcall copiestherequeststructurefrom userinto kernel

space,transferstherequestto thedevicevia asequenceof uncachedstoresandinitiatesthe

transactionwith anuncachedloadthatreturnsthedeviceflow controlstatusto thedevice

driver.Theentiredeviceaccesssequenceis protectedby aglobalspinlock. Althoughthis

schememaynot necessarilyrepresentanactualimplementation,it approximatesthecost

of enteringkernelmodeandsynchronizingthedeviceaccess.Thesystemcall executesthe

minimumnumberof uncachedloadsandstoresrequiredfor a requestsetup.Notethatthe

UIO systemcall doesnot lock thephysicalpagesin mainmemory,nordoesit translatethe

virtual buffer spaceinto a list of physicaladdresses,sinceit assumesthattheI/O deviceis

Figure 18: Request Overhead

0

1.0

2.0

2.5

1.5

0.5over

head

inµs

norm

aliz

ed o

verh

ead

0

0.4

0.8

1.0

0.6

0.2

400 / 100

400 / 100

System CallCSB with Flush Trap

CSB with Fast Flush TrapCSB at Device

CSB with hardwired context

400 / 200

400 / 200

2G / 100

2G / 100

2G / 200

2G / 200

98

ableto performtheseoperationsindependently.In systemsthatdo not supportuser-level

I/O, this assumptionis not realistic,but it is usedhereto fairly compareonly thecostof

atomicity and flow control for I/O requests.

Theotherschemesshownin Figure18 usetheconditionalstorebuffer eitherin the

processoror the devicewith a variety of processcontextmechanisms.In the caseof the

hardwiredcontextscheme,the contextis providedby the privilegedtlb-contextregister

which is normallyusedfor fastTLB misshandlingin theCPU.This registerpointsto the

physicalpagetablerootof thecurrentprocess.UponaCSBflush,theCSBinsertsthecur-

rentvalueof this registerin location0. Usingthephysicalpagetablerootenablesveryef-

ficient addresstranslationwithout kernel involvementat the device.However,for user-

level notifications,thekernelneedsto scantheprocesslist for a matchingprocessbefore

deliveringthenotification.Alternatively,thephysicaladdressof theprocessstructuremay

beusedasa contextidentifier.TLB misshandlersneedto performat leastoneadditional

memoryreferenceto find theassociatedpagetable,butnotificationhandlingmaybefaster

than when the page table pointer is used.

The trap basedcontextschemesusethe samecontext identifier as the hardwired

scheme.Thegeneralprivileged-instructiontrapis dispatchedto aC-routinethathandlesall

privileged instructionexceptions.If it determinesthat the trappedinstructionis a CSB

flush,it writesthecurrentprocesscontextidentifier to theCSB,performstheflushinstruc-

tion andmodifiestheprocessstateonthestacksuchthatthereturnvalueof theconditional

flush is set as expected by the application.

The fast trap schemeusesa dedicatedCSB flush trap,which is handledby a short

assemblerroutine.This routinewrites theprocesscontextto theCSB,copiesthe trapped

99

instructionto akernellocationandjumpsto it, thusexecutingtheinstructionon theappli-

cations behalf.

Perhapssurprisingly,the systemcall schemeincurs significantly higher overhead

thanthegeneralpurposetrapmechanismusedto inserttheprocesscontext,eventhough

bothsaveandrestorethesameamountof processstate.Thereasonis that thesystemcall

passesthroughthefilesystemvnodelayerbeforeenteringthedevicespecificroutine,while

the privilegedtrap is immediatelydispatchedto the appropriatehandlerroutine.In addi-

tion, thesystemcall locksandunlocksthedeviceandissuestherequestasa sequenceof

individual uncachedstores,whereasthe CSB providesatomicity in hardwareandissues

only one bus transaction, thus reducing overhead.

In systemswith a slow CPU,thefasttrapschemeperformsonly slightly worsethan

thedeviceCSBorprocessorCSBwith hardwiredcontext.In theseconfigurations,theover-

headis dominatedby copyingtherequestinto theCSBandwaitingfor theflow controlsta-

tus from the device,and the additional instructionsof the fast trap handlerdo not add

significantlyto theoverhead.However,in systemswith afastCPU,thespecialpurposetrap

handlerperformsworsethanthegeneralpurposetrap.In orderto minimizethenumberof

instructionsexecuted,thefasttraphandleremulatestheCSBflushby copyingtheinstruc-

tion from userinto kernelspaceandthenjumpsto it. Othermethodsof emulatingthe in-

struction would require more complex decoding,since the flush may take arbitrary

effectiveaddressesprovidedasbaseplusoffsetandarbitrarysourceanddestinationregis-

tersasarguments.However,copyingtheinstructioninto kernelspacerequiresflushingthe

instructioncache,which leadsto drasticlatencyincreasesfor systemswith fastCPUsand

relatively slow main memory.

100

Theperformanceadvantageof theprocessorsideCSBis evidentin thefastprocessor

experiments.For slow CPUsthedeviceCSBperformscomparableto theprocessorCSB

with hardwiredcontext,butonly thelatterschemeis ableto reduceoverheadfurtheronfast

CPUs,becauseit executestheleastinstructions,andbecausethesequenceof storesinto the

CSB is not slowed down by the system bus.

Implementingthe CSB andprocesscontextidentifier in the deviceresultsin over-

headsslightly lower thanthe fast trap scheme,andcomparableto the hardwiredcontext

schemefor slowprocessors.Thismakesit anattractivealternativeto theotherschemesbe-

causeit doesnot involve modificationsof theprocessorbusinterface,while showingsig-

nificantoverheadreductionscomparedto asystemcall.ThedeviceCSBscheme,however,

requiresmodificationsof thecontextswitchhandlerto updatethecurrentprocesscontext

in thedeviceandcleartheCSBhit counter.In theLAMIX kernelthis is implementedsim-

ilarly to thesharedinterruptmechanism.At boottime,thedevicedriver registersa routine

with thekernelthatis calledfor everycontextswitch.Thisroutineperformsdevicespecific

operationsto updatetheprocesscontextatthedevice.Beforeswitchingto thenewprocess,

thecontextswitchhandlercallsall routinesthathavebeenregisteredby devicedriversfor

this purpose.This methodis flexible enoughto fit in a generalpurposeoperatingsystem

structure,whereawidevarietyof hardwareconfigurationsandI/O devicesis supportedby

a configurablekernel.A moreoptimizedimplementationmight placetheadditionalcode

directly in thecontextswitchhandler,thusmakingthecontextswitchhandlermoreeffi-

cient but device configuration dependent.

Figure19showscontextswitchlatenciesfor avarietyof systemconfigurations.The

top setof bar graphsshowsabsolutelatenciesin microseconds,while the bottomgraphs

101

showthesameresultsnormalizedto theunmodifiedcontextswitchhandlerfor eachsys-

tem.Themicrobenchmarkmeasurescontextswitch latencyasthe time it takesto switch

betweentwo processesusingtheyield() systemcall.Notethatsincenootheroperationsare

performedbetweencontextswitches,few cacheandTLB missesareincurredandthere-

ported latencies are close to the minimum.

Figure 19: Context Switch Latency

0

4

8

6

2

late

ncy

inµs

norm

aliz

ed la

tenc

y

0

0.4

0.8

1.0

0.6

0.2

Unmodified SwitchModified, 0 devices

Modified, 1 deviceModified, 2 devices

Modified, 3 devices

1.2

400 / 100

400 / 100

400 / 200

400 / 200

2G / 100

2G / 100

2G / 200

2G / 200

1.4

10

102

Realisticapplicationsmayobservehighercontextswitchlatenciesdueto instruction

and datacachemissesincurredby the contextswitch routine. Nevertheless,a context

switch in theLAMIX operatingsystemhashigherlatencythanon mostcommercialsys-

temsbecausethe L-RSIM processormodelrequiresflushing of the entireTLB during a

contextswitch,whereasmanycommercialmicroprocessorstagTLB entrieswith aprocess

identifier, eliminating the need to flush the TLB on a switch.

Thecontextswitch latencygenerallyincreasesif thehandleris modifiedto call de-

vice driver specificroutineson everycontextswitch,evenif no suchroutineis installed

(secondbargraph).This additionallatencyis dueto theadditionalinstructionsbeingexe-

cutedto checkif anydriver specificroutinesareto becalled.Somesystemconfigurations

showaslightdecreasein latency,probablybecausein changesin theinstructionscheduling

or layout in the instructioncachethat leadto betterutilization of theprocessorpipelines.

Contextswitchlatencyincreasesby 5 to 20%whenat leastonedevicedriverhasinstalled

acontextswitchroutine.In thiscasethecontextswitchhandlercallsanadditionalsubrou-

tine,which readsdevicedriver specificdatastructuresandissuesanuncachedstoreto in-

form thedeviceof theswitch.Adding moredevicespecificroutinesto thecontextswitch

handlerincreaseslatencyfurtherby about2-5%perroutine.Theincrementalcostof these

routinesis lessthantheinitial latencyincreasebecausethisexperimentusesthesamerou-

tine for everyadditionaldevice,thusreducingcachemissesfor repeatedinvocations.Con-

sideringthat basiccontextswitchesin manysystemsaresignificantly fasterthan in the

simulator,alatencyincreaseby 10to 20%is significantandunacceptablefor mostgeneral-

purpose multiuser systems.

103

5.4 Summary

Initiating anI/O transactioninvolvesatomicallytransferringmultipleargumentsand

parametersto the I/O device.Unlike operatingsystemcode,applicationscannotrely on

software-basedglobalsynchronizationmechanisms.Theconditionalstorebuffer is ahard-

waremechanismsthatallowssoftwareto controlwhichI/O storesarecombined,andwhen

theresultingbustransactionis issued.To avoidoverflowingthedevicerequestqueue,the

I/O devicereturnsastatuswordto theapplicationindicatingif therequestwasacceptedor

rejected.Thesecharacteristicsallow applicationsto usetheCSBto atomicallytransferall

argumentsof anI/O transactionto thedevicewithout involving theoperatingsystem.The

CSBmaybelocatedin theprocessorbusinterfaceor in theI/O device.In additionto the

transactionparameters,eachrequestpacketcontainsa uniqueprocesscontextidentifier

which is usedby thedeviceto validateandtranslatevirtual buffer addressesandto notify

the processwhenthe transactioncompleted.To securelyandreliably communicatethis

processcontextto thedevice,eithertheCSBhardwareor thekernelinserttheappropriate

context in the request when it is issued to the device.

Microbenchmarksshowthat theprocessor-basedCSBwith hardwiredprocesscon-

text incurstheleastoverheadcomparedto all otherschemes,becauseit minimizesboththe

numberof instructionsexecutedaswell asthenumberof bustransactions.However,hard-

wiring theprocesscontextfield of theCSBto aprocessorregisterrestrictstheflexibility of

systemsoftwareandI/O devicesto chosewhichform of contextidentifier to use.Although

moreflexible, softwarebasedsolutionsincreasethelatency,which canapproachthatof a

simple system call for unoptimized implementations.

104

As aresultof theseexperiments,aprocessor-sideconditionalstorebufferwith aded-

icatedtrapfor theCSBflush instructionto inserttheprocesscontextappearsto bethebest

solution.It strikesa balancebetweenoverheadandflexibility, anddoesnot requireany

modifications to the context switch handler.

6. DIRECT USER-SPACE TRANSFER

Copyingdatabetweenkernelanduserspaceis in manycasesthelargestcomponent

of I/O overhead.Thisoperationnotonly occupiesthehostCPUfor manycycles,it alsohas

significanteffectson the cacheandTLB. The copy operationloadsdatainto the cache

twice, replacingothercachelinespreviouslyusedby theapplication,andwastesprecious

systembusbandwidthfor theseextradatatransfers.In addition,thehostCPUcopyperfor-

manceis usuallylower thanthatof a dedicatedDMA enginein I/O devices,sincemicro-

processorsaregenerallyoptimizedfor singleword accessesto the level onecacherather

that large data movement operations.

In thecaseof file I/O operations,thedatacopyis necessarybecausetheOSmanages

files in acacheof diskblocksin kernelmemory,from wheretherequesteddataarecopied

into userspace.Thisdesigngivesthekernelbettercontroloverthephysicalmemoryallo-

catedfor thebuffer cache,andit enablesthekernelto performthenecessaryaddresspro-

tectioncheckfor theapplicationbuffer duringthecopyoperation.Networkprotocolcode

maintains kernel buffers for retransmissions in the event of transient network errors.

Next-generationIO deviceshavesufficientcomputationresourcesandautonomyto

performsomeof theoperationsnormallydoneby theoperatingsystem,thusallowing the

kernelto bebypassedfor datatransfers.If datatransfersareperformedby thedeviceinde-

pendentlyfrom thehostprocessor(DMA), thesourceor destinationbuffer mustbespeci-

fied asaphysicaladdress,sinceDMA transfersoccuratthesystembuslevel.Traditionally,

106

the devicedriver translatesa contiguousvirtual buffer spaceinto a list of discontiguous

physicalpagesthatarecommunicatedto theDMA enginein theI/O device.Beforeinitiat-

ing thetransfer,thedevicedriverensuresthatthephysicalpagescorrespondingto thebuff-

er arepresentandvalid, andmarksthemasnonpageablefor thedurationof the transfer.

This pinningor lockingof pagescaneitherbedonebeforeeveryDMA transfer,or when

the buffer is initially allocated.

Existingsolutionsfor high-performancecommunicationnetworksoftenrequirethe

applicationto specifythecommunicationbuffersin advance[24][32][39]. Duringthebuff-

er setup,thekernelpins thepagesin physicalmemory,andtheaddressmappingis made

availableto theI/O device.In addition,thekernelcanarrangethephysicalpagescontigu-

ously.Theapplicationis thenableto initiate DMA transfersusingthis prearrangedbuffer

without kernelassistance.This schemeplacestheburdenof managingthe limited buffer

spaceon theprogrammer,andoften forcestheapplicationto copydatain andout of the

buffer to locationswherethe datais used.Furthermore,pinning pagestakesthesepages

awayfrom thegeneralpoolof availablememory,possiblyaffectingtheperformanceof the

entiresystem,increasingphysicalmemorypressureandlimiting thescalabilityof this ap-

proach.

Enablingthe I/O deviceto transferdatadirectly to and from userbuffers reduces

CPUoccupancyandminimizesthecacheandTLB effectsof I/O transactions.At a mini-

mum, this requiresperformingaccessprotectioncheckson the entireapplicationbuffer,

translatingthevirtual bufferregionintoalist of physicaladdressesandlockingthephysical

pagesin memoryto avoidpagefaults.Executingasystemcall to performtheseoperations

incursextraoverheadanddefeatsthe goal of the user-levelI/O architecture.In addition,

107

specifyingthesourceor destinationbufferasavariable-lengthlist of physicaladdressesre-

quiresasuitablelocationto storetheseaddressesandincreasesthecomplexityof theDMA

engine.Theuser-levelI/O architecturetakesanoptimisticlow-overheadapproachto direct

user-spaceDMA transfers.TheUIO deviceis augmentedwith atranslationlookasidebuff-

er (TLB) thatcachesvirtual to physicaladdressmappingssimilar to aprocessorTLB. This

approachenablestheUIO deviceto translatevirtual to physicaladdresses,performprotec-

tion checksanddetectpagefaultswithout utilizing thehostprocessor.TLB missescanbe

handledeitherby thehostoperatingsystemvia interrupts,or independentlyby thedevice

with thehelpof aprogrammablepagetablewalk engine.Thissectionfirst describestheI/O

deviceTLB designanddiscussesits integrationwith thehostoperatingsystem.It thencon-

traststhe kernel-basedandhardwareTLB misshandlingmechanismsandevaluatesthe

performance of these alternatives.

6.1 Device TLB Design

Theuser-levelI/O architectureallowstheUIO deviceto performvirtual to physical

addresstranslationsautonomously,thusminimizing hostCPU occupancyfor datatrans-

fers.To achievehighDMA bandwidth,addresstranslationsarecachedin adevicetransla-

tion lookasidebuffer (TLB), asshownin Figure20. For everybustransaction,theDMA

enginepresentsthevirtual addressanda contextidentifier to theTLB, which returnsthe

correspondingphysicaladdressandwhich flagsanyaccessviolations.Thecontextidenti-

fier is neededto distinguishaddressmappingsfor differentprocesseswith thesamevirtual

address.OnaTLB miss,thedevicecaneitherperformthenecessarypagetablelookupin-

108

dependently,or invokethekernelfor assistance.Both optionsarediscussedin thefollow-

ing sections.

Unlike a microprocessorTLB, thedeviceTLB doesnot needto performanaddress

translationat everycycle,sinceit is only accessedonceperbustransaction.Furthermore,

dueto therelativelylong latencyof mostI/O transactions,single-cycleaccessto theTLB

is notasimportantasin aprocessor.In addition,mostDMA operationsaccessmemoryse-

quentially,thusreducingtheneedfor a highly associativeTLB designto achieveaccept-

able hit rates. Theserelaxed requirementsdiffer from those for a high-performance

microprocessorandenableTLB designsthatuselarger,lessexpensivecommoditySRAM

structures with low associativity.

EachTLB entryconsistsof aprocesscontextidentifier,anaddresstagderivedfrom

thevirtual pagenumber,thecorrespondingphysicalpagenumber,avalid bit andanumber

Figure 20: Device TLB Design

DMA transaction to I/O bus

virtual address &

TLB miss

context identifier

handler

TLB miss

TLB fill

TLB

table walk / interrupt

109

of protectionbits.Sincetheuser-levelI/O devicemaybeusedin a varietyof systems,the

TLB mustbe compatiblewith different virtual memoryarchitectures.Accessprotection

bitsareencodedin asystemindependentway,andtheTLB misshandleris responsiblefor

convertingthesystem-specificencodingappropriately.A configurationregisterspecifies

thebasepagesizeof thecurrentsystem,andtheTLB usesthesettingto split thevirtual

addressappropriatelyinto tagandoffsetcomponents.To keeptheTLB designassimple

aspossible,variablepagesizesor superpagesarenotsupported,theTLB misshandlercon-

verts large page mappings to the base page size.

In its simplestform, theTLB usesonly thevirtual addressto indexinto thearrayof

TLB entries,andtheprocesscontextandlowerbitsof thevirtualaddressarecomparedwith

thetag.A smalldegreeof setassociativityhelpsto reducethe likelihood of conflictsdue

to virtual addressaliases.If the numberof setsis the sameasthe numberof concurrent

DMA streamssupportedby thedevice,conflictsbetweenstreamscanbecompletelyavoid-

ed.Alternatively,ahashvaluethatis a functionof boththevirtual addressandtheprocess

context can be used to index into the TLB.

6.2 TLB Misses and Faults

Dueto its finite size,theTLB canactonly asa cacheof recentlyusedaddressmap-

pings.If a requestedaddresstranslationis not found in the TLB (TLB miss),it mustbe

loadedfrom theoriginalpagetable.DeviceTLB misseseitherinvoketheoperatingsystem

via aninterrupt,or triggerthedeviceto performthepagetablelookuponits own.Formax-

imum flexibility, TLB missesare satisfiedusing the kernel’s pagetables.This method

avoidstheoverheadof maintaininga separatesetof pagetablesfor the I/O device,but it

110

canincreasethecostof a TLB missslightly comparedto a special-purposepagetable.In

addition,sharingpagetableswith thehostprocessorgivesthedeviceaccessto all relevant

pageinformationsuchasprotectionbits,pagesizeandvalid flags.Forportabilityandflex-

ibility, mostmodernoperatingsystemsmaintaintwo setsof pagetables.A hardwareinde-

pendentstructureis usedfor most high-level memorymanagementoperationswhile a

hardware-specificlow-level pagetableis usedto satisfyprocessorTLB misses.Thepro-

cessor-specificpagetable is usually simpler and can be traversedefficiently, while the

high-levelpagetablecontainsmoredetailedinformationaboutpageusageandsharing.If

deviceTLB missesarehandledby the hostoperatingsystemvia interrupts,the interrupt

handlercanobtainthecorrecttranslationfrom thevirtual memorysubsystemandprovide

it to thedevice.If, ontheotherhand,thedeviceperformspagetablelookupsindependently,

it shouldusethe low-level pagetableasit is keepsthealgorithmexecutedby thedevice

hardware simpler.

To simplify its design,theTLB stallsuntil themissis handled.If no valid mapping

wasfound,theTLB misshandlerinstallsaninvalid mappingfor thecurrentaddressin the

TLB. WhentheTLB lookupis restarted,amatchingentryis found,but thevalid bit is not

setandanaccessviolation is signaledto theoperatingsystem.Similarly, if an illegal ac-

cess,suchasa write to a read-onlypageis attempted,theTLB invokestheoperatingsys-

tem.

For fatal accessviolations,thekernelinterrupthandlerterminatestheprocesslike it

wouldfor anyothermemoryaccessviolation.In addition,theDMA streamthatcausedthe

violation needsto beterminated.Theexactmechanismto achievethis dependson theI/O

networkarchitecture.If thenetworktransmitsdataasa singlestreamor in largepackets,

111

theinterrupthandlercaninstructtheDMA engineto drainthecurrentstreamwithout any

furtherbustransactions.Long I/O transactionsmightbetransmittedasmultiplestreamsto

reducenetworkcontention,in whichcaseeachadditionalstreamcausesanotheraccessvi-

olation that is handledin thesameway. In this case,theper-streaminterruptapproachis

feasibleonly if packetsare sufficiently large so that the numberof kernel interruptsis

small.If dataaretransferredin manysmallpackets,thecostof invoking thekernelto han-

dleaccessviolationsfor eachcannegativelyimpactunrelatedapplications.In theprototype

implementationwith a 400 MHz CPU,a UIO devicepagefault takes4.8 µs to process,

while the interrupthandleroccupiesthe hostprocessorfor over 9 µs. Assumingeach48

byteATM cell triggersanaccessviolation, thehostprocessorreaches100percentsatura-

tion dueto theseinterruptsat 5.2 Mbyte/snetworkbandwidth.At 200 Mbyte/snetwork

bandwidthand1 Kbyte packets,processorutilization dueto accessviolationsexceeds50

percent.Theseexamplesdemonstratetheperformanceimpactafaulty applicationcanhave

on unrelatedprocesses.To reducethis impact in a packet-orientednetwork,the request

statestoredat thedevicefor thedurationof theDMA transfershouldbeusedby theDMA

engine to drop all packets belonging to a stream that caused an access violation.

OutgoingDMA transfersaresomewhateasierto handle,sincetheDMA enginecan

considerthetransfercompleteassoonastheaccessviolation is detected.However,if part

of therequesthasbeentransmittedon to thenetwork,thereceivermustbenotified to dis-

regard the transfer.

PagefaultsoccurwhenaprocessinitiatesanI/O transferto abufferwith at leastone

of thephysicalpagesswappedout.Sincethedeviceusesthesamekernelpagetablesasthe

hostprocessor,it is able to detectthis conditionandinvoke the operatingsystemin the

112

samewayasfor anaccessviolation.However,in thiscasethestreamof dataflowing to or

from thenetworkmustbestalleduntil thepagehasbeenbroughtinto memoryby theop-

eratingsystem.This designhasthesideeffect that theDMA enginethatcausedthepage

fault is blockedfor thedurationof thepagefault, whichmayaffectotherI/O requests.It is

possibleto havetheoperatingsystemremovetherequestwith its currentDMA statefrom

thedeviceandrestartthetransferafterthepagefault is handled,but thiswouldaddsignif-

icantcomplexityto theOSandthedevicehardware.Stallinganoutgoingrequestmayalso

blocknetworkresourcessuchasavirtual circuit for thedurationof thepagefault.Thesys-

temimpactof thesedependsontheparticularnetworkorganization.In additionto blocking

networkresources,stallinganincomingrequestrequiresthatthesenderor thenetworkhas

sufficient buffering for the in-transitdata,and that the networkprovidesan appropriate

flow control mechanisms.Fortunately,mostof theserequirementsarenot uniqueto the

user-levelI/O architectureandmanysystem-areanetworksarepreparedto handlestallsat

both the data source and sink [15].

Theasynchronousoccurrenceof I/O devicepagefaultsalsorequiressomemodifica-

tions to the operatingsystem.Normally, pagefaults aredetectedsynchronouslyandthe

processthatcausesit is blockedwhile thekernelhandlesthefault usingmechanismssim-

ilar to synchronoussystemcalls.Interrupthandlers,on theotherhand,mustnevertrigger

a pagefault sincetheycannot beblocked.Modernoperatingsystemsdealwith thesere-

strictionsby off-loadingsomeof theinterrupthandlerfunctionalityto a regular(although

high-priority)kernelthreadthatis subjectto normalsynchronizationandschedulingactiv-

ity. A similarapproachcanbeusedtoprocessI/O pagefaults.After determiningthatapage

fault is thecauseof theinterrupt,theinterrupthandlerunblocksakernelthreadoraseparate

113

processto handlethecondition.A queuestructurein kernelmemoryis usedto communi-

catethevirtual addressandprocessidentifier to thethreador process,whichreadsthepage

into memory on behalf of the I/O device and restarts the I/O transaction.

6.3 TLB Coherence and Consistency

Aswith anyformof cacheorTLB, coherencymustbemaintainedbetweenthedevice

TLB, the processorTLB andthe pagetablesin memory.Oncethe operatingsystemre-

movesamappingfor aphysicalpage,it is freeto usethesamephysicalpagefor adifferent

virtual address.To avoid corruptingdatain the new page,the old mappingmustbe re-

movedfrom anyTLB in thesystem(TLB shootdown), which in effectmakesthechange

visible to all TLBs.

Severalarchitecturesspecifya busprotocolfor TLB shootdownoperationsthat in-

validatesandsynchronizesall participatingTLBs.Betweenprocessors,thisis implemented

with aspecialbustransactionthatcollectsresponsesfrom all TLBs beforegraduating,but

suchtransactionsarenormallynot forwardedto the I/O bus,hencethe I/O devicecannot

participatein this protocol.Thesameeffect,however,canbeachievedwith anuncached

readoperationfollowing theuncachedwrite that triggersthe invalidation.Thedevicere-

turnsdatafor thereadonly after it hasprocessedthe invalidationrequest.Unfortunately,

uncachedreadtransactionsincur latenciesat leaston theorderof a mainmemoryaccess,

andusingareadto synchronizetheI/O deviceTLB increasesthecostof globalTLB shoot-

downoperationsevenfurther.To keepthevirtual memorymanagementroutinesin theker-

nelmodularanddeviceindependent,thedevicedriver registersadevice-specifictlb-flush

routinewith the kernelduring initialization. The routineis calledby the virtual memory

114

subsystemwheneverapageis unmapped,it performsanyuncachedwritesandreadsneed-

ed to invalidate the page table entry.

To minimizethenumberof I/O bustransactionswheninvalidatingadeviceTLB en-

try, thekernelwritesonly thevirtual addressthatis unmappedto adevicecontrolregister.

ThedeviceTLB removesall entriesthatmatchthisvirtual address.If theTLB usesahash

functionof thevirtual addressandprocesscontextasanindex,thensearchingfor anentry

with only avirtual addresscantakemanycyclesandmayproducemultiplematches.Since

theTLB cannotbeusedby theDMA enginesduringtheinvalidationprocess,thisscheme

canintroducesignificantstall timesfor otherI/O transactions.In addition,it mayunneces-

sarilyremoveentriesof unrelatedprocessesbecauseit invalidatesanymatchingvirtual ad-

dressregardlessof theprocessidentifier.Alternatively,boththeprocesscontextandvirtual

addresscanbecommunicatedto thedevicefor aTLB shootdownoperation,thuseliminat-

ing thetimeconsuminglinearsearchin thistypeof TLB, aswell astheinvalidationof alias

entries.Thedisadvantageof thisapproachis thattwo bustransactionsmayberequired,un-

less the two values can be written in a single, larger transaction.

If, on theotherhand,thedeviceusesonly thevirtual addressto indexinto theTLB

array,invalidationoperationscanbeperformedin a singleTLB access,similar to normal

addresstranslations.In sucha design,it is sufficient to communicateonly thevirtual ad-

dress during the shootdown, unless invalidations of aliases are considered a problem.

To reducetheoverheadof globalTLB shootdownoperationsin thepresenceof large

numbersof processors,modernsystemsusea lazy approachwhereit is guaranteedthata

physicalpageis not reusedwithin a certaintime. ProcessorsperiodicallyinvalidateTLB

entriessuchthatwithin thetimeperiodall entrieshavebeeninvalidatedat leastonce.Note

115

that theseinvalidationsareappliedto physicalTLB arrayentries,e.g.,entryzerothrough

three,entryfour throughsevenandsoon.WhentheTLB entryis reloaded,modifications

madein theprevioustime periodbecomevisible to theprocessor.Sincetheprocessof in-

validatingTLB entriesis local to eachprocessoranddoesnot requireglobalcommunica-

tion, it scalesto arbitrarynumbersof TLBs, at thecostof someadditionalTLB missesdue

to unnecessaryinvalidationsaswell asincreasedphysicalmemorypressuredueto delayed

reuseof physicalpages.This delayedTLB shootdownprotocolcaneasilybeextendedto

includetheI/O deviceTLB, with little ornoadditionaloverhead.Thekernelclockinterrupt

handlercanperiodicallyinvalidateI/O deviceTLB entriesin additionto theCPUTLB en-

tries.Sincethedevicecanguaranteethataninvalidationcompleteswithin a few cyclesaf-

ter it hasreceivedtheuncachedwrite triggeringtheprocess,no explicit acknowledgment

in theform of anuncachedreadis needed.To keeptheclock interrupthandlersystemcon-

figurationindependent,devicedriverscanregisteraninvalidationroutinethat is calledat

everyclockinterrupt.With little additionalhardwarethedevicecanalsoinvalidatepartsor

all of its TLB atprogrammableintervals,thusfurtherreducingthecostof pagetableunmap

operations.Thelatterschemeis preferableover theothertwo TLB invalidationmethods,

asit minimizesthecostof boththepageunmapoperationsandtheclock interrupthandler.

6.4 TLB Miss Handling with Kernel Interrupts

SincethedeviceTLB is only acacheof recentlyusedpagetableentries,afractionof

DMA transactionsdonot find a translationin theTLB at thetimeof theaccess.Justasthe

otheraddress-translationrelatedexceptionsdescribedearlier,TLB missescanbehandled

by thekernelthroughaninterrupt.Suchadesignsimplifiesthedevicehardwareandleaves

116

theoperatingsystemdesignerfreeto useanypagetableorganizationwithoutregardfor the

I/O deviceorganization.This sectiondiscussesthe designandperformanceof a kernel-

basedTLB misshandlingscheme,while thenextsectiondiscussesdetailsof a hardware

based TOB miss handling mechanism.

In theprototypeimplementation,a generalpurposeinterruptroutinehandlesall de-

viceexceptions,includingTLB faults,TLB missesandnetworkerrors.Theparticularerror

conditionis recordedin abitvectorthatis readby theinterruptroutine.Exceptionsmayac-

cumulatein thedevice,butonly thefirst exceptionalconditiontriggerstheinterrupt.After

readingthe interruptcauseregister,the interruptroutinehandlesall accumulatedexcep-

tions,giving thehighestpriority to TLB misses.ThekernelhandlesTLB missesby reading

thevirtual addressandprocessidentifier for therequestedtranslationfrom I/O devicereg-

isters,performsthe pagetablelookup on the device’sbehalf,clearsthe interruptbit and

writestheresultof thetranslationto a devicecontrol register.Whenall pendinginterrupt

conditionsarehandled,the routinechecksthecauseregisteragainin casefurtherexcep-

tions have accumulated that where not reported when the register was read the first time.

Table12presentsdeviceTLB misshandlinglatenciesandoverheadsfor thefour dif-

ferentsystemspresentedin Table11.Thereportedvaluesarebasedon 64 samplesfrom a

singleprogramissuinga variety of I/O requests.From the I/O deviceperspective,TLB

misslatencyis mostcritical influencefor performance.EventhoughaTLB missis thefirst

conditionto becheckedandhandledin thekernelinterruptroutine,theTLB misshandling

latencyobservedby thedevicecanbesignificant.Theinstructionsequenceexecutedupon

entry into the interrupt handleris essentiallythe sameas for a systemcall. It involves

switchingto thekernelstackandsavingthecurrentcontexton it. Like asystemcall, many

117

of the instructionsoperateon global processorstateregistersand throttle the dynamic

schedulingcoreof theprocessor.In addition,a largevariationof thedeviceTLB misshan-

dling latencycanbeobservedif thekernelis executingcritical sectionsduringwhich ex-

ternal interrupts are disabled.

6.5 TLB Miss Handling with a Programmable TLB Fill Engine

An alternativeto handlingTLB missesin thekernelis to enablethedeviceto perform

thepagetablelookupindependently.This requiresthat thedevicehassufficient informa-

tion to find andtraversetheprocesspagetable.Virtual memoryarchitecturesandpageta-

bleorganizationsvarywidely betweensystems.A programmabletablewalk engineoffers

the flexibility neededfor thedeviceto operatein a wide varietyof systems.This section

introducesthedesignof sucha tablewalk engine,evaluatesthehardwarecomplexityof a

prototypeimplementationand presentsthe implementationand performanceof various

page table walk algorithms.

TheproposedTLB fill engineis anaccumulator-basedone-addressmicroprocessor

with averysimpleinstructionset.Primaryobjectivesof thearchitectureareaminimalsize

of the hardwareimplementationwhile maintainingreasonableexecutionefficiency and

Table 12: Kernel TLB Miss Handling Performance

System Minimum Latency Average Latency Average Overhead

400 MHz / 100 MHz 2.72 µs 3.89 µs 4.48 µs

400 MHz / 200 MHz 2.10µs 2.50µs 3.01µs

2 GHz / 100 MHz 2.21 µs 3.07 µs 3.38 µs

2 GHz / 200 MHz 1.42µs 1.85µs 2.06µs

118

sufficientflexibility of theinstructionsetto implementawidevarietyof addresstranslation

algorithms.Thesegoalsareaccomplishedwith avarietyof domain-specificoptimizations.

The instructionsetsupportsonly basicinstructionsthatareneededfor pagetablelookup

operationssuchas addition/subtraction,logical and shift operationsand memoryloads.

Memoryreadaccessescanbeperformedin avarietyof sizesfrom oneto 16word.This is

particularlyusefulsincethedatastructuresin manypagetableorganizationsarerelatively

big. Theaccumulator-basedarchitecturesimplifies theregisterfile design,requiringonly

onereadandonewrite port. Instructionsare16 bits, thusreducingthesizeof therequired

instructionmemory.The relatively largeregisterfile of 32 registerssupportspagetable

lookup algorithmsthat requiremany temporaryvariables,or that manipulatelargedata

structures.

TheTLB fill engineworks in conjunctionwith theon-chipTLB of the I/O device.

Uponamiss,theTLB presentsthevirtual addressandtheprocessidentificationto theen-

gine,which performsa tablelookupandeitherreturnsa valid physicalpagenumberor a

failurestatus.TheUIO deviceTLB doesnot supportvariablepagesizes.Duringstartupit

is configuredto usethesystem’sbasepagesize.TheTLB fill enginemustbeableto deal

with variable-size pages and if necessary convert address mappings to the base page size.

The TLB fill enginecommunicateswith the TLB throughthreeregistersthat are

mappedinto theengine’sgeneralpurposeregistersetasregisters0 through2. Register0

and1 areinitialized by theTLB upona missandcontainthevirtual addressandprocess

context.Register2 is writtenby theTLB fill engineat theendof thetablelookupandcon-

tains the physical page number.

119

ThehostCPUhasaccessto theinstructionmemoryandregisterfile. During system

startup,theI/O devicedriverwritesasystemspecificinstructionsequenceinto theinstruc-

tion memory.Havingaccessto theregisterfile aswell allowsthedriverto initialize thereg-

isters with constant values that are used by the page table walk algorithm.

6.5.1 Instruction Set

TheTLB fill engineinstructionsetarchitecturedefinesasmallnumberof 16-bitwide

instructionsthat includebasicarithmeticandlogic instructions,memoryload operations

for a varietyof datasizesandcontrol transferinstructions.Table13 summarizesthesup-

portedinstructionclasses.Instructionscontaineitheran8-bit immediatevalue,or a 5-bit

registeridentifier. Most instructionsusea registervalueor the sign-extendedimmediate

valueandtheaccumulatorasoperandsandupdatetheaccumulatorwith the resultor the

operation.Controltransferinstructionsuseanabsoluteaddressspecifiedin aregisteror as

animmediatevalueastarget.Notethatbranchtargetsarespecifiedasinstructionmemory

Table 13: Table Walk Engine Instruction Set Summary

Class Instructions

arithmetic add, subtract

logic and, or, xor, invert

shift shift left, shift right

comparison compare, bit test

control flow branch on condition code

move move to accumulator, move from accumulator, swap

memory read load 1, 2, 4, 8 or 16 words

halt halt with success, halt with failure

120

address,not byteaddress.Thus,an8-bit immediatevalueis sufficientto spana 256entry

instructionmemory.All instructionsoperateon 32-bit dataitems.Instructionscanspecify

whethertheyupdatetheaccumulator,theconditioncoderegistersor both,thusenablinga

variety of comparison and test instructions.

6.5.2 Hardware Implementation

Sincemanypagetablelookupalgorithmsexhibit a critical codepathlengthof tens

of instructions,it is importantthat implementationsof theTLB fill enginebeat leastpar-

tially pipelined.Theaccumulator-basedinstructionsetarchitectureandtherelatively low

complexityof thearithmeticinstructionslend themselveseasilyto a simplethreeor four

stagepipeline.Dataneedto beforwardedbetweentheaccumulatorandthefollowing in-

structionif aMOVA or SWAPinstructionis followedby aninstructionthatreadsthedes-

tination register.

Memoryloadsareclearlythedominantfactorof mosttablewalkalgorithms.In many

cases,performanceimprovesif theTLB fill enginedoesnotstallautomaticallyon loadin-

structions.Thisallowstheinstructionsequencetocontinueoperatingwhile theloadoccurs,

for instanceby performingadditionalcheckson someoperandsor by preparingbit masks

or othervaluesthatareneededwhentheloadreturns.Usingthis scheme,theTLB fill en-

ginestallsonly whenit encountersaninstructionthatusesoneof thedestinationregisters

of the pendingload asits sourceor destination,or whenit encountersa secondload.To

simplify the registerconflict detectionlogic, datamust be loadedin registersthat are

aligned at the request size.

121

Thesampleimplementationshownin Figure21featuresa three-stagepipeline.Data

areforwardedfrom MOVA/SWAP instructionsto following consumers.Loadsstall only

whenaninstructionusingtheloadvalueis encountered.Not-takenbranches(fall-through)

incur no delay, taken branches result in a one cycle pipeline bubble.

Thedesignwasimplementedin theVerilog hardwaredescriptionlanguageandsyn-

thesizedwith theSynopsysDesignCompiler[25] for acommercial0.25µm five metallay-

er process.The targetfrequencyis 66 MHz, correspondingto the frequencyof currently

used high-end PCI buses.

Theinstructionmemoryis asingle-portedsynchronousSRAM generatedby aVLSI

foundry specifictool. Similarly, the registerfile is a dual-portedsynchronousSRAM. A

special-purposeregisterfile with dedicatedreadandwrite portswould havehada slightly

smallerfootprint,but this wasnot usedbecausetheregisterfile generatordid not function

properly.Sincetheaccesstimesof theinstructionSRAM andtheregisterfile arewell be-

low 3 ns,it is possibleto performboththeinstructionfetchandoperandreadoperationin

a singlecycle,with theregisterfile runningoff thenegativeclock edge.Ideally, anasyn-

chronousregisterfile wouldbeusedin its place.Thearithmetic-logicunit andits surround-

ing multiplexers and pipeline registers was generated using Synopsys’ module compiler.

In thefirst pipelinestage,thenewPC(eitherprovidedby thePCincrementlogic or

a branchtarget)is latchedinto the synchronousinstructionmemoryandthe PC register.

Theinstructionmemoryprovidesaninstructionwhich is usedto accesstheregisterfile at

thefalling clock edge.At thesametime, theinstruction’simmediatebit controlsthemul-

tiplexer thatchoosesbetweenthesign-extendedimmediatevalueanda registervalue.In

parallelwith theregisterfile, thesetof externallyvisible registersis accessed.Theoutput

122

Figure 21: TLB Refill Engine Architecture

pc

instructions

+1

iaddr idata iwr_l

regfile

instruction

immediate

read write

daddr/ddata mem_data

operand

accumulatorCC

condition codes

mem_addr

ALU control

branch target

forwarding

result write

r0r1

r2r3

dataxi dataxo

ALU

123

of theseregistersis chosenby themultiplexerwhenthe instructionaccessesa registerin

therange0 through3. Forwardingdatahasthehighestpriority, unlesstheinstructionuses

animmediatevalue.Dataareforwardedif theoldestinstructionwritesa valueto thereg-

ister file and the following instruction reads from the same location.

Thefollowing pipelinestagelatchesthecurrentinstructionandtheoperandandper-

formsthedesiredoperation.Theinstructioncontrolswhethertheresultwill be latchedin

theaccumulatorandtheconditioncoderegister.In addition,if theinstructionis amemory

access,the memoryrequestsignalis asserted.This is possibleat this stageonly because

memoryaccessinstructionsdonotperformanyarithmeticoperationandthecontentof the

accumulatorremainsunchangedfrom thepreviousinstruction.Thesecondpipelinestage

alsodrivestheregisterfile write signalif thecurrentinstructionis aswapor mova. If mem-

ory happensto returndataat this time,theregisterfile write port is occupiedandtheentire

pipelineis stalled.Finally, branchprocessingis performedin stage2. Thecurrentoperand

is forwardedto thePClogic andthemultiplexerchoosesthenewPCif theconditioncodes

matchthe branchcondition.The third pipelinestagewrites the result in the accumulator

and condition code register, if enabled by the current instruction.

Theimplementationaspresentedhereis synthesizedfor acommercial0.25µm pro-

cess,with an11 nsclock cycle targetto allow for routingdelaynot accountedfor during

synthesis.Table14 summarizesthearearequirementreportedby theSynopsyssynthesis

tool. It is equivalentto theareaof a2 KbyteSRAM createdby anautomaticmemorygen-

eratorfor thesametechnology.TheALU asit is generatedby theSynopsysModuleCom-

piler [69] containsthe critical path. A custom-designedALU insteadof the current

standard-cellbaseddesignwouldreduceboththedelayandareaof thiscomponent.Certain

124

controlpathsto theregisterfile, which is runningoff the invertedclock, arealsocritical.

However,ideally onewould choseanasynchronousregisterfile, in which casethetiming

requirements could be relaxed.

6.5.3 Table Walk Algorithms

To investigatethefeasibilityof performingpagetablelookupswith aprogrammable

controller,thissectiondescribesthreedifferentpagetableorganizationsfoundin commer-

cial systems, and presents implementations of the corresponding lookup algorithms.

6.5.3.1 32-Bit PowerPC. The 32-bit PowerPCprocessor[83] presentsa 4 Gbyte

virtualaddressspacewhichissplit into16segmentsof 256Mbyteeach.Theprocessorcon-

tains16 segmentdescriptorregisters,eachconsistingof a 24-bit segmentID andaccess

protectioninformation.The four mostsignificantbits of the addressselectoneof the 16

segmentdescriptors.Thevirtual pagenumber,combinedwith thesegmentID anda mask

form anindexinto thehashedpagetable.Eachpagetablebucketcontainseight64-bitpage

tableentries(PTE).If noneof theeightprimaryPTEsmatchthesegmentID, a secondary

Table 14: Table Walk Engine Area Requirements

Module Area inµm2

pc logic 3318

instruction memory 86093

register file 156488

external registers (r0-r3) 39341

alu & pipeline registers 65457

control 18079

total (including estimated routing) 388012

125

hashfunctionis appliedto thevirtual pagenumberandthesegmentID to accesseightsec-

ondaryPTEs.Figure22 summarizesthepagetablelookupalgorithmof thePowerPCar-

chitecture.

Althoughthebasicpagetablelookupalgorithmis implementedin hardware,differ-

entoperatingsystemsmayusetheavailablefeaturesdifferently. For this discussion,it is

assumedthatboth thesegmentdescriptorregisterSDR1andthe16 segmentregistersare

changeduponacontextswitch.Consequently,botharepartof theprocesscontextandcan

beaccessedthroughtheprocessstructurein kernelspace.Thetablelookupalgorithmtakes

a virtual addressa thephysicalpointerto theprocessstructureasinputsandproducesthe

correspondingphysicalpagenumberuponsuccess.Thealgorithmpresentedheredoesnot

support large pages that use the PowerPC block address translation feature.

Figure23presentsamabbreviatedversionof thepagetablelookupalgorithmimple-

mentationfor theprogrammableTLB fill engine.Omittedis thecodethatchecksthe last

six primaryPTEsfor amatchaswell asthecodesegmentthatcomputesthesecondaryhash

index, loadsthesecondaryPTEsandsearchesfor a match.The indentedinstructionsare

executedwhile a loadis outstanding,theydonotcontributeto thetotal latencyof thealgo-

rithm.

The latencyof a pagetablelookupusingthis codedependson whetherthedesired

PTEis locatedin theprimaryor secondarybucket,andwherein thatbucketis found.The

algorithmrequiresaminimumof threememoryaccesses,andafourthaccessif thesecond-

arybucketis needed.Thelatenciesreportedin Table15assumeanaveragememoryaccess

latencyof 30 PCI cyclesat 66 MHz, composedof a 10-cyclememorycontrollerlatency

(150 ns) and 10 cycles each from the PCI bus through the PCI bridge and back.

126

Figure 22: 32-bit PowerPC Page Table Lookup Overview

Seg # 16-bit page # 12-bit offset

24-bit segment ID

0 3 4 19 3120

16-bit page #00019-bit segment ID19-bit segment ID

XOR9-bit mask16-bit base

AND

OR

9 bit9 bit

32-bit PTE address

10 bit

7 bit

000000

V 24-bit segment ID 6-bit APIH

= ? = ?

6 upper bits of page #

20-bit phys. page # protection etc.

12-bit offset20-bit phys. page #

64 bit PTE

SDR1

SR 0 - 15

virtual address

physical address

NEG

hash 1

Page Table

hash 0

127

Figure 23: 32-bit PowerPC Page Table Lookup Code

MOV r1 // get context r1

ADD r15 // add SDR1 offset r15

LDD r6 // load SDR1 into r6

MOV r0 // get VA

SHIFTLi 4

SHIFTRi 16 // extract virtual page #

MOVA r8 // store VPN in r8

MOV r0 // get VA

SHIFTRi 28 // extract segment #

SHIFTLi 2 // 32 bit chunks

ADD r14 // add segment reg offset r14

ADD r1 // add context

LDD r4 // load segment reg into r4

MOV r6 // get SDR1 (r6)

SHIFTLi 23

SHIFTRi 13 // extract 9 bit mask

OR r13 // make lower 10 bits 1

MOVA r5 // store mask in r5

MOV r6 // get SDR1

SHIFTRi 16

SHIFTLi 16 // extract 16 bit base

MOVA r6 // store base in r6

MOV r4 // get segment ID

AND r11 // mask out segment #

XOR r8 // XOR segment with VPN

MOVA r7 // store hash 1 in r7

AND r5 // AND with mask

SHIFTLi 6 // shift left 6 bits

OR r6 // OR with base address

LDH 16 // load first 4 PTEs

MOVi 1

SHIFTLi 31 // create valid bit

MOVA r3 // store bit in r3

MOV r4 // read segment ID

AND r11 // mask out segment #

SHIFTLi 7 // shift left

OR r3 // set valid bit

MOVA r3 // store back in r3

MOV r8 // get VPN

SHIFTRi 10 // extract API

OR r3 // concat. with segment ID

CMP r16 // check PTE 0

BNEi next0 // skip if no match

MOV r17 // read PTE0

BAi succ // done

next0:

CMP r18 // check PTE 1\

BNEi next1 // skip if no match

MOV r19 // read PTE1

BAi succ // done

next1: // continue up to PTE 7

... // if no match, load

... // secondary PTEs

... // check secondary PTEs

HALTFAIL // no match found

succ:

MOVA r5 // store PTE in r5

AND r12 // extract physical page #

SWAP r4 // store PPN, get segment ID

SHIFTRi 27 // extract Kp

OR r5 // combine with PTE

ANDi 0x07 // extract protection bits

CMPi 0x04 // privileged ?

BEi priv // jump ahead

CMPi 0x05 // read only ?

BEi sro // yes -> jump ahead

ANDi 0x03

CMPi 0x03 // read only ?

BEi sro // yes -> jump ahead

MOV r4 // read PPN

MOVA r2 // write TLB entry

HALTSUCC

sro:

MOV r4 // get PPN

ORi 0x01 // set Read-Only bit


HALTSUCC

priv:

MOVi 0x02 // set privileged bit

OR r4 // merge with PPN


HALTSUCC

128

6.5.3.2 MIPS-4 (32-Bit). Many32-bitMIPS-basedsystemsuseatwo-leveladdress

translationscheme.WhenauserprocessincursaTLB miss,thevirtualpagenumberisused

to indexinto aflat pagetablethatstartsatthevirtual addressspecifiedin theXContextreg-

ister.Whenthepagetableentryis loaded,its virtual addressis in turn translatedagain.In

casethe secondtranslationalsomissesin the TLB, a secondaryTLB misshandleris in-

vokedwhich usesa conventionaltwo-tier pagetableorganizationto locatethe required

physicaladdress.Thephysicaladdressis thenusedto loadthepagetableentryfor theuser

process.Eachpagetableentryis 64bitslargeandcontainsthephysicalpagenumber,aval-

id bit, accessprotectioninformationanda four-bit pagesizeindicator.Figure24 summa-

rizes the page table lookup algorithm used in 32-bit MIPS processors.

In manyMIPS systems,theprimaryuserpagetableis locatedat a fixed virtual ad-

dresswhichdoesnotchangeduringcontextswitches.It is assumedthatthebasepointerof

therootpagetableis partof theprocessstructureandis providedasprocesscontextto the

device.Thealgorithmpresentedin Figure25 first calculatesthevirtual addressof theuser

pagetableentry,which is thentranslatedusingthecontextprovidedby theTLB. Both the

Table 15: PowerPC Page Table Lookup Latencies

Case instructions code latency incycles

latency includingmemory access

first primary PTE match 31 35 1.894µs

second primary PTE match 35 39 1.955µs

eighth primary PTE match 59 63 2.319µs

first secondary PTE match 69 73 2.924µs

eighth secondary PTE match 97 101 3.348µs

no PTE match (fault) 98 101 3.348µs

129

Figure 24: 32-bit MIPS Page Table Lookup Overview

VPN offset

30 14 13 0

virtual address 0

31

user PT base

XContext

ADD

17 bit VPN 00000...00

proc. PT base

ADD

L0 idx L1 idx offset31 25 24 14 13 0

L0 idx00...00 000

L0 Page Table

L1 idx00...00 000

ADD

L1 Page Table

NULL?

26 bit PPN 0000 V 0 PS000 00....0028 25

OR

SHIFTL

26 bit PPN 0000 V 0 PS000 00....0028 25

AND

0x00000FFF

0x00000FFF

OR

lower 18 bits

lower 18 bits

physical address

valid ?

valid ?

Context

user page table

SHIFTL

AND

130

rootpagetableentryandtheuserpagetableentrymaymapapagethatis largerthanthe4

Kbyte basepagesize.In this case,thepagesizespecifierin thePTEindicateshow many

additionalbits of the pageoffset areused,andhow by how manybits the physicalpage

number shrinks. The algorithm converts all mappings to the 4 Kbyte base page size.

The algorithmrequiresthreememoryaccesses.With 35 instructionsin the critical

path,thetotal instructionlatencyis 38cycles.Againassuminga30-cyclememorylatency,

a page table lookup requires 128 cycles or 1.939µs.

Figure 25: 32-bit MIPS Page Table Lookup Code

MOV r0 // get VA

SHIFTRi 11 // extract virtual page #

ANDi 0xF8 // align at dword

ADD r15 // add to XContext

SHIFTRi 22 // extract L0 offset


ADD r1 // add to UPT base

LDD r6 // load L0 ptr into l6/l7

MOV r0 // get VA

SHIFTRi 11 // extract VPN


ADD r15 // add XContext

MOVA r5 // store UPA in r5

SHIFTLi 7 // remove L0 offset

SHIFTRi 21 // extract L1 offset

SHIFTLi 3 // align at dword

SWAP r7 // save offset, get L1 base

CMPi 0 // check if ptr is NULL

BEi fail // fail if NULL

ADD r7 // add offset to L1 base

LDD r6 // load L1 ptr

MOV r0 // get VA

AND 11 // extract VPN (don’t shift)

MOVA r4 // store VPN in r4

MOV 12 // get large page mask

MOVA r8 // move to r8

MOVA r9 // move to r9

MOV r6 // get UPA PTE[0]

SHIFTRi 25 // get page size bits

ANDi 0x0F // mask out rest

SWAP r8 // swap with large page mask

SHIFTL r8 // shift large page mask

OR r12 // set lower bits

AND r5 // use mask on UPA

SWAP r7 // swap UPA offset with PTE[1]

CMPi 0 // check if ptr is NULL

BEi fail // fail if NULL

SHIFTLi 8 // shift out top bits

ADD r7 // add new VA offset

LDD r6 // load PTE into r6/r7

MOV r7 // read PTE[1]

BTSTi 0x02 // test valid bit

BEi fail // fail if not set

AND r10 // extract PPN

SHIFTLi 8 // shift out top bits

SWAP r6 // store in r6, get PTE[0]

SHIFTRi 25 // shift page size bits

ANDi 0x0F // mask out rest

SWAP r9 // swap with large page mask

SHIFTL r9 // shift left by N bits

AND r4 // extract lower bits of VPN

OR r6 // merge with VPN

MOVA r2 // write PPN

HALTSUCC // done

fail:

HALTFAIL

131

6.5.3.3 Intel IA-32. Similar to thePowerPC,the IA-32 architectureusesa combi-

nationof segmentationandpage-basedmemorymanagement,aspresentedin Figure26.

Theprocessorprovidessix segmentselectorregisters,which index into a globalor local

tableof segmentdescriptors.Threeof thesix segmentsaredefinedby thearchitectureas

text,dataandstacksegment,butinstructionscanoverwritethedefaultsegmentfor memory

accesses.Thesegmentdescriptorselectedby the instructioncontainsa baseaddress,size

andprotectionbits. The baseaddressis addedto the virtual addressto form a linear ad-

dress. This linearaddressis optionallymappedthroughaconventionaltwo-tierpagetable.

Figure 26: IA-32 Page Table Lookup Overview

10-bit dir # 10-bit page # 12-bit offset22 21 031 1112

16-bit Segment Selector

selected by instruction

32-bit virtual address

ADD

Seg. Table Base

GDTR / LDTR

ADD

segment descriptor table

address

Linear Address

20-bit basePDBR

12-bit offset20-bit physical page #32-bit physical address

132

Eachpagetableentrycontainsthebaseaddressof thenextlevelpagetable,or thephysical

pagenumber,aswell asaccessprotectionbits.Alternatively,first level pagetableentries

may point to a superpage instead of the next page table level.

TheIA-32 architectureallowsmanydifferentcombinationof segmentationandpag-

ing with differentlinearaddresssizesandpagesizes.Thealgorithmpresentedin Figure27

assumesthecommonlyusedflat addressspacemodel,whichbypassessegmentationcom-

pletely.Thesix on-chipsegmentselectorregisterspoint to a segmentthatstartat address

0 andspans4 Gbyte.Thefirst levelpagetablebaseaddressis processspecificandcontains

pointersto second-levelpagetablesor 4 Mbytesuperpages.Thebasepagesizeis 4 Kbyte.

Theprocessorstill accessesthesegmenttablefor everymemoryaccess,butthedevicepage

Figure 27: IA-32 Page Table Lookup Code

MOV r0 // get VA

AND r12 // extract L0 offset

SHIFTRi 20 // shift right

ADD r1 // add to context

LDD r4 // load L0 ptr into r4/r5

MOV r0 // get VA

AND r13 // extract L1 offset

SHIFTRi 10 // shift right

MOVA r8 // store in r8

MOV r0 // get VA

AND r15 // get part of VPN for lrg.pg.

MOVA r9 // store in r9

MOV r4 // get L0 ptr

BTSTi 0x01 // check valid bit

BEi fail // skip to fail if not set

BTST r11 // check large page bit

AND r14 // extract L1 base/PPN

BNEi sucb // skip to success big if set

ADD r8 // add L1 offset

LDD r6 // load L1 ptr into r6/r7

MOV r6 // get L1 pointer

BTSTi 0x01 // check valid bit

BEi fail // skip to fail if not set

AND r14 // extract PPN

SWAP r6 // store in r6, get L1 ptr

AND r4 // combine with L0 pointer

succ:

INV r0 // negate bits

ANDi 0x06 // extract protection bits

SHIFTRi 0x01 // move to correct position

OR r6 // combine with PPN

MOVA r2 // write mapping

HALTSUCC // success

sucb:

SWAP r4 // save PPN in r4, get L0 ptr

INV 8’h00 // negate bits

ANDi 8’h06 // extract protection bits

SHIFTRi 8’h01 // move to correct position

OR 8’h04 // combine with PPN

OR 8’h09 // merge with VPN offset

MOVA 8’h02 // write mapping

HALTSUCC // success

fail:

HALTFAIL // failure

133

tablelookup algorithmcanignoresegmentationandusethe virtual addressaslinear ad-

dress.Thealgorithmtakesthevirtual addressandpagetablebaseaddressasargumentsand

returnsthe physicalpagenumber.Mappingsfor 4 Mbyte superpagesareconvertedto 4

Kbyte mappings. Each page table lookup requires two memory accesses.

Performingatwo-tier lookuprequires25instructions,or 28cycles,plustwo memory

accesses.Assuminga30cyclemainmemorylatency,theentiretranslationtakes1.333µs.

A lookupof a4 Mbytepagerequiresonly 19 instructions,or 23cycles,whichcorresponds

to a 0.803µs latency.

6.6 Device TLB Performance Evaluation

Overallperformanceof thedeviceTLB is determinedby severalcomponents.TLB

misshandlinglatencyis measuredfrom themomenta TLB missis detecteduntil the re-

quiredentry is loadedinto theTLB andtheDMA operationrestarts.TLB misshandling

overheadis definedasthetime duringwhich theprocessoris not executingapplicationor

kernelcodebecauseof theTLB miss.In addition,theoverallsystemperformanceimpact

of both measures is affected by the TLB miss ratio.

6.6.1 Miss Handling Latency Comparison

Figure28 comparestheTLB misshandlinglatenciesof thehardwareandsoftware

basedTLB misshandlingmechanismsfor thesetof systemsdescribedin Table11 As ex-

pected,hardwarebasedTLB misshandlinggenerallyprovideslower latency,especially

whenconsideringthatthesimulatedkernelTLB misshandlerusesasimpletwo-levelpage

tableorganizationcomparableto the flat memorymodel in IA-32. Kernel interruptsnot

134

only incur at leasttwice the latencyof the comparabledevice-sidedTLB misshandling,

theyalsointroducesignificantCPUoverheads,asnotedon top of thebargraphs.On the

otherhand,device-basedTLB misshandlerscanonly benefitfrom fasterbusesandlower

mainmemorylatency,while kernelinterrupthandlersalsobenefitfrom fasterprocessors

andareableto reducethelatencygapto thehardwarebasedhandlerspartially.However,

this scalingeffectmaybe lesssignificantfor manyrealisticapplicationsbecausethemi-

crobenchmarkdoesnotutilize thecacheasmuchasrealapplicationswould,andhencethe

interrupthandlerdoesnot incurasmanycacheandTLB misses.In addition,interrupthan-

dlerscanhavenon-negligiblesecondaryeffectsonsystemperformanceby replacingcache

andTLB entriesthatthenneedto bereloadedby theapplicationaftertheinterruptis han-

dled.Theperformanceimpactof theseeffectsis largelydependenton thecacheandTLB

hit ratesof the applicationand is not measuredin theseexperiments.Generally,kernel

basedTLB misshandlingalsoshowsa largerlatencyvariationdueto thefact that thein-

terruptmight bedelayedif thekernelis executinga critical section,while hardwarebased

Figure 28: TLB Miss Handler Latency

0

2

4

3

1

late

ncy

inµs

PPC, first primary matchPPC, last secondary match

MIPSIA-32, 4k page

L-RSIM / LAMIX kernel handler

400 / 100 400 / 200 2G / 100 2G / 200

4.5 µs

3.0 µs3.4 µs

2.1 µs

CPU overhead noted on top of bar

135

handlersproceedindependentlyandareonly affectedby busandmemorycontrollercon-

tention.

6.6.2 Host Processor Occupancy

Thegraphsin Figure29 showCPUoccupancydueto kernelTLB misshandlingfor

a varietyof I/O DMA bandwidthsandTLB missratios.TLB hit ratiosaregivenasa per-

centageof pageaccesses,i.e.,a0%hit ratemeansthateveryfirst accessto apageincursa

TLB miss,buttheremainingaccessesto thesamepagehit. Thisdefinitionof hit rateisused

becauseit is independentof thesizeof a DMA transaction,andhenceindependentof the

numberof DMA transactionsperpage.It alsoimpliesthathit ratesbelow0%arepossible

if multiple streamsinterferewith eachother,causingmultiple TLB missesperpage.CPU

occupancyis determinedanalyticallybasedontheeffectivebandwidthunderagivenpage

sizeandhit rate,andtheresultingnumberof TLB missespersecond.TLB missoverhead

is assumedto remainconstantovertime.Thedistributionof theCPUoccupancyalongthe

y-axis for a givenpagesizecorrespondsto varyingTLB misshandleroverheadsranging

from 1.2 µs (lower bound) to 4 µs (upperbound).For instance,in a systemwith 400

Mbyte/speakDMA bandwidth,a1.2µs TLB misshandleroverheadleadsto a11percent

CPUutilization for a zeropercenthit rate,while a 4 µs overheadleadsto over30 percent

CPUutilization.Thesemisshandleroverheadsareintendedto covertherangeobservedin

the prototype implementation for different processor and memory system configurations.

ThegraphsshowthatTLB missoverheadcanbesignificantevenfor moderatetrans-

fer rates.Smallerpagesleadto ahigherrateof TLB missesgivenafixed hit ratio,sincethe

DMA enginecrossespageboundariesmorefrequently.Thehighsensitivityto TLB hit ra-

136

Figure 29: TLB Miss Handler CPU Utilization

0

5

15

20

30

35

0 20 40 60 80 100

TLB Hit Ratio in %

TLB Hit Ratio in % TLB Hit Ratio in %

TLB Hit Ratio in %

host

pro

cess

or u

tiliz

atio

n in

per

cent

host

pro

cess

or u

tiliz

atio

n in

per

cent

host

pro

cess

or u

tiliz

atio

n in

per

cent

host

pro

cess

or u

tiliz

atio

n in

per

cent

DMA Bandwidth 50 Mbyte/s DMA Bandwidth 100 Mbyte/s


4 Kbyte pages16 Kbyte pages

10

25

0

5

15

20

30

35

0 20 40 60 80 100

10

25

0

5

15

20

30

35

0 20 40 60 80 100

10

25

0

5

15

20

30

35

0 20 40 60 80 100

10

25

137

tios indicatesthat locality within a singlerequeststreamaloneis not sufficient to achieve

acceptablelow CPU overheads.However,sinceI/O DMA streamsexhibit almostcom-

pletelysequentialbehavior,asimpleTLB entryprefetchingschemeis ableto improveTLB

hit ratiossignificantly.UponaTLB miss,themisshandlernotonly resolvestheoutstand-

ing missbutalsoloadstheaddresstranslationfor thesubsequentvirtual pageinto theTLB,

thuseffectivelyreducingthemissrateby a factorof two. Sucha prefetchschemeis most

effectivefor TLBs with somedegreeof associativity,sincethatreducesthelikelihoodthat

a prefetchedTLB entry may replacean entry that is in useby anotherstream.Given the

relativelyhighbaseoverheadof akernelinterrupt,performinganadditionaltranslationin-

curs only a minor overhead increase

6.6.3 DMA Bandwidth Impact of TLB Miss Handling Latency

Theimpactof TLB misshandlinglatencyonoverallI/O performancedependsonthe

peakDMA transferbandwidthandtheTLB missrate.Thegraphsin Figure30plot theef-

fectiveDMA bandwidthfor avarietyof peakbandwidthsandTLB missratios.Eachshad-

ed areadescribesbandwidthfor a particularpagesize for TLB miss handlinglatencies

rangingfrom 0.8µs (upperbound)to 3.0µs (lowerbound).For instance,in asystemwith

400Mbyte/speakDMA bandwidth,theeffectivebandwidthfor azeropercenthit raterang-

esfrom305Mbyte/sfor a3.0µsmisslatencyto370Mbyte/sfor a0.8µsTLB misslatency.

Effectivebandwidthis calculatedwith theassumptionthat in theabsenceof TLB misses,

datais transferredbetweenthedeviceandmainmemoryatpeakbandwidth.DuringaTLB

miss,theDMA enginestalls.Notethatthey-axisof eachgraphrepresentsonly theupper

25% of the total bandwidth scale.

138

Figure 30: Effective DMA Bandwidth

37.5

40.0

42.5

45.0

47.5

50.0

0 20 40 60 80 10075

80

85

90

95

100

0 20 40 60 80 100

300

320

340

360

380

400

0 20 40 60 80 100

TLB Hit Ratio in %

TLB Hit Ratio in % TLB Hit Ratio in %

TLB Hit Ratio in %

effe

ctiv

e ba

ndw

idth

in M

byte

/sef

fect

ive

band

wid

th in

Mby

te/s

effe

ctiv

e ba

ndw

idth

in M

byte

/sef

fect

ive

band

wid

th in

Mby

te/s



150

160

170

180

190

200

0 20 40 60 80 100

4 Kbyte pages16 Kbyte pages

139

ThegraphsclearlyshowthatTLB misslatenciesaremorecritical for higherband-

width, smallerpagesandhigherTLB missratios.Largerpagesincur relativelyfewerTLB

missesandarethuslesssensitiveto TLB misslatenciesbecausetheyamortizethecostof

theinitial TLB missoveralargernumberof subsequenthits.Similarly,ahigherbandwidth

increasestherateof TLB accesses,andhencetherateof TLB missesundera givenmiss

ratio.Thegraphssuggestthatfor modestbandwidth,neitherTLB missrationormisslaten-

cy haveasignificantimpactonoverallperformance.Forhighbandwidthandsmallpages,

low TLB misslatencyandlow missratiosbecomecritical to achieveacceptableperfor-

mance.

To minimize theTLB complexity,theanalysisassumesthatall requestsarestalled

while amissis resolved.ThissimpledesigncannegativelyimpactunrelatedDMA streams

sincetheTLB is sharedamongmultiple sendandreceiveDMA engines.With someaddi-

tionalhardware,it is possibleto only stall therequeststreamthatcausestheTLB miss,thus

reducing the impact of TLB misses on unrelated DMA streams.

6.7 Summary

Copyingdatabetweenkernelanduserspaceis in manycasesthedominantcompo-

nentof I/O overhead.EliminatingthecopyoperationleavestheCPUavailablefor applica-

tion processingandcan leadto improvedsystemthroughputandhigher I/O bandwidth.

EnablingtheI/O deviceto transferdatadirectly to andfrom userspacerequiresthateither

theoperatingsystemor theI/O devicetranslatevirtual addressesspecifiedby theapplica-

tion into physicaladdresses,while performingthenecessaryprotectionchecksanddetect-

ingpagefaults.Sinceinvokingtheoperatingsystemfor everyI/O requestincurssignificant

140

overhead,theuser-levelI/O architecturelets theapplicationprovidevirtual addressesdi-

rectly to theI/O device.ThedeviceprovidesaTLB to cacherecentlyusedaddresstransla-

tionsandusestheexistingkernelpagetablesto look up entriesnot presentin thecache.

Thelookupoperationcanbeperformedwith kernelassistanceor independentlyby thede-

vice.Performingaddresstranslationson demandprovidesflexibility by enablingapplica-

tions to usearbitrarymemoryregionsfor I/O operations,without theneedto preallocate

buffers.

Having thekernelperformaddresstranslationson demandvia interruptssimplifies

thedevicehardwareandleavesthekernelfreeto usethemostefficient pagetableorgani-

zationwithout regardfor theI/O device.However,thisschemeincurshigherTLB missla-

tenciescomparedto thedevice-basedlookupscheme,which leadsto lower effectiveI/O

bandwidth.In addition,underhigh TLB missratesthis schemeusessignificantCPU re-

sources of up to 30% of the available processor cycles.

A programmablepagetablelookupengineresidingin thedevicecombinesthenec-

essaryflexibility with the ability to performTLB refills without hostprocessorinvolve-

ment.Sucha lookup enginecanbe implementedwith a small amountof additionalchip

resources.It is ableto reduceboththeTLB misslatencyobservedby theDMA engineand

to eliminate the CPU involvement during data transfers.

Although TLB miss latenciesare not performancecritical for all but the highest

DMA bandwidth,hostprocessorutilization canbecomesignificantfor low TLB hit rates.

Theseresultsunderscoretheimportanceof performingTLB misshandlingindependently

from the host processor.

7. USER-LEVEL NOTIFICATION

Interruptor notificationhandlingis animportantpartof mostI/O transactions.Inter-

ruptsareasynchronousexternaleventsandarefrequentlyusedto notify theoperatingsys-

tem that an I/O transactionhas completed.In most general-purposemicroprocessors,

interruptscausetheprocessorto saveits currentstatein specialtrapregistersandtransfer

controlto apredefinedaddress.Theinterrupthandlerroutineat thepredeterminedlocation

savesadditionalstateon the kernelstack,determinesthe causeof the interruptandper-

forms thenecessaryoperations.Beforeresumingexecutionat the interruptedinstruction,

manyinterrupthandlerscheckif a schedulingoperationis necessaryandif a signalneeds

to bedeliveredto thecurrentprocessbeforerestoringtheprocessstateandresumingexe-

cution at the original location.

WheninitiatinganI/O requestasaresultof asystemcall performedbyanapplication

program,operatingsystemslike UNIX transfercontrol to a devicedriver which performs

thenecessaryuncachedreadsandwritesto setuptheI/O controlinformationat thedevice.

Theapplicationthenblocksin thedevicedriver waiting for anI/O interruptfrom this de-

vice.WhentheI/O devicedeliversaninterrupt,signallingthatanI/O transfercompleted,

a devicespecificinterrupthandlerunblocksall processeswaiting for that transfer.Using

thesemechanisms,theoperatingsystemhidesthenonblockingcharacteristicsof I/O trans-

actionsfromapplicationsbycontextswitchingtoanotherprocessuntil theI/O requestcom-

142

pletes.It alsoallowstheoperatingsystemto considerI/O activity whenmakingscheduling

decisions and computing process priorities.

Thisoperatingsystemdesignsimplifiestheapplicationinterfaceto I/O, but it incurs

significantoverheads.For instance,a disk controllerinterrupttakesbetween50 to 100µs

to execute,duringwhichtimeit replacesover10Kbytesof instructionsandbetween3 and

4 Kbytesof datafrom thefirst levelcaches[92]. Thisoverheadis a resultof thegenerality

of theinterrupthandler,whichrequiresit tosaveandrestoresignificantprocesscontextand

which leads to a complex code structure to handle all possible interrupt causes.

Handlinginterruptsin dedicateddevicedriver routinesinsidetheoperatingsystem

alsolimits flexibility in thewaycompletionnotificationsaretreatedby individualprocess-

es.ApplicationprogramsareblockedwhenanI/O operationis initiatedandarenotableto

overlapthelonglatencyI/O operationwith otherwork.To mitigatethisshortcoming,most

commercialUNIX variantsoffer anasynchronousI/O interface.This interfaceallowsap-

plicationstocontinueexecutingafterinitiatinganI/O request,atthecostof someadditional

overhead.To dealwith themismatchbetweenblockingdevicedriversandthenonblocking

programminginterface,the applicationlibrary createsan I/O threadwhich performsa

blockingsystemcall on behalfof theoriginal application.Whentheblockingsystemcall

returns,theI/O threadsetsaflag or raisesasignalin theoriginalapplicationandexits.Cre-

atingandremovingtheI/O threadincursthecostof additionalsystemcalls,butallowsthe

library to implementasynchronousI/O with minimalkernelsupport.Flexibility in handling

asynchronous I/O notifications is limited to a signal handler or polling on a shared flag.

Theuser-levelI/O architectureexpandsonthisschemeandprovidesaflexible, light-

weightmechanismto invokearbitraryuserroutinesfor I/O notificationswith minimalker-

143

nel involvement.The user-levelnotification mechanismconsistsof a lightweight kernel

interrupthandlerthatcloselycooperateswith applicationsto asynchronouslyexecuteuser-

level notification routines.A novelnotificationqueuestructurein thehostprocessorbus

interfacereducesthekernelinterrupthandleroverheadby enablingtheUIO deviceto write

all requiredinformationto thehostprocessorat thetimeof theinterrupt.Theinterrupthan-

dler canthusobtainall pertinentinformationlocally from the queuewithout the costof

multipleuncachedreadsfrom theI/O device.Theflexibility of thenotificationmechanism

allowsapplicationsto useanynumberof specificallytailoredroutinesto handlenotifica-

tions,thusreducingtheexecutioncostof eachroutine.Handlingnotificationsalmostex-

clusivelyin userspacereducesoverheadsincenoheavyweightcontextswitchis necessary

andcacheandTLB pollution effectsareminimized.Thenotificationmechanismexploits

thefact thattheoriginatingapplicationprocessis likely to becurrentlyexecuting,in which

caseheavyweightschedulingandcontextswitchingis not necessary.Note,however,that

theselightweightnotificationsareusedonly for thecasewhereanI/O requestcompletes

withoutexception,all errorconditionsandotherexceptionssuchasDMA accessviolations

arehandledby thekernelusingnormalinterrupts.This chapterfirst providesanoverview

of the user-levelnotification mechanism,discussesthe hostprocessornotification queue

andsomemultiprocessorissues,andthenpresentsresultsof a performanceevaluationus-

ing microbenchmarks.

7.1 Lightweight Notification Mechanism

Eachuser-levelI/O requeststructurecontainsthreepiecesof informationneededto

delivernotificationsto theoriginatingapplication.Wheninitiating anI/O request,theap-

144

plicationspecifiesabuffer locationwheretheoriginal requeststructurewith thereturnsta-

tuswill bedepositedby theI/O device.In addition,it specifiestheaddressof a user-level

notification routinethat is to be executedwhenthe I/O requestcompletes.The kernelor

CSBhardwareinsertsa processidentifier into therequestthat is usedto locatethe target

process for a notification.

Notificationsarehandledby theUIO devicein two phases,asshownin Figure31.

WhentheUIO devicereceivestherequeststructureaspartof theresponsefrom a remote

I/O device,it writestherequeststructureinto thepredeterminedbuffer in userspaceusing

theexistingDMA mechanism.Storingtherequestin userspaceallowstheapplicationto

examinethereturnvalueaswell asanyrequestspecificargumentsaspartof thenotifica-

tion handling.Softwareis responsiblefor ensuringthatmultiple outstandingnotifications

aredepositedin distinctbuffer locations,otherwisenotificationsoverwriteeachotherand

do not reach the application.

Figure 31: Notification Handling

application address space

req

1. DMA requestinto user buffer

req

host processor

2. Send request tohost processor

interrupt

request returned from remote device

145

After writing the requeststructureinto userspace,theUIO devicewrites thesame

structureto thenotificationqueuein thehostprocessorbusinterfaceto triggeraninterrupt.

This is similar to theway normalexternalinterruptsaredeliveredin manyarchitectures.

However,thebustransactioncausingtheinterruptcarriesthecompleterequeststructureto

theCPU,thusmakingall pertinentinformationavailablelocally in theCPU.Whenreceiv-

ing thenotification,thehostprocessorentersa low-priority interrupthandlerthataccesses

thecontrolregistersin theCPUbusinterfaceto processthenotification.Usingtheprocess

identifier in therequeststructure,it determinesif thecurrentlyrunningprocessis thetarget

processfor thenotification.If it is, thekernelinterrupthandlersavesaminimalamountof

processstateon theuserstackandmodifiestheprocessstatesuchthattheapplicationpro-

cessstartsexecutingthenotificationhandlerwhenreturningfromtheinterrupt.Theaddress

of therequeststructurein userspaceaswell astheaddressof theinterruptedinstructionare

passedto thenotificationroutineasarguments.Uponcompletion,theuserspacenotifica-

tion handlerreturnsto the interruptedinstructionby executinga jump to theaddresspro-

vided as one of the arguments.

If thecurrentprocessis not thetargetprocessfor thenotification,or thecurrentpro-

cessis notexecutingin usermode(e.g.,in asystemcall), thenotificationcannotbedeliv-

ered immediately and must be queuedfor later delivery. Queuing ensuresthat each

individual notification is forwardedto the application,sinceeachnotification carriesa

uniquereturnstatusfor adistinctrequest.To queuependingnotifications,thekernelmain-

tainsa list of pointersto requeststructuresin theprocessstructure,asshownin Figure32.

Keepingthe list in nonpageablememoryallows the interrupthandlerto accessit without

risking pagefaults.Whenreceivinga notificationfor aninactiveprocess,thekernelinter-

146

rupthandlerappendsthevirtual addressof thecurrentrequeststructurethatwaswritten to

auserbuffer to this list. Themaximumsizeof thelist determinesthemaximumnumberof

outstandingrequestsfor a processandis a system-specificconstant.On a contextswitch,

thekernelchecksif thenewprocesshasanypendingnotifications,similar to thecheckfor

pendingsignalsin UNIX. In fact,oneof theunusedbits in theprocesssignalmaskmaybe

usedto flag thissituation.If aprocesshaspendingnotifications,thekernelusesthepointer

arrayto createa linked list of requeststructuresin userspace.Unlike theinterrupthandler,

Figure 32: Queued Notification Handling


req

req

req

notification list in process structure

1. Interrupt handlercreate list ofpointers to pendingnotifications


req

req

req

2. Process resumesmodified application PC

build linked list ofnotifications,resume at firstnotification handler

147

thekernelis ableto accessusermemoryat this time becausea pagefault only blocksthe

currentprocess.The kernel thenmodifies the processstatesuchthat the applicationre-

sumesexecutionat thefirst notificationhandler.Eachnotificationhandleris responsibleto

checkif it is thelastin thelist, orotherwisecall thenextnotificationroutinewith thecorrect

arguments.Thelastnotificationhandlerreturnsto theinstructionthat theapplicationwas

executing before the last context switch, using a jump instruction.

Theamountof statesavedby thekerneluponentry into thenotificationhandlerin-

cludesonly theregistersthatareusedtopassargumentsto theroutine.Thenotificationrou-

tine is responsiblefor savingandrestoringanyadditionalstatethat it modifiesduring its

execution.Subsequently,theroutinecanexecutearbitrarycodeto synchronizewith theap-

plication,suchassettingaflag or unblockingauserthread.Unlike atraditionalUNIX sig-

nal, returningto the interruptedinstructionis accomplishedby executinga jump to the

addressprovidedasoneof thearguments.Thefollowing pseudocodesequenceillustrates

a basic framework for user-level notifications.

void uio_notification(unsigned ret_addr, uio_req *req)

{

save registers

perform notification operation

if (req->next != NULL)

{

restore registers

req->notification(ret_addr, req->notification_buf)

}

restore registers

jump(ret_addr)

}

Notethat to executecorrectlyunderall circumstances,thenotificationroutinemust

bereentrantandmayaccessonly lock-freedatastructures[46], becauseit is notallowedto

148

blockonsynchronizationvariables.However,sincetheroutineexecutescompletelyin user

space, violating these requirements affects only one application and not the entire system.

Reentrycapabilityis neededbecausenotificationscanarriveat anytime, including

duringtheprocessingof a notification,andthesystemprovidesno mechanismto disable

or masknotificationstemporarily.Sincenotificationsarelocal to a process,andnotifica-

tionsarelowestpriority interrupts,it ispossibletoprovideasystemcall interfacethatraises

theinterruptpriority on behalfof anapplication.If theinterruptpriority is partof thepro-

cesscontextandis savedandrestoredonacontextswitch,applicationscanusethesystem

call to disablenotificationsduring critical sections.The kernel also must provide that

queuednotificationsarenotbedeliveredafteracontextswitchif theinterruptpriority pro-

hibits this.However,thecostof asystemcall to masknotificationsfor relativelyshortcrit-

ical sectionsdoes not seemjustified, consideringthat in many caseslock-free data

structurescanbeusedto implementsimilarfunctionality.Alternatively,processorscanim-

plementauseraccessiblecontrolregisterthatis usedto enableor disablenotificationsand

that becomespart of the processcontext.This schemegives applicationscompleteand

lightweightcontrolovernotifications,similar to interruptmasksusedby thekernel.How-

ever,addingregistersto existinginstructionsetarchitecturesis often difficult andcostly

and should not be taken lightly.

7.2 Processor Notification Queue

To minimizetheoverheadof theuser-levelnotificationmechanism,thehostproces-

sorprovidesanotificationqueuethatholdsrequeststructuresfrom thetimetheinterruptis

triggereduntil it is processedby thekernelinterrupthandler.ThequeueenablestheUIO

149

deviceto transferthecompleterequeststructureto theprocessorat thetime it signalsthe

interrupt.The tail of the queueis exportedby the processorasa setof control registers.

Writing to thequeueloadsanentrywith thedataandtriggersan interrupt.To determine

thetargetprocessandnotificationhandleraddress,theinterrupthandlercanaccessthehead

of thequeuelocally without issuingbustransactions,thusavoidingthecostof multipleun-

cached reads from device control registers t.

Thepurposeof thenotificationqueueis to providesufficientstoragefor onenotifi-

cationfrom eachpossibleUIO device.Consequently,its sizelimits themaximumnumber

of UIO devicessupportedby a system.To avoidoverflowingthenotificationqueue,each

notificationneedsto beacknowledgedby thekernelinterrupthandlervia anuncachedstore

to thedevice,similar to regularexternalinterrupts.Thekernelinterrupthandlerprocesses

thenotificationat theheadof thequeue,acknowledgesit to thedeviceandexplicitly shifts

thequeuebeforeexiting.To avoidthecostof polling all UIO devicesto matchthecurrent

notificationto a particulardevice,theoriginatingUIO deviceinsertsits physicalbasead-

dressin the requeststructurebeforewriting it to theprocessornotification registers.The

physicaladdressis configuredin a devicecontrolregisterby thekernelandis usedby the

interrupt handler to write the acknowledgment to the device.

Thenotificationqueueminimizesoverheadby avoidingthecostof uncachedreads

to UIO devices,butthebasicuser-levelnotificationmechanismdoesnotdependonit. Sim-

ilar to conventionaldeviceinterrupthandlers,thekernelnotificationinterrupthandlercan

poll all UIO devicesto determinewhich devicetriggeredthe interrupt,andthenreadthe

notification-specificinformation from control registers.However, the sequenceof un-

cachedreadsrequiredandtheaddedcomplexityof invokingmultipleinterrupthandlerrou-

150

tines to find the originating deviceleadsto overheadthat approachesthat of a regular

interrupt handler.

7.3 Multiprocessor Considerations

In multiprocessorsystems,the operatingsystemoften considersprocessoraffinity

whenschedulingaprocessfor execution.By executingaprocesson thesameprocessoras

duringtheprevioustime slice,cacheandTLB locality canbeexploitedacrosstime slices

to reducethe performanceimpact of multiprogramming.Under this schedulingpolicy,

sendinga notification to theprocessorthatexecutedtheapplicationprocesswhenthere-

questwasissuedincreasesthelikelihood thattheprocessis foundto beactive,thusreduc-

ing notificationlatencyandoverhead.If processesmigratebetweenprocessorswhile UIO

requestsareoutstanding,or if theUIO devicesendsa notificationto thewrongprocessor,

theuser-levelnotificationmechanismqueuesnotificationsmoreoftenthannecessary,de-

gradingperformanceto a specializedsignalhandlingmechanism.Exploiting the perfor-

mancebenefitsof processor-specificnotificationsrequiresthat therequestindicatesfrom

which processorit originated.In manysystemarchitectures,this informationis available

on thesystembusfrom whereit canbeinsertedinto therequeststructureby otherdevices

[68][84]. However,currentI/O busessuchasPCIdonotpropagatethis informationto I/O

devices.Futurelocal I/O interconnectdesignswill possiblypropagatemoredetailedinfor-

mationsuchastheprocessorID to I/O devices,enablingatighterintegrationof I/O devices

andhostprocessors.Alternatively,theCSBcouldinserttheprocessorID into eachrequest

ata fixed location,similar to theprocesscontext.Unfortunately,thisschemerestrictsflex-

ibility of the CSB even further.

151

7.4 Performance Evaluation

Figure33comparesthelatenciesof theuser-levelnotificationmechanismundervar-

iousconditionsto a traditionalinterruptplussignalscenario.Eachgroupof graphsshows

resultsaveragedover70differentI/O requestsfor themachineconfigurationsdescribedin

Table11.Graphsin thetopgrouprepresentabsolutevalueswhile thebottomgraphsshow

overheadrelativeto thebaseline.Theleft mostgraphof eachgroupshowsthelatencyof a

Figure 33: Notification Latencies

0

4

7

6

2over

head

inµs

interrupt + signalqueued notification, worst

queued notification, bestimmediate notification, worst

immediate notification, best

400 / 100 400 / 200 2G / 100 2G / 200

3

5

1

8

norm

aliz

ed o

verh

ead

0

0.4

0.8

1.0

0.6

0.2

400 / 100 400 / 200 2G / 100 2G / 200

152

regularinterrupthandler(bottomportion)plusthecostof handlingthesignalin theappli-

cation(topportion).Theinterrupthandlerperformsthreeuncachedreadsto determinethe

causeof the interrupt,to determinethe targetprocessfor thenotificationandto load the

virtual addressof therequestpacket.It thensendsa signalto thetargetprocess,acknowl-

edgesthe notification via an uncachedstoreandreadsthe devicestatusregisteragainto

checkif anotherinterruptispending.ThesignalhandlingcostismeasuredusingLMBench.

Signalhandlinglatenciesmeasuredon the simulatoraresignificantly lower thanon real

systems,becauseunlike commercialoperatingsystems,theLAMIX kerneldoesnot sup-

portall UNIX signalhandlingfeaturesandbecauseit is notmultithreaded.With a175MHz

microprocessor,the LAMIX kernel is almostfour timesfasterwhendeliveringa signal.

Consequently,theresultsin Figure33arepessimisticasthebaselineoverheadis lowerthan

it would likely be in a real system.

Notethatthis baselineschemeof interrupthandlerandsignaldeliverydoesnot cor-

respondto anyexistingI/O notificationmechanism.Most I/O interrupthandlersunblocka

processandinvokethescheduler,in additionto acknowledgingtheinterruptandsendinga

signal.However,this schemeis usedbecauseit correspondsbestto the nonblockingI/O

model of the user-level I/O architecture.

Thesecondandthird bargraphin eachgroupshowthecostof a user-levelnotifica-

tion if the targetprocessis not running.Thebottomportionof eachgraphrepresentsthe

kernelinterrupthandlercosts,andthetopportioncorrespondsto thecostof deliveringone

queuednotification to the application.Note that the bestandworst caseshownhereare

whathasbeenobservedin themicrobenchmark,theydo not necessarilycorrespondto the

theoretically possible absolute best and worst case.

153

Thelasttwo bargraphscorrespondto thecostof immediatelydeliveringauser-level

notificationto thecurrentlyrunningprocess,againseparatedinto observedbestandworst

case.Notethattheuser-levelnotificationresultsdonot includethecostof theactualappli-

cationnotificationhandler,whichdependsentirelyontheamountof work performedin the

handler.Similarly, thebaselinesignalhandlerresultis obtainedwith anemptysignalhan-

dler.Performingactualwork in thesignalhandleroruser-levelnotificationhandlerincreas-

esthe latenciesshownhereby a constantamount,assumingthatbothhandlersperforma

comparable amount of work.

Therelativeperformancebenefitof theuser-levelnotificationschemeis significant

for all evaluatedsystemconfigurations,for severalreasons.Theuser-levelnotificationin-

terrupthandleris ableto accessall necessaryinformationaboutthe notification locally,

whereasnormalinterrupthandlersissueanumberof uncachedloadsfor this purpose.The

amountof statesavedby thekernelwheninvoking theuser-levelnotificationhandlercon-

sistsonly of theregistersneededto passtheargumentsto thehandler.UNIX signalsrequire

thatthekernelsavestheentireprocessstateontheuserstack.In addition,returningfrom a

UNIX signal requires another system call to restore the original process state.

All notificationschemesbenefitfrom fasterprocessorsaswell asfasterbuses.The

regularinterruptplussignalschemeis ableto exploit thehighercomputeperformanceof

thefasterprocessorbetterbecauseof thelongercodepathwhichcontainsmoreinstruction

levelparallelism,comparedto theuser-levelnotifications.Dueto thelargernumberof un-

cachedloads,it alsobenefitsmorefrom fastersystembuseswith lower latency.However,

it shouldbenotedthatthepresentedresultsfor theinterruptplussignalbaselineareclose

to theachievableminimum,dueto themeasurementmethodologyof LMBench,whereas

154

the user-levelnotification latenciesaremeasuredin a morerealisticapplicationenviron-

ment.

Oneimportantperformancebenefitof theuser-levelnotificationschemeis not cap-

turedby themicrobenchmarkatall. In theuser-levelI/O architecture,applicationscanuse

different notification handlersfor different requests.This allows programmersto write a

setof notificationhandlersspecificallyoptimizedfor onetask,andusetheappropriatehan-

dler for eachrequest.This flexibility cansimplify theprogrammingtaskandleavesroom

for optimizationsin eachhandler.In contrast,becausechangingsignal dispositionsin-

volvesanexpensivesystemcall, signalhandlersareusuallymoregeneralandhencecom-

plex and slow. The performancebenefit of this added flexibility of the user-level

notificationdependsontheparticularsynchronizationschemeusedin anapplicationandis

not further investigated here.

7.5 Summary

In mostcurrentsystems,thenonblockingcharacteristicsof long-latencyI/O opera-

tionsis hiddenfrom applicationsby thekernel.After initiating anI/O transaction,theap-

plicationprocessissuspendeduntil therequestcompletes.Uponcompletion,theI/O device

interruptsthekernelto unblocktheapplicationprocessandinvoke thescheduler.Imple-

mentingnonblockingI/O in aUNIX like kernelusuallyrequiresthattheapplicationspawns

a separatethreadwhich blockson its behalf,a processwhich incursadditionaloverhead

from system calls and costly inter-thread communication via signals.

The user-levelI/O architectureprovidesa flexible low-overheadmechanismto in-

vokearbitraryuserroutineswhenanI/O requestcompletes.Applicationprogrammersare

155

ableto usedifferentnotificationroutinesfor differentrequestsandcanthusoptimizeeach

routinefor its particularpurpose.Thenotificationmechanismassumesthatit is likely that

theinitiating processis still runningat thetime of thenotificationandthata heavyweight

contextswitchis not necessary.In this casethekernelsavesonly minimal statebeforein-

voking theuser-levelhandler,which savesanyadditionalstateasnecessary.Notification

handlersreturnto the original instructionby executinga jump instruction.The queueing

mechanismnecessaryfor caseswherenotificationsarrivefor inactiveprocessesaddsonly

asmallamountof executionoverheadandrequirestheadditionof asmallfixed-sizedarray

of pointers in nonpageable memory.

Theperformanceadvantagesof theuser-levelnotificationschemeovernormalinter-

ruptswith signalsfor synchronizationstemfrom the fact that the UIO devicepushesall

necessaryinformationto theprocessor,andthatthekernelis notinvolvedin returningfrom

anotificationhandler.Otheradvantagesincludetheflexibility of usingspecificallytailored

notificationhandlersfor differentrequestsandthefact thatnotificationsareprocessedal-

mostcompletelyin userspace.As a result,user-levelnotificationsareableto reducenoti-

fication overheadby morethana factorof five for thecommoncase,andby morethana

factor of two if the notification must be queued for later delivery.

8. PERFORMANCE EVALUATION

Previouschaptershavepresentedperformanceevaluationsof theindividual mecha-

nismsandhavefocusedoncomparingalternativedesigns.Thischapterquantifiestheper-

formanceof the entireuser-levelI/O architecture.The low requestoverheadof the UIO

architecturedirectly benefitsapplicationsthat uselatencyhiding to improvethroughput.

Suchapplicationscanimplementuser-levelmultithreadingor otherapplication-levelsyn-

chronizationschemesto overlapI/O latencies.In addition,otherapplicationswith low I/O

requirementsor without theability to hideI/O latenciesthemselvesmaybenefitindirectly

asthereducedrequestoverheadmakesmoreprocessorcyclesavailable.Theability of the

user-levelI/O architectureto performI/O operationswith arbitraryuserbufferswithoutthe

needto wire or preallocatethosebuffersmeansthatapplicationscanrealizeperformance

improvements without software modifications.

This chapterfirst describesdetailsof the prototypeimplementationwith respectto

theremoteI/O devicemodelsandtheapplicationlibraries.It thenpresentsresultsof exper-

imentsthatusea syntheticbenchmarkto measurethemaximumavailableI/O bandwidth

undervariousrequestpatterns,andshowshow a databaseservercanimprovethroughput

without any modifications to the original program.

157

8.1 Prototype Architecture

Theuser-levelI/O architecturepresentedin thiswork is evaluatedwith adistributed

storagedeviceorganizationsimilar to network-attachedsecuredisks[42]. Thisstoragear-

chitectureimprovesfile serverscalabilitybyseparatingdatastoragefromdatamanagement

andby transferringdatadirectly betweendisksandclients.At theheartof thesystemare

autonomousstoragedevicesconnectedto clientsvia a scalablesystem-areanetwork.The

storagedevicesexportanobject-basedinterfaceto clientsontopof whichexistingfile sys-

temabstractionscanbebuilt. A storagemanageroverseesaccessrightsandmetadataop-

erations for the distributed disks, but is not involved in data transfers.

The network-attacheddisk architecturemodeledfollows the generaloutline of the

proposednetwork-attachedsecuredisksarchitecture[42]. However,becausetheuser-level

I/O architecturefocuseson datatransfers,a simplerapproximationof theobjectbasedin-

terfacehasbeenimplemented.Sincenonetwork-attacheddiskdeviceprototypeexists,this

studyusesworkstationnodesto modeltheseautonomousdisks,similarly to otherwork on

network-attacheddisks.Eachdisk noderunsa completeoperatingsystemkernelwith ad-

ditional servercode.Thedisk nodesexporta standardUNIX filesysteminsteadof thege-

neric objectbasedinterface.Objectsarereferencedby filenameinsteadof an encrypted

capability.Theseapproximationsdonotaffecttheperformanceof datatransfers,whichare

the focusof this work. Modeling thesedetailsmoreaccuratelywould haveno impacton

theresultsshownhere.However,transferringfilenamesinsteadof encryptedcapabilities

addsacomparableamountof datato eachrequest.Althoughtheremotediskmodelsdonot

decryptcapabilities,thehigheroverheadof thegeneral-purposekernelrunningonthedisks

compensatesfor thesimplifiedcontrolpath.In addition,mostfile managerrelatedactivity

158

affectsonly metadatamodificationsandnot theactualdatatransferwhich is the focusof

this work.

Thedistributionof responsibilitiesbetweenthekernelandusercodecanvary, from

performingonly performance-criticaldatatransfersat theuserlevel to completelyelimi-

natingtheoperatingsystemfrom any I/O operations.Sincethe remotedisksdo not trust

clientmachines,theexactchoicedoesnotaffectsystemsecurity.In eithercase,auser-level

library maintainsa list of openfiles, similar to theprocessfile descriptortablemanagedby

theoperatingsystem,asshownin Figure34.Eachentrycontainstheappropriateuser-level

deviceaddress,routinginformationto accessto correspondingremotedisk, theobjectca-

pability andthefile offset.Thefirst two itemsallow applicationsto issuerequestsdirectly

to theUIO device,which forwardsthemto thecorrectdisk.Thefile offsetmustbemain-

tainedin thelibrary becausereadandwrite requestsbypassthekernel,andfor scalability

reasonstheremotedisksdo not keepinformationaboutopenfiles. Thefile offsetis trans-

Figure 34: File Descriptors in a User-level I/O Library

read(fd, buf, size) fd>M ?

UIO file descriptors

system callno

yes

UIO requestfile-specific data

other parameters

159

ferredto thedisk with everyrequest,andis updatedlocally whenthereador write opera-

tionssucceed.Maintainingthis informationat theuserlevel allows the library to present

thefamiliar file interface,regardlessof whetherthefile isstoredlocallyoronaremotedisk.

To distinguishregularfiles from user-levelfiles, UIO file descriptorshavehighervalues

than the highest possible kernel file descriptors.

If theoperatingsystemdelegatesall UIO file operationsto usercode,thelibrary must

alsomaintaina list of mountpointsor directoryprefixesat which remotedisk objecthier-

archiesstart.Thelist of prefixesis checkedfor everyfile operationthattakesafilenameas

argument,andtheoperationis eitherforwardedto asystemcall or resultsin aUIO request.

In addition,for completenessit is necessaryto implementthenotionof a currentworking

directory and a root directory in the library.

Delegatingthe file descriptormanagementandpossiblyUIO mountpoint manage-

mentto user-levelsoftwaredoesnot weakenthesecuritymodel,sinceremotestoragede-

vicesdo not trustclientsregardlessof whetherrequestsareissuedby theclient operating

systemor user-levelcode.In addition,failure to correctlyimplementthis functionalityaf-

fectsonly theoneapplicationandhasno impacton thecorrectnessof otherapplications.

Onereasonfor this robustnessis thatremotedisksdonotmaintainanyper-clientinforma-

tion aboutopenfiles. Sinceclientsdo not consumeany resourceson remoteI/O devices

beyondwhatis requiredto handlependingrequests,aclient terminatingwith pendingUIO

requestshasnoimpactontheremotedisksor onotherapplications.However,therequired

functionality increasesapplicationsoftwarecomplexity,andshouldbe implementedin a

dynamicallylinkedlibrary whereit is hiddenfrom applicationprogrammersandcaneasily

160

bereused.A dynamiclibrary hastheadditionaladvantagethatit canadaptdifferenthard-

ware implementation details of the UIO device to a common programming interface.

To completelyhidethedetailsof theuser-levelI/O architecture,thelibrary mustalso

suitably implementthe blocking behaviorof file operations,for instanceby initiating a

threadswitchafterissuingaUIO request.Sincenotificationhandlersmustbereentrantand

maynotblock,blockedthreadscannotbeunblockeddirectlyby thenotificationhandleras

thethreadrunqueuemaybelocked.Instead,thenotificationhandlersetsaflag in thethread

structuremarkingthethreadasready,andsignalstheschedulerthrougha global flag that

at leastonethreadcanbe unblocked.At the next threadschedulerinvocation,the list of

blocked threads is scanned and all ready threads are moved to the run queue.

A compatibilitylibrarygivessingle-threadedapplicationsaccessto theuser-levelI/O

interface.Although suchapplicationscannotdirectly takeadvantageof the reducedI/O

overhead,theincreasedidle timecanbemadeavailableto otherprocessesto improveover-

all systemthroughput.After transmittingan I/O request,the library initiates a context

switchvia theyield() systemcall. A globalflag setby thenotificationhandlerindicatesif

therequestis complete.Thelibrary repeatedlycheckstheflag andinitiatescontextswitch-

esif it is notyetset.Thisschemedoesnotminimizeoverhead,asit wastesprocessorcycles

polling theflag. However,if otherprocessesarecompetingfor theCPU,thepoll interval

is relatively large.Frequentpolling occursonly if little or no otherwork is available,in

whichcasethehigherprocessorutilizationdoesnotaffectthroughput.Themaindisadvan-

tageof this schemeis that thekernelscheduleris unawareof thecompletionnotification

and may delay rescheduling the process unnecessarily, causing longer I/O latencies.

161

8.2 UIO Bandwidth Scaling

A syntheticbenchmarkis usedto measurethe maximumI/O bandwidthattainable

from anumberof independentdisks.ThebenchmarkissuesI/O requestsasquickly aspos-

siblewithout consumingthedata.Requeststo thesamedisk areblocking,while requests

to independentdisksareoverlappedto maximizethroughput.This benchmarkis identical

to theoneusedin Section2.2.3,with addedlibrary supportfor user-levelI/O. Thesimulat-

edarchitectureis alsoidenticalto theonedescribedin Table4, allowing a comparisonof

kernel-based and user-level I/O performance.

Theresultsin Figure35 comparethebandwidthof kernel-basedI/O with theuser-

level I/O architecturefor a varietyof requestsizes,accesspatternsandfor differentnum-

bersof disks.TheUIO modelusesasingleUIO networkinterfacewith asimplifiedmodel

of asystemareanetwork.TheUIO interfaceimplementssix DMA engines,threesenden-

ginesandthreereceiveengines.This meansthata maximumof threemessagescanbein

transitin eachdirection.NotethattheUIO deviceprovidesonly oneTLB thatis sharedbe-

tweenall DMA engines.Preliminaryexperimentsshowedthat due to contention,fewer

DMA enginesaffecttherequestlatencynegatively,decreasingoverallperformance.Onthe

otherhand,thekernel-basedI/O measurementsareperformedwith oneDMA engineper

disk. The bandwidthof the I/O networkis 300 Mbyte/s,which correspondsto the band-

width of InfiniBand overa serialcopperlink. Thelocal I/O bus,on theotherhand,hasno

bandwidthrestrictions,thusapproximatinganarchitecturein whicheachI/O deviceresides

onaprivatebus.In addition,eachSCSIdiskisattachedtoaseparateSCSIbustokeepSCSI

bus and host adapter contention from affecting the results.

162

Figure 35: I/O Bandwidth Scaling Comparison

0

20

40

60

80

100

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams

0

20

40

60

80

100

aggr

egat

e ba

ndw

idth

in M

byte

/s

0 4 8 12 16 20 24# of streams


0

40

80

120

160

200

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams

0

40

80

120

160

200

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams


320 320

User-level I/O Kernel-based I/O

0 4 8 12 16 20 24

0 4 8 12 16 20 24 0 4 8 12 16 20 24

240

280

240

280

163

The graphsshowthat the UIO architectureachievessignificantly improvedband-

width comparedto thekernelbasedI/O architecture,resultingin twice thebandwidthfor

23disks.Fornonsequentialaccesses,scalabilityis almostlinearsincehostprocessoroccu-

pancyis theonly limiting factor.For sequentialaccesses,bandwidthstartsto saturateasit

approaches the 300 Mbyte/s limit of the I/O network.

Also notethatkernel-basedI/O neveroutperformsUIO, eventhoughUIO requests

experienceaslightly higherlatencydueto theadditionalnetworktraversal.Theseparticu-

lar experimentswhereperformedusingthefast trapmechanismto inserttheprocesscon-

text on a CSB flush. The TLB useskernel-basedinterrupthandling.However,sincethe

benchmarkdoesnot performanyotheroperationsbesidesissuingI/O requests,thechoice

of hardwareor softwaremechanismdoesnotaffecttheoverallbandwidth.In theUIO case,

theCPUutilization for 23 requeststreamsis below2 percent,indicatingthatmechanisms

with slightly higheroverheadwould not changetheseresults.In caseof kernel-basedI/O,

however,CPUutilization is closeto 100percentandhencetheoverallbandwidthis satu-

rated.

Figure36 showshow a finite networkbandwidthof 100Mbyte/seffectsbandwidth

scaling.In thesimulatorprototype,bandwidthis limited by throttling themaximumrateat

which theUIO deviceDMA enginesissuebustransactions.Theeffectof a limited band-

width I/O networkis similar to theoperatingsystemsaturationeffectfor kernel-basedI/O.

Note that for nonsequentialaccesses,evensmall numbersof streamsshowa decreasein

throughput.The lower networkbandwidthincreasesoverall requestlatency.Becausethe

benchmarkis not able to sufficiently overlapthe long disk seeklatencieswith otherre-

questsevenwithout a bandwidthlimited network,overall throughputdecreasesfurther.

164

Figure 36: I/O Bandwidth Scaling Comparison with Network Effects

0

20

40

60

80

100

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams

0

20

40

60

80

100

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams


0

40

80

120

160

280

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams

0

40

80

120

160

200

aggr

egat

e ba

ndw

idth

in M

byte

/s

# of streams


320 320

User-level I/O with 100 Mbyte/s bandwidth network

Kernel-based I/O

User-level I/O with 300 Mbyte/s bandwidth network

0 4 8 12 16 20 24 0 4 8 12 16 20 24

0 4 8 12 16 20 240 4 8 12 16 20 24

200

240 240

280

165

Thesequentialstreams,ontheotherhand,areabletoapproachthe100Mbyte/slimit quick-

ly. Furthermore,throughputremainscloseto this limit for anincreasingnumberof request

streams,indicatingthat in thepresenceof sufficientconcurrency,theUIO architectureis

ableto sustainthroughputscloseto thehardwarelimit. Notethat,asdescribedin Section

6.6,anincreasedTLB misslatencytranslatesinto lower effectivebandwidthof theDMA

engines,which leadsto similar throughputreductionsasif the network itself wasband-

width limited. Becausethemicrobenchmarkdoesnotutilize theprocessorsufficiently, the

impact of other overheads on system throughput cannot be measured.

8.3 Application Throughput Scaling

Theexperimentdescribedin thissectionevaluatesthethroughputscalabilityof a re-

alistic I/O intensiveapplication,theMySQL databaseserver[98]. Similar to themeasure-

mentspresentedin Section2.2.4,anumberof databasequeriesareissuedto separatetables

residingon independentdisks.A singleinstanceof thedatabaseserverexecutesa thread

for each query to hide I/O latency

TheMySQL databaseserverremainedunmodified,only theuser-levelthreadlibrary

wasextendedto includesupportfor user-levelI/O. Asoutlinedearlier,thelibrarymaintains

a tableof user-levelfile descriptorsaswell asa list of UIO mountpoints.Whena file is

opened,thelibrary comparestheabsolutefilenameto thelist of mountpointsto distinguish

regularfiles from remotefiles. For remotefiles thelibrary maintainsthefile offsetaswell

asaccessandroutinginformationfor thecorrespondingremotedisk.Thethreadstructure

is extendedby a flag indicatingif thethreadis blockedonaUIO request.Blockedthreads

areskippedby the scheduler.The flag is setwhena requestis issuedandis resetby the

166

notification handler.This designenablesa simple lock-free implementationof thread

blockingandscheduling.However,leavingblockedthreadsin therunqueueincreasesthe

complexityof theschedulerslightly, andcanleadto increasedthreadswitchlatenciesif a

largenumberof threadsis blocked.In theexperimentsreportedherethenumberof threads

is smallenoughthat this simpledesignis adequate.Alternatively,after issuinga UIO re-

quest,threadscanbemovedto a blockedthreadlist. A globalflag indicatesif at leastone

threadcanbeunblocked,in which casetheschedulerscansthelist of blockedthreadsand

removesanythatcanbeunblocked.Thisalternateschemecanresultin lowerthreadswitch

latenciesfor largenumbersof threads.However,for smallto moderatenumbersof threads,

theperformanceimpacton theschedulerdueto theadditionalflag is likely comparableto

skipping blocked threads.

Figure37 comparesthe throughputscalingfor varyingnumbersof concurrentque-

riesfor a conventionalkernel-basedI/O systemandtheuser-levelI/O architecture.In the

Figure 37: MySQL Database Throughput Scaling

4

3

2

1

5

6

7

8

0

0 2 4 6 8 10 12

spee

dup

# of queries

14

9

10

User-level I/O

Kernel-based I/O

16

167

kernel-basedI/O systemthroughputsaturatesat nine queriesas the CPU utilization ap-

proaches100percent.Approximately35 to 36 percentof thebusyCPUtime is spentexe-

cutingoperatingsystemcodedirectlyor indirectlyrelatedto I/O processing.Theuser-level

I/O architecturereducesthisoverheadby two ordersof magnitudeandasaresultis ableto

scaleto largernumbersof queries.Indeed,throughputfor 15queriesis improvedby close

to 25percent.Thereasonthroughputdoesnot improveby thetheoreticallypossible35per-

centis thatthedatabaseserverusesadditionalinternalsynchronizationthatsomewhatlim-

its threadconcurrency.In addition,slight changesin disk requestarrival timescanleadto

significantdifferencesin headseektimes,thusaffectingdisk bandwidth.Finally, thesim-

ulatedI/O networklimits bandwidthto 300Mbyte/s.Thesyntheticbenchmarkin thepre-

vioussectiondemonstratesthatthiscanleadto throughputsaturationstartingat13streams.

It shouldbenotedthatevenfor smallnumbersof queries,thekernel-basedI/O sys-

temneveroutperformsUIO. Despitethefact that individual UIO requestsincur a slightly

higherlatencydueto theadditionalnetworktraversal,theloweroverheadallowsbetterla-

tency hiding and hence better throughput scaling.

Theexperimentsalsoshowedthatthreadconcurrencyis critical to achievemaximum

throughput,andthatapplicationperformanceis very sensitiveto thelocking strategy.For

instance,MySQL implementsa globalindexcachethatis sharedby all threads,aswell as

a globalcacheof tabledatastructures.Accessingthesestructuresduring indexedqueries

or whenopeninga tableblocksall otherthreadsandeffectivelyeliminatesthreadconcur-

rency,evenfor threadsoperatingondifferenttablesor databases.Thisdesignis acceptable

for environmentswith a low degreeof concurrencydueto few requestsor high I/O over-

head,butit canhaveasignificantperformanceimpactsonthehighly paralleluser-levelI/O

168

architecture.Thequeriesin this experimentdo not usethetableindexandarehenceable

tomaximizeconcurrencyfor thedataretrievalphase.Asaresult,thesequeriesapproximate

theperformanceof anoptimizeddatabaseimplementationwith fine-grainlockingof index

and table data structures.

8.4 Summary

Theeffectivenessof theuser-levelI/O mechanismsis evaluatedin thecontextof a

distributedautonomousstoragearchitecturesimilar to network-attachedsecuredisks.By-

passingthekernelfor disk I/O operationsmeansthatapplicationsmustmaintainfile state

normallymanagedby thekernel.Theper-file statemayincludeaccessandrouting infor-

mationfor theremotedeviceandthefile offset.Maintainingthis informationat theuser-

level ratherthanin thekerneldoesnot affect theprotectionmodelasremotedisksdo not

trustclientsregardlessof whetherrequestsareissuedby thekernelor by applications.Dy-

namicallylinked librariescanbeusedto hidehardware-dependentdetailsof thearchitec-

ture, to maintainper-file stateand to presenta conventionalblocking I/O interfaceto

applications.

A syntheticbenchmarkthatissuesI/O requeststo asetof disksshowsthatdueto sig-

nificantly reducedoverhead,the user-levelI/O architectureis able to provide twice the

bandwidthof akernel-basedI/O systemwhile at thesametimereducingCPUutilizationto

lessthantwo percent.For theMySQL databaseserver,theseimprovementstranslateinto

25percentincreasedthroughputfor 15concurrentquerieswithoutanyapplicationrestruc-

turing.

9. CONCLUSIONS

Theperformanceof theI/O subsystemis becomingincreasinglyimportantfor a va-

riety of applications.Manyof theseI/O intensiveapplicationsarepartof agrowingmarket

for multiprocessorserversand experienceincreasingperformancerequirements.At the

sametime, technologicaltrendsleadto a growinggapbetweenapplicationandoperating

systemperformance.As aresult,latencyhiding techniquesemployedto improvethrough-

put in thepresenceof longI/O accesslatenciesarebecominglesseffectiveastheoperating

systemoverheadof I/O operationslimits their scalability.Bypassingtheoperatingsystem

for I/O operationswhile maintainingmostof thecharacteristicsof kernel-basedI/O allows

applicationstomaximizetheperformancebenefitof overlappingI/O operationswithoutre-

quiring extensive software modifications.

9.1 User-level I/O Architecture

This thesisintroducedanI/O architecturethatreducesI/O overheadby giving appli-

cationsdirectandprotectedaccessto theI/O subsystem.Thearchitecturebuildson a dis-

tributed organizationof autonomousI/O devicesthat connectto clients over a scalable

system-areanetwork.Clientscommunicatewith remotedevicesthrougha local I/O net-

work interface.Togetherwith thehostprocessorbusinterface,this user-levelI/O device

providesmechanismsthat implementprotecteduser-levelrequestinitiation, user-space

datatransferandcompletionnotification.Thearchitecturalcostof thesemechanismsis low

170

astheyrequireonly smallamountsof chipareaanddonotaffectthecycletimeof thepro-

cessorcoreor I/O device.A statelesscommunicationprotocolfacilitatesanefficient and

scalable hardware implementation of the UIO device.

Building on the basicmechanisms,the user-levelI/O architecturepresentsa light-

weightnonblockingprogramminginterfacesimilarto manymessagepassingarchitectures.

On top of this low-level interfacea varietyof standardI/O interfacescanbeimplemented

thatallow applicationsto takeadvantageof thelow-overheadI/O mechanismswith no or

minimal softwarerestructuring.The flexibility of the underlyingmechanismskeepsthe

overheadof suchlibrarieslow. Bypassingtheoperatingsystemallowsthearchitectureto

providescalableperformancethatis notlimited by thegrowingCPU-memoryperformance

disparity.

A prototypeof theuser-levelI/O architecturehasbeenimplementedin anexecution-

drivensystemsimulator.To thisend,anexistingsimulationtool hasbeensignificantlyex-

tendedto combinedetailedprocessor,cacheandI/O devicemodelswith aUNIX-compat-

ible operatingsystem.Validationof thesimulationsystemagainstanexistingworkstation

demonstratesthat thetool is ableto accuratelycapturethemicroarchitecturalcharacteris-

tics as well as most of the operating system performance of a real computer system.

Microbenchmarksshowthattheuser-levelI/O architecturereducesI/O overheadby

a factorof 100,andthat theoverheadis lesssensitiveto technologicaltrendsthanthatof

anoperatingsystembasedI/O architecture.For23independentrequeststreams,theseover-

headreductionsleadto bandwidthimprovementsof a factorof two, while at thesametime

reducingtheprocessoroccupancyby almosttwo ordersof magnitude.Whereasatradition-

al kernel-basedI/O systemsaturatesat eight to tenstreamsdueto softwareoverhead,the

171

UIO systemexhibitsalmostlinearscalabilityup to thebandwidthlimit imposedby other

factors such as the network or I/O bus.

A public-domaindatabaseserveris ableto improveits throughoutfor 15concurrent

queriesby upto 25percent.TheUIO systemis ableto makealmost100percentof thepro-

cessorcyclesavailablefor applicationprocessing.Most importantly, theseperformance

improvementsarerealizedwithoutanysoftwaremodificationsoutsidetheuser-levelthread

library.

9.2 Limitations and Future Work

Dueto thelimited scope,thisstudyis notableto exploreall aspectsof theuser-level

I/O architecture.Futurework includesa morethoroughI/O characterizationandperfor-

manceanalysisof a representativesetof I/O intensiveapplications.The throughputim-

provementresultingfrom thereducedI/O overheaddependson the I/O characteristicsof

individual applications.In general,applicationswith a high I/O to computationratio will

benefitmorefrom thereducedI/O overhead.Ontheotherhand,applicationswith asignif-

icantcomputationcomponentbenefitindirectlyandlesssubstantiallyfrom loweroverhead

asmoreCPUcyclesbecomeavailableto theapplication.A detailedanalysisof common

I/O intensiveworkloadssuchasWebserversandcollaborativebusinesssoftwareis needed

to quantify the realized performance improvements for a wider variety of applications.

Thecomplexityversusoverheadtrade-offof thearchitecturalmechanismscanonly

beexploredunderawidevarietyof realisticworkloads.Thisstudyhasquantifiedtheover-

headcontributionsof theindividualphasesof anI/O transactionusingfine grainmeasure-

mentsandmicrobenchmarks.The impactof thesefactorson overall systemperformance

172

dependson theCPUutilization of theapplication.In somecasesit maybeacceptableto

incur thecostof a systemcall to initiate anI/O transfer,but otherapplicationscanbenefit

from furtheroverheadreductions.In addition,thelatencyof I/O requestsaffectstheability

of applicationsto overlaprequestsandto tolerateoverhead.Thesefactorsneedfurtherin-

vestigation with a representative set of applications.

9.2.1 General-purpose User-level I/O

Theprototypeimplementationof theuser-levelI/O architecturefocusesondisk I/O.

Expandingthearchitectureto includesupportfor otherhigh-performanceI/O devicessuch

asnetworkadaptersandgraphicsadaptersis neededto makeit a truealternativeto kernel-

basedI/O. Fortunately,graphicsdeviceaccessis similar to disk I/O asrequestsarealways

initiatedby thehostprocessor.Furthermore,almostall transfersoccurfrom memoryto the

device.Network I/O, on theotherhand,cangenerateincomingmessageswithout anex-

plicit requestby thehostprocessor.This unexpectedmessagearrival eitherrequiresinter-

mediatebuffering, or a modification of the programminginterface similar to Active

Messages.A combinationof thesetwo mechanismswould beableto providetheperfor-

manceof directuser-spacetransfersfor manycases,andfall backto intermediatebuffering

in exceptionalsituations.Furthermore,theinexpensivedatatransfermechanismprovided

by the UIO architecturemay be usedto improvelocal datatransfersbetweenkerneland

userbuffers.Extendingthescopeof theuser-levelI/O architectureevenfurther,the low-

overheadcommunicationmechanismsappearwell-suitedto supportmessagepassingcom-

municationbetweenclient systems.Thesignificantlystricterlatencyrequirementof most

parallel programs amplify the need for low overhead.

173

9.2.2 Operating System Implications

Bypassingtheoperatingsystemfor I/O datatransfersnotonly improvesthroughput,

it alsobypassesservicesnormallyprovidedby theoperatingsystem.In caseof disk I/O,

onesuchserviceis thebuffer cache.Thebuffer cacheis amainmemorycacheof recently

useddiskblocksandis managedby theoperatingsystem.Thecacheexploitstemporallo-

cality both within an applicationand betweenapplications,as well as spaciallocality

throughprefetching.Theuser-levelI/O architectureprototypebypassesthebuffercacheon

theclient, but implementsa similar cacheon theremotedisk. Theremotedisk cacheex-

ploits thefact thattheautonomousdisksoperateonobjectsratherthandiskblocksandem-

ploys thesameprefetchingandcachingstrategiesasa client cache.This designallowsa

fair evaluationof thebenefitsof theuser-levelI/O architecturewithout cacheeffectsbias-

ing theresults.However,theperformanceimpactof a remote,distributedcachecompared

to a local cacheneedsto beevaluatedin a realisticenvironmentwith a largernumberof

applications.

Bypassingtheoperatingsystemfor I/O requestsalsomeansthatthekernelis unable

to considerI/O activity in theprocessscheduler.Normally,a processis suspendedby the

kernelduringanI/O operation.In theUIO architecture,if the initiating applicationis un-

abletooverlaptheI/O latencywith otherwork, it needstoexplicitly causeacontextswitch.

However,sincethekernelis unawareof therequestcompletion,it maynot reschedulethe

applicationimmediately,resultingin longerI/O latencyandpoorthroughput.Enablingthe

kernelto wakeup a processwaiting for a user-levelnotificationrequiresthat theapplica-

tion informsthekernelof this situationthrougha systemcall. TheprototypeUIO imple-

mentationalreadyutilizes a fast kernel trap for initial notification handling.A possible

174

solutionis to treatuser-levelnotificationsasaspecialUNIX signalthatispartof thenormal

signalmaskbut is handledvia theUIO notificationmechanism.Thiswouldallow applica-

tionsto suspendthemselveswaiting for a signal,while thekernelwakesup theprocessas

soonasthenotificationarrives.Theimplementationdetailsof this schemeaswell aspos-

sibleperformancepenaltiesdueto increasednotification traphandlercomplexityrequire

further investigation.

9.2.3 Next-generation Architectures

Theincreasingperformancepenaltyof processorpipelinestallshasled to thedevel-

opmentof variouskindsof multithreadedmicroprocessors.Theseprocessorsprovidehard-

waresupportfor inexpensivethread-levelcontextswitchesto hidefrequentbut relatively

shortlatencies.Combiningthelatency-hidingcapabilityof theseprocessorswith thelow-

overheaduser-levelI/O architectureleadsto new challengesaseachprocessorsupports

multiplesimultaneouslyactivecontexts.At thesametime, theability to concurrentlyexe-

cuteinstructionsfrom multiple instructionstreamsenablesnovelwaysto hidetheremain-

ing I/O overhead.

9.2.4 I/O Programming Paradigms

Traditionally,I/O operationshavebeenconsideredexpensiveby programmers,both

in termsof latencyto completetherequestandin termsof softwareoverhead.Overheadin

particular has often preventedaggressiveapplication-levelI/O requestschedulingor

prefetching.Thelow per-requestoverheadof theuser-levelI/O architecturewarrantsa re-

evaluationof the trade-off betweencomputationand I/O. For instance,requestingdata

175

from a remotedevicewhereit hasbeencachedcanbeat leastasfastascopyingthedata

from the local buffer cache,andmay incur lessprocessoroverhead.Issuingnonbinding

prefetchhintstoanintelligentdiskallowstocommunicateapplicationknowledgeaboutac-

cesspatternsto thediskwith extremelylow overhead.Thelow costof initiating datatrans-

fers between clients and I/O devices may allow programmersto eliminate some

computationin favor of I/O operations.RevisitingthecurrentI/O programmingparadigm

in the facehighly parallelanddistributedI/O architecturescombinedwith low-overhead

user-leveldeviceaccessmay leadto innovativemethodsto improveperformanceof data

intensive applications.

REFERENCES

[1] AIC-7770 Data Book, Adaptec, 1992.

[2] Alpha 21164Microprocessor:Hardware ReferenceManual, CompaqComputer,Houston, Tex., 1995.

[3] J.M. Andersonet al., “ContinuousProfiling: WhereHaveAll the CyclesGone?,”Proc.16thACM Symp.OperatingSystemsPrinciples(SOSP-16), ACM Press,NewYork, N.Y., 1997, pp. 357-390.

[4] T.E. Andersonet al., “ServerlessNetwork File Systems,”Proc. 15th ACM Symp.OperatingSystemsPrinciples(SOSP-15),ACM Press,New York, N.Y., 1995,pp.109-126.

[5] R.H. Arpaci-Dusseauet al., “The Architectural Costs of Streaming I/O: AComparisonof Workstations,Clusters,and SMPs,” Proc. 4th Int’l Symp.High-PerformanceComputerArchitecture (HPCA-4), IEEE CS Press,Los Alamitos,Calif., 1998, pp. 90-101.

[6] G.BangaandP.Druschel,“MeasuringtheCapacityof aWebServerUnderRealisticLoads,” World Wide Web Journal Special Issue on World Wide WebCharacterization and Performance Evaluation, vol. 2, no. 1, May 1999, pp. 69-83.

[7] D. BanksandM. Prudence,“A High-performanceNetworkArchitecturefor a PA-RISC Workstation,”IEEE Journal on SelectedAreasin Communications, vol. 11,no. 2, Feb. 1993, pp. 191-202.

[8] J.S.BarreraIII, “A FastMACH NetworkIPC Implementation,”Proc.UsenixMachSymp., Usenix Assoc., Berkeley, Calif., 1991, pp. 1-11.

[9] L.A. Barroso,K. Gharachorloo,andE. Bugnion,“Memory SystemCharacterizationof CommercialWorkloads,”Proc. 25th Int’l Symp.ComputerArchitecture(ISCA-25), ACM Press, New York, N.Y., 1998, pp. 3-14.

[10] D. Bhandarkarand J. Ding, “PerformanceCharacterizationof the PentiumProProcessor,” Proc. 3rd Int’l Symp. High-PerformanceComputer Architecture(HPCA-3), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 288 - 297.

177

[11] B.N. Bershad, D.D. Redell, and J.R. Ellis, “Fast Mutual Exclusion forUniprocessors,”Proc. 5th Int’l Conf. Architectural Support for ProgrammingLanguagesand OperatingSystems(ASPLOSV), ACM Press,New York, N.Y.,1992, pp. 223-233.

[12] B.N. Bershadet al., “Extensibility, SafetyandPerformancein the SPIN OperatingSystem,”Proc. 15th ACM Symp.OperatingSystemPrinciples (SOSP-15),ACMPress, New York, N.Y., 1995, pp. 267-284.

[13] M.A. Blumrichetal.,“Protected,User-levelDMA for theSHRIMPMulticomputer,”Proc. 2nd Int’l Symp.High PerformanceComputerArchitecture(HPCA-2), IEEECS Press, Los Alamitos, Calif., 1996, pp. 154-165.

[14] M.A. Blumrichetal., “Virtual MemoryMappedNetworkInterfacefor theSHRIMPMulticomputer,” Proc. 21st Int’l Symp.ComputerArchitecture(ISCA-21), ACMPress, New York, N.Y., 1994, pp. 143-153.

[15] N.J. Boden et al., “Myrinet: A Gigabit per-SecondLocal Area Network,” IEEEMicro, vol. 15, no. 1, Feb. 1995, pp. 29-36.

[16] R. Bordawekar,QuantitativeCharacterizationandAnalysisof theI/O Behaviorof aCommercialDistributed-shared-memoryMachine, tech.reportCACR 157,Centerfor AdvancedComputingResearch,CaliforniaInst.of Technology,Pasadena,Calif.,1998.

[17] U. BrüningandL. Schaelicke,“Atoll: A High-performanceCommunicationDevicefor Parallel Systems,”Proc. 1997 Conf. Advancesin Parallel and DistributedComputing, IEEE CS Press, Los Alamitos, Calif., 1997, pp. 228-234.

[18] J.C. Brustoloni and P. Steenkiste,“Effects of Buffering Semanticson I/OPerformance,” Proc. Usenix 2nd Symp. Operating Systems Design andImplementation (OSDI ’96), Usenix Assoc, Berkeley, Calif., 1996, pp. 227-291.

[19] D. Burger and T.M. Austin, The SimpleScalarTool SetVersion2.0, tech. report#1f342, ComputerScienceDept., Univ. of Wisconsin-Madison,Madison, Wis.,1997.

[20] R.Card,É.Dumas,andF.Mével,TheLINUXKernelBook,JohnWiley & Sons,NewYork, N.Y., 1998.

[21] Y. Chen et al., “UTLB: A Mechanismfor Address Translation on NetworkInterfaces,”Proc.8thInt’l Conf.ArchitecturalSupportfor ProgrammingLanguagesand OperatingSystems(ASPLOS-VIII), ACM Press,New York, N.Y., 1998,pp.193-204.

178

[22] B. Clarke,“Net ApplianceKeepsDatabaseCurrent,” EE Times, Issue1051,Mar.1999.

[23] D. Culler et al., TheGenericActiveMessageInterfaceSpecification,white paper,Computer Science Division, Univ. of California, Berkeley, Calif., 1994.

[24] A. Davis,M. Swanson,andM. Parker,“Efficient CommunicationMechanismsforClusterBasedParallelComputing,”Proc. 1st Int’l WorkshopCommunicationandArchitecturalSupportfor Network-basedParallel Computing(CANPC‘97), IEEECS Press, Los Alamitos, Calif., 1998, pp 1-15.

[25] DesignCompiler ReferenceManual v2000.05, Synopsys,Mountain View, Calif.,2000.

[26] K. Diefendorff and P.K. Dubey, “How Multimedia Workloads Will ChangeProcessor Design,”IEEE Computer, vol. 30, no. 9, Sept. 1997, pp. 43-45.

[27] P. DruschelandL.L. Peterson,“Fbufs: A High-bandwidthCross-domainTransferFacility,” Proc. 14th ACM Symp.OperatingSystemsPrinciples (SOSP-14),ACMPress, New York, N.Y., 1993, pp. 189-202.

[28] P. Druschel, L.L. Peterson,and B.S. Davie, “Experienceswith a High-SpeedNetwork Adaptor: A SoftwarePerspective,”Proc. SIGCOMM ‘94 Symp., ACMPress, New York, N.Y., 1994, pp. 2-13.

[29] C. Dubnicki et al., “VMMC-2: Efficient Supportfor Reliable,Connection-OrientedCommunication,”Proc. Hot InterconnectsV, IEEE CSPress,Los Alamitos,Calif.,1997.

[30] S.H.Duncan,C.D.Keefer,andT.A. McLaughlin,“High PerformanceI/O Designinthe AlphaServer 4100 Symmetric Multiprocessing System,” DEC TechnicalJournal, vol. 8, no. 4, June 1996, pp. 61-75.

[31] J.H.Edmondsonet al., “Internal Organizationof theAlpha 21164,a 300-MHz64-bit Quad-issueCMOSRISCMicroprocessor,”Digital TechnicalJournal,vol. 7, no.1, July 1995.

[32] T. von Eicken et al., “U-Net: A User-LevelNetwork Interfacefor Parallel andDistributed Computing,” Proc. 15th ACM Symp.Operating SystemsPrinciples(SOSP-15), ACM Press, New York, N.Y., 1995, pp. 40-53.

[33] T. vonEickenetal., “Active Messages:aMechanismfor IntegratedCommunicationandComputation,”Proc.19thInt’l Symp.ComputerArchitecture(ISCA-19),ACMPress, New York, N.Y., 1992, pp. 256-266.

179

[34] R. Enbody, “Perfmon User’s Guide,” http://www.cse.msu.edu/~enbody/perfmon/index.html

[35] Y. Endoet al., “Using Latencyto EvaluateInteractiveSystemPerformance,”Proc.2ndSymp.OperatingSystemDesignandImplementation(OSDI’96), UsenixAssoc,Berkeley, Calif., 1996, pp. 185-199.

[36] D.R. Engler,M.F. Kaashoek,andJ. O’Toole Jr, “Exokernel:An OperatingSystemArchitecturefor Application-LevelResourceManagement,”Proc.15thACMSymp.OperatingSystemsPrinciples(SOSP-15),ACM Press,New York, N.Y., 1995,pp.251-266.

[37] J.R.Eykholtetal., “BeyondMultiprocessing… MultithreadingtheSunOSKernel,”Proc. Summer’92 Usenix Conf., Usenix Assoc., Berkeley, Calif., 1992, pp. 11-18.

[38] B. FalsafiandD.A. Wood,“SchedulingCommunicationon anSMPNodeParallelMachine,” Proc. 3rd Int’l Symp. High-Performance Computer Architecture(HPCA-3), IEEE CS Press, Los Alamitos, Calif. 1997, pp. 288-297.

[39] M. Fillo andR.B.Gillett, “ArchitectureandImplementationof MemoryChannel2,”DEC Technical Journal, vol. 9, no. 1, July 1997.

[40] A. Gallatin,J.Chase,andK. Yocum,“Trapeze/IP:TCP/IPatNear-GigabitSpeeds,”Proc.1999UsenixTechnicalConference, UsenixAssoc.,Berkeley,Calif., 1999,pp.109-120.

[41] G.R. Ganger, B.L. Worthington, and Y.N. Patt, The DiskSim SimulationEnvironmentVersion 1.0 ReferenceManual, tech. report CSE-TR-358-98,Dept.Electrical Eng. and Computer Science, Univ. of Michigan, Ann Arbor, Mich, 1998.

[42] G. Gibsonet al., “A Cost-Effective,High-BandwidthStorageArchitecture,”Proc.8th Int’l Conf. ArchitecturalSupportfor ProgrammingLanguagesand OperatingSystems (ASPLOS-VIII), ACM Press, New York, N.Y., 1998, pp. 92-103.

[43] R.B. Gillett, “Memory ChannelNetworkfor PCI,” IEEE Micro, vol. 16,no.1, Feb.1996, pp. 12-18.

[44] J.L. Hennesseyand D.A. Patterson,Computer Architecture: A QuantitativeApproach,SecondEdition, Morgan Kaufman Publishers,San Francisco,Calif.,1995.

[45] J.L. Henning, “SPEC CPU2000: Measuring CPU Performancein the NewMillennium,” IEEE Computer, vol 33, no. 7, July 2000, pp. 28-35.

180

[46] M.P.Herlihy, “A Methodologyfor ImplementingHighly ConcurrentDataObjects,”Proc. 2ndACMSIGPLANSymp.PrinciplesandPracticeof Parallel Programming,ACM Press, New York, N.Y., pages 197-206, 1990.

[47] M.P. Herlihy andJ.E.B.Moss,“TransactionalMemory: ArchitecturalSupportforLock-FreeDataStructures,”Proc. 20th Int’l Symp.ComputerArchitecture(ISCA-20), ACM Press, New York, N.Y., 1993, pp. 289-300.

[48] S.A. Herrod,UsingCompleteMachineSimulationto UnderstandComputerSystemBehavior, doctoral dissertation,Computer ScienceDept., Stanford University,Stanford, Calif., 1998.

[49] R.Horst,“TNet: A ReliableSystemAreaNetwork,” IEEEMicro, vol. 15,no.1,Feb.1995, pp. 37-45.

[50] N.C. Hutchinson and L.L. Peterson, “The x-Kernel: An Architecture forImplementingNetworkProtocols,”IEEE Trans.SoftwareEngineering,vol. 17, no1, Jan. 1991, pp. 64-76.

[51] InfiniBand ArchitectureSpecificationRelease1.0, InfiniBand TradeAssociation,Portland, Ore., 2000.

[52] B.L. JacobandT.N. Mudge,“A Look atSeveralMemoryManagementUnits,TLB-Refill Mechanisms, and Page Table Organizations,” Proc. 8th Int’l Conf.Architectural Support for Programming Languages and Operating Systems(ASPLOS-VIII), ACM Press, New York, N.Y., 1998, pp. 295-306.

[53] M.J. Kilgard, D. Blythe, and D. Hohn, “System Support for OpenGL DirectRendering,” Proc. Graphics Interface, Canadian Human-ComputerCommunications Soc., Toronto, Ont., Canada, 1995, pp. 116-127.

[54] J. Kluge et al., “The ATOLL Approach for a Fast and Reliable SystemAreaNetwork,” Proc. 3rd Int’l WorkshopAdvancedParallel ProcessingTechnologies(APPT’99), PublishingHouseof ElectronicsIndustry,Beijing, China,1999,pp99-105.

[55] A. Kumar,“The HP PA-8000RISCCPU,” IEEE Micro, vol. 17, no. 2, Mar. 1997,pp. 27-32.

[56] J. LaudonandD. Lenoski,SystemOverviewof the SGI Origin 200/2000ProductLine,white paper, SGI, Mountain View, Calif., 1997.

[57] E.K. Lee,PerformanceModelingandAnalysisof DiskArryas,doctoraldissertation,Computer Science Division, Univ. of California, Berkeley, Calif., 1993.

181

[58] B.-H. Lim etal., “MessageProxiesfor Efficient, ProtectedCommunicationonSMPClusters,”Proc.3rd Int’l Symp.High-PerformanceComputerArchitecture(HPCA-3), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 116 - 127.

[59] B.N. Lipchak et al., “PowerStorm4DT: A High-performanceGraphicsSoftwareArchitecture,”Digital Technical Journal, vol. 9, no. 4, June 1997, pp. 49-60.

[60] M48T02 Data Sheet, ST Microelectronics,Dallas, Texas, 1998.

[61] P.S.Magnussonetal.,“SimICS/sun4m:A Virtual Workstation,”Proc.UsenixConf.,Usenix Assoc., Berkeley, Calif., 1998, pp. 119-130.

[62] A.M. MainwaringandD.E. Culler, “Design Challengesof Virtual Networks:Fast,General-PurposeCommunication,”Proc.7th ACM SIGPLANSymp.PrinciplesandPracticesof Parallel Programming, ACM Press,New York, N.Y., 1999,pp. 119-130.

[63] M.J. Marchi and A. Watson, The Network Appliance Enterprise StorageArchitecture: Systemand Data Availability, technical report TR3065, NetworkAppliance, Sunnyvale, Calif., 1999.

[64] E.P.MarkatosandM.G.H. Katevenis,“User-LevelDMA withoutOperatingSystemKernelModifications,” Proc. 3rd Symp.High-PerformanceComputerArchitecture(HPCA-3), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 322-331.

[65] R.P.Martin etal., “Effectsof CommunicationLatency,Overhead,andBandwidthina Cluster Architecture,” Proc. 24th Int’l Symposiumon ComputerArchitecture(ISCA-24), ACM Press, New York, N.Y., 1997, pp. 85-97.

[66] M.K. McKusick et al., TheDesignand Implementationof the 4.4. BSDOperatingSystem, Addison Wesley Longman, Boston, Mass., 1996.

[67] L. McVoy and C. Staelin,“lmbench: PortableTools for PerformanceAnalysis,”Proc.UsenixAnn.TechnicalConf.,UsenixAssoc.,Berkeley,Calif., 1998,279-294.

[68] MIPS R10000MicroprocessorUser's Manual, Version 2.0, MIPS Technologies,Mountain View, Calif., 1996.

[69] ModuleCompilerReferenceManual v2000.05, Synopsys,MountainView, Calif.,2000.

[70] D. Mosbergerand L.L. Peterson,“Making PathsExplicit in the ScoutOperatingSystem,” Proc. Usenix 2nd Symp.OS Design and Implementation(OSDI ‘96),Usenix Assoc., Berkeley, Calif., 1996, pp. 153-168.

182

[71] D. Mosberger,P.Druschel,andL.L. Peterson,“ImplementingAtomic SequencesonUniprocessorsUsingRollforward,”Software---PracticeandExperience,vol. 26,no.1, Jan. 1996, pp. 1-24.

[72] M. Moudgill andS.Vassiliadis,“PreciseInterrupts,”IEEEMicro, vol. 16,no.1,Feb.1996, pp. 58-67.

[73] S.S. Mukherjee and M.D. Hill, “The Impact of Data Transfer and BufferingAlternativesonNetworkInterfaceDesign,”Proc.4thInt’l Symp.High-PerformanceComputerArchitecture(HPCA-4), IEEE CSPress,Los Alamitos,Calif., 1998,pp.207-218.

[74] J.K. Osterhout, “Why Aren’t Operating Systems Getting Faster as Fast asHardware?,”Proc. UsenixSummerConference, Usenix Assoc.,Berkeley,Calif.,1990, pp. 247-256.

[75] V.S. Pai,P. Druschel,andW. Zwaenepoel,“IO-Lite: A Unified I/O Buffering andCaching System,” Proc. Usenix 3rd Symp. Operating SystemsDesign andImplementation (OSDI ’99), Usenix Assoc., Berkeley, Calif., 1999, pp. 15-28.

[76] V.S. Pai,P. Druschel,andW. Zwaenepoel,“Flash:An Efficient andPortableWebServer,”Proc.UsenixAnn.TechnicalConf., UsenixAssoc.,Berkeley,Calif., 1999,pp. 199-212.

[77] V.S.Paj,P.Rangnathan,andS.V.Adve,RSIMReferenceManual, Version1.0,tech.report 9705, Dept. Electrical and Computer Eng., Rice Univ., Houston, Tex, 1997.

[78] M.A. Pagels,P. Druschel,and L.L. Peterson,Cache and TLB EffectivenessinProcessingNetwork I/O, tech. report 94-08, Dept. ComputerScience,Univ. ofArizona, Tucson, Ariz., 1994.

[79] J. Pasqualeet al., “High-PerformanceI/O and Networking Softwarein Sequoia2000,”Digital Technical Journal, vol. 7, no. 3, Mar. 1995, pp. 84-96.

[80] J. Pasquale,E.W. Anderson,and P.K. Muller, “Container-shipping:OperatingSystemSupportfor IntensiveI/O Applications,”IEEEComputer, vol. 27,no.3,Mar.1994, pp. 84-93.

[81] PCILocalBusSpecificationRevision2.1, PCISpecialInterestGroup,Portland,Ore.,1995.

[82] G. Pfister,In Searchof Clusters, SecondEdition,PrenticeHall, UpperSaddleRiver,N.J., 1998.

183

[83] PowerPC MicroprocessorFamily: The Programming Environmentsfor 32-BitMicroprocessors, Motorola, Schaumburg, Ill., 1997.

[84] PowerPCMicroprocessorFamily: TheBus Interfacefor 32-Bit Microprocessors,Motorola, Schaumburg, Ill., 1997.

[85] P. Ranganathanet al., “Performanceof DatabaseWorkloadson Shared-MemorySystemswith Out-of-OrderProcessors,”Proc.8th Int’l Conf.ArchitecturalSupportfor ProgrammingLanguagesandOperatingSystems(ASPLOS-VIII), ACM Press,New York, N.Y., 1998, pp. 307-318.

[86] R.F. Rashidet al., “Machine-IndependentVirtual Memory Managementfor PagedUniprocessorandMultiprocessorArchitectures,”IEEETransactionsonComputers,vol. 37, no. 8, August 1988, pp. 896-908.

[87] M. Rosenblumet al., “Using the SimOS Machine Simulator to Study ComplexComputerSystems,”ACM TOMACSSpecialIssueon ComputerSimulation, 1997,pp. 79-103.

[88] M. Rosenblumet al., “The Impact of ArchitecturalTrendson OperatingSystemPerformance,”Proc. 15thsACM Symp.OperatingSystemPrinciples (SOSP-15),ACM Press, New York, N.Y., 1995, pp. 285-298.

[89] L. Rzymianowiczetal., “ATOLL: A NetworkonaChip,” Proc.1999Conf.ParallelandDistributedProcessingTechniquesandApplications, CSREAPress,LasVegas,Nev., 1999, pp. 2307-2313.

[90] L. Schaelicke,“L-RSIM: A SimulationEnvironmentfor I/O IntensiveWorkloads,”Proc. 3rd Ann. IEEE WorkshopWorkloadCharacterization2000, IEEE CS Press,Los Alamitos, Calif., 2000, pp. 83-89.

[91] L. SchaelickeandA. Davis,“Improving I/O Performancewith a ConditionalStoreBuffer,” Proc.31stInt’l Symp.Microarchitecture(MICRO-31),IEEECSPress,LosAlamitos, Calif., 1998, pp. 160-169.

[92] L. Schaelicke,A. Davis, and S.A. McKee, “Profiling I/O Interrupts in ModernArchitectures,” Proc. 8th Int’l Symp. Modeling, Analysis and Simulation ofComputerandTelecommunicationSystems(MASCOTS2000),IEEECSPress,LosAlamitos, Calif., 2000, pp. 115-123.

[93] I. Schoinas and M.D. Hill, “Address Translation Mechanismsin NetworkInterfaces,” Proc. 4th Int’l Symp. High-PerformanceComputer Architecture(HPCA-4), IEEE CS Press, Los Alamitos, Calif., 1998, pp. 219-230.

[94] SGI, “IRIX man pages,” http://techpubs.sgi.com/library/

184

[95] SGI, “IRIXview User’s Guide,” http://techpubs.sgi.com

[96] K. SkadronandD. Clark,“DesignIssuesandTradeoffsfor Write Buffers,”Proc.3rdInt’l Symp.High-PerformanceComputerArchitecture(HPCA-3), IEEE CS Press,Los Alamitos, Calif., 1997, pp. 144-155.

[97] K. Swartz, “The Brave Little ToasterMeets Usenet,” Proc. 10th Usenix LargeInstallationSystemAdministrationConference(LISA X), UsenixAssoc.,Berkeley,Calif., 1996, pp. 161-170.

[98] TcX AB, DetronHB, andMonty ProgramKB, “MySQL ReferenceManualVersion3.2.1,” http://www.mysql.com/documentation/mysql/

[99] M.N. ThadaniandY.A. Khalidi, An EfficientZero-CopyI/O Frameworkfor UNIX,tech.reportSMLI TR95-39,SunMicrosystemsLaboratories,PaloAlto, Calif., 1995.

[100] C.A. Thekkath and H.M. Levy, “Hardware and Software Support for EfficientExceptionHandling,”Proc.6th Int’l Conf.ArchitecturalSupportfor ProgrammingLanguagesand OperatingSystems(ASPLOS-VI), ACM Press,New York, N.Y.,1994, pp. 110-119.

[101] J. Torrellas, A. Gupta, and J. Hennessy,“Characterizing the Caching andSynchronizationPerformanceof aMultiprocessorOperatingSystem,”Proc.5thInt’lConf. Architectural Supportfor ProgrammingLanguagesand OperatingSystems(ASPLOS-V), ACM Press, New York, N.Y., 1992, pp. 162-174.

[102] UltraSPARC User’s Manual, Sun Microsystems, Palo Alto, Calif., 1997.

[103] Ultrastar 9ZX Hardware/FunctionalSpecification9.11 GB Model, 10020 RPMVersion1.01, DocumentNumberAS19-0217-01,IBM StorageSystemsDivision,San Jose, Calif., 1997.

[104] D.L. WeaverandT. Germond,TheSPARCArchitectureManualVersion9, PrenticeHall, Upper Saddle River, N.J., 1994.

[105] E.H. Welbon et al., POWER2 PerformanceMonitor, IBM J. ResearchandDevelopment, vol. 38, no. 5, May 1994, pp. 545-554.

[106] M. Welsh,A. Basu,andT. von Eicken, IncorporatingMemoryManagementintoUser-LevelNetworkInterfaces, tech.reportTR97-1620,Dept. ComputerScience,Cornell Univ., Ithaka, N.Y., 1997.

[107] M. Wittle and B.E. Keith, “LADDIS: The Next Generationof NFS File ServerBenchmarking,” Proc. Usenix Summer1993 Technical Conf., Usenix Assoc.,Berkeley, Calif., 1993, pp. 111-128.

ARCHITECTURAL SUPPORT FOR USER-LEVEL …lambert/pdf/dissertation.pdf0 ARCHITECTURAL SUPPORT FOR...

Documents

Transcript of ARCHITECTURAL SUPPORT FOR USER-LEVEL …lambert/pdf/dissertation.pdf0 ARCHITECTURAL SUPPORT FOR...