ARCHITECTURAL SUPPORT FOR USER-LEVEL …lambert/pdf/dissertation.pdf0 ARCHITECTURAL SUPPORT FOR...
Transcript of ARCHITECTURAL SUPPORT FOR USER-LEVEL …lambert/pdf/dissertation.pdf0 ARCHITECTURAL SUPPORT FOR...
0
ARCHITECTURAL SUPPORT FOR
USER-LEVEL INPUT/OUTPUT
by
Lambert Schaelicke
A dissertation submitted to the faculty ofThe University of Utah
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Computer Science
School of Computing
The University of Utah
December 2001
i
Copyright © Lambert Schaelicke 2001
All Rights Reserved
ii
THE UNIVERSITY OF UTAH GRADUATE SCHOOL
SUPERVISORY COMMITTEE APPROVAL
of a dissertation submitted by
Lambert Schaelicke
Thisdissertationhasbeenreadbyeachmemberof thefollowing supervisorycommittee and by majority vote has been found to be satisfactory.
Alan L. Davis
Erik L. Brunvand
John B. Carter
Sally A. McKee
Ulrich Brüning
Chair:
iii
THE UNIVERSITY OF UTAH GRADUATE SCHOOL
FINAL READING APPROVAL
To the Graduate Council of the University of Utah:
I havereadthedissertationof LambertSchaelickein its final form andhavefoundthat(1) its format,citations,andbibliographicstyleareconsistentandacceptable;(2) its illustrativematerialsincludingfigures,tables,andchartsarein place;and(3) the final manuscriptis satisfactoryto the supervisorycommittee and is ready for submission to The Graduate School.
Alan L. DavisChair: Supervisory Committee
Date
Approved for the Major Department
Thomas C. HendersonDirector
Approved for the Graduate Council
David S. ChapmanDean of The Graduate School
iv
ABSTRACT
Theperformanceof the input/outputsubsystemis becomingincreasinglyimportant
for manyapplications.CommercialI/O intensiveapplicationsarea fast growing market
segmentandexperienceconstantlyincreasingperformancedemands.Manyof theseappli-
cationsexploitconcurrencyto overlapthelatencyof I/O operationsto improvethroughput.
At thesametime,semiconductortechnologytrendsresultin agrowinggapbetweenappli-
cationandoperatingsystemperformance.Consequently,operatingsystemoverheadin-
creasinglylimits theefficiencyof latency-hidingtechniquesto improvethroughput.This
dissertationdevelopsandevaluatesa novel I/O architecturethat,by providinguser-level
accessto theI/O subsystem,minimizesI/O overheadwhile maintainingthelevelof protec-
tion andprogrammingflexibility of conventionalkernel-basedarchitectures.Inexpensive
hardwaremechanismsin theI/O deviceandhostprocessorimplementprotecteduser-level
requestinitiation, user-spacedatatransfers,anduser-levelnotifications.Together,these
mechanismsareableto reduceI/O overheadby upto two ordersof magnitude.As aresult,
applicationsare able to efficiently overlap long-latency I/O operationsto maximize
throughputandto exploit thescalablebandwidthof next-generationdistributedI/O archi-
tectures.Theflexibility of thebasicmechanismsfacilitateslibrary implementationsof ava-
riety of standardI/O programmingmodelswith low overhead,asthearchitecturedoesnot
restrict the allocation and use of I/O buffers.
v
A prototypeof theuser-levelI/O architectureis implementedandevaluatedin anex-
ecution-drivensystemsimulator.The simulationsystemcombinesdetailedmodelsof a
modernmicroprocessorandcaches,which arebasedon anexistingsimulator,a memory
controllerandI/O devices,with a UNIX-compatibleoperatingsystem.Validation of the
simulatoragainstarealworkstationshowthatthetool accuratelycapturestheperformance
characteristicsof existingcomputersystems.Syntheticbenchmarksdemonstratethat the
user-levelI/O architectureachievestwice theaggregatebandwidthon 23 requeststreams
comparedto kernel-basedI/O, while at thesametimereducingCPUoccupancyby 98per-
cent.TheMySQL databaseserveris ableto improvethroughputby upto 25percent,with-
out requiring any program modifications.
CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ix
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xi
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.1 Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.2 I/O Intensive Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.3 Operating System Performance Trends. . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.4 Low-Overhead I/O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61.5 Contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
2. MODERN I/O ARCHITECTURE OVERVIEW. . . . . . . . . . . . . . . . . . . . . . . . .11
2.1 Kernel-based I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122.2 File I/O Overhead Characterization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.3 Network Attached Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .322.4 InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .342.5 User-level Network Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .362.6 Virtual I/O Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .402.7 Operating System Support for High-Performance I/O . . . . . . . . . . . . . . . .422.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
3. THE L-RSIM ARCHITECTURAL SIMULATOR . . . . . . . . . . . . . . . . . . . . . . .50
3.1 Simulator Machine Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .513.2 LAMIX Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .533.3 Simulator Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .543.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65
4. USER-LEVEL I/O ARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67
4.1 UIO Architecture Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69
vii
4.2 Application Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .724.3 Basic UIO Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .754.4 UIO Device Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .764.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
5. ATOMIC DEVICE ACCESS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81
5.1 The Conditional Store Buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .825.2 Conditional Store Buffer at the I/O Device. . . . . . . . . . . . . . . . . . . . . . . . .945.3 Performance Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .955.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103
6. DIRECT USER-SPACE TRANSFER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105
6.1 Device TLB Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1076.2 TLB Misses and Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1096.3 TLB Coherence and Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1136.4 TLB Miss Handling with Kernel Interrupts. . . . . . . . . . . . . . . . . . . . . . . .1156.5 TLB Miss Handling with a Programmable TLB Fill Engine . . . . . . . . . .1176.6 Device TLB Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .1336.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139
7. USER-LEVEL NOTIFICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141
7.1 Lightweight Notification Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . .1437.2 Processor Notification Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1487.3 Multiprocessor Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1507.4 Performance Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1517.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154
8. PERFORMANCE EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156
8.1 Prototype Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1578.2 UIO Bandwidth Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1618.3 Application Throughput Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1658.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .168
9. CONCLUSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169
9.1 User-level I/O Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1699.2 Limitations and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171
REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .176
viii
LIST OF TABLES
Table Page
1. Experimental Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
2. Read System Call Overhead with Disk Access. . . . . . . . . . . . . . . . . . . . . . . . . . . .19
3. Cache Misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
4. Simulator Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
5. LMBench Average Latency Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
6. LMBench Average Bandwidth Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
7. System Call Latencies in Microseconds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61
8. File System Performance in Creations/Deletions per Second. . . . . . . . . . . . . . . . .62
9. SPEC 2000 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
10. UIO Request Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73
11. System Configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96
12. Kernel TLB Miss Handling Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117
13. Table Walk Engine Instruction Set Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . .119
14. Table Walk Engine Area Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124
15. PowerPC Page Table Lookup Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128
LIST OF FIGURES
Figure Page
1 Read System Call with Disk Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
2 Measuring Disk Read Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16
3 Structure of Interrupt Overhead Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22
4 I/O Bandwidth Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28
5 MySQL Database Throughput Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
6 Network-attached Secure Disk Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33
7 InfiniBand Distributed I/O Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
8 L-RSIM Machine Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
9 LMBench Memory Load Latency for 256-byte Stride . . . . . . . . . . . . . . . . . . . . . .57
10 LMBench Memory Read Bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58
11 LMBench Disk Seek Time and Read Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . .60
12 User-level I/O Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68
13 User-level I/O Architecture Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71
14 UIO Device Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77
15 Architectural Model with Conditional Store Buffer . . . . . . . . . . . . . . . . . . . . . . . .83
16 Conditional Store Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84
17 Conditional Store Buffer Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92
x
18 Request Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97
19 Context Switch Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101
20 Device TLB Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108
21 TLB Refill Engine Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122
22 32-bit PowerPC Page Table Lookup Overview. . . . . . . . . . . . . . . . . . . . . . . . . . .126
23 32-bit PowerPC Page Table Lookup Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .127
24 32-bit MIPS Page Table Lookup Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129
25 32-bit MIPS Page Table Lookup Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130
26 IA-32 Page Table Lookup Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131
27 IA-32 Page Table Lookup Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132
28 TLB Miss Handler Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134
29 TLB Miss Handler CPU Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136
30 Effective DMA Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .138
31 Notification Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .144
32 Queued Notification Handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146
33 Notification Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151
34 File Descriptors in a User-level I/O Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . .158
35 I/O Bandwidth Scaling Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .162
36 I/O Bandwidth Scaling Comparison with Network Effects. . . . . . . . . . . . . . . . . .164
37 MySQL Database Throughput Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .166
ACKNOWLEDGMENTS
This dissertationwould not havebeenpossiblewithout thesupportandencourage-
mentof manyindividuals,onlysomeof whomcanbementionedhere.My advisorAl Davis
wasalwaysavailablefor technicaldiscussionsandknew whento give me the freedomI
neededto find fulfillment in this work. My supervisorycommitteemembersJohnCarter,
SallyMcKee,Erik BrunvandandUlrich Brüningfrequentlygaveadviceontechnicalprob-
lemsaswell ashelpedimprovethepresentationof thisdissertation.In addition,JohnCarter
supportedmethrougharesearchassistantpositionoverseveralyears.Throughhisgenuine
interestin thiswork despitethegeographicaldistance,my externalcommitteememberUl-
rich Brüning made my effort even more meaningful.
Frequentdiscussionswith my fellow studentMike Parkerhelpedsolvemanytechni-
cal challengesandunderstandunexpectedresults.Seeminglyunsolvableproblemsdisap-
pearedin the presenceof his listening ears. Uros Prestor and Mike Hibler where
instrumentalin my understandingof modernoperatingsystemstructure.Thefruitful dis-
cussionswith manyof my fellow studentshelpedmy own understandingof theproblem
domain.
My family providedthesupportneededto completea projectof this scale.My wife
Rita tirelesslyencouragedmewhenI neededit andnevermademefeelguilty for spending
xii
longhoursat thelab.My sonIndri, althoughtooyoungto know,gavemereasonsto smile
at the end of many long days.
Thiswork wassupportedin partby theDefenseAdvancedResearchProjectsAgency
underagreementnumberN0003995C0018andF306029810101,by aNSFResearchInfra-
structureGrantNumberCDA9623614andby a GraduateResearchFellowshipfrom the
University of Utah Graduate School for the academic year 2000/2001.
1. INTRODUCTION
The efficiency of the input/output(I/O) subsystemplaysa largerole in the overall
performanceof manyapplications,andits relativeimportanceis increasing.I/O intensive
commercialapplicationsarea significantandfast growing marketof computersystems.
Growingdatavolumesanduser’sperformancedemandsleadto increasingpressureon the
I/O subsystem.Respondingto theimportanceof highI/O performance,next-generationI/O
architecturesarebasedonadistributedI/O deviceorganization.A system-areanetworkre-
placestheconventionalI/O busto connectmultipleclientswith anumberof distributedI/O
devices,providing betterbandwidthscalability and greaterexpandability.At the same
time,semiconductortechnologytrendsresultin agrowinggapbetweenapplicationandop-
eratingsystemperformance.Thelow cacheandTLB locality of operatingsystemcode,to-
getherwith low degreesof availableinstructionlevel parallelism,makeoperatingsystem
codemostly memoryperformancebound.As a result,many I/O intensiveapplications
spendincreasingamountsof time executingoperatingsystemcode.This trendseverely
limits applicationsability to overlaplong-latencyI/O operationswith independentwork to
improve throughput.
ThisdissertationintroducesandevaluatesanovelI/O architecturethatminimizesI/O
overheadby bypassingtheoperatingsystemfor I/O requestsin thecontextof adistributed
I/O architecture.It is basedon innovativehardwarefeatureslocatedin thehostprocessor
andthesystem-areanetworkinterfacethatallow user-levelprocessesto directlyaccessthe
2
I/O subsystem.As aresult,I/O overheadis reducedby afactorof hundred,allowingappli-
cationstomoreefficientlyapplylatencyhidingtechniquesto improveI/O performanceand
resultingin betterthroughoutscalability.Bypassingtheoperatingsystemfor I/O operations
meansthat evenin the faceof the wideninggapbetweenmemoryandprocessorperfor-
mance,I/O intensiveapplicationsareableto achievehighperformancethroughconcurrent
I/O.
1.1 Organization
Thisdissertationis organizedin ninechapters.Thefollowing sectionsdiscusstrends
in I/O performancerequirementsandoperatingsystembehaviorandthengiveanoverview
of thegoalsandmechanismsof theproposeduser-levelI/O architecture.Thenextchapter
discussesthecontemporaryI/O architecturecommonlyfoundin workstationsandservers,
describesthesourcesof operating-systeminducedI/O overhead,presentsmethodologies
to quantifytheoverhead,measuresits impacton systemperformance,anddiscussesa va-
riety of optimizationspreviouslydeveloped.Chapter3 describesthe functionality of the
simulationsystemusedfor this work andpresentsresultsof its validationagainsta real
workstation.Chapter4 givesa detailedoverviewof theuser-levelI/O architecture,Chap-
ters5 through7 discussandevaluatethe individual hardwareandsoftwaremechanisms.
Chapter8 expandson the microbenchmarkmeasurementsof the precedingchaptersand
presentsperformanceresultsfor asyntheticI/O intensiveworkloadandarealisticdatabase
application.Finally, Chapter9 summarizestheresults,drawsconclusionsandexploresar-
eas of possible future work.
3
1.2 I/O Intensive Applications
Commercialapplicationssuchasdatabases,emailandWebserversmakeupthelarg-
estand fastestgrowing marketsegmentfor multiprocessorcomputersystems.Many of
theseworkloadsareconsideredI/O intensive,becauseof theamountof datatheyprocess
and becauseof users’performancedemands.Most commercialapplicationsoperateon
largedatasetsresidingonsecondaryor tertiarystorage,while communicatingwith distrib-
utedclient systemsovernetworks.Work environmentsarebecomingmorecollaborative,
thusplacinghigherperformancedemandson file servers[107], email andnewsservers
[97]. Theincreasedfile serverdemandsstemfrom a growthin datavolumethatoutpaces
theserverscachingcapability,andfrom increasingrequestratesfrommoreusers.Although
email andnewsserverperformancefor the individual enduseris usuallynot considered
critical, theamountof dataandnumberof files passingthroughtheservercanleadto sig-
nificantI/O throughputdemands.DistributedsystemssuchasSequoia[79] allow research-
ers to collaborateacrossgeographicallydistributed laboratories.Data from satellite
observationsor scientificsimulationscantakeonenormousvolume,andexchangingsuch
data between machines requires significant I/O bandwidth.
With theconstantgrowthof theWorld Wide Web,Webservershavebecomea sig-
nificantmarket.Websitesarealsobecominglargerandmorecomplex,leadingto increas-
ing demandsonwebserverperformance[76] dueto thelargerdatasetsandhigherrequest
ratesfrom moreclients.In manycasestheserverperformsno or only minimal operations
onthedata,thusemphasizingtheimportanceof theI/O subsystemfor overallperformance.
However,the increasingpopularityof dynamicallycreatedweb contentleadsto higher
CPU performance requirements for web servers as well.
4
Multimedia applicationssuchasvideo conferencingor video-on-demandnot only
tax theperformanceof currentI/O subsystemsbut alsorequireentirelynewservicessuch
asguaranteedbandwidthandlimited latencyvariationof I/O datastreams.Theserequire-
ments are only partly met by current I/O architectures.
Although databasestraditionally emphasizereliability andavailability over perfor-
mance,thedramaticgrowthof commercialtransactionsperformedovernetworksleadsto
higherperformancerequirementsfor databaseenginesaswell asto a growingmarketfor
commercialserversystems.Databases,like Webandfile servers,achievehighI/O through-
putby overlappingI/O requestsof independenttransactions[9][85]. Commercialdatabase
serversusuallyrun on a shared-memoryor clusteredmultiprocessorsystemswith a large
numberof disks.Hundredsof disksarerequirednot only to storethedatabasetablesbut
alsoto providesufficientparallelismandredundancyin thestoragesystemto allow effi-
cient overlap of requests to hide access latencies.
1.3 Operating System Performance Trends
At thesametimeasapplicationsincreasethedemandsontheI/O subsystemin terms
of performanceandservicesprovided,semiconductortechnologytrendsleadto I/O perfor-
mancebeingmoreandmorelimited by operatingsystemoverheads.Becausein mostgen-
eral-purposesystem,theoperatingsystemis involvedin all I/O operations,its performance
directly affectsI/O performance.Althoughtheperformanceof thecentralprocessingunit
(CPU)is generallyimprovingat a fasterrate[44] thanI/O deviceperformance[57], oper-
atingsystemsdonotbenefitfrom this trendasmuchasapplications[74][88]. Becauseop-
eratingsystemcodeisexecutedrelativelyinfrequently,it incursmanycacheandtranslation
5
lookasidebuffer (TLB) missesandis hencemorelimited by mainmemoryperformance.
OScodealsodoesnotbenefitfrom dynamicinstructionschedulingasmuchasapplication
softwarebecauseoperatingsystemcodedoesnotexhibit largeamountsof instruction-level
parallelismand frequentlyexecutesprivileged instructionsthat serializethe superscalar
processor pipeline.
I/O relatedoperatingsystemoverheadcanbecategorizedaseithercontroloverhead
or datatransferoverhead.Controloverheadis aresultof contextswitchesandthecrossing
of protectiondomains.WhenanapplicationissuesanI/O requestto thekernel,it performs
asystemcall thatsavesprocessstateandperformsvariousprotectionchecksaspartof the
user-to-kernelmodetransition.HandlingtheI/O requestmayinvolvemultiplekernelpro-
cesses,in whichcaseadditionalcontextswitchesandschedulingoperationsareperformed.
Finally, I/O completionsare traditionally signaledto the host processorvia interrupts,
which incur additionalcontextswitchesanddisrupt the currentlyexecutingapplication.
Datatransferrelatedoverheadis incurredwhenthekernelprovidesintermediatebuffering
for I/O data.In thiscasedataarecopiedbetweenapplicationandkernelbuffers,whichcon-
sumesvaluableprocessorcyclesandnegativelyaffectsthe application’scacheandTLB
behavior.
Sinceboth I/O deviceperformanceandI/O relatedoperatingsystemoverheadim-
proveataslowerratethanto applicationperformance,I/O overheadmaybecomethedom-
inating factor for manyI/O intensiveapplicationsin the future.For instance,commercial
databasesystemshaveuntil now beenableto hideI/O latencyalmostcompletely.This is
madeevidentby the fact that databasesarenow often memoryperformancelimited [9].
High-performancemicroprocessorshavesofar allowedapplicationsto offsetthegrowing
6
gapbetweenI/O andprocessorperformancethroughconcurrencyandlatencyhiding.Mod-
erndatabasesystemsspendbetween10and30percentof thetotalCPUcyclesin operating
systemcode[9]. Given the growing discrepancybetweenprocessorandmemoryperfor-
mance,andhencebetweenapplicationandoperatingsystemperformance,operatingsys-
tem overheadincreasinglylimits the ability of applicationsto exploit parallelismin I/O
requests which results in decreased throughput.
1.4 Low-Overhead I/O
Theability to hideanykind of latencydependson theavailability of sufficientcon-
currencysothatindependentwork canbeperformedduringlong-latencyoperations.Many
server-classapplicationsareableto exploitconcurrencyamongrequeststo overlapI/O op-
erations.However,the CPU overheadincurredby eachlong-latencyoperationlimits the
throughputimprovementsachievableby latencyhiding, sinceoverheadconstituteswork
that cannot be overlappedandthat is executedsequentially.For example,a 10 percent
overheadto scheduleandcompletea disk requestmeansthata singleprocessoris ableto
overlapatmost10diskrequests,atwhichpointnoCPUcyclesareavailablefor application
processing.AlthoughboththeI/O latencyandoverheadvarywidely betweenindividualre-
quests,differentI/O devices,andrequesttypes,theoveralleffectis thatoverheadlimits the
scalability of I/O latency-hiding techniques.
Thiswork presentsandevaluatesanI/O architecturethatminimizesI/O overheadby
bypassingtheoperatingsystemfor performance-criticalI/O operations.Theuser-levelin-
put/output(UIO) architectureaddressesall sourcesof I/O overhead,from requestinitiation
to datatransfersto completionnotifications,throughacombinationof novelhardwareand
7
softwaremechanisms.Thesemechanismsareinexpensivein termsof requiredchip area
anddonot impactthecycletimeof themicroprocessoror theI/O device.A statelesscom-
municationprotocolis usedto avoidscalabilitylimitationsin theI/O device.Extensionsto
themicroprocessorarerestrictedto thebusinterfacewheretheydo not impactthecritical
path of the execution core.
A software-controlledcombiningbuffer in theprocessorbusinterfaceenablesappli-
cationsto issuerequestsdirectlyto theI/O subsystem,thusavoidingthecostof systemcalls
andcontextswitches.Dataaretransferreddirectly to andfrom applicationbufferswith the
helpof anI/O deviceTLB to makeoptimaluseof theavailablesystembusbandwidthand
to minimizeCPUoccupancy.A light weight interrupthandlerprocessescompletionnoti-
ficationsfrom a notificationqueuein theprocessorbusinterfaceandsignalsthemto the
userprocess,allowingapplicationsto implementoptimizedlow-overheadnotificationhan-
dling without expensivecontextswitches.Theoverheadof anI/O requestin this architec-
tureis at leasttwo ordersof magnitudelowerthanthatof akernel-basedI/O operation,and
unlike for a kernel-mediatedrequest,it is independentof therequestsize.As a result,ap-
plicationsareableto moreefficiently overlapI/O operationswith independentwork, lead-
ing to higherandmorescalablesystemthroughput.The throughputimprovementcanbe
realizeddirectly by theapplicationthroughlatencyhiding techniques,or indirectly by in-
dependentprocessesthatbenefitfrom theincreasedCPUidle time duringI/O operations.
In manycasesapplicationscanrealizeperformanceimprovementswithoutextensivesoft-
warerestructuringastheuser-levelI/O architecturedoesnotrestricttheuseormanagement
of I/O buffers as previous work has often done [24][27][29][62][75][99].
8
The prototypeimplementationof this architectureis ableto improvethe aggregate
bandwidthof 23 disk I/O streamsof a distributedstoragearchitectureby a factor of two
while reducingCPUoccupancyby almosta factorof 100.Thesignificantlyreducedover-
headenablesa databaseserverto improvethroughputfor 15 requestsby over25 percent,
without requiringchangesto theprogramstructure.The fact that theoperatingsystemis
not involvedin I/O operationsmeansthattheperformanceadvantagesof theUIO architec-
ture can be retained in the presence of the widening processor/memory performance gap.
Previouswork tominimizeoperatingsystemoverheadfor improvedI/O performance
hasoften requiredchangesof the I/O programminginterfaceor hasleadto complexand
nonscalablehardwareimplementations.Theproposeduser-levelI/O architectureexportsa
flexible low-overheadnonblockingprogramminginterfaceto applications,ontopof which
user-levellibrariescanimplementa wide variety of applicationprogramminginterfaces.
Theseinterfacescanrangefrom a standardUNIX interfaceto user-levelmultithreading
thatimplementsconventionalblockingI/O callsonaper-threadbasis.Theability to imple-
ment theselibraries with little overheadallows applicationsto realizeperformanceim-
provementswithout modificationsto the applicationstructure.Alternatively,application
programmerscandirectly accessthe low-level primitives to implementspecializedsyn-
chronization schemes.
A prototypeof theuser-levelI/O architectureis implementedin anexecution-driven
simulatorthat combinesdetailedhardwaremodelswith a fully-functional operatingsys-
tem.Execution-drivensimulationis a commonlyusedtool to investigatetheperformance
implicationsof novelhardwarefeatureson realisticworkloads.Thesimulatorusedin this
studyaccuratelymodelshardwarecomponentsof a workstationsuchasthe dynamically
9
scheduledprocessorwith caches,amemorycontrollerandI/O devices.On topof thesim-
ulatedhardwarerunsa UNIX compatiblemultitaskingoperatingsystemthat includesa
completefilesystemwith buffer cacheanddevicedrivers.Thesimulationenvironmentis
extendedwith modelsof theproposedhardwarefeaturesandsoftwaremechanismsto eval-
uatethefeasibilityandperformanceimpactof theUIO architecture.Alternativeimplemen-
tationsof individual mechanismsare evaluatedusing microbenchmarksand fine-grain
measurementsof latencyandoverhead.A syntheticI/O-intensivebenchmarkis usedto
demonstratethe impactof I/O overheadon systemthroughput,andto separateI/O over-
headeffectsfrom application-specificbehavior.At thesametime,adatabaseserveris sim-
ulatedto demonstratehow a real I/O intensiveapplicationis ableto realizeperformance
improvements without any modifications.
Implementingmodelsof newhardwarein anarchitecturalsimulatorallowsresearch-
ersto investigateperformanceimplicationsof thesemechanisms,to explorealternatives
andto investigatetheinteractionwith therestof thesystem.However,this approachdoes
not indicatethecostof suchmechanismsin termsof hardwarecomplexity,critical timing
pathandrequiredchip area.To this end,this studypresentshardwareimplementationsof
theproposedmechanismsthatshowthatevenmoderatelyoptimizeddesignsdo not nega-
tively affect the cycle time and that the area requirement is small.
1.5 Contributions
This dissertationintroducesnovelhardwareandsoftwaremechanismsthatgive ap-
plications low-overheadaccessto I/O deviceswithout operatingsysteminvolvement,
while maintainingthe level of protectionandmostof thesemanticsof traditionalkernel-
10
basedI/O with copysemantics.To evaluatetheperformancebenefitof thenewarchitec-
ture,adetailedexecution-drivensystemsimulatoris developedandvalidated.Variousim-
plementationsof the proposedmechanismsareevaluatedandcomparedwith respectto
overhead,hardwarecomplexityandeaseof integrationinto existingsystemorganizations.
The impact of the overheadreductionsrealizedby thesemechanismsis experimentally
quantified using a synthetic benchmark and a database server.
In addition,this dissertationpresentsportablemethodologiesto quantifyvariousas-
pectsof I/O overheadin currentsystemsandappliesthesemethodologiesto two different
hardware platforms.
To facilitatetheperformanceevaluationof theprototypeuser-levelI/O architecture,
thiswork introducesanarchitecturalsystem-levelsimulatorthatcombinesdetailedmodels
of amicroprocessor,caches,memorycontrollerandI/O deviceswith arealisticUNIX com-
patibleoperatingsystem.Thesimulationsystemis anextensionof theRSIM architectural
simulator.Thevalidationof thesimulatoragainstanexistingworkstationshowsthat it is
able to closely approximate the performance of real computer systems.
2. MODERN I/O ARCHITECTURE OVERVIEW
The termI/O architecturerefersto theorganizationof I/O devices,hostprocessors
andmainmemory,andthemethodsby whichthesecomponentsexchangecontrolinforma-
tion anddata.Traditionally, most I/O architecturesarebasedon a sharedI/O bus,with
memory-mappeddeviceaccessmediatedby theoperatingsystem[30]. Althoughbeingrel-
ativelysimpleandcosteffective,bus-basedarchitecturesareinherentlylimited in termsof
scalabilityandbandwidth.In additionto thesehardwarelimitations,thesoftwareoverhead
associatedwith I/O operationscanhavea largeimpacton I/O performance.This chapter
discussescurrentandnext-generationI/O architecturesandshowshow existingoptimiza-
tionsrelateto theproposeduser-levelI/O architecture.Thefollowing two sectionsidentify
andmeasurethesourcesof operatingsystemoverheadin currentsystems,andshowhow
overheadaffectsapplicationthroughputscalability.Sections2.3and2.4describedetailsof
next-generationdistributedI/O architecturesthateliminatethebandwidthbottleneckof the
I/O busandreplaceit with a scalablenetwork.Sections2.5 and2.6 discussesseveralap-
proachesto reduceoperatingsystemoverheadin thecontextof user-levelcommunication
architectures.Finally,Section2.7givesanoverviewof operatingsystemoptimizationsthat
reduce the cost of data copy operations and control overhead.
12
2.1 Kernel-based I/O
Two maintasksof anoperatingsystemaremanaginghardwareresourcesandprovid-
ing aconvenientandhardwareindependentlayerof abstraction.Resourcemanagementre-
latesto fair andsecuresharingof hardwareresourcesamongprocesses.To sharehardware
resources,the operatingsystemmultiplexesrequestsfrom competingprocessesonto the
actualhardware.Theoperatingsystemcanassignhardwareresourcesdirectly to anappli-
cationandrevokeit forcefully whenit is neededby anotherprocess.For instance,applica-
tions executedirectly on the microprocessor.Timer interruptsare usedto preemptthe
currentprocessandassigntheCPUto anotherprocess.Main memoryis logically divided
into pageswhich areassignedto differentprocesses.Theoperatingsystemcanremovea
physicalpagefrom oneprocessandassignit to anotherprocess.In bothcases,applications
havedirectaccessto theresourceonceit hasbeenassigned.Protectionchecksandresource
assignment and revocation are supported by hardware features in the microprocessor.
In thecaseof input/output,theoperatingsystemmultiplexesrequestsfrom multiple
applicationsto thesamedeviceby performingthemonbehalfof theapplications.Thisde-
sign allows the operatingsystemto performprotectionandintegrity checkson requests,
andto implementschedulingstrategiesto maximizeI/O deviceutilization. For instance,
disk driversoften reorderrequeststo minimizeheadseektimes,anddelaywrites to give
priority to readrequests.Involving theoperatingsystemin everyI/O transactionis neces-
sarypartly becauseprocessorsdo not providehardwaresupportfor directapplicationac-
cessto theI/O subsystemasis thecasewith virtual memory.Memoryaccessesperformed
by applicationsaretransparentlycheckedandtranslatedby theprocessorTLB. Theoper-
atingsystemis involvedin memoryaccessesonly in exceptionalcasessuchasaccessvio-
13
lationsor pagefaults.I/O deviceaccesses,ontheotherhand,donotenjoysimilarhardware
supportin theprocessor,becausesuchaccessesconsistof complexsequencesof loadand
store operations.
Anotherservicetheoperatingsystemprovidesis abstraction.It bridgesthesemantic
gapbetweenhigh-levelI/O requestsandthecapabilitiesof the I/O devicehardware.For
instance,theoperatingsystemtranslatesanapplication’saccessto afile segmentto control
registerreadsandwrites that triggera disk accessto individual blocks,andit maycache
disk blocksin memoryfor fasteraccess.In thecaseof networkI/O, theoperatingsystem
mayimplementreliablein-orderdeliveryof packetsontopof anunreliablenetwork.A side
effect of this designis that the OS is ableto hide the detailsof manydifferent hardware
implementationsundera commonstandardinterface.Often, this standardinterfacealso
hidesthevariablelatencyof requestsfrom applicationsby blockingprocessesuntil thede-
sired I/O operation completes.
ThestandardI/O interfacein UNIX andmanyotheroperatingsystemsspecifiescopy
semantics.Applicationsarenot ableto observeincorrector partialdatain aninput buffer,
or in otherwords,theinput routinedoesnot returnuntil theinputoperationhascompleted
successfully.Giventhatin thecaseof networkI/O, input datamayarrivebeforeanappli-
cationpostsa request,buffering input datain intermediatekernelbuffersandcopyingit
uponrequestis necessary.Foroutputoperations,it is guaranteedthattheoutputbuffercan
bemodifiedaftertheoutputroutinereturnswithoutaffectingtheresultof theoutputoper-
ation.If thesystemchoosesto delaytheoutputoperation,intermediatebufferingis needed.
Copyingdatabetweenapplicationandkernelbuffersis alsonecessarywhentheI/O device
hasbuffer alignmentrestrictionswhich aregenerallynot knownby applications,or when
14
theoperatingsystemperformstransformationsonthedata,for instanceby packetizingnet-
work data.Copyingdatabetweenuserandkernelbufferscanbeviewedasanotherservice
providedby the operatingsystem,giving applicationsthe flexibility to usearbitraryI/O
buffers allocated under program control.
Although the servicesprovidedby the operatingsystemsimplify applicationpro-
grams,they canintroducesignificantoverhead.The overheadcanbe classifiedinto two
components:1) controloverheadfrom contextswitchesandlongcodepathsinsidetheop-
eratingsystem;and2) datatransferoverheadassociatedwith copying.Unfortunately,both
componentsarelimited to a largedegreeby mainmemoryperformanceandhencedo not
scaleproportionallyto theprocessor-orientedperformanceof compute-intensiveapplica-
tions.As discussedin Section1.3,theprimaryreasonfor this is thatoperatingsystemcode
isexecutedinfrequentlyenoughto incurmanycacheandTLB misses,whicharedominated
by mainmemoryperformance.Manyinstructionsexecutedduringcontextswitchesexhibit
shortdatadependenciesandcannot takeadvantageof theout-of-orderexecutioncapabil-
itiesof modernprocessors.In addition,manyprivilegedinstructionsaccessglobalproces-
sor stateand are implementedsuch that they serializethe superscalarpipeline of the
processor.Copying databetweenaddressspacesfrequently incurs many cachemisses,
sinceeithersourceor destinationbuffer waspreviouslyinvalidatedin theprocessorcache
by adirectmemoryaccess(DMA) operation.Furthermore,unlikeaDMA engine,anindi-
vidualprocessoris usuallynotableto exploit thefull memorybandwidthavailableto pipe-
lined burstbustransactionssinceit supportsonly a smallnumberof outstandingmemory
operations.
15
2.2 File I/O Overhead Characterization
File I/O is oneof themostcommonformsof I/O operationsperformedby applica-
tions.Theaccesslatencyandtransferratesof harddiskshavenot keptpacewith theper-
formanceof otherI/O systems,e.g.,local areanetworks.Disksarefoundin virtually any
generalpurposecomputersystem.Many I/O intensiveapplicationsusethestoragesystem
as either the source or sink of their operations.
Figure1 showsthedifferentphasesof a readsystemcall andthedifferentprocessor
executionencountered.Dueto the long latencyof a disk access,a file readtransactionis
performedin two phases.Whenthereadsystemcall detectsthattherequesteddiskblocks
arenot in thebuffercache,it initiatesadisk readrequestandsuspendsthecallingprocess.
After thediskcontrollerhastransferredthedatainto thebuffercache,it interruptsthehost
CPU.Theinterrupthandlerdeterminesreschedulestheapplicationprocess,which thenre-
sumesexecutionin thesystemcall. After beingwokenup, thesystemcall copiesthe re-
quested data from the buffer cache into the application buffer and returns to user mode.
Figure 1: Read System Call with Disk Access
initiate disk request
application
kernel
interrupt handler
read disk status
interrupt
system call latency
idle/other processexec
utio
n co
ntex
t
16
Althoughthehostprocessorcanexecuteanotherprocessduringthelong-latencydisk
access,theinvolvementof theoperatingsystemduringthetransfersetupandtheinterrupt
handlingandprocessreschedulingis consideredoverhead.Thefollowing sectionsdescribe
experiments that quantify this overhead.
2.2.1 File Read System Call Overhead
2.2.1.1 Methodology. Thechallengein measuringtheoverheadincurredby a disk
readis that the initiating processis suspendedduring theactualdisk read.However,it is
possibleto observetheeffectiveidle time duringa readsystemcall andsubtractthis time
from thereadsystemcall latencyobservedby thecallingapplication.Thesetupfor thisex-
periment consists of two closely cooperating processes, as shown in Figure 2.
An I/O processperformsthereadsystemcall while anobservationprocessmeasures
theidle timebetweenthedisk transferinitiation andthecompletion,aswell astheend-to-
endcontextswitchtime.Thetwoprocessescommunicateandsynchronizethroughashared
memorystructurethat providesstoragefor a synchronizationflag andseveraltime vari-
ables.Beforeenteringthereadsystemcall, theI/O processincrementsthesynchronization
Figure 2: Measuring Disk Read Overhead
I/O process
observation process
setup overhead completion overhead
kernel
exec
utio
n co
ntex
t
idle time
17
flag andsavesthecurrenttime in a sharedvariable.Thesystemcall suspendstheI/O pro-
cessandswitchesto theobservationprocesswhichis spinningonthesynchronizationflag.
It detectsthatthesystemcall hassuspendedtheI/O processandsavesthecurrenttimeread
via asystemcall in anothersharedvariable.It thencontinuouslyreadsthecurrenttimeand
storesit in athird sharedvariable.Theobservationprocessrunsuntil thedisktransfercom-
pletes,atwhichpoint thedisk interrupthandlerwakestheI/O processupandcausesacon-
text switch.Thevaluein thethird sharedvariablerepresentsthetimeof thedisk interrupt.
After returningfrom thereadsystemcall, theI/O processdeterminesthetotal systemcall
latency,aswell asthedifferencebetweenthevarioustimestampsandcomputesthesystem
call overheads.
Thismethodologyrepresentsaportablewayof measuringdiskI/O overhead.It relies
only on thefork() systemcall andtheSystemV sharedmemoryinterface.It alsorequires
thattheexperimentis performedon a uniprocessorsystemandthatno otherprocessesare
activeduring theexperiment.Note thatmeasuringindividual setupandcompletionover-
headsrequiresthatthesystemprovidesaglobalhigh-resolutiontimer,while thetotalover-
head can be measured using process-local timers.
2.2.1.2 Experimental setup. Theexperimentsusingthemicrobenchmarkareper-
formedon anSGI Origin 200[56] anda SUN Ultra-1 workstation,bothrunningcommer-
cial UNIX variants.Table1 summarizestherelevantfeaturesof theseplatforms.Thetwo
systemsrepresentvery different architecturalapproaches.The Ultra-1 is a uniprocessor
workstationusinga superscalarin-ordermicroprocessor[102], while theOrigin-200is a
distributedsharedmemorysystemwith four MIPS R10000processors[68] on two nodes.
18
Theoperatingsystems,ontheotherhand,arebothmodern,internallymultithreadedUNIX
System V variants [37].
2.2.1.3 Results. All experimentsaccessa localSCSIdisk.Beforeeachexperiment,
theoperatingsystembuffercacheis flushedby readingafile of thesamesizeasmainmem-
ory, thusensuringthatthebuffer cacheis completelyfilled with blocksfrom this file. The
flush file mustresideon thesamedeviceasthemeasurementfiles, sincethebuffer cache
is indexed using the device number.
Table2 summarizestheaverageof 16 measurementsfor thetwo differentsystems.
Theresultsclearlyshowthatboththesetupoverheadandthetotaloverheadscalewith the
Table 1: Experimental Platforms
SGI Origin 200 Sun Ultra-1
OS IRIX 6.5 Solaris 2.6
CPU 4 x 225 MHz R10000 143 MHz UltraSPARC-1
L1 D-Cache 32 Kbyte32-byte blocks2-way set-associativewrite-backvirtually indexed
16 Kbyte32-byte blocksdirect-mappedwrite-throughvirtually indexed
L1 I-Cache 32 Kbyte64-byte blocks2-way set-associativevirtually indexedphysically tagged
16 Kbyte32-byte blocks2-way set-associativevirtually indexedvirtually tagged
L2 Cache 2 Mbyte128-byte blocks2-way set-associativephysically indexed & tagged
0.5 Mbyte64-byte blocksdirect-mappedphysically indexed & tagged
TLB 64 entries unified withmicro I-TLB
64 entry D-TLB64 entry I-TLB
Main Memory 1024 Mbyte 256 Mbyte
19
requestsize.Beforeaccessingthe disk, the readsystemcall checksif the requestedfile
blocksarein thebuffer cache,asfile sizegrowsmoreblocksneedto bechecked.At the
endof thedisk transfer,therequestedamountof datais transferredfrom thebuffer cache
to theuserbuffer.Thiscopyoperationis performedby thehostprocessorandtheoverhead
is dependenton theamountof data.Theslightly betterscalingof thetotaloverheadon the
Ultra-1 may indicate a more efficient implementation of the copy operation in Solaris.
2.2.2 Interrupt Overhead
Interruptsarean importantpartof the I/O subsystemof moderncomputersystems.
During normaloperations,interruptsareusedto signalexternaleventsfrom I/O devices
suchasthecompletionof adisk transaction,thesuccessfultransmissionor thearrivalof a
networkpacket,or a periodictimer eventto thekernel.Becauseof thefrequentuse,inter-
ruptsaffect all aspectsof OS performance,andthey representan increasinglyimportant
bottleneckin modernsystems.Indeed,interruptperformancebecomescrucial for gigabit
networking,or highly parallelor pipelinedI/O. For instance,Gallatinet al. [40] find that
Table 2: Read System Call Overhead with Disk Access
System RequestSize
System CallLatency in ms
SetupOverhead (µs)
TotalOverhead (µs)
Sun Ultra-1 8 k 10.5 - 21.6 260 - 480 300 - 700
32 k 20.0 - 25.5 260 - 500 400 - 800
256 k 63.7 - 102.4 200 - 480 350 - 800
SGI Origin 200 8 k 11.6 - 17.7 240 - 280 300 - 400
32 k 11.2 - 19.8 240 - 340 400 - 600
256 k 23.2 - 34.2 330 - 420 1200 - 1500
20
interrupthandlingaccountsfor between8 and25percentof receiveroverheadin theirmea-
surementsof TCP/IP performanceon an Alpha 21164workstation.The authorhasob-
servedtheMySQL databaseserverto spendover35percentof its CPUcyclesin thekernel,
and20 to 25 percentof thekerneltime canberelatedto interrupthandling.Thetrendto-
wardsmultithreadedandmodularoperatingsystemsfurther increasesthe interrupthan-
dling cost.
This sectionpresentsa portablemethodologyfor measuringthecacheimpactof in-
terruptswhichis subsequentlyusedto studydiskinterruptsonaSununiprocessorworksta-
tion andonaSGIsharedmemorymultiprocessor.Themethodologycanbeappliedto both
diskandnetworkinterrupts,thissectioncontainsonly asummaryof theresultsfor disk in-
terrupts [92].
2.2.2.1 Methodology. To measurethecacheeffectsof interruptsfrom a userpro-
gram’sperspective,themethodologyemploysanapplicationwith perfectcachebehavior
thatrepeatedlytouchesall cachelines.An I/O interruptdisturbstheapplicationby replac-
ing somenumberof cachelinesin eachlevelof thecachehierarchy.In addition,theinter-
rupt handler itself incurs cache misses.To measurethese effects, the experimental
applicationfirst performsanoperationthatwill leadto anI/O interruptat a laterpoint in
time (phase 1 in Figure 3(a)).
It thenstartstheeventcountersandfills thecachewith applicationdata(phase2) by
readinga dataarraywith thestrideequalto thecacheblock sizeanda sizematchingthe
cachesizewithout actuallyconsumingthedata.Filling theinstructioncachecanbesimi-
larly accomplishedwith aroutinethatrepeatedlybranchesforward,touchingeveryinstruc-
tion cacheblockexactlyonce.After afixed timeperiod,duringwhichtheI/O interrupthas
21
beenhandled(phase3), theapplicationtoucheseverycacheline again,andstopstheevent
counters(phase4). The numberof cachemissesmeasuredby the eventcounterscorre-
spondsto thenumberof cachelines thathavebeenreplacedby the interrupthandler.By
varying the size of this array, one can measure L1, L2, or TLB effects.
Countingcachemissesin differentcountermodesallowsoneto observeavarietyof
effects.In usermode,the numberof cachemissesindicateshow manycachelines have
beenreplacedby theinterrupthandler.Sincetheexperimentalapplicationtouchestheen-
tire cache,this representstheworst-casecostof interrupthandling,andcanbeusedto es-
timatethecachefootprint of the interrupthandler.Whencountingin kernelor exception
mode,theexperimentsmeasurethenumberof cachemissesincurredby theinterrupthan-
dler itself.
Initial experimentsusingthismethodologyrevealedthatnormalperiodicsystemac-
tivity introducedasignificantnumberof cachemissesin anotherwiseidle system.For in-
stance,waitingfor adiskeventtakes10to20ms.If theapplicationdoesnottouchthecache
during this time, manyof theapplication’sL1 datacachelineswill beevictedby system
threadsbeforetheI/O interrupthandlerruns.Thissystemactivity is causedby variouspe-
riodic clock interrupts and related handler threads and network broadcast messages.
Figure3(b) illustratesa refinedapproachthat isolatestheeffectsof thedisk or net-
work interrupt.While waiting for theparticularI/O interruptto occur,theapplicationre-
peatedlytouchesall cachelines,forcing theminto thecache.This guaranteesthatdespite
theperiodicclock interrupts,theinterrupthandlerbeingmeasuredincurscloseto themax-
imumnumberof cachemisses.Omittingtheinterrupt-schedulingsystemcall measuresthe
number of cache misses in an idle system over a fixed period of time.
22
To generateadisk interrupt,theapplicationissuesanasynchronousreadsystemcall
to asmallfile residingonalocaldisk.Theasynchronousreadallowstheapplicationto con-
tinue executingwhile the datais transferredfrom disk. To guaranteethat the file is read
from disk, thefile cacheis flushedby readinga largefile (equalto themainmemorysize
of theexperimentalmachine).Note that theamountof datatransferredis smallerthanor
equalto thesmallestcacheline sizein thesystem,so that themeasuredinterrupthandler
overhead is not dominated by the data copy to or from user space.
All experimentsarerepeatedat least16 times,until the95 percentconfidenceinter-
val is lessthan10percentof thearithmeticmeanof thesamples(±5percent).Beforecalcu-
Figure 3: Structure of Interrupt Overhead Experimentsa) Basic Experiment; b) Refined Experiment to Eliminate Effects of Other System Activity
interrupt
kernel
user
exec
utio
n co
ntex
t
start counters stop counters
fill cache
interrupt
handler threadsystem call
fill cacheother system activity
interrupt
kernel
userexec
utio
n co
ntex
t
start counters stop countersinterrupt
handler threadsystem call
fill cacheother system activity
phase 1 phase 2 phase 4phase 3
fill cache
23
lating the mean,high and low outliers are removed.The resultspresentedhereare the
arithmetic mean of the remaining data points.
2.2.2.2 Experimental setup. The interruptoverheadmeasurementsareperformed
onaSunUltra-1andanSGIOrigin 200workstation.Thesearethesamesystemsusedpre-
viously to measureoverall disk I/O overhead.The relevantmachinecharacteristicsare
summarized in Table 1.
2.2.2.3 Results. Table3 presentstheresultsfor bothplatformsfor theL1 dataand
instructioncachesandtheL2 cache.As expected,whenthekerneldeliversa signalto the
applicationat theendof thedisk transfer,thenumberof cachemissesincreasesslightly.
TheL1 datacacheeffectsarevery similar for bothplatforms.Thedatacachefootprint of
theinterrupthandleris approximately3-4kilobytes.NotethatsincetheUltra-1cacheline
sizeis 16bytes,thenumberof cachemissesincurredby theapplicationis twice thatof the
Origin 200.
Table 3: Cache Misses
L1 D-Cache L1 I-Cache L2 Cache
Mode Description O-200 Ultra-1 O-200 Ultra-1 O-200 Ultra-1
User no signal delivered 104 215 169 402 298 253
signal delivered 130 216 156 444 300 243
Kernel no signal delivered 52 273 58 100 70 245
signal delivered 63 228 61 110 65 264
Exception no signal delivered 48 n/a 80 n/a 64 n/a
signal delivered 53 n/a 86 n/a 74 n/a
24
SincetheUltra-1performancecountersdonotdistinguishbetweenkernelandexcep-
tion mode,theSolariskernelmodecachemissesshouldcorrespondto thesumof thekernel
andexceptioncachemissesin IRIX. However,becauseSolariseventsarecountedregard-
lessof theprocesscontext,theresultishigherthanthecorrespondingIRIX results.In IRIX,
ontheotherhand,thecachemissesincurredby theinterrupthandlerthreadarenotincluded
in thesemeasurements;hencethesumof kernelandexceptionmissesis lessthanthetotal
number of replaced cache lines.
The L1 instructioncacheresultsfollow the sametrendasfor the datacache.Both
platformsshowaninstructioncachefootprint of about10-13kilobytes.TheSolarisinter-
rupt handlerreplacesabout15-18percentmoreinstructioncachelinesthantheIRIX han-
dler. However,theSolarisresultsshowa largevariation(especiallyin kernelmode)and
occasionallydonot reachthe95percentconfidenceintervalof ±5percentof themean.This
is dueto theOrigin 200beinga four-processorsystem,whereclock interruptandnetwork
packetprocessingcanbemovedto otherprocessors.On theUltra-1,all systemactivity is
handledby thesingleprocessor.This introducesmorenoiseinto theexperiments,especial-
ly when measuring over the relatively long period of 25 ms.
Note that due to the inclusion propertyof caches,instructioncachelines may be
evictedwhenthe correspondingL2 cacheline is replaced,regardlessof whetherit is re-
placedby dataor instructions.This explainswhy thenumberof kernelinstructioncache
misses is lower then the total number of L2 cache lines replaced by the interrupt handler.
Thenumberof L2 cachelinesreplacedby theinterrupthandleris approximatelythe
samefor bothplatforms,with theSunresultsbeingslightly lower.SincetheL2 cacheline
sizeon theUltraSPARCprocessoris half thatof theMIPS R10000,this indicatesthatthe
25
interrupthandlerdoesnot exhibit sufficientspatiallocality to benefitfrom a largercache
line size.This is confirmedby theobservationthatthesumof thenumberof L1 instruction
anddatacachelinesreplacedis almostequalto thenumberof replacedL2 cachelineson
the SGI platform.
On the Sun,on the otherhand,the sumof L1 instructionanddatacachemissesin
usermodeis higherthanthenumberof L2 misses,possiblybecausein thesmallerL2 cache
of theUltraSPARCinstructionsanddataoverlapandconflict with eachother,creatinga
smaller footprint.
Theseobservationsconfirmthatfrom theapplication’sperspectiveinterruptsin mul-
tithreadedoperatingsystemshavea highercostthanin traditionaloperatingsystems.The
additionalthreadschedulingactivity andcontextswitchescausemanymoreapplication
cache misses [78].
2.2.3 Latency Hiding and I/O Bandwidth Scaling
Thevaryinglatencyof I/O requestsis hiddenfrom applicationsby theoperatingsys-
temby blockingtherequestingapplicationin thesystemcall while thekernelperformsa
contextswitchto anotherprocess.This latencyhiding techniqueimprovesoverallsystem
throughput,andit canalsoimproveI/O throughputif requeststo independentI/O devices
canbeoverlapped.Thethroughputimprovementis in partlimited by theoperatingsystem
overhead associated with each I/O request.
A syntheticI/O intensivebenchmarkservesto demonstratethe effect of operating
systemoverheadon I/O throughput.The benchmarkissuesmultiple streamsof readre-
queststo a collection of independentdisks.Eachrequeststreamextractsthe maximum
26
bandwidthfrom a disk, while overlappingrequestsmaximizeoverall I/O throughput.To
eliminatebandwidthlimiting effectsof theSCSIbusor hostadapter,eachdisk is attached
to aseparateSCSIhostadapter.Thepurposeof thebenchmarkis to measurethemaximum
obtainableI/O throughputunderideal conditions,wherestreamsaredirectedat separate
disksandtheapplicationperformsno operationon thedata.As such,thebenchmarkdoes
not representrealapplicationbehavior,but it demonstratesanupperboundof obtainable
I/O performance when applications overlap I/O requests with other work.
Theresultsshownhereareobtainedby runningthebenchmarkon theL-RSIM sim-
ulationsystem[90]. L-RSIM is adetailedexecution-drivensimulatorthataccuratelymod-
elsanout-of-orderprocessorwith caches,a mainmemorycontrolleranda numberof I/O
devices.ThesimulatorexecutesaBSD-basedoperatingsystemthatimplementsacomplete
I/O subsystem,includingfilesystemanddevicedrivers.Section3 containsamoredetailed
descriptionof the simulationsystem,and presentsresultsof a validation againsta real
workstation.Thisexperiment,aswell astheexperimentsin Chapter8, usethesystemcon-
figurationsummarizedin Table4.Thisconfigurationis anapproximationof ahypothetical
400 MHz R12000 based workstation such as an SGI Origin-200.
Figure4showsaggregateI/O bandwidthfor varyingnumbersof requeststreams.The
top graphsshowthe total I/O bandwidthif eachstreamrequestsdatain a pseudorandom
sequencewith twodifferentrequestsizes.Bandwidthsaturateswhenthenumberof streams
approaches10,andremainslargelyconstantasthepressureontheI/O systemincreasesfur-
ther.Thesaturationis dueto thefact thatthehostprocessoroccupancyreaches100%,add-
ing morerequestsdoesnot leadto increasedoverlap.It shouldbenotedthatthesaturation
27
point remainsalmostunchangedfor differentrequestsizes,while theaggregatebandwidth
is slightly higher for larger requests.
Thebottomgraphsin Figure4 showaggregatebandwidthfor sequentialaccesspat-
terns.Issuingrequestssequentiallyleadsto higheraggregatebandwidthbecauseboth the
diskscontrollersandtheoperatingsystemareableto successfullyprefetchdiskblocksfor
subsequentrequests.However,theoperatingsystemoverheadremainsunchanged.As are-
sult, bandwidthsaturatesfor anevensmallernumberof requests,becauseprefetchingre-
duces the effective latency of I/O requests, and thus reduces the potential for overlap.
Thebenchmarkusedfor theseexperimentsissuesstreamsof I/O requestsfrom inde-
pendentprocesses.Many modernoperatingsystemsalsoprovidea nonblockingI/O inter-
facethatallowsapplicationsto continueexecutingwhile anI/O requestis processed.The
traditionalUNIX kernel,however,is basedon a blocking I/O model.To implementthe
nonblockingI/O interfacewith minimalchangesto thekernelstructure,mostoperatingsys-
Table 4: Simulator Configuration
Parameter Value
Processor 400 MHzdynamically scheduled48-entry reorder buffer4-way dispatch, issue & graduation
L1 caches 32 Kbyte 2-way set associative instruction & data cache
L2 cache 2 Mbyte 2-way set associative
System Bus 100 MHz 64-bit multiplexed address & data
Main Memory 100 MHz SDRAM, 4 banks
I/O bus 66 MHz PCI, 320 ns read latency
Disk 9 Gbyte, 10000 rpm, 5.3 ns average seek time
28
Figure 4: I/O Bandwidth Scaling
0
10
20
30
40
50
aggr
egat
e ba
ndw
idth
in M
byte
/s
0 4 8 12 16 20 24
# of streams
0
10
20
30
40
50
aggr
egat
e ba
ndw
idth
in M
byte
/s# of streams
Nonsequential 16 Kbyte Blocks Nonsequential 64 Kbyte Blocks
0
20
40
60
80
100
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
0
20
40
60
80
100
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
Sequential 16 Kbyte Blocks Sequential 64 Kbyte Blocks
120 120
0 4 8 12 16 20 24
0 4 8 12 16 20 24 0 4 8 12 16 20 24
29
temsusekernelthreadsto performtheI/O relatedwork in thebackground.Uponissuinga
nonblockingI/O request,a library routinespawnsanI/O thread,which executesa normal
blockingsystemcall andsignalstheoriginal threadwhenit completes.Althoughthis inter-
faceavoidstheoverheadof switchingbetweenseparateprocesses,it executesseveralsys-
temcallsto createandremovetheI/O thread.As aresult,nonblockingrequestsareat least
as costly in terms of operating system overhead as traditional blocking requests.
2.2.4 Application Throughput Scaling
Many applicationswith high I/O throughputrequirementsexploit latencyhiding
techniquesto improvethroughput.DatabasesandWeb serversprocessmultiple requests
concurrentlyand use locking to synchronizeconflicting accesses.Data are distributed
acrossmanydisksto not only providethe necessarystoragecapacitybut alsoto support
overlappingdiskaccesses.Theexperimentdescribedin thissectionsimulatestheMySQL
databaseserver[98] on theL-RSIM simulationsystemto demonstratetheimpactof oper-
ating system overhead on I/O throughput scaling.
MySQL is a public-domaindatabaseserverthat supportsthe SQL databasequery
language.It is internallymultithreadedandcantakeadvantageof nonblockingI/O opera-
tions if supportedby the kernel. If usedwith a user-levelthreadlibrary, I/O operations
blocktheentireprocessincludingall threads.Databasesarerepresentedasregulardirecto-
ries andfiles, makingthe serverextremelyportable.However,performingdisk accesses
throughthebuffercacheis not representativeof manycommercialdatabases,which tradi-
tionally accesstherawdiskdevice.Ontheotherhand,modernnetworkattachedfileserver
appliances[63] exportonly a NFS filesysteminterfaceandrequirecommercialdatabase
30
systemsto movetowardsa regularfile I/O interfaceaswell [22]. In this experiment,the
MySQL databasesystemservesasa representativeof a numberof I/O intensiveapplica-
tions that arecharacterizedby datasizeson disk that exceedmain memorycapacityand
high I/O throughputrequirements.Theseapplicationsincludefile servers,mail andnews
servers, web servers and databases.
Eachexperimentrunsanumberof identicalcopiesof thedatabaseserverthatexecute
thesamequeriesonseparatedisksto approximatethebehaviorof amultithreadeddatabase
server.Executingmultiple copiesof thedatabaseserverto improveI/O throughputis not
necessarilyrepresentativeof real I/O intensive applications.However, even though
MySQL is internally multithreadedand can exploit overlappingI/O requestsamong
threads,thesimulatoroperatingsystemdoesnotprovidethenecessarythreadsystemcalls
andkernelthreadsupport.To morecloselyapproximatetheperformanceof amultithread-
eddatabase,theleveloneandlevel two cachesizeandassociativityof thesimulatedarchi-
tectureis scaledwith thenumberof serverprocesses,thusminimizing theimpactof cache
conflicts among processes.
Eachcopy of the serverexecutesa sequentialqueryon a 30 Mbyte database,per-
formedmostlyasa seriesof 16 Kbyte readrequests.Thebuffer cacheis configuredwith
20 Mbytesfor all experiments.Sincethecacheis initially emptyandthedatabaseis too
large to fit in the cache, buffer cache effects do not influence I/O performance.
In this experiment,a singlequeryexecutesin approximately5 secondsreal-timeon
the simulatedarchitecture.Due to the largenumberof disk seekoperations,the CPU is
busyfor about8percentof thetime.Overonethirdof theseCPUcyclesarespentexecuting
operatingsystemcoderelatedto I/O operations.Thisnumberis higherthanthepercentage
31
of operatingsystemcyclesof commercialdatabaseengines,partly becauseMySQL is not
ashighly optimizedto avoidI/O operations.On theotherhand,othertypesof serversthat
performfewerdatamanipulations,suchasfile serversor webservers,exhibitanevenhigh-
er operating system component of the execution time.
Figure5 showsthespeedupof parallelqueriesoverasinglequeryaswell asuserand
kernelCPUutilization components.Thesimulatorconfigurationfor theseexperimentsis
the same as in the previous experiment and is summarized in Table 4.
Theleft graphshowsthethroughputrelativeto asinglequery.Thedatabasethrough-
put increasesalmostlinearlyup to ninequeries.Beyondthispoint, throughoutsaturatesat
closeto a speedupof eight.Theright graphplotsCPUutilization in termsof busycycles
and cycles spent in the operatingsystem.The busy cycles curve closely follows the
throughputcurve.As the processorutilization approaches100 percent,throughoutsatu-
rates.However,evenwith alargenumberof queries,theprocessoris nevercompletelyuti-
lized. This effect is dueto the fact thatdisk latencydependson whena requestis issued
with respectto thepositionof thediskheadrelativeto therequestedsector.Thisnonlinear
behaviorpreventsthesystemfromfine-grainmultiprogrammingwith perfectprocessoruti-
lization.Theoperatingsystemcomponentof theprocessorutilization remainsalmostcon-
stantbetween34and36percent,slightly increasingtowardslargernumbersof queries.The
operatingsystemoverheadof aboutonethird of theprocessorcyclescanbeattributedal-
mostcompletelyto I/O relatedactivity. Reducingthis overheadmakesthesecyclesavail-
able for application processingand improves throughputscalability of I/O intensive
applications such as the MySQL database server.
32
2.3 Network Attached Disks
Network-attachedsecuredisks(NASD) [42] areapromisingarchitecturefor scalable
file serversthataddressessomeof theshortcomingsof currentstoragearchitecturesby sep-
aratingstoragefunctionalityfrom storagemanagement,andby delegatinglower-levelde-
vice management functionality to the disk controller, as shown in Figure 6.
Separatingstoragefrom accessmanagementpermitsdatatransfersdirectly between
disksandclients,thusbypassingthemainmemorybottleneckof conventionalfile servers.
This directdatatransferis madepossibleby thehigh-levelinterfacethat thenetwork-at-
tacheddisksprovide,andby thedisksability to checkaccesspermissionsfor eachrequest.
NASD devicesstoreandoperateon objects,ratherthanindividual blocks,eliminatingthe
needfor a block-managemententity that is found in conventionalfile systems.Accessto
theseobjectsis managedby aseparatefile manager,whichgrantscapabilitiesto clientsto
Figure 5: MySQL Database Throughput Scaling
4
3
2
1
5
6
7
8
0
0 2 4 6 8 10 12
spee
dup
# of queries
40
20
60
100
80
0
CP
U u
tiliz
atio
n in
%
# of queries
14 0 2 4 6 8 10 12 14
busy cycles
kernel I/O cycles
33
accessobjectson thedisks.Thefile manageris alsoresponsiblefor mappingfilenamesto
disks,volumesandobjects,andfor metadataupdatessuchasobjectcreationandattribute
changes.Datatransfersfor readandwrite operationsareperformedautonomouslybetween
clients and disks over a scalable system area network
To maintaintheintegrityof thedisks,andto provideUNIX-level protection,thecli-
entsand disks usecapabilitiesin combinationwith public/privatekey encryptionwhen
transmittingrequestsanddata.Whenfirst connectingto thedistributedstoragesystem,a
clientrequestsasetof capabilitiesfrom thefile managerthatthenallow theclientto direct-
ly contact the disks to read or write data.
A studyby Gibsonetal. [42] finds thatnetwork-attachedsecuredisksindeedleadto
scalabledistributedfile systemswith performanceat leastcomparableto existingserver-
basedsolutions.However,thecommunicationoverheadbetweenclientsandfile manager
is identifiedasamajorbottleneck,potentiallylimiting thescalabilityof theirexperimental
prototype.
Figure 6: Network-attached Secure Disk Architecture
File Manager
Client 0
Client 1
create, d
eletecapabilities
read/write
read/write
34
2.4 InfiniBand
InfiniBand [51] is a recentlydevelopedindustrystandardfor a next-generationdis-
tributedI/O architecture.It addressesseveralshortcomingsof currentI/O architectures,in-
cludingbandwidthlimitationsimposedby sharedI/O busesandtheoverheadincurredby
operating system mediated access to I/O devices.
InfiniBand centersarounda distributedI/O architecturewith a scalablesystem-area
networkconsistingof switches,routersandhostchanneladapters(HCA) thatconnectcli-
ent systemsto autonomousI/O devices.Figure7 showsanexampleof theswitchednet-
work with a variety of client systemsandI/O devices.Comparedto sharedI/O buses,a
Figure 7: InfiniBand Distributed I/O Architecture
Switch
CPUCPU
Mem HCA
HCA HCA HCA
CPU
Mem HCA
Switch
I/OI/OI/O
Switch
HCA
I/O
HCA
I/O
Router
othe
r su
bnet
s
35
switchednetworkis ableto providescalablebandwidthfor largernumbersof devices,and
is ableto providehighertransferratesthroughtheuseof point-to-pointlinks thatcanop-
erateathighersignallingfrequencies.It alsopermitsdirectdevice-to-devicedatatransfers
and can be used to communicate between client systems.
Similarto NASD, InfiniBandrequiresautonomousI/O devicesthatareableto export
ahigh-levelinterfacetoclientsandthatperformcomplexbuffermanagementfunctionsand
accessprotectionchecks.However,theexactsemanticsof requestssentto I/O devicesis
not part of the InfiniBand standard, and remains to be defined.
TheInfiniBandarchitecturerecognizestheneedtobypasstheclientoperatingsystem
for I/O requeststo achievehighI/O throughput.Themechanismusedto communicatewith
thelocal I/O networkinterface,andultimatelywith remotedevices,is basedon theU-Net
user-levelnetworkarchitecture[32]. Eachclient applicationestablishesat leastonenet-
work endpointconsistingof send,receivequeue,andcompletionqueueslocatedin mem-
ory. The host channeladapter(HCA) polls all work queuesand processesrequests
deposited in these queues.
To enabletheHCA to accessusermemory,thework queuesanddatabuffersreside
in memorythathasbeenregisteredwith thenetworkinterfacethroughanoperatingsystem
service.Theoperatingsystemensuresthatthecorrespondingphysicalpagesarecontiguous
andarenonpageable,andit communicatesthevirtual-to-physicaladdressmappingto the
interface.Theseprearrangedbuffersfacilitatelow-costDMA operationswithoutoperating
systeminvolvement,but restricttheapplicationsuseof I/O buffers.It canalsolimit scal-
ability, asall registeredbuffer pagesmustbe presentin physicalmemory.Applications
mayconservativelyallocatemorebufferspacethanis required,thusincreasingthepressure
36
on thephysicalmemorypool further.RestrictingtheHCA to usepinnedandprearranged
memoryregionssimplifiestheDMA enginehardwareandtheintegrationwith existingsys-
temsoftware,but placesrestrictionson applicationprogrammers.On theotherhand,by-
passingthe operatingsystemfor requestinitiation movesthe responsibilityto multiplex
andschedulemultipleI/O requeststo theHCA hardware.Thistaskrequireseithercomplex
finite statemachinesto control theHCA operationor a programmableon-chipcontroller,
bothof which increasetheHCAshardwarecomplexity.In addition,in asystemwith many
activeprocesses,polling manyqueuescanincreaserequestlatencysignificantly,evenif
most queues are frequently empty.
2.5 User-level Network Architectures
Communicationsubsystemsfor tightly coupledparallelmachinesandmorerecently
for clustersof commodityworkstationsregularlybypasstheoperatingsystemfor common
operationsto reducecommunicationoverheadandimprovescalability.Most of thesesys-
temsprovidesomeformof protectedandatomicuser-levelaccessto thenetwork,minimize
the number of data copies for data transfers and implement lightweight notifications.
Manyuser-levelcommunicationarchitecturesuseaconnection-orientedapproachto
minimizetheamountof informationthatneedsto betransferredto thedevicefor eachmes-
sage.Duringtheconnectionsetup,theoperatingsystemcanpin memory,communicatead-
dressmappingsto thedeviceandperformprotectionchecks.Applicationsoftwareis then
ableto initiate messagetransfersusingtheestablishedendpointwhile bypassingtheoper-
atingsystem.Privilegedargumentsareinferredby thenetworkinterfacefrom theconnec-
tion endpoint.This approachis very effective in reducingthe per-messageoverheadby
37
movingaportionof thesoftwareoverheadto theconnectionsetuptime.However,its scal-
ability to a largenumberof connectionscanbeproblematicasthenetworkdeviceneedsto
manageandstorethestateof eachconnection.It alsodoesnot reduceoverheadfor appli-
cations that require frequent connection setup and teardown operations.
U-Net[32] andthevirtual networkinterfacein NOW [62] useexistingmemoryman-
agementfacilities to providemultiple applicationswith theillusion of exclusiveaccessto
thenetwork.User-levelsoftwareneverdirectly accessesthedevicehardwarebut commu-
nicateswith thedevicethroughshareddatastructuresor networkendpointsin virtual mem-
ory.Thesedatastructuresallow applicationsto assembletherequestargumentsin memory
without concernfor atomicity beforeinforming the devicehardwareof the new request.
Virtualizing thedevicehardwarethroughper-applicationendpointsin memorysimplifies
thesoftwareinterfaceandeliminatestheneedto providehardwaresupportfor atomicity.
However,it increasesthedevicehardwarecomplexitysignificantlyby puttingtheburden
of multiplexing the endpoints on the network interface.
Otherdesignsutilize theinherentatomicityof abustransactionto implementatomic
deviceaccess.ATOLL [17] usestheunusedupperaddressbitsof anuncachedwrite trans-
actionasan index into a routing tableto initiate a DMA transfer.Avalancheplacesa re-
questcontrol structurein kernelmemoryandupdatesa queuecounterin the NI usinga
singleuncachedwrite. TheMedusanetworkadapter[7] combinesa networkpacketstart
addressandlengthinto a32-bitwordthatis writtento ahardwaretransmitFIFOinto asin-
glebustransaction.Sinceabustransactionis anaturalunit of atomicityin mostcomputer
systems,it is temptingto useit to implementatomicmessagetransfersetup.Thesmallsize
of uncachedtransfers,however,limits thenumberof requestparametersthatcanbetrans-
38
ferredwith eachrequest.Cacheline transactionsareableto transfera largernumberargu-
mentsatomically, but softwarehas no control over thesetransactionsbecausecache
conflicts and replacements are handled entirely by the cache controller.
PerformingDMA operationsthathavebeeninitiatedby user-levelsoftwarewithout
operatingsysteminvolvementiscomplicatedby thefactthatapplicationsoperateonvirtual
addresses,whereasI/O deviceusephysicaladdressesto accessmemory.Providingthere-
quiredaddresstranslationsandpinningtheassociatedphysicalpagesis oftendoneduring
connectionsetupandimpliesthatmessagebuffersremainfixed for thedurationof thecon-
nection.Overcomingtherestrictionof staticallypinnedDMA buffersrequiresaddingady-
namicaddresstranslationcapabilityto theDMA engine.Similar to a processorTLB, the
DMA enginecancacheaddresstranslationsin on-chipmemoryandthusamortizethecost
of addresstranslationsby exploitingspatiallocality in applications.Theoperatingsystem
performsaddresstranslationsandprotectionchecksatthetimeamappingis installedin the
DMA TLB. Althoughthiscanbedoneondemandusinginterrupts,handlingpagefaultsat
interrupttime is difficult at bestbecauseno contextis availablethatcanblock waiting for
thedisk request.To avoidextensivemodificationsof theoperatingsystemkernel,theU-
TLB mechanismperformsaddresstranslationsandpinspagesin advanceunderapplication
control [21]. Applicationsmanagea tableof translationslocatedat the DMA device.To
implementprotectedaddresstranslation,applicationshaveno accessto the physicalad-
dressof a mappingbut specifyDMA addressesusingan index into the translationtable.
Systemcallsareusedto install mappingsin the table,at which time thephysicalpageis
pinned.
39
TheU-Netarchitecturehasbeenextendedwith asimilardemandpinningschemeand
a TLB at theDMA engine[106]. Unlike theU-TLB, addressmappingsarerequestedby
thenetworkinterfaceduringa DMA transferandareprocessedthroughkernelinterrupts.
Pagefaultsarehandledby akernelthreador processthatis triggeredby theinterrupthan-
dler. Pagesarepinnedwhenthecorrespondingtranslationis installedin thedeviceTLB,
andremainpinneduntil thedeviceevictstheTLB entry.To avoidstallingthenetworkon
aTLB miss,thenetworkinterfaceprefetchestranslationsfor receivebuffers.Applications
arenot involvedin managingtheTLB or maintainingtranslations.TheU-Net/MM mech-
anismprovidestransparentuser-levelDMA with arbitrarybufferspacesat thecostof add-
ed complexity in the operating system.
Virtual memory-mappedcommunication[14][39] is a user-levelcommunication
mechanismthatavoidstheuser-levelDMA problemsbyperformingall communicationbe-
tweenpairsof virtual memorypages.Insteadof explicitly specifyingdatatransfers,it for-
wardsall modificationsmadeto a local memorypageto anassociatedremotepage.This
communicationschemereducescommunicationoverheadandlatencyto a minimum,but
sincethehostprocessoractivelyperformsall datatransfers,bandwidthfor bulk transfersis
limited by the uncached write bandwidth of the processor.
Many user-levelcommunicationarchitecturesreducemessagepassingoverheadby
minimizing thenumberof copyoperationsfor a messagetransfer.Avoiding copyingdata
whenreceivingmessagesis complicatedby the fact thatmessagesmayarrivebeforethe
receiverprocesshasposteda receivebuffer. If thecommunicationlibrary providesinter-
mediatebufferingfor theseunexpectedmessages,communicationoverheadincreasesdue
to thenecessarydatacopy.U-Net andInfiniBand requirethatprocessesenqueuereceive
40
buffersin advance.Sincethearrivalorderof messagesisundefined,processeshavenocon-
trol overwhich messageis depositedin which buffer.Active Messages[23][33] is a com-
municationmodel that avoids copying at the receiverby invoking a messagehandler
routineuponmessagearrival.Thehandleris specifiedin themessageitself. It is responsi-
ble for removingthemessagefrom thenetworkandintegratingit into theflow of compu-
tation.Performingall communicationin pairsof request/replymessagesenablesthesender
to setupareceivebufferfor thereplymessageaspartof themessagetransmissionprocess,
thusavoidingintermediatebufferingof incomingmessages.However,only tightly coupled
parallelprogramscanberequiredto follow thisprogrammingmodel,sincethesenderspec-
ifies a user-level routine in the receivers address space when transmitting a message.
2.6 Virtual I/O Devices
Oneof thetasksof anoperatingsystemis to virtualizethemachinehardwaresothat
applicationshavetheillusion of exclusiveaccessto it. If theoperatingsystemis bypassed
for somehardwareaccesses,thehardwaremustbevirtualizedby someothermeans.Trust-
ing applicationsto synchronizeamongeachotherwhenaccessingdevicehardwareis only
an option in restrictedenvironmentssuchastightly-coupledmultiprocessors.In general-
purpose multiuser systems, processes must be protected from uncooperative applications.
User-levelnetworkinterfacesareusuallyvirtualizedby providingmultipleendpoints
thataremultiplexedby thedevicehardware.U-Net providesprocess-privatequeuesthat
arepolled by the devicefor requests.MemoryChannelmapsdistinct physicalpagesinto
different processesaddressspaceandmultiplexeswrites to the pages.The NOW virtual
networkinterfaceusesvirtual memorytechniquesto improvescalabilityof thenetworkin-
41
terface[62]. Like in U-Net, networkendpointsarelocatedin main memoryandmapped
into applicationaddressspace.Endpointscanberesidentin thenetworkinterfaceoutboard
memory,or swappedout in main memory.The network interfacemultiplexesmessages
from the setof residentendpoints,andcooperateswith the hostoperatingsystemwhen
swappingendpoints,thusextendingthenumberof availableendpointsbeyondwhatis sup-
ported by the device hardware.
High-performancegraphicsadaptersareanotherclassof I/O devicesthatcanbenefit
from bypassingtheoperatingsystem.In manyworkstations,theX-serveris theonly entity
with directaccessto thegraphicshardware.Like anoperatingsystem,it multiplexesgraph-
icscommandsthatit receivesfrom applicationsontothehardwarewhile maintainingwin-
dow boundariesandotherapplicationcontext.In high-performanceapplications,the X-
serverbecomesabottleneckbecauseit requiresmultiplecontextswitchesbetweentheap-
plicationandtheserverto transmitacommandto thegraphicsengine.Directrenderingby-
passesthe X-server and lets applicationstransmit commandsdirectly to the graphics
hardware, which needs to be virtualized to handle multiple client applications.
SGI’s directrenderingimplementationvirtualizesthegraphicsadapterby treatingit
aspartof theprocesscontext[53]. To avoid thecostof loadingandunloadingtheentire
graphicsenginestateoneverycontextswitch,theoperatingsystemimplementsalazycon-
textswitchstrategyandperformstheswitchonly if thenewprocessattemptsto accessthe
graphicshardware.TheDIGITAL PowerStorm4DT graphicsadaptertakesa lessaggres-
siveapproachto direct renderingandusestheX-serverto multiplex graphicscommands
[59]. Insteadof usingheavyweightinterprocesscommunicationmechanismsto transfer
high-levelcommands,applicationsqueuehardwarecommandsin sharedmemorypages
42
anddirecttheX-serverto thesecommands.Bothapproachesaddresstheoverheadinvolved
in multiplexingapplicationcommandson to thegraphicshardwareby optimizingor elim-
inating the intermediate software entities.
2.7 Operating System Support for High-Performance I/O
Apart from theI/O devicehardwarecapabilities,I/O performanceis alsoaffectedby
operatingsystemoverheadsconsistingof controloverheadsuchassystemcallsanddata
transfersbetweenprotectiondomains.Operatingsystemsupportfor high-performanceI/O
addressesbothof thesesourcesby avoidingcopyingdatabetweenkernelanduserbuffers,
by optimizing the overheadof layeredand modularsoftwaredesignand by providing
mechanisms to customize operating system services for application’s needs.
2.7.1 Copy Avoidance
Data transfersbetweenkernel and userspaceas well as within the kernel often
amountto thelargestcomponentof I/O overheadandcanlimit overallI/O performanceof
a system.The I/O interfaceimplementedin UNIX andotheroperatingsystemsimpliesa
copysemanticsfor I/O operationsby giving applicationscontrolover theI/O buffer allo-
cationanddeallocation.Thesystemplacesno restrictionson thelocationor alignmentof
I/O buffers.In addition,theoperatingsystemmaintainsstrongintegrity of theapplication
I/O buffers,suchthatapplicationsneverobserveincorrectI/O dataoninput,andthatoutput
datacanbemodifiedwithout ill effectsaftertheoutputoperationhasexecuted.To imple-
mentthesesemanticswithout placingrestrictionson theapplication,mostoperatingsys-
tems copy I/O data betweenuser spaceand intermediatekernel buffers. Using kernel
43
buffersfor I/O alsoallowsthekernelto exploit locality andsharingbetweenapplication,
for instancethrougha file buffer cache.However,copyingdatabetweenprotectiondo-
mainscostsvaluableprocessorcyclesanddisplacescacheandTLB entriesthatlaterneed
to bereloadedby theapplication.Performanceof copyoperationsis largelydeterminedby
mainmemoryperformance,increasesin CPUcomputeperformancegenerallydonottrans-
late into proportionallyimprovedcopy bandwidth[88]. The amountof overheadscales
with the request size and limits the achievable I/O performance.
Copy-on-write[86] is anoptimizationthatattemptsto avoidcopyingof outputdata
andusestheapplicationbuffer for I/O. Ratherthancopyingthedata,thekernelmarksthe
pagesthatcontaintheoutputbufferasread-onlyby theapplication.If theapplicationmod-
ifies theoutputbufferbeforetheI/O operationcompletes,it incursapagefault,uponwhich
thevirtual pagescontainingthebuffer arecopiedto a kernelbuffer.Alternatively,theap-
plication canbe stalleduntil the I/O operationcompletes[8]. Copy-on-writeintroduces
overheadsthatmakethisschemebeneficialonly for largertransfers.Theapplicationbuff-
ers must be pinned, which can be more costly than pinning kernel buffers.
Furtheroptimizationsarepossibleif theprogrammingmodelis modified.Giving the
kernelresponsibilityfor I/O bufferallocationanddeallocationallowsit to usepageremap-
ping insteadof copyingfor both input andoutput.Fbufsarea system-widecross-domain
datatransferfacility thatavoidsdatacopyingby remappingandsharingvirtualbufferpages
betweenprotectiondomains[27]. To furtherreducethetransfercost,fbufs canbecached
in individualdomainsfor laterreuse,andbuffersaremappedat thesamevirtual addressin
all domains.Fbufsarea generalmechanismfor efficiently transferringdatabetweenpro-
tectiondomains,includingtheI/O subsystem,buttheyimply significantchangesto thepro-
44
grammingmodel.Softwareexplicitly allocatesanddeallocatesI/O buffers,but it hasno
controloverthealignmentandthevirtualaddressof inputdata,andcannotreusetheoutput
buffer afteran I/O operation.Therequiredchangesof the I/O programminginterfaceare
onepossiblereasonwhy, apartfrom anexperimentalimplementationin Solaris[99], fbufs
haveonly slowly foundtheirwayinto generalpurposeoperatingsystems,despitetheirsig-
nificant performance advantages over copying.
Containershippingis anI/O facility that,similar to fbufs, replacesdatacopyopera-
tionswith pageremappings[79] [80]. It restrictsthegeneralityof fbufs by implementing
movesemanticswhenpassingcontainers,andtheactualdatapagesareremappedonly if
thereceiverrequiresaccessto thedata,otherwiseonly thecontainerdescriptoris passed.
Themorerestrictedsemanticsof containershippingcomparedto fbufsallowsamoreeffi-
cientimplementation,sinceestablishingsharedpagemappingsis morecostlythansimply
unmappingandmappinga page.It alsosimplifies the programminginterfacesomewhat
and lends itself better to strict producer-consumer I/O scenarios.
Thevirtual transfertechniqueemployedby fbufs andcontainershippingcanbefur-
ther specializedand optimizedfor peer-to-peerI/O transfers.Suchtransfersmove data
from asourceI/O devicethroughthekernelto asink I/O devicewithoutanydatamanipu-
lation.Splicingis atechniquethatallowsapplicationstoestablishsuchin-kerneldatatrans-
fer paths[79]. By eliminatingthedatatransferto andfrom applicationspace,splicing is
ableto sharekernelbuffersbetweensourceandsinkdevicedrivers,thuscompletelyelim-
inatingin-memorydatacopies.Peer-to-peerI/O with splicingcandrasticallyimproveI/O
performancefor a significantsubsetof I/O intensiveapplications,but is restrictedto sce-
45
narioswhereapplicationsdonot inspectthedatabeforeforwardingit, andneitherapplica-
tions nor kernel modules modify the data.
IO-lite [75] generalizesthe sharingof kernel buffers into a unified buffering and
cachingschemeandextendstheability to shareI/O buffersto applications.In additionto
providingacopy-freecross-domaindatatransferfacility, it minimizestheamountof phys-
ical memoryusedfor cachingby integratingthefile cacheinto thesharedbufferpool. IO-
lite thusextendstheutility of schemeslike fbufs andsplicing to applicationsthat inspect
data before forwarding it to other I/O facilities, and which combine file and network I/O.
2.7.2 Control Overhead Reduction
Additionalperformanceimprovementscanbeachievedby minimizingcontrolover-
headassociatedwith I/O operations.Thisoptimizationis especiallyeffectivefor smalldata
transfers.Operatingsystems,like manycomplexsoftwarepieces,uselayeringandabstrac-
tion to reducecomplexityandimprovemodularityandflexibility. Thisapproachis benefi-
cial for the software developmentprocess,but it is often counterproductivefor I/O
performance.Passingrequeststhroughmultiple layers,sometimesinvolving crossingpro-
tectiondomains,incursoverheadin theform of procedurecalls,contextswitchesanddu-
plicate work performed in different layers.
Thex-kerneloperatingsystemcombinesaframeworkfor high-performancenetwork
protocolimplementationswith supportfor modularprotocolcomposition[50]. It provides
basiclow-level servicessuchasthreadmanagement,buffer managementandeventhan-
dling thatareoptimizedfor networkprotocolprocessing.High performanceis achievedby
avoidingcontextswitchesandby providingastreamlinedinterfacebothbetweenprotocols
46
within thekernelandapplications.Threadsareassociatedwith messagesratherthanwith
protocols,a threadshepherdsa messagethrougha seriesof protocolswithout context
switches.TheScoutoperatingsystemtakestheideaof minimizingthecostof multiplepro-
tocol layersonestepfurtherandexplicitly specifiespathsthroughtheprotocolstack[70].
Thusthesystemisabletoapplysoftwareoptimizationssuchascommonsubexpressionand
deadcodeelimination,inlining andconstantfolding atalargescopeandwith betterresults.
In addition,schedulingdecisionscanbe basedon the entirepathof a message,thusen-
ablingthekernelto providequality-of-serviceandotherreal-timeguarantees.Scout’sno-
tion of a pathalignscontrol flow anddatatransferandenablesoptimizationsthatbenefit
bothoverheadcomponentsat thesametime,while alsointroducingnewoptimizationop-
portunities.
Extensibleoperatingsystemslike Spin[12] or Exokernel[36] allow applicationsto
implementspecializedoperatingsystemfunctionalityin anextensiblekernel,whichresults
in betterperformancefor many applicationsfor which the general-purposeinterfaceof
monolithickernelsis inappropriate.Consequently,the interfaceprovidedby suchkernels
providesonly low-level primitives andplacesa greaterburdenon applicationsto imple-
mentresourcemanagementpoliciesandotherextensions.At thesametime, it givespro-
grammerstheopportunityto applydomain-specificoptimizationsto improveperformance
or implement new application functionality.
Initiating anI/O requestnormallyrequiresa systemcall. Oneof thereasonsI/O de-
vice accessis performedonly by theoperatingsystemis to ensureatomicityof theaccess
sequence.Theoperatingsystemguaranteesthatprocessescompetingfor deviceaccessdo
not interferewith eachotherand interleaverequestarguments.Bypassingthe operating
47
systemwheninitiating I/O requestsavoidsthecostof asystemcall andcanleadto perfor-
manceimprovementsespeciallyfor low-latencyI/O operationssuchasnetworkcommuni-
cation.Software[11][71] andhardware[47][68] basedsynchronizationmechanismsexist
that implementuser-levelatomicsequences,but thesesolutionsalwaysrequirethat soft-
warevoluntarily usesthesynchronizationmechanismto avoidcorruptionof shareddata.
Uncooperativeapplicationscanbypassthesynchronizationmechanismandcanaffect the
integrityof theentiresystem.Hence,thesesolutionsareunsuitableto implementuser-level
I/O device access.
Synchronousor asynchronouseventssuchaserrorconditionsor I/O completionno-
tificationsareusuallyhandledby theoperatingsystem,to allow thekernelto hidethelow-
leveldetailsof theseconditionsfromapplications.Applicationscanbeinformedof asubset
of theseeventsvia signals.However,thecostof this mechanismprohibitsits usefor fre-
quenteventssuchaswriteprotectionfaultsin agarbagecollector.Low-overheaduser-level
exceptionhandling[100] reducesthe costof a general-purposesignalhandlerby saving
only minimalstatebeforeexecutingtheuser-levelhandler,andby avoidingthesystemcall
normally requiredto resumeprogramexecution.However,themechanismsupportsonly
synchronousexceptionsthatcanbehandledin thecurrentprocesscontext.As suchit is not
directly applicable to asynchronous I/O interrupts.
2.8 Summary
TheI/O interfaceof mostcontemporaryoperatingsystemsspecifiescopysemantics,
which meansthat applicationscanuseI/O dataasif they arecopiedbetweenkerneland
applicationbuffersaspartof anI/O operation.This modelgivesgreatflexibility to appli-
48
cationprogrammers,but its implementationusuallyrequiresanactualcopyoperation.This
copyoperationaswell asthesystemcallsandinterruptsinvolved in executinganI/O re-
questintroducesignificantoverheadthatlimits effectivebandwidthandthroughputof I/O
operations.Optimizingthecopyoperationsin thegeneralcasewithoutchangingtheappli-
cationI/O interfaceis difficult. Theoperatingsystemoverheadassociatedwith I/O opera-
tionsnot only limits the throughputof individual requestsbut alsorestrictstheability of
applications to exploit latency hiding techniques to improve system throughput.
RecentlyproposedstorageandI/O architecturesaddressthebandwidthlimitation of
currentI/O andmemorybusesby distributingI/O deviceson a scalablesystemareanet-
work, andplacingtheresponsibilityfor administrativeoversightonaseparateaccesscon-
trol server.Data transferscan be performeddirectly betweenI/O devicesand clients,
bypassingthe memorysystembottleneckof traditional servers.By applying user-level
communicationtechniquesto I/O, InfiniBand is ableto bypasstheclientoperatingsystem
for manyI/O requests.However,theconnection-orienteddesignof theprogramminginter-
facecanleadto scalabilityproblems,astheI/O networkinterfacehardwareneedsto main-
tain state for every process.
User-levelcommunicationarchitecturesminimizeoverheadby bypassingtheoper-
atingsystemfor messagetransfers.Thiscanbeaccomplishedby separatingconnectionset-
up from messagetransfers. Connections are establishedunder operating system
supervisionto performprotectionchecksandto communicateprivilegedinformationto the
devicehardware.Subsequentmessagetransfersareinitiateddirectlyby theapplicationus-
ing connectionendpointdescriptorsprovidedby theoperatingsystem.Theseendpointsare
49
maintainedandmultiplexedby thenetworkinterface,whichleadsto complexandnonscal-
able hardware implementations.
Betterscalabilityof suchvirtual I/O devicesis achievedif thedevicecontextis treat-
ed aspart of, or similar to a processcontext.This designallows the operatingsystemto
swapanunlimitednumberof communicationcontextson to a finite numberof hardware
contexts,similar to demandpagingof physicalmemoryor CPUtime slicesin a multipro-
grammedsystem.However,triggeringandhandlingdevicecontextswitchesintroducesad-
ditional overheads and can negatively impact latency.
Operatingsystemimprovementstargetedat high-performanceI/O haveaddressed
both dataand control overheadwhile keepingthe operatingsysteminvolved in I/O re-
quests.Reducingdatatransferoverheadmayrequiredrasticmodificationsto theprogram-
ming interface, as applicationsand the kernel must cooperateclosely in I/O buffer
management.If theoperatingsystemis awareof thecontrolpathof I/O data,it is ableto
optimizethesepaths,significantlyreducingoreliminatingduplicateworkperformedin dif-
ferentmodulesandby applyingglobaloptimizationtechniques.Thisoptimizationbenefits
mostlythelatencyof smalldatatransferssinceit doesnotnecessarilyeliminatecopyoper-
ations.
Otherresearchhasfocusedon allowing applicationsto tailor both the I/O interface
andtheservicesthatanoperatingsystemprovidesto their individual needs.Suchextensi-
ble kernelscanimproveI/O performancesignificantly,but requirea largeeffort from the
programmerto providethe neededkernelextensions.Otheroptimizationstargetingsyn-
chronizationandexceptionhandlingarenot directly applicableto I/O astheyassumevol-
untary cooperation of applications and do not support asynchronous interrupts.
3. THE L-RSIM ARCHITECTURAL SIMULATOR
Execution-drivensimulationis avaluabletool for evaluatingnewcomputerarchitec-
tures,sinceit allowstheresearcherto modify virtually anypartof acomputersystemwith-
out incurringthecostof developingnewhardware.Designinga detailedexecution-driven
simulatoris a trade-offbetweenaccuracy,simulationspeed,anddevelopmenteffort. In
manycases,assumptionsaremadethatyield asimplerandfasterimplementationwith suf-
ficient accuracy,aslong astheseassumptionshold. For instance,it is oftenassumedthat
instructioncachemisseshavea negligibleimpacton performanceandhencea perfectin-
structioncacheis modeled.For similar reasons,manysimulatorsignoreoperatingsystem
andI/O activity.Theresultingsimulatorsarecapabletoolsfor evaluatingnewideasfor im-
proving instruction-levelparallelismor cacheperformance[19][77]. However,when it
comesto workloadsthatexhibit a significantamountof operatingsystemactivity, multi-
programmingor I/O, theassumptionsmadewhensuchsimulatorsweredevelopednolong-
er hold, and a different approach must be taken.
To addresstheseworkloads,severalfull-systemsimulatorssuchasSimOS[48] and
SimICS[61] havebeendeveloped.ThesetoolsmodelI/O devicesto suchdetailthatanal-
mostunmodifiedoperatingsystemcanbebootedin thesimulationenvironment,whichal-
lows researchersto evaluatevirtually any workload. However, to achieveacceptable
simulationperformance,the processorandcachemodelsemployedby thesefull-system
simulatorsareoftenverysimple,anddonotsimulatethecomplexinteractionsof amodern
51
microarchitectureandmemoryhierarchy.In addition,theresultingsimulationenvironment
is acomplexsystemconsistingof thesimulatorandoftenacomplete,off-the-shelfOSlike
Linux [20], makingthelearningcurvefor newusersverysteep.TheL-RSIM simulatorde-
velopedfor thisstudycombinesdetailedprocessorandcachemodelsthatarebasedon the
RSIM architecturalsimulator[77] with theability to simulateoperatingsystemeffectsand
I/O device behavior [90].
3.1 Simulator Machine Model
The processormodel implementsa dynamicallyscheduled32-bit SparcV9 [104]
processorwith registerrenaming.Theexecutioncoreusesa unified instructionissuewin-
dow of configurablesize,but it alsoallowsfurtherdispatchandissuerestrictionsfor dif-
ferentinstructionclasses.ThecachehierarchyincludesL1 instructionanddatacachesof
configurablesizeandassociativity,a unified L2 cache,anda combiningbuffer for un-
cachedstores.Cachecoherencyis maintainedusingtheMESI protocol,with extensionsfor
efficient DMA transferssuchasread-currentandwrite-purgebustransactions.Themem-
ory controller supports multiple banks of SDRAM memory.
TheI/O subsystemconsistsof aPCIbridgethatsupportsboot-timedevicedetection
andconfigurationasoutlinedin thePCIspecification[81]. Forsimplicity, thePCIbusitself
is notmodeled.Instead,I/O devicesaremodeledasif theyaredirectlyattachedto thesys-
tembusby configurabledelaypipelinesthatapproximatethelatencyof aPCIbridge.The
real-timeclock is modeledafter theMOSTEK 48T02clock chip [60]. It providestwo in-
dependentinterruptsourceswith aprogrammableperiodrangingfrom 1 millisecondto 10
seconds.The SCSI adapteris compatiblewith the AdaptecAIC7770-basedSCSI host
52
adapterfamily [1]. It supportsmultiple outstandingrequests,disconnectsand request
queueing.EachadaptercontrolsoneSCSIbusof configurablewidthandtransferrate.SCSI
busarbitrationdelayandidle timesaremodeledaccurately,while datatransfersalwaysoc-
cur at the maximum synchronous transfer rate of the particular bus.
ThemodeledSCSIdisk implementsaprefetchandwrite cacheof configurablesize.
The cacheis divided into segments.Eachof the segmentscancontaina streamof disk
blocksstartingatanarbitraryoffset,upto themaximumsizeof thesegment.After finishing
areadrequest,thediskfetchesblockssequentiallyinto thesegmentuntil eitherthesegment
is full or anewrequestarrives.If write cachingis enabled,thediskbuffersdatain thecache
andreportscompletionimmediately.Theactualwrite-backhappenslaterwhenthedisk is
idle, whena newwrite requestarrivesandno write segmentis available,or whena syn-
Figure 8: L-RSIM Machine Architecture
CPU
L1 I-Cache L1 D-Cache
L2 Cache
Memory Controller
DRAM DRAM. . . . .
PCI Config
RealtimeClock
SCSIAdapter
System Bus
PCI Bridge
SCSI Bus
Disk
53
chronizerequestis received.Blocks thatarewritten to thedisk aresavedin a file on the
simulationhost,which allows thedisk modelto maintainits dataacrosssimulationruns.
The internaloperationof the disk drive aswell as the methodsfor computingseekand
transfer times are based on the very detailed descriptions by Ganger [41] and Lee [57].
Thearchitectureof themodeledcomputersystemdoesnot representanyparticular
system.The processorusesthe Sparcinstructionset,but its microarchitectureresembles
theMIPSR10000.Thisdesignchoicewasmadein theoriginalRSIM systemwhich is the
basisof thecurrentprocessorandcachemodels.UsingtheSparcinstructionsetarchitec-
tureopensthesimulatorto avarietyof compilersandalargenumberexistingapplications.
Modeling a particularSparc-basedworkstationwould have requiredimplementingthe
TLBs, interruptdeliverymechanismsandI/O subsystemin awaywhichmightnotalways
be thebestchoicefor a dynamically-scheduledprocessor.Second,thegoalof thedesign
effort wasto provideenoughdetailto simulatetheeffectsof I/O andoperatingsystemac-
tivity, ratherthanto bootanexistingoperatingsystem.Theco-developmentof thesimula-
tor and operatingsystemalso allowed easiertesting of new featureswhile they were
implemented.Most importantly,thesimulatoris intendedto representaclassof machines
andnotnecessarilyoneparticularsystem.By developinganewmachinearchitectureit was
easierto avoidtheidiosyncrasiesof aparticularsystem,whichmightaffectthevalidity of
future results.
3.2 LAMIX Kernel
Theprocessoranddevicemodelsprovideenoughdetailto simulateafully functional
UNIX compatibleoperatingsystemwith I/O subsystem,including systemcalls, buffer
54
cache,device drivers and interrupts.Partsof the operatingsystemare designedfrom
scratch,basedon Linux [20] andBSD [66] sourcecode,while the filesystemanddevice
driver codeis takenfrom the NetBSDsources.The kernel is compatiblewith the 32-bit
subsetof Solaris,which meansit is ableto run executablescompiledfor Solariswithout
modification.Internally,thekernelis structuredsimilar to BSD 4.4. It implementsmulti-
programming,signalhandling,processsynchronizationandsharedmemory,aswell asa
completefilesystemwith vnodelayer,a fixed-sizefile cacheandthenativeBSD filesys-
temsFFSandLFS. A newHostFSfilesystemgivessimulatedapplicationsaccessto files
on the simulation host.
The hybrid designof the kernelsimplified the simulatordesigndrastically.Rather
thanhavingto understanda largeportionof anexistingOSsuchasLinux or BSDin order
to port it to thesimulator,thekernelwasbuilt up incrementallyasthesimulatorbecame
moredetailed.For instance,earlyversionsdid notsupportcontextswitchesor time-related
systemcalls.Thesefeatureswhereonly addedwhenthereal-timeclockmodelwasimple-
mented.In addition,sincetheLAMIX kernelis targetedspecificallyat file systemanddisk
I/O, it is lesscomplexandeasierto understandandmodify thana completeoff-the-shelf
OS.
3.3 Simulator Validation
Whendesigningandusingacomplexsimulationenvironmentwith varioushardware
modelsandsoftwarecomponents,caremustbe takenthat the resultingsystemfaithfully
modelsnot only the logical behaviorbut alsothe performanceof existingcomputersys-
tems.Althoughsimulatorsaremostlyusedto evaluatedifferentoptionsin a hypothetical
55
architecture,thesimulationenvironmentshouldbeableto approximatetheperformanceof
acurrentsystemascloselyaspossible.Suchvalidationincreasesconfidencein theresults
of research that uses the simulation environment.
Onewayto validateasimulatoris to runawidevarietyof workloadsonarealsystem
andonthesimulatorundersimilarconditionsandcomparetheexecutiontimes.In addition,
furtherdetailssuchasthenumberof cacheor TLB missescanbeused,althoughthis is less
important if different workloads are known to stress different parts of the system.
Thevalidationmethodologyusedin thisstudyconsistsof threesteps.In thefirst step,
basicarchitecturalparametersaremeasuredandcomparedto thesystemusedfor valida-
tion. If necessary,simulatorparametersarechangedto matchthevalidationsystem.In the
secondstep,basicoperatingsystemperformanceis measured.Assumingthatthearchitec-
tural parametersof the two systemsmatch,theseresultsshowhow closelythesimulated
operatingsystemmodelsthevalidationsystem.In thethird step,realapplicationsaresim-
ulatedandexecutedandthetotal runtimefor thetwo systemsis compared.Theseapplica-
tions should be sufficiently diverse to validate all performance aspects of the system.
ThisstudyusesLMBench[67] to validatebasicarchitecturalparametersandto find
appropriatesettingsfor unknownsystemparameterssuchasthememorycontrollerconfig-
urationandtheL2 cachelatency.In addition,LMBenchis aconvenienttool to validateop-
eratingsystemparameterssuchassystemcall latencies,copy bandwidthandfile system
performance.A varietyof applicationsfrom theSPEC2000[45] benchmarksuiteandthe
Andrewfilesystembenchmarkarethenusedto validatethesimulatorwith realisticwork-
loads.
56
An SGI Octaneworkstationrunningat 175MHz actsasreferenceplatformfor the
validation.This systemwaschosenbecauseits machinearchitectureis similar to thesim-
ulator architecture,althoughit usesa different instructionsetandoperatingsystem.On
both platforms,the LMBench tools werecompiledusing the nativecompilersin 32-bit
modewith -O2 optimizationandno othersystem-specificoptimizationsenabled.For this
reason,theresultsfor theSGI Octanemaydiffer from otherpublishednumbers.TheSGI
workstationusesa9 GbyteIBM Ultrastar9ZX harddisk [103] for local storage,while the
simulateddisk is configuredwith identicalseektimeandplatterconfigurationparameters.
3.3.1 Memory Hierarchy Validation
Memory latencyis measuredusinga pointer-chasingmicrobenchmark.Thebench-
marksetsupanarrayof pointersof aspecifiedsizeandstridethatis walkedbackwardsin
anunrolledloop.Dependingonthearrayandstridesize,themeasuredlatencycorresponds
to theL1, L2 or mainmemorylatency.Thegraphin Figure9 showsplotsof thememory
load latencyfor 256-bytestridesfor bothplatforms,andTable5 summarizesthe latency
ratiosfor all measuredstrides.Notethatsincethestrideis largerthantheL2 cacheline size,
memoryaccessesexhibit no spatiallocality andthetestrevealsthemaximumlatencyfor
thememorysystem.Thegraphsshowtwo distinctstepswhenthearraysizecrossestheL1
andL2 cachesize.Theplateausin betweenthestepscorrespondto theloadlatencyfor the
respectivelevel of thememoryhierarchy.Resultsfor otherstridesshowsimilarly strong
correlations, but are omitted here for brevity.
57
Figure 9: LMBench Memory Load Latency for 256-byte Stride
Table 5: LMBench Average Latency Ratios
Stride Latency Ratiolatl-rsim / latoctane
16 1.0957
32 0.9939
64 0.9839
128 1.0079
256 1.0026
512 0.9989
1024 1.0063
10
20
40
80
160
320
640
16k 64k 256k 1M 4M1k 16M4k
L-RSIM
SGI Octane
Array Size in bytes
Late
ncy
in n
s
11-12
66 -78
550 -565
58
Figure10showsthereadbandwidthfor differentarraysizeswith unit stride,andTa-
ble6 containsthegeometricmeansof thebandwidthratiobwl-rsim / bwoctaneof thetwo sys-
temsfor variousreadandwrite benchmarks.Thetwo curvesdonotmatchaswell asin the
latencyexperiment,mostlikely becauseof loadinstructionissuerestrictionsor cacheport
contentionin theR10000thatarenotmodeledin thesimulator.Thefact thatthesimulator
is generallyfasterthanthevalidationsystemmeansthattheoverheadof datatransferswill
likely be underestimated by the simulator.
NotethatthebcopyandbzerotestsusethenativeC-libraryroutines,whicharecoded
very differently for the two platforms.The Irix/MIPS versionsareunrolledseveraltimes
whereasthebasicSolarislibrary performsone32-bitmemoryaccessperloopiteration.An
optimizedversionthat is unrolledandusesfloating point registersfor 64-bit memoryac-
cessesis alsoavailable,butwasnotusedfor thesetests,becauseit is evenmoreoptimized
than the SGI version and would not provide a better correlation between the two systems.
Figure 10: LMBench Memory Read Bandwidth
0
400
500
700
16k 64k 256k 1M 4M1k 16M4k
L-RSIM
SGI Octane
Array Size in bytes
band
wid
th in
Mby
te/s
300
200
100
600
59
3.3.2 Disk Validation
LMBenchmeasuresdiskseektimesandreadbandwidthby readingfrom therawde-
viceatvariousoffsets.Theresultinggraphsshowalargevariationfor seeksof similardis-
tancedueto varyingrotationaldelay,which dependson whenthereadrequestwasissued
with respectto thetargetblock position.Figure11 showscleaned-upversionsof theseek
timeandbandwidthcurves.Theseektimecurvesmatchverywell for thetwo systems,re-
flectingontheaccuracyof theseektimeapproximationusedin thesimulatormodel,which
is based on work by [41] and [57].
The readbandwidthof the simulationmodel is constantand doesnot follow the
downwardslopeof therealdiskbecausethemodeldoesnotimplementzoneswith different
sectorcountspercylinder.Instead,it usesasimpleconstantsectorspercylindergeometry,
which results in a constant read bandwidth for all cylinders.
Table 6: LMBench Average Bandwidth Ratios
Parameter BandwidthRatiobwl-rsim / bwoctane
libc bcopy 0.9314
unrolled copy 1.0964
read 1.0479
partial read 1.0041
write 1.2620
partial write 1.4995
partial read/write 1.4964
libc bzero 0.8179
60
3.3.3 Operating System Validation
Operatingsystemperformanceis an importantpartof overall systemperformance.
LMBenchmeasuresthelatencyof anumberof basicsystemcalls.Table7 showstheindi-
vidual results as well as the latency ratios for the two platforms.
Theemptysystemcall latencycorrelateswell, despitethesignificantorganizational
differencesof thetwo operatingsystems.This indicatesthattheLAMIX systemcall entry
andexit codeintroducesoverheadssimilar to theIRIX code.However,theremainingsys-
temcallsdo not correlateto thesameextent,dueto thedifferent internalorganizationof
Figure 11: LMBench Disk Seek Time and Read Bandwidth
0
10
15
5
2000 4000 6000 8000
seek distance in cylinders
cylinder
seek
tim
e (m
s)ba
ndw
idth
(M
byte
/s)
L-RSIM
SGI Octane
20
0
10
15
5
2000 4000 6000 8000
61
the two operatingsystems.Generally,filesystemcalls aresignificantly slower in IRIX.
Only the‘stat’ call and‘select’with largenumbersof file descriptorsproducecomparable
results.Sincememorylatenciesandbandwidthshowamuchbettercorrelationbetweenthe
two systems,andthesesystemcallsdonotmoveanysignificantamountof data,thediffer-
encecanbeexplainedonly by thedifferentcodepaths.Runningthesametestsondifferent
Sparcsystemsalsoshowswidely varyingresults,confirmingthishypothesis.For instance,
a nominally slower Ultra-1 workstation is faster than L-RSIM for the ‘stat’ and
‘open/close’tests,anda 450MHz Ultra-60hashigherlatenciesfor the‘read’ and‘select’
tests.TheseresultsshowthattheLAMIX kernelcannot realisticallybecomparedto com-
mercialoperatingsystems,dueto thewidely varyingkernelorganizations.Ideally,theref-
erencesystemshouldberunninga BSD variant,but at thetime of this work no BSD port
Table 7: System Call Latencies in Microseconds
System Call L-RSIM Octane latl-rsim/latoctane
empty system call 2.5522 2.6097 0.9780
read from /dev/zero 6.6345 10.3270 0.6424
read from local disk 7.6376 26.7427 0.2856
write to /dev/null 6.3088 12.2352 0.5156
stat 42.9370 49.4505 0.8683
fstat 4.5793 7.9985 0.5725
open/close 50.6481 74.0685 0.6838
select on 10 fds 7.7895 11.0776 0.7032
select on 60 fds 31.6243 38.7986 0.8151
signal handler installation 2.857 8.152 0.3505
signal handler overhead 8.622 31.856 0.2707
62
for MIPSR10000basedworkstationsexisted.However,theresultsindicatethatthesimu-
lator andits kernelareableto at leastapproximatetheperformanceof commercialUNIX
variants.LAMIX is frequentlyfasterthanits multithreadedcommercialcounterparts,lead-
ing to pessimisticresultswhenit is usedto evaluatetheimpactof operatingsystemover-
head on performance.
LMBenchmeasuresfile readbandwidthby repeatedlyreadingafiles of differentsiz-
es.For small to moderatesizes,thefile is residentin thebuffer cacheandreadbandwidth
is dominatedby memorycopybandwidth.Thecorrelationbetweenthetwo systemsis not
asgoodasfor thesimplecopytests.Thegeometricmeanof thefile readbandwidthratio
bwl-rsim/bwoctaneis 1.4726,andimprovesslightly to 1.3946whentheopenandcloseover-
headis included.This discrepancyis mostlikely dueto different implementationsof the
kernelto userspacecopyroutine.TheLAMIX routineusesdouble-wordloadsandstores
for largetransfers,whereastheimplementationdetailsof thecopyroutinesin IRIX areun-
known.
Filesystemperformanceis measuredasthenumberof file creationsanddeletionsper
second for varying file sizes, and is shown in Table 8.
Table 8: File System Performance in Creations/Deletions per Second
File Size L-RSIM Octane Ultra-1/140 Ultra-60/450
0 Kbytes 68/120 528/350 38/87 56/119
1 Kbytes 37/73 210/247 37/35 63/59
4 Kbytes 35/66 207/229 28/37 52/63
10 Kbyte 27/73 218/227 27/38 48/68
63
XFS is a modernhigh-performancejournaling64-bit filesystemdevelopedby SGI.
Not surprisingly,file creationanddeletiononXFS is significantlyfasterthanon theorigi-
nalFFS,sincethiswasoneof thedesigngoals.To makeamorerealisticcomparisonwith
otherFFSimplementations,Table8 alsoincludesresultsfor two differentSparcplatforms
runningSolaris.TheL-RSIM/LAMIX performanceis comparableto the450MHz Ultra-
60,eventhoughtheSunworkstationis nominallyfasterthanthesimulated200MHz sys-
tem, indicating that the LAMIX FFS implementationis closerto Sun’s implementation
than to SGI’s.
3.3.4 SPEC 2000 Validation
A numberof SPEC2000programsareusedto validatethesimulationsystemusing
realisticworkloadsthatexerciseboththeprocessorcoreandmemoryhierarchy.TheSPEC
2000suitewaschosenbecauseit containsa wide varietyof portableapplicationsthatare
well understoodby researchers.Theseapplicationsrangefrom compressionandintegrated
circuit optimizationto facerecognitionandarewritten in severaldifferentprogramming
languages.For thevalidationprocess,thetrainingdatasetsprovidedwith thebenchmark
suitewereused,with theexceptionof lucasandsixtrackfor which thetestdatasetswhere
usedto keepthesimulationtime manageable.Table9 summarizesthetotal runtimeof the
SPEC 2000 applications in seconds on the simulator and the Octane workstation.
SPEC2000applicationsaremuchmorecompileroptimizationdependentthanLM-
Bench.It wasfound that whenusingequivalentoptimizationlevels,the MIPS compiler
producessignificantly fastercodethantheSparccompiler.This effect is particularlypro-
nouncedfor the floating point codesof SPEC2000.To achievequalitativelycomparable
64
executablecode,theMIPSbinariesarecompiledwith optimizationlevel2 andinterproce-
duraloptimizationenabled,while theSparcbinariesarecompiledwith optimizationlevel
four.
Theapplicationruntimesgenerallyshowgoodcorrelationbetweenthetwo systems,
with a few exceptions.Forapplicationswith extremelylargeworkingsetssuchasmcf, the
differentpagesizesof thetwo systemscanleadto performancediscrepanciesdueto TLB
hit ratedifferences.The performanceof the floating point applicationswim is generally
Table 9: SPEC 2000 Runtime
Benchmark L-RSIM Octane tl-rsim/toctane
gzip 172.0 171.0 1.0058
mcf 235.0 154.0 1.5260
parser 47.0 41.0 1.1463
eon (kajiya) 49.0 50.0 0.9800
eon (cook) 9.6 9.5 1.0105
eon (rushmeier) 13.6 14.0 0.9714
gap 39.2 44.0 0.8909
vortex 87.0 72.0 1.2083
twolf 77.0 60.0 1.2833
wupwise 198.0 198.0 1.0000
swim 147.0 96.0 1.5313
mgrid 158.9 122.0 1.2951
applu 71.4 88.0 0.8114
art 95.0 80.0 1.1875
lucas (test) 26.0 41.0 0.6341
sixtrack (test) 45.2 44.0 1.0273
aplsi 81.5 85.0 0.9588
65
consideredto beverycompilerdependent,andthewidely differing runtimesin thisexper-
iment confirm this observation.
3.4 Summary
TheL-RSIM simulationenvironmentusedin thisstudycombinesadetaileddynam-
ically scheduledprocessormodelwith a sufficiently detailedI/O subsystemto executea
realistickernel.This uniquecombinationof featuresallowsresearchersto exploreI/O re-
latedperformanceeffectsin greatdetail,while keepingthe complexityof the simulation
system more manageable compared to other full system simulators.
Validatingtheaccuracyof thesimulationsystemensuresthateffectsmeasuredin the
simulatorrepresentrealsystembehavior,thusincreasingconfidencein resultsobtainedus-
ing thesystem.Microarchitecturalparameterssuchascachelatencyandbandwidthshow
very closecorrelationto a SGI workstation,indicatingthat thehardwaremodelsprovide
enoughdetail to accuratelycapturethemajoreffectsof modernmicroarchitectures.Oper-
ating systemperformancedoesnot correlateaswell betweenthe simulatorandthe SGI
workstation,becausetheinternalorganizationof thetwo operatingsystemsis toodifferent.
However,in mostcasesthe simulatorOS performsfasterthenIRIX. As a result,perfor-
manceimprovementsmeasuredwith thesimulatorcanbeassumedto bepessimistic.The
sometimeswidely differing applicationperformancewhenrunningSPEC2000pointsto
thedifficulty of comparingsystemswith differentinstructionsetarchitecturesandcompil-
ers.
TheL-RSIM simulationsystemis usedin thisstudyfor bothdetailedmeasurements
usingmicrobenchmarksandto measureoverallsystemperformanceunderavarietyof ap-
66
plications.Thegenerallygoodcorrelationfoundwhenvalidatingthesimulatorgivescon-
fidencethatthedemonstratedperformanceimprovementscarryoverto realsystems,while
thevalidationresultsshowthat theexistingmodelingerror leadsto generallypessimistic
results when measuring the operating system overhead.
4. USER-LEVEL I/O ARCHITECTURE
MostI/O requestsconsistof threedistinctphases:requestinitiation,datatransferand
requestcompletion.Eachof thesephasesincursoverhead,by usingprocessingresources
thataremadeunavailableto theapplication.If thekernelmediatesaccessto theI/O device,
enteringinto andexiting from thekernelin a protectedway consumesconsiderabletime.
ThiscontextswitchalsoreplacescacheandTLB entriesthatneedto bereloadedby theap-
plication.Datatransfersbetweenthe I/O deviceandmemoryor betweenkernelanduser
spaceoftenconstitutethelargestcomponentof I/O overhead.Usingthehostprocessorto
copydatausesupmanyprocessingcyclesandleadsto cacheandTLB pollutionasdatais
movedacrossthe cachehierarchytwice. Requestcompletionis typically signaledvia an
interrupt,which is at leastasdisruptivefor applicationperformanceasasystemcall, since
it saves and restores comparable amounts of state.
This dissertationintroducesa user-levelI/O architecturethat addressesall three
sourcesof I/O overheadby eliminatingtheoperatingsystemalmostcompletelyfrom com-
monI/O operations.Thearchitectureis designedfor anext-generationdistributedI/O sys-
tem with a scalablesystem-areanetwork connectingclient systemsto autonomousI/O
devices,asshownin Figure12.TheUIO deviceactsasanetworkinterfaceto theI/O net-
work. Novel hardwarefeaturesin thehostprocessorandtheUIO deviceimplementlow-
overheadmechanismsfor atomicuser-leveldeviceaccess,user-spacedatatransfersand
user-levelnotifications.I/O requestsareissuedfrom theapplicationdirectly to thedevice
68
with the supportof a hardwarebuffer structurethat combinesthe requestargumentsand
transmitsthem atomically to the device.Data transfersoccur directly to and from user
space,eliminatingtheneedfor thehostprocessorto useits valuablecomputeresourcesfor
this task.Completionnotificationsarehandledalmostcompletelyby theapplication,en-
ablingit to performfine-grainuser-levelthreadschedulingor othersynchronizationswith
very low overhead.
ReducingI/O overheadallowsapplicationsto moreefficiently overlaplong-latency
I/O operationswith unrelatedcomputation,thusimprovingoverallthroughput.Lowerper-
requestcostenablesapplicationsoftwareto takefull advantageof thebandwidthavailable
from distributedI/O systemssuchasnetwork-attacheddisksandto efficiently employla-
Figure 12: User-level I/O Architecture
CPU CPU
I/O BridgeMemory
UIODev0
system bus
I/O bus
Dev1
local I/O devices
Client 0
Client 1
Client 2
I/O Network
I/O Device 0
I/O Device 1
I/O Device 2
I/O Device 3
I/O Manager
69
tencyhidingtechniquesto maximizethroughout.Bypassingtheoperatingsystemfor com-
monI/O requestsmeansthatapplicationperformanceis lesssensitiveto thegrowinggap
between application and operating system performance.
At the sametime, the user-levelI/O architecturemaintainsthe level of protection
foundin mostmodernoperatingsystems.Thesystemprovidesseparateaddressspacesto
protectprocessesandthe operatingsystemfrom inadvertentor maliciousaccessesto in-
valid memoryregions.Furthermore,theI/O architectureprovidesa level of programming
flexibility usuallynot foundin communicationor I/O architecturesthatbypasstheoperat-
ing system,by allowingprogrammersto performI/O operationsto arbitrarymemoryloca-
tions without the needto pin or preallocatebuffers.As a result,manyapplicationscan
realizeperformanceimprovementswithoutextensivesoftwaremodificationsasuser-level
librariescanprovideconventionalprogramminginterfaceswith smalladditionaloverhead.
4.1 UIO Architecture Overview
ThedistributedI/O systemis composedof clientsystemsthatconnectto thenetwork
via aUIO device,anumberof network-attachedautonomousI/O devices,andanI/O man-
ageroverseeingaccessrightsto thedevices.I/O devicesperformnecessarylow-levelman-
agementtaskslocally onthedevicecontroller.For instance,network-attachedsecuredisks
mapbytestreamsto disk blocksandexportanobjectinterfaceto theclients.In addition,
network-attachedI/O devicesperformprotectionchecksfor everyI/O request,in coordi-
nationwith aglobalI/O manager.To accessa featureonaremotedevice,aclient requests
anencryptedcapabilityfrom theI/O manager.Usingthecapabilityto identify itself andits
permissions,theclient issuesrequestsdirectlyto theremotedevice.Sincethisorganization
70
assumesthatdevicesdo not trust remoteclients,removingtheoperatingsystemfrom the
client I/O path does not weaken the security model.
The networkis a scalablesystemareanetwork,suchasMyrinet [15] or ServerNet
[49]. Theonly assumptionthatthearchitecturemakesis thatthenetworkerrorrateis suf-
ficiently low thatend-to-enderrordetectionandcorrectionis not necessary.Many system
areanetworksoperatesuccessfullywith thisassumption,sincetheelectricalcharacteristics
of cablesandconnectorscanbetightly controlledandwire lengthsarelimited. In addition,
per-hoplink-level error detectionandcorrectioncanbe implementedin hardwarein the
network interfaces and routers to further reduce the error probability [54].
TheUIO deviceis at thecoreof theuser-levelI/O architecture.Thisdeviceactsasa
low-overheaduser-levelnetworkinterfaceto theI/O subsystem.Its design,in combination
with certainprocessorfeatures,providesprotectedandefficientuser-levelaccessto there-
moteI/O devices.Thekeycomponentsof theUIO architectureareshownin Figure13.The
conditionalstorebuffer in theprocessorbusinterfaceimplementsnonblockinguser-level
synchronizationandflow control, it is usedto atomicallytransferthe requestarguments
from applicationsto the UIO device.The designandevaluationof the conditionalstore
buffer is describedin moredetail in Chapter5. User-spacedatatransfersarefacilitatedby
the device TLB, which performsvirtual-to-physicaladdresstranslationand protection
checks.ThedeviceTLB characteristics,its integrationwith thehostoperatingsystemand
differentTLB misshandlingmechanismsarediscussedin Chapter6. User-levelnotifica-
tionsaredelivereddirectly to theapplicationby thehostoperatingsystembasedon infor-
mationprovidedby theUIO deviceatthetimethenotificationinterruptis triggered.Details
of thenotificationmechanismaredescribedin Chapter7.Theremainderof thischapterin-
71
troducesthesoftwareinterfaceof theUIO device,describesthebasicoperationof thear-
chitecture and gives an overview of a possible UIO device implementation.
The local UIO devicedetermineshow applicationprogramscommunicatewith the
remotedeviceatthelowestlevel.However,inherentdifferencesin theobjectinterfacesex-
portedby different deviceclassesarevisible to applicationsoftwareandshouldbe dealt
with in systemlibraries.In addition,only theperformance-criticalsubsetof theI/O opera-
tionsneedstobeissueddirectlybyuser-levelsoftware.ThisincludesanyI/O operationthat
transfersasignificantamountof data,suchasstoragereadandwriteoperations.Infrequent-
ly performedI/O operationssuchasnetworkconnectionestablishmentsor storageobject
Figure 13: User-level I/O Architecture Components
Send Engine Receive Engine
IO Network
CSB
uncached stores interrupt
Processor
UIO Device
request notifcationDMA data
TLB
notificationqueue
72
metaoperationscaninvolvetheclientoperatingsystemto hidesomeof thelow leveldetails
from applications.However,involving theclientoperatingsystemdoesnotchangetheba-
sicsecuritymodel,asremotedevicesgenerallydonottrustclientsystems,regardlessof the
protection level of the initiating software entity.
4.2 Application Interface
Thebasicstructureusedto communicatebetweenapplicationsoftware,theUIO de-
viceandtheremotedevicesis arequestpacket, or requeststructure.Thisstructurecontains
all informationneededto handlea request,bothlocally at theclient andat theremotede-
vice.Therequeststructureis passedfrom theapplicationto theUIO deviceandthenfor-
wardedto theremoteI/O device.WhentheI/O requestcompletes,thestructureis returned
to theUIO deviceandeventuallyreachestheapplicationin theform of anotification.Since
theremotedevicereturnsthesamestructurewhentherequestcompletes,theUIO device
doesnotneedto storeanystatewhile arequestis pending.Havingsoftwareprovideall rel-
evantinformationwith everyrequestslightly increasestheamountof datatransferredwith
everyrequest,but it facilitatesa simpleandscalableUIO devicedesignsinceno statein-
formationneedsto bemaintainedby thehardware.In addition,thestatelessdevicedesign
simplifies failure recoveryas no stateneedsto be reconstructedor synchronizedin the
eventof atransientfailure.Theslight increasein per-requestdatatransferredbetweensoft-
wareandtheUIO devicedoesnotaffectperformanceaslongastheentirerequeststructure
doesnotexceedthesizeof acacheblock,sincethatis thenaturaltransfersizeonthesystem
bus.Table10 lists thecomponentsof theUIO requeststructure,separatedinto remoteand
local information.
73
Remoteinformationsuchasthecapability,commandandargumentsis usedonly by
theremotedevice;theUIO devicesimply forwardsit. Thecapabilityis anencryptedvalue
thatidentifiesboththeobjectto operateonaswell astheaccessrightsof theclient for that
object.It enablestheremotedevicetocheckif theclientisallowedtoperformtherequested
operation.For instance,in thecaseof network-attacheddisks,acapabilityspecifieswhich
disk objectto accessandif theclient is allowedto read,write or deleteit. Sincethecapa-
bility is encryptedwhenit is grantedfrom theI/O manager,clientsarenot ableto bypass
the securitychecksof the remotedevicesby modifying the capability.The encryption
mechanismis establisheddynamicallybetweenthe I/O managerandthe remotedevices.
Otherremotelyusedinformationincludesthedesiredoperationandanyotherarguments
andflags.For instance,whenwriting to aremotestorageobjectthatrepresentsafile, addi-
tional argumentsmay includetheamountof dataandpossiblya flag requestingsynchro-
nousoperation.The return statusfield is usedby the remotedevice to indicate if the
Table 10: UIO Request Structure
Group Name Description
remote capability identifies remote object and clients access rights
information command desired operation
arguments and flags additional operation specific arguments
return status completion status of operation, returned to client
local context ID identifies process to device
information buffer & length buffer address and length for data transfers
notification buffer application space buffer for notification
notification handler user-level routine called during notification
request argument application defined value to identify request
74
operationwassuccessful,similarto thereturnvalueof asystemcall.Theformatandcoding
of all theremotelyusedinformationmustbestandardized,sothatdifferentclient systems
cancommunicatewith remotedeviceswithoutregardwhichparticulardevicetypetheyac-
cess.Suchstandardcommandsetsaresimilar to existingSCSIor ATA commandsusedon
local storage buses, but generally operate at a semantically higher level.
Argumentsin thelocalgroupin Table10areusedonly by theUIO device.However,
sincesomeof themareusedwhentherequestreturnsto theapplication,theyarealsofor-
wardedto the remotedeviceandreturneduponcompletion.Theseargumentsincludea
uniqueprocessidentifier thatenablestheUIO deviceto performlocalprotectioncheckson
thebufferaddressesandto delivercompletionnotificationsto thecorrectprocess.A buffer
addressandlengthis usedby requeststhattransferdatato or from theremotedevice.This
includesreadandwrite requestsaswell asstatusinquiriesthatreturndatastructures.The
notification buffer addressis a location in applicationaddressspacewherethe request
structureis depositedwhenit is returnedfrom theremotedevice,sothattheapplicationcan
inspectthe returnvalue.The notification handleris a user-levelroutinethat is executed
when the requestcompletes,it canperformalmostarbitrarysynchronizationoperations
within the application.The requestargumentis an applicationdefinedvaluethat canbe
usedto uniquelyidentify therequestto thenotificationhandler.Forinstance,in auser-level
threadlibrary thisvaluemaybetheID of thethreadblockingon theI/O request,or it may
beapointerto thethreadstructure.Thevalueis notusedby theUIO devicenorby there-
moteI/O device,it is completelyunderapplicationcontrol.Theoretically,this valueis re-
dundant,sincethenotificationbufferaddressshouldbeuniqueandcanbeusedto identify
75
therequest.However,allowing softwareto providea moreconvenientidentifier cansim-
plify the notification handler.
4.3 Basic UIO Operation
To initiatea request,softwareassemblesall requiredargumentsinto a requeststruc-
tureandsendsit to thelocalUIO device.Thisprocessoccursin userspacewithout invok-
ing theoperatingsystemthroughasystemcall. Low leveldetailssuchastheaddressof the
UIO deviceandthe layoutof theUIO requeststructurecanbehiddenby a library. After
initiating the I/O request,the applicationcontinuesexecutingsinceno kernelscheduling
operationwasinvolved.ThesenonblockingI/O requestsenabletheapplicationto optimal-
ly overlaplong-latencyI/O operationswith independentcomputation.A user-levelthread
library, for instance,caninitiate a low-overheadthreadswitchandthuspreservethetradi-
tional blocking I/O programmingmodelfor individual threads,while avoidingthe over-
head of kernel support for nonblocking I/O.
After receivingtherequeststructure,theUIO deviceperformsanynecessaryDMA
readoperation(e.g.,for awrite request)andforwardstherequestandassociateddatato the
remoteI/O device.Thatdeviceperformstheapplicableprotectioncheck,executesthere-
questandreturnstherequeststructurewith thereturnstatus,aswell asanyrequesteddata,
to theoriginal UIO device.After writing anyreturneddatainto theapplicationbuffer, the
UIO devicedepositstherequeststructurein theprearrangednotificationbuffer wherethe
returnstatuscanbe inspectedby theapplication.Thesamerequeststructureis written to
an externallyvisible host processorregisterto deliver the notification by executingthe
user-level notification handler specified with the request.
76
Thebasicprogrammingmodelof theuser-levelI/O architecturecombinesnonblock-
ing requestswith asymmetriccommunicationmodel.Eachrequestreceivesaresponse,in-
dicatingthattherequestcompleted.Thismodelapplieswell to passiveI/O devicessuchas
storageandto manyoutputdeviceslike framebuffers.Thesedevicesonly operatewhen
triggeredby thehostprocessor,makingthemagoodmatchfor thesymmetriccommunica-
tion model.Otherdevicessuchasnetworkadaptersandmanyinput devicesoperatewhen
receivingexternaltriggersandinvokesoftwareassistancethroughinterrupts.Thiswork fo-
cusesonstoragedeviceI/O, sincetheinterfaceandfunctionalityof network-attacheddisks
are more clearly developed than for other high-performance I/O device classes.
4.4 UIO Device Architecture
TheUIO deviceservesasauser-levelI/O networkinterface.Figure14showsapos-
sibleimplementationof aUIO device.Thedeviceactsasbothabusslaveandbusmaster,
asit acceptsandrespondstohostprocessorreadandwrite requestsandperformsDMA data
transfers.Applicationsoftwarewritestherequeststructureto thedevicerequestqueuevia
theconditionalstorebuffer in thehostprocessor,wheretheyarebuffereduntil processed
by a transmitDMA engine.This simpleinterfacereducesthecomplexityof the transmit
statemachinessignificantly.Requestsarehandledby oneor moretransmitDMA engines,
which performanynecessaryDMA readoperation(e.g.,for a write request)andforward
therequeststructureandanydatato thetransmitbuffer.DMA operationsaretranslatedand
verifiedby thedeviceTLB. Equivalentto aprocessorTLB, thedeviceTLB translatesvir-
tualbufferaddressesprovidedby applicationsinto physicaladdresses,andperformsaccess
protectionchecksfor everybustransaction.Repliesfrom remotedevicesarehandledby a
77
setof receiveDMA engines.TheseDMA engineswrite datainto applicationbuffersif ap-
plicableanddepositthereturnedrequeststructurein a notificationqueuefrom whereit is
sent to the host CPU for processing.
Thedesigndecisionto combineall requestargumentsin a commonstructuretrans-
ferredbetweenapplicationsoftware,thelocalUIO deviceandtheremotedevicefacilitates
asimplehardwareorganizationof theUIO device,comparedto manyotheruser-levelnet-
work interfaces.Thedevicestorestherequeststateonly for thedurationof thelocalDMA
operation,thuseliminatingtheneedfor largestorageareasthathaveto bemanagedby the
Figure 14: UIO Device Structure
Bus Master InterfaceBus Slave Interface
I/O Bus
I/O Network
Receive DMAReceive DMA
Transmit DMATransmit DMA
Transmit Buffer Receive Buffer
Network Tx/Rx
Req
uest
Que
ue
Not
ifica
tion
Que
ue
I/O Bus
TLB
78
hardwareor theoperatingsystem.Also, thedevicehardwaredoesnot imposeanylimits on
thenumberof outstandingrequests,thenumberof activeprocessesor theamountof data
involved in requests.
Thedepthof therequestqueuedeterminesthenumberof requeststhatsoftwarecan
issuebeforenewrequestsarerejected.ReadrequestsdonotrequireDMA transactionsdur-
ing transmissionandhenceoccupya DMA enginefor only a shortamountof time.Since
in manyenvironmentsI/O readrequestsdominatewrites, evena shallowrequestqueue
should be able to provide sufficient buffering.
Theoptimalnumberof DMA enginesanddetailsof thebuffer managementscheme
dependon theparticularnetworkarchitectureandflow controlmechanism.If thenetwork
providesvirtual channels,multiple DMA enginescantakeadvantageof theavailablepar-
allelism andprocessrequestsconcurrently.Similarly, with networksthat transmitsdata
streamsasa sequenceof independentlyroutedpackets,multiple DMA enginescaninter-
leaverequestson thenetworkandshouldbeableto hidemainmemoryaccesslatencies.
The transmitbuffer shouldprovidetheabstractionof onebuffer perDMA engine,either
staticallyor dynamicallypartitioned.Staticpartitioningimpliesaverysimplemanagement
schemewhereessentiallythebuffer is managedby theDMA engine.Dynamicpartitioning
eliminatesinefficienciesdueto fragmentationbutrequireshighermanagementcomplexity.
However,sinceonly oneDMA operationcanreturndatafrom theI/O busat a time,only
onebuffer port is needed,regardlessof thepartitioningscheme.Theamountof buffering
requireddependson theflow controlgranularity.Eachbuffermustbeableto holdenough
datato continuetransmittingdatain thepresenceof aDMA stalluntil flow controlcantake
effect.
79
Thereceivepathof theUIO deviceis almostsymmetricto thetransmitpath,with a
unified or distributedbuffer feedingdatainto a setof DMA engines.However,dueto the
notification scheme,the DMA enginecontrol is slightly morecomplex.The notification
queueholdstherequeststructurethatwasreturnedfrom theremotedevice.It is usedto de-
coupletheDMA enginesfrom thenotificationmechanism,which is flow controlledby the
hostprocessor.Consideringthat theamountof datareturnedpernotification is oftensig-
nificant, the queue does not need to be very deep to provide the desired decoupling effect.
4.5 Summary
Theuser-levelI/O architecturereducesI/O overheadin thecontextof a distributed
I/O architectureby almostcompletelyeliminatingtheoperatingsystemfrom commonI/O
operations.By providinguser-levelaccessto I/O devicesandtransferringdatadirectly to
andfrom applicationbuffers,hostprocessoroccupancyfor I/O requestsis minimized.The
overheadreductionsallow applicationsto efficiently overlaplong-latencyI/O operations
with otherwork andresult in highersystemthroughputandI/O bandwidth.At the same
time, the architecturemaintainsa level of processprotectionfound in operatingsystems
suchasUNIX andprovidesflexibility by not restrictingprogrammersto pinnedor other-
wise preallocated buffer spaces.
ThelocalUIO deviceactsasanetworkinterfacetoadistributedI/O system.Together
with thehostprocessor,it implementsa simplelow-overheadinterfaceto initiate I/O re-
questsanddelivernotifications.User-levelI/O requestsaredescribedby arequeststructure
thatcontainsall of theinformationneededto processtherequestbothlocally andat there-
moteI/O device.Thestructurepassesthroughthe local andremoteI/O devicesasthere-
80
questis processed.This designsignificantlyreducesthehardwarecomplexityof theUIO
device,asrequeststateneedsto bestoredonly for thedurationof a DMA operation.The
following chaptersdiscussthe individual UIO mechanismsin greaterdetail andprovide
comparative overhead measurements for a variety of alternative mechanisms.
5. ATOMIC DEVICE ACCESS
Thegoalof theuser-levelI/O architecturetominimizeoverheadbybypassingtheop-
eratingsystemfor performancecritical I/O requestsrequireschangesto all phasesof anI/O
transaction.Although datatransfersarefrequentlythe dominantsourceof I/O overhead,
thecontroloverheadinvolvedin initiating I/O requestscanoftenbeconsiderableaswell,
especially for small transfers of low-latency requests.
Wheninitiating an I/O request,softwaremustensurethat the individual arguments
andparametersarecommunicatedto thedeviceatomicallywith respectto otherentitiesac-
cessingthedevice.Many I/O devices,suchasSCSIhostadapters,exporta setof control
registersinto which softwarewrites the requestargumentsvia a sequenceof uncached
stores.Writing to a dedicatedregistertriggersthedeviceto startprocessingtherequest.If
multipleentitieswrite to theseregisterssimultaneously,argumentsof differentrequestsare
interleaved,leadingto invalid requests.In addition,softwaremustensurethatthedeviceis
ableto accepttherequest.Many devicesprovidestatusregistersfor this purposethatare
checked by software before a request is initiated to detect flow control situations.
If I/O requestsareinitiatedonly by entitieswithin theoperatingsystemkernel,both
atomicity and flow control areachievedthroughsoftwaresynchronizationmechanisms.
Disablinginterruptsandacquiringa lock preventsotherentitieswithin thekernelfrom ac-
cessingthe device.The resultingcritical region allows the devicedriver to atomically
check the device status and write the required arguments to the device control registers.
82
Restrictingaccessto I/O devicesto thedevicedriver insidethekernelprovidesboth
atomicity and flow control, but sucha designincurssignificantoverheadfor every I/O
transaction.In multiuseroperatingsystemswith protectedaddressspaces,thesystemcall
entrycodesavesa largeamountof processstateon thekernelstack,beforeexecutingthe
actualsystemcall routine.Whenreturningto usermode,thesystemcall checksif asignal
is pendingfor thecurrentprocess,andif theschedulerneedsto beinvoked,beforerestoring
theprocessstate.Theseactivitiesnotonly consumemanyCPUcyclesbutalsohavesignif-
icant impacton cachesandTLBs, which leadsto degradedapplicationperformanceafter
the system call is complete.
Eliminatingsystemcallsfrom theI/O requestinitiation helpsto reduceI/O overhead.
If applicationsareable to directly transferthe requestcontrol information to the device
hardwarewithout involving theoperatingsystem,multiple entitiesarecompetingfor de-
viceaccesswithoutglobalsynchronization.Sinceuser-levelprocessescanbepreemptedat
anytimeandcannotbetrustedto performsynchronizationthatimplementsgloballyatomic
sections,a low-overheadhardwaremechanismis neededto provideatomicuser-levelde-
viceaccess.Thischapterintroducestheconditionalstorebuffer,asoftwarecontrolledcom-
bining buffer that implements nonblocking synchronization and flow control.
5.1 The Conditional Store Buffer
The conditionalstorebuffer (CSB) is a softwarecontrolled,uncached,combining
hardwarebuffer thatallows I/O storesto becombinedinto a singlebustransactionup to
themaximumsizeof onecacheline. It reducesI/O overheadwith aminor increasein hard-
warecomplexityby implementingnonblockingatomicityandflow controlattheuserlevel.
83
TheCSBpermitsuserlevel codeto explicitly controlwhichstoreswill becombinedand
whenthecombinedsetof storeswill beissuedon thesystembus.This guaranteesthatthe
sequenceof combinedstoreinstructionsis atomicandprovidesthenecessaryexactlyonce
semanticsfor the resultingbustransaction.In addition,the conditionalstorebuffer pro-
videsanonblockingflow controlmechanismto reliably inform applicationsif requestsare
issued at a faster rate than the I/O device can process them.
Figure15 illustratestheresultingsystemmodel.It showsa block diagramof a con-
ventionalsystemwith a two-level cachehierarchyanda typical uncachedloadandstore
capabilitythat hasbeenenhancedwith the additionof a CSB.Side-effectfree uncached
loadsandstorescanbehandledin thenormalmanner,while dedicatedcombiningstorein-
structions are handled by the CSB.
Figure 15: Architectural Model with Conditional Store Buffer
system bus
unca
ched
buf
fer
L2 cache
L1 cacheconditionalstorebuffer
processor
bus interface
mainmemory
I/Odevices
84
5.1.1 Conditional Store Buffer Design
Figure16showsthestructureof theconditionalstorebuffer.Thedatabufferprovides
spacefor onecacheline worth of dataandthecacheline alignedphysicaladdressof the
mostrecentcombiningstore.Thehit counterimplementsthenonblockingconditionalflush
operation.It countsthe numberof consecutivestoresthat havebeenissuedby a process
without conflict.
Conflictsaredetectedin two differentways.Whenthebuffer receivesa combining
store,it comparesthedestinationaddresswith thevaluethathasbeensavedfrom thepre-
viousstoreinstruction.Onamatch,it storesthedatain theappropriateslotandincrements
thehit counter.If thecomparisonfails, thebuffer is cleared,thehit counteris setto one,
and the new data are stored.
Figure 16: Conditional Store Buffer
data buffer
transaction to system interface
storeconditional flush
address
counter
instructions from CPU core
85
To achieveinterprocessatomicity, theconditionalstorebuffer is clearedunderany
conditionthatmayleadto aprocesscontextswitch.This includesanumberof trapsandall
externalinterrupts.Errortrapsor systemcallsmaysuspendor terminatethecurrentprocess
andswitchto adifferentprocess,or performacontextswitchif thetimesliceof thecurrent
processis exhausted.Similarly, externalinterruptssuchasclock ticks or SCSIinterrupts
may invoke the schedulerandleadto contextswitches.Whensuchan eventoccurs,the
CSB hit counter is reset to zero and the data buffer is cleared.
At the endof the uncachedstoresequence,whenall requestargumentshavebeen
written, theapplicationissuesa conditionalflush instructionto indicatethat thesequence
is completeandto checkatomicityandflow controlstatus.Theconditionalflush instruc-
tion communicatestheexpectedvalueof thehit counterto theCSB,andreturnsboth the
atomicityandflow controlstatusto theapplication.If thecountervalueisequalto thevalue
providedby the instruction,andthe destinationaddressmatchesthe valuepresentin the
CSB,thenthedataissentto thesysteminterfaceasasinglebursttransaction,andthebuffer
andhit counterarecleared.Whenreceivingthebursttransaction,theI/O devicereturnsthe
flow controlstatusto theprocessor,whichbecomesthereturnvalueof theconditionalflush
instruction.If either the addressesor countervaluesdo not match,the dataregisterand
counterarecleared,nothingis issuedto thesysteminterface,andtheconditionalflush in-
struction returns a negative result to the application.
Softwareis responsiblefor checkingthereturnvalueof theconditionalflushinstruc-
tions.Theapplicationmayrecoverfrom afailedflushfor instanceby branchingbackto the
beginningof thesequenceof combiningstores.Thefollowing pseudocodesegmentshows
an example of how software might access the CSB.
86
.RETRY:
stc %r4, [%r1] ! sequence of combining stores
stc %r10, [%r1+40]
! ... 5 additional combining stores
stc %r12, [%r1+8]
flushcond [%r1], %r4 ! conditional flush
cmp %r4, SUCC ! check return value
bneq .RETRY ! retry on failure
Suppose,for example,that this processis interruptedbeforeit executedthe condi-
tional flush instruction.Theinterruptwill clearthehit counteraswell asall datastoredso
far by thecurrentprocess.If thenewprocessissuesasequenceof combiningstores,thehit
counteris incrementedfor everystoreandtheconditionalflushinstructionsucceeds.When
theoriginalprocessattemptsto flush thebuffer, theexpectedcountervaluewill notmatch
thevaluestoredin thebuffer andtheconditionalflush instructionwill returna 0 to signal
theconflict. If, on theotherhand,no interruptoccurredwhile thesequenceof storeswas
executed,the conditionalflush issuesthe buffer contentsandthe devicereturnsits flow
control status.
Thenonblockingconflict detectionpolicy removestheneedto lock theCSBprior to
accessandcompetingprocessesdonotblockonaconflict [27]. Thepolicy is optimisticin
its assumptionthatconflictsarerareandit is morecosteffectiveto replaceheavyweight
synchronizationon everysequencewith a softwarerecoverymechanismon a failed at-
tempt.Sincelock-freesynchronizationschemesdo not preventcompetingprocessesfrom
accessinga resource,theydo not leadto problemslike priority inversionor thedifficulty
of deadlock avoidance.
Theoretically,it is possiblefor two processesto bescheduledsuchthateachcontin-
uouslyconflictswith theother.Therearenumeroussimplesolutionsfor this livelock sce-
87
nario.Onecanlimit thenumberof failedconditionalflushes,or useanexponentialbackoff
algorithm to reduce the likelihood of a conflict.
Similarly, theflow controlschemeis optimistic in its assumptionthat in mostcases
thedeviceis ableto accepttherequest.Ratherthanpreventingtheprocessorfrom issuing
requests,it notifiessoftwarein theeventthattherequestqueueof thedevicewasfull. This
flow controlschemerequiresthatboththesystembusandI/O bussupportaswapbustrans-
action,which writesdatato a targetdeviceandreturnsa statusvalue.Conceptually,it is a
write transactionimmediatelyfollowedby areadtransaction.In fact, if thearbitrationpro-
tocol canguaranteesuchbackto backtransactions,it maybe implementedin thesystem
interface as a write followed by a read without any other transactions in between.
Thesizeof thedataregisteris equalto thesizeof a cacheline, sincethesystemin-
terfaceandsystembusarealreadyoptimizedto handlecacheline sizedburst transfers.
Sincemostsystembusesdonotallow arbitrarylengthbursts,theCSBmodelin this study
alwaysissuesa full cacheline, regardlessof thenumberof combiningstoreinstructions.
Thisrestrictioncouldberelaxedin aCSBdesignfor abusthatpermitsmultipleburstsizes.
Notethattheperformanceimprovementthat is madepossibleby theCSBdependson the
ability of thetargetI/O deviceto acceptburstwrites.In general,thisincreasesthecomplex-
ity of anI/O device.On theotherhand,thepotentialperformancegainmorethanjustifies
theincreasedcost.NotealsothatmanymodernI/O adaptersalreadyprovidethiscapability.
It shouldalsobenotedthat it is not strictly necessaryto includethedestinationad-
dressin theconflict check.However,thisallowsdetectionof conflictsbetweencompeting
user-levelthreads.It alsoservesasanaidto detectprogrammingerrorsin whichindividual
request arguments may be written to different target addresses.
88
5.1.2 Instruction Set Architecture Modifications
The CSB designrequirestwo architecturalmodifications.First, softwaremust be
ableto specifywhich storeinstructionsshouldbecombined.Theobviouschoiceis to in-
troducea newinstructionstorecombine. However,addingnewinstructionsto anexisting
architectureshouldnot betakenlightly. TheCSBdesignin theprototypeimplementation
thereforeusesexistingmemorymappinghardwareto indicatewhich addressesshouldbe
combined.Severalarchitecturesalreadyencodecachepoliciesandothermemoryattributes
in pagetableentries.ThePowerPCallowsthespecificationof write-throughor write-back
caching,alongwith otherattributes,onaper-pagebasis[83]. In theR10000,theaccelerat-
eduncachedbuffer is enabledby abit in thepagetableentry[68]. Hence,theencodingof
one additional attribute is a minor extension to existing TLB designs.
Thesecondrequiredmodificationinvolvestheadditionof a conditional-flushcapa-
bility. Thepurposeof this instructionis twofold. It musttriggertheflushingof theCSB(if
no conflict wasdetected),andsignalthesuccessof the flush to theprogram.Ratherthan
introducinga newinstruction,theprototypeimplementationusestheSparcswapinstruc-
tion for theconditionalflush. If thedestinationaddressis in uncachedcombiningaddress
space,the instructionis sentto theCSB.Theconditionalflush instructionreturns0 if the
flush failed dueto anatomicityviolation, otherwisetheflow controlstatusreturnedfrom
the I/O device becomes the instructions result.
Thesemanticsof theconditionalflush instructionareverysimilar to thestore-condi-
tional instructionusedin manyarchitecturesfor interprocesssynchronization.A store-con-
ditional writes a newvalueto a memorylocationif no otherCPU accessedthat location
sincea load-linkedwasexecuted,andreturnsthesuccessor failure statusof theoperation
89
in aregisteror conditioncodebit. Architecturesthatprovidesuchstore-conditionalinstruc-
tion but no swap may use this instruction as conditional flush.
5.1.3 Process Context Identifier
Whensettingup a transferat the user-levelI/O device,the applicationprocessnot
only specifiestheparametersandbufferaddressesnecessaryfor theparticularrequest,but
alsoneedsto identify itself to thedevice.Thecontextidentificationis neededwhenthede-
vicetranslatesandverifiesbufferaddressesprovidedwith therequest,andwhenacomple-
tion notification is sent to the application.
This contextidentifier cantakemanydifferent forms suchasthe processstructure
pointer,a pagetablepointeror theprocessID. Thebestchoicedependson theparticular
kernelorganizationanddevicestructure.A commonrequirementis that the identifier is
uniqueamongall currentlyrunningprocesses.If theI/O deviceperformsaddresstransla-
tionswithout kernelassistance,thecontextidentifier mustprovideenoughinformationto
find theprocesspagetable.In manycasesthephysicaladdressof thepagetableor aphys-
ical pointerto theprocessstructurewhich in turn containspointersto thepageor segment
tablecanbeused.If addresstranslationis donewith kernelassistance,thecontextidentifier
shouldenablethekernelto quickly find theassociatedprocess.This canbeaccomplished
by providingthevirtual processstructurepointerwith everyrequest.Finally, low-overhead
completionnotificationsrequirethatthekernelbeabletoquicklyconfirmif thenotification
targetsthecurrentlyrunningprocess,andto find thecorrectprocessstructureif this is not
the case. Again, the virtual pointer to the process structure is a good choice.
90
However,no matterwhat form of contextidentificationis used,allowing applica-
tionsto identify themselvesmakestheentiresystemvulnerableto faulty or maliciousap-
plications,unlessa secureandtrustedway is devisedin which the processidentification
canbe communicatedto the device.The mostefficient way to communicatethe context
identifier to theI/O deviceis to havetheCSBinserttheneededinformationautomatically
onaflush.Thisschemerequiresthattheidentifieris availablein aprivilegedprocessorreg-
ister.Severalarchitecturesstoresomeform of processID in a supervisormoderegisterto
detectaliasingof cacheor TLB entries.For instance,theMIPS architecturedefinesan8-
bit addressspaceidentifier thathelpsto avoid flushingtheTLB on everycontextswitch;
PA-RISCusesan18-bit identifier to de-aliasreferencesto virtually addressedcaches[55];
andtheAlpha 21164storesa 7-bit processID in a privilegedregister[2]. Althoughthese
identifiersareuniqueamongall processes,theymaynot beideal for device-basedvirtual
to physicaladdresstranslationor low-overheadprocessnotification.In addition,hardwir-
ing a CSB entry to a privileged register limits flexibility.
A moreflexible alternative,althoughassociatedwith higheroverhead,is to trap to
thekernelwhentheCSBflush instructionis executed.Thekernelcantheninsertthe re-
quiredcontextidentifier andemulatetheflush operationon behalfof theapplication.The
overheadof thisschemedependslargelyon implementationdetails.If theflush instruction
triggersageneralprivilegedinstructiontrap,thecodesequenceis verysimilar to asystem
call, andconsequentlyintroducessimilar overheads.Theonly advantageof theCSBover
a systemcall in this caseis theimprovedbandwidthcomparedwith a sequenceof single-
wordstores,andtheintra-CPUatomicitythateliminatestheneedfor costlyglobalsynchro-
nization.
91
A moreoptimizedimplementationis possibleif adedicatedtrapvectoris usedfor the
CSBflush instruction.Thetraphandlerneedsto saveonly minimalstate,inserttheneeded
informationandexecutetheflushinstruction.Unlike ageneral-purposesystemcall, it does
not needto checkif a signalneedsto bedeliveredto theprocess,or if a contextswitchis
scheduled.Bothsoftwarebasedsolutionsoffer greaterflexibility. Thetraphandlercanin-
sertanyform of contextidentifier in anypositionin theCSB.Thedisadvantageis thein-
creased CPU overhead, which can approach that of a system call.
A third optionfor communicatingthecontextidentifier to theI/O deviceis to decou-
ple thecontextfrom therequest.Ratherthanspecifyinga contextwith everyrequest,the
operatingsystemsinformsthedeviceof thenewidentifier duringa contextswitch.When
thedevicereceivesarequest,it insertsthecurrentcontextidentifier in therequeststructure.
Updatingthecurrentprocesscontextat theI/O devicereducesthetransfersetupoverhead,
but it requiresmodificationof the contextswitch handler,which inevitably leadsto in-
creasedcontextswitchcosts.To avoidmakingthecontextswitchhandlercodedependent
on thehardwareconfiguration,devicedriverswould registercontextswitchcallbackrou-
tineswith thekernelthatarecalledby thecontextswitchroutine.In asystemwith asmall
numberof processesandfew contextswitchesbut frequentI/O operations,thereducedre-
questoverheadmayoffsettheincreasedcontextswitchoverhead.Decouplingthecontext
from therequestis particularlyattractiveif theCSBis locatedin thedeviceandnot in the
processor,asdiscussedin a latersection.In thiscase,updatingthecurrentprocesscontext
implicitly notifies the device of a context switch and clears the CSB hit counter.
92
5.1.4 CSB Hardware Implementation
A samplehardwareimplementationservesto assessthehardwarecomplexityof the
conditionalstorebuffer whenit is addedto a processorsysteminterface.Figure17 shows
ablockdiagramof theconditionalstorebufferwith genericinterfacesto theprocessorcore
andbusinterface.Centralto thedesignarethe64-bytedatabuffer,theaddressregisterand
theaccesscounter.A statemachineincrementsthecounterfor everycombiningstorethat
matchesthecurrentCSBaddress.If thedatavalueprovidedby theCSBflush instruction
matchesthecurrentcountervalue,thestatemachinerequestsatransactionfrom thebusin-
terface.Whenacknowledged,thestatemachinesendstheaddressandthedatabuffercon-
Figure 17: Conditional Store Buffer Implementation
d0 d1 d2 d3
d15d13d11d9
processor interface
data out
system bus interface
0x0000
d0 d2 d4 d6
d7d5d3d1
data out
data incontextaddr in
addr
count
data in
cmp
cmp
cntl
address process context data result
bus address & data return data
93
tentsin nineconsecutivecyclesto thebusinterface,assuminga multiplexed64-bit wide
systembus.A datareturnfrom thebusinterfaceis forwardedto theprocessorcore.In case
of afailedCSBflush,thevaluezerois immediatelyreturnedto theprocessor.To minimize
the numberof bits neededto countconsecutiveCSB accesses,the counterimplementsa
stickyoverflowbit. Thecounterprovidesonly thenumberof bitsneededfor asequenceof
uniquebytestores.Thesticky overflow bit indicatesa counteroverflow, it is resetduring
an interruptor whena storewith a conflicting addressor a conditionalflush is received.
Theconditionalflushinstructionfails if thebit is set.All datapathsfrom andto theproces-
sor core are 32 bits wide.
In high performancedesigns,it is critical that all inputsandoutputsof a block are
directlyconnectedto registers.Thisdesignguidelinesimplifiesglobaltiming significantly,
leavingalmosttheentireclockcyclefor signalpropagationbetweenblocksthatmaybelo-
catedrelativelyfar from eachotheron thedie.TheCSBsampledesignfollows thisguide-
line and includes input and output registers for all data and control signals.
Thedesignis implementedin Verilog andsynthesizedfor a0.25µm technologyus-
ing a commercialstandardcell library with a targetfrequencyof 400MHz. Althoughthis
methodologyis rarelyusedfor high performancemicroprocessordesigns,it givesa good
indicationof theworstcasecycletimeandarearequirements.TheSynopsyssynthesistool
[25] reportsanareaof approximately0.22mm2, includingestimatedroutingarea.Thecrit-
ical pathof thedesignexceedsthe targetclock cycleby 0.6 ns,resultingin anoperating
frequency350MHz. Theaddresscomparisonlogic leadingto themaincontrollogic com-
prisesthecritical path.An optimizedcustomdesigned32-bit comparatorwould eliminate
this problem,resultingin a designthatmeetsthetiming requirementsfor microprocessors
94
usingthis technology.TheD-flipflops usedfor thedatabuffer andpipelineregistershave
relativelyhighsetuptimerequirements.Customdesignsof thedatabufferandpipelinereg-
isterswould improvenot only thearearequiredfor the implementationbut alsothecycle
time further.
5.2 Conditional Store Buffer at the I/O Device
ThebasicCSBprincipleof nonblockingsynchronizationcanbeappliedattheI/O de-
vice levelaswell. In thiscase,theI/O deviceprovidesadatabufferandhit counterfor ev-
ery CPU in the system.For everystoreto the store-combiningaddressspace,the device
storesthedatain thecorrespondingdatabuffer entryandincrementsthehit counter.The
hit counteranddatabufferareclearedoneverycontextswitch,sothatthedeviceis ableto
detectif thesequenceof storeswasatomic.Similarto theCSB,aswapbustransactioncom-
municatestheexpectedhit countervalueto thedeviceandreturnsthesuccessor failuresta-
tus to theapplication.Alternatively, if thenumberof storesin a sequenceis fixed, a read
bus transaction is sufficient to start the I/O transaction and return the flow control status.
In a multiprocessorsystem,thedevicemustprovidea CSBfor eachprocessor.This
requiresthat the devicecan determinewhich CPU issueda bus transaction,so that the
transactionis associatedwith thecorrectCSB.Mostmodernsystembusescarrysomeform
of processoridentificationin eachtransaction,eitherin someunusedaddressbitsor aspart
of the transaction ID used by split-transaction bus protocols.
Themainadvantageof movingtheCSBto theI/O deviceis that it doesnot require
anymodificationsof theprocessorsysteminterfaceor memorymanagementlogic,beyond
supportfor a swapbustransaction.However,thecontextswitchhandlerof theoperating
95
systemneedsto bemodifiedto clearthedeviceCSBoneverycontextswitch.Suchmodi-
fication is usuallynot desirablebecausethe contextswitch codeis highly optimizedfor
minimumcodelengthandlatency.Increasingthecontextswitchlatencypenalizesall pro-
cesses, not just applications that utilize the user-level I/O features.
5.3 Performance Evaluation
This sectionpresentsoverheadresultsfor thedifferent transfersetupschemeson a
setof currentandnearfuturesystems.All resultsareobtainedfrom a prototypeuser-level
I/O implementationin theL-RSIM architecturalsimulator.Theoverheadresultsaretheav-
erageof approximately60requestsof differenttypes.In all experiments,therequeststruc-
ture is copied from an applicationbuffer into the CSB before the CSB is flushed.To
evaluatetheimpactof architecturaladvancesonthevarioustransfersetupschemes,theex-
perimentsvary theprocessorandcachesubsystemindependentlyfrom thesystembusand
mainmemorysubsystem.For eachsubsystem,two configurationsaretested,representing
currentandnearfuturearchitectures.Eachconfigurationis characterizedby acombination
of CPUcoreandmemoryclock frequency.Table11summarizesthemostrelevantconfig-
urationparametersof the resultingfour systems.Takentogether,thesecombinationsare
designed to cover a wide range of processors and main memory subsystems.
Figure18 showsboththetotal overheadof thedifferenttransfersetupschemesdis-
cussedbefore,aswell astheoverheadreductionsrelativeto thebaselinesystemcall imple-
mentation.Eachgroupof bar graphscorrespondsto a setof measurementson the same
system,startingfrom thebaselinesystemcall schemeto theCSBwith hardwiredprocess
context.The bar graphsshowingthe absoluteoverheadsare split into minimum values
96
(shadedbottomportion) andaveragevalues(white box stackedon top). The maximum
overheadincurredby eachschemeis essentiallyunboundeddueto cacheandTLB misses
and interrupts,andhenceis not shownhere.The secondsetof graphsshowsminimum
overheads normalized to the system call scheme.
Thesystemcall schemerepresentsa baselinethatusesanioctl() call handledby the
UIO devicedriver.Thesystemcall codenecessaryto switchfrom userto kernelspaceand
backusually includesseveraltensto hundredsof instructions.The codesequencesaves
mostof thecurrentprocessstate,switchestoakernelandchecksfor errorconditions.When
returningto usermode,thekernelfirst checksif a signalneedsto bedeliveredto theuser
process,or if a contextswitchis scheduled.If neitheris thecase,it restorestheuserstate,
switchesbackto theoriginal stackandresumesexecutingin usermode.After savingthe
Table 11: System Configurations
Parameter 400 / 100 400 / 200 2G / 100 2G / 200
CPU Frequency 400 MHz 400 MHz 2 GHz 2 GHz
Superscalarity 4 way 4 way 6 way 6 way
Instruction Window 48 entries 48 entries 96 entries 96 entries
L1 Cache size / assoc. 32K / 2 32K / 2 256K / 4 256K / 4
L1 Cache latency 1 cycles 1 cycle 2 cycles 2 cycles
L2 Cache size / assoc. 2M / 2 2M / 2 8M / 2 8M / 2
L2 Cache latency 14 cycles 14 cycles 26 cycles 26 cycles
System bus frequency 100 MHz 200 MHz 100 MHz 200 MHz
System bus width 8 byte 16 byte 8 byte 16 byte
Main memory latency 550 ns 225 ns 550 ns 225 ns
I/O latency 420 ns 340 ns 420 ns 340 ns
97
currentprocessstate,theUIO systemcall copiestherequeststructurefrom userinto kernel
space,transferstherequestto thedevicevia asequenceof uncachedstoresandinitiatesthe
transactionwith anuncachedloadthatreturnsthedeviceflow controlstatusto thedevice
driver.Theentiredeviceaccesssequenceis protectedby aglobalspinlock. Althoughthis
schememaynot necessarilyrepresentanactualimplementation,it approximatesthecost
of enteringkernelmodeandsynchronizingthedeviceaccess.Thesystemcall executesthe
minimumnumberof uncachedloadsandstoresrequiredfor a requestsetup.Notethatthe
UIO systemcall doesnot lock thephysicalpagesin mainmemory,nordoesit translatethe
virtual buffer spaceinto a list of physicaladdresses,sinceit assumesthattheI/O deviceis
Figure 18: Request Overhead
0
1.0
2.0
2.5
1.5
0.5over
head
inµs
norm
aliz
ed o
verh
ead
0
0.4
0.8
1.0
0.6
0.2
400 / 100
400 / 100
System CallCSB with Flush Trap
CSB with Fast Flush TrapCSB at Device
CSB with hardwired context
400 / 200
400 / 200
2G / 100
2G / 100
2G / 200
2G / 200
98
ableto performtheseoperationsindependently.In systemsthatdo not supportuser-level
I/O, this assumptionis not realistic,but it is usedhereto fairly compareonly thecostof
atomicity and flow control for I/O requests.
Theotherschemesshownin Figure18 usetheconditionalstorebuffer eitherin the
processoror the devicewith a variety of processcontextmechanisms.In the caseof the
hardwiredcontextscheme,the contextis providedby the privilegedtlb-contextregister
which is normallyusedfor fastTLB misshandlingin theCPU.This registerpointsto the
physicalpagetablerootof thecurrentprocess.UponaCSBflush,theCSBinsertsthecur-
rentvalueof this registerin location0. Usingthephysicalpagetablerootenablesveryef-
ficient addresstranslationwithout kernel involvementat the device.However,for user-
level notifications,thekernelneedsto scantheprocesslist for a matchingprocessbefore
deliveringthenotification.Alternatively,thephysicaladdressof theprocessstructuremay
beusedasa contextidentifier.TLB misshandlersneedto performat leastoneadditional
memoryreferenceto find theassociatedpagetable,butnotificationhandlingmaybefaster
than when the page table pointer is used.
The trap basedcontextschemesusethe samecontext identifier as the hardwired
scheme.Thegeneralprivileged-instructiontrapis dispatchedto aC-routinethathandlesall
privileged instructionexceptions.If it determinesthat the trappedinstructionis a CSB
flush,it writesthecurrentprocesscontextidentifier to theCSB,performstheflushinstruc-
tion andmodifiestheprocessstateonthestacksuchthatthereturnvalueof theconditional
flush is set as expected by the application.
The fast trap schemeusesa dedicatedCSB flush trap,which is handledby a short
assemblerroutine.This routinewrites theprocesscontextto theCSB,copiesthe trapped
99
instructionto akernellocationandjumpsto it, thusexecutingtheinstructionon theappli-
cations behalf.
Perhapssurprisingly,the systemcall schemeincurs significantly higher overhead
thanthegeneralpurposetrapmechanismusedto inserttheprocesscontext,eventhough
bothsaveandrestorethesameamountof processstate.Thereasonis that thesystemcall
passesthroughthefilesystemvnodelayerbeforeenteringthedevicespecificroutine,while
the privilegedtrap is immediatelydispatchedto the appropriatehandlerroutine.In addi-
tion, thesystemcall locksandunlocksthedeviceandissuestherequestasa sequenceof
individual uncachedstores,whereasthe CSB providesatomicity in hardwareandissues
only one bus transaction, thus reducing overhead.
In systemswith a slow CPU,thefasttrapschemeperformsonly slightly worsethan
thedeviceCSBorprocessorCSBwith hardwiredcontext.In theseconfigurations,theover-
headis dominatedby copyingtherequestinto theCSBandwaitingfor theflow controlsta-
tus from the device,and the additional instructionsof the fast trap handlerdo not add
significantlyto theoverhead.However,in systemswith afastCPU,thespecialpurposetrap
handlerperformsworsethanthegeneralpurposetrap.In orderto minimizethenumberof
instructionsexecuted,thefasttraphandleremulatestheCSBflushby copyingtheinstruc-
tion from userinto kernelspaceandthenjumpsto it. Othermethodsof emulatingthe in-
struction would require more complex decoding,since the flush may take arbitrary
effectiveaddressesprovidedasbaseplusoffsetandarbitrarysourceanddestinationregis-
tersasarguments.However,copyingtheinstructioninto kernelspacerequiresflushingthe
instructioncache,which leadsto drasticlatencyincreasesfor systemswith fastCPUsand
relatively slow main memory.
100
Theperformanceadvantageof theprocessorsideCSBis evidentin thefastprocessor
experiments.For slow CPUsthedeviceCSBperformscomparableto theprocessorCSB
with hardwiredcontext,butonly thelatterschemeis ableto reduceoverheadfurtheronfast
CPUs,becauseit executestheleastinstructions,andbecausethesequenceof storesinto the
CSB is not slowed down by the system bus.
Implementingthe CSB andprocesscontextidentifier in the deviceresultsin over-
headsslightly lower thanthe fast trap scheme,andcomparableto the hardwiredcontext
schemefor slowprocessors.Thismakesit anattractivealternativeto theotherschemesbe-
causeit doesnot involve modificationsof theprocessorbusinterface,while showingsig-
nificantoverheadreductionscomparedto asystemcall.ThedeviceCSBscheme,however,
requiresmodificationsof thecontextswitchhandlerto updatethecurrentprocesscontext
in thedeviceandcleartheCSBhit counter.In theLAMIX kernelthis is implementedsim-
ilarly to thesharedinterruptmechanism.At boottime,thedevicedriver registersa routine
with thekernelthatis calledfor everycontextswitch.Thisroutineperformsdevicespecific
operationsto updatetheprocesscontextatthedevice.Beforeswitchingto thenewprocess,
thecontextswitchhandlercallsall routinesthathavebeenregisteredby devicedriversfor
this purpose.This methodis flexible enoughto fit in a generalpurposeoperatingsystem
structure,whereawidevarietyof hardwareconfigurationsandI/O devicesis supportedby
a configurablekernel.A moreoptimizedimplementationmight placetheadditionalcode
directly in thecontextswitchhandler,thusmakingthecontextswitchhandlermoreeffi-
cient but device configuration dependent.
Figure19showscontextswitchlatenciesfor avarietyof systemconfigurations.The
top setof bar graphsshowsabsolutelatenciesin microseconds,while the bottomgraphs
101
showthesameresultsnormalizedto theunmodifiedcontextswitchhandlerfor eachsys-
tem.Themicrobenchmarkmeasurescontextswitch latencyasthe time it takesto switch
betweentwo processesusingtheyield() systemcall.Notethatsincenootheroperationsare
performedbetweencontextswitches,few cacheandTLB missesareincurredandthere-
ported latencies are close to the minimum.
Figure 19: Context Switch Latency
0
4
8
6
2
late
ncy
inµs
norm
aliz
ed la
tenc
y
0
0.4
0.8
1.0
0.6
0.2
Unmodified SwitchModified, 0 devices
Modified, 1 deviceModified, 2 devices
Modified, 3 devices
1.2
400 / 100
400 / 100
400 / 200
400 / 200
2G / 100
2G / 100
2G / 200
2G / 200
1.4
10
102
Realisticapplicationsmayobservehighercontextswitchlatenciesdueto instruction
and datacachemissesincurredby the contextswitch routine. Nevertheless,a context
switch in theLAMIX operatingsystemhashigherlatencythanon mostcommercialsys-
temsbecausethe L-RSIM processormodelrequiresflushing of the entireTLB during a
contextswitch,whereasmanycommercialmicroprocessorstagTLB entrieswith aprocess
identifier, eliminating the need to flush the TLB on a switch.
Thecontextswitch latencygenerallyincreasesif thehandleris modifiedto call de-
vice driver specificroutineson everycontextswitch,evenif no suchroutineis installed
(secondbargraph).This additionallatencyis dueto theadditionalinstructionsbeingexe-
cutedto checkif anydriver specificroutinesareto becalled.Somesystemconfigurations
showaslightdecreasein latency,probablybecausein changesin theinstructionscheduling
or layout in the instructioncachethat leadto betterutilization of theprocessorpipelines.
Contextswitchlatencyincreasesby 5 to 20%whenat leastonedevicedriverhasinstalled
acontextswitchroutine.In thiscasethecontextswitchhandlercallsanadditionalsubrou-
tine,which readsdevicedriver specificdatastructuresandissuesanuncachedstoreto in-
form thedeviceof theswitch.Adding moredevicespecificroutinesto thecontextswitch
handlerincreaseslatencyfurtherby about2-5%perroutine.Theincrementalcostof these
routinesis lessthantheinitial latencyincreasebecausethisexperimentusesthesamerou-
tine for everyadditionaldevice,thusreducingcachemissesfor repeatedinvocations.Con-
sideringthat basiccontextswitchesin manysystemsaresignificantly fasterthan in the
simulator,alatencyincreaseby 10to 20%is significantandunacceptablefor mostgeneral-
purpose multiuser systems.
103
5.4 Summary
Initiating anI/O transactioninvolvesatomicallytransferringmultipleargumentsand
parametersto the I/O device.Unlike operatingsystemcode,applicationscannotrely on
software-basedglobalsynchronizationmechanisms.Theconditionalstorebuffer is ahard-
waremechanismsthatallowssoftwareto controlwhichI/O storesarecombined,andwhen
theresultingbustransactionis issued.To avoidoverflowingthedevicerequestqueue,the
I/O devicereturnsastatuswordto theapplicationindicatingif therequestwasacceptedor
rejected.Thesecharacteristicsallow applicationsto usetheCSBto atomicallytransferall
argumentsof anI/O transactionto thedevicewithout involving theoperatingsystem.The
CSBmaybelocatedin theprocessorbusinterfaceor in theI/O device.In additionto the
transactionparameters,eachrequestpacketcontainsa uniqueprocesscontextidentifier
which is usedby thedeviceto validateandtranslatevirtual buffer addressesandto notify
the processwhenthe transactioncompleted.To securelyandreliably communicatethis
processcontextto thedevice,eithertheCSBhardwareor thekernelinserttheappropriate
context in the request when it is issued to the device.
Microbenchmarksshowthat theprocessor-basedCSBwith hardwiredprocesscon-
text incurstheleastoverheadcomparedto all otherschemes,becauseit minimizesboththe
numberof instructionsexecutedaswell asthenumberof bustransactions.However,hard-
wiring theprocesscontextfield of theCSBto aprocessorregisterrestrictstheflexibility of
systemsoftwareandI/O devicesto chosewhichform of contextidentifier to use.Although
moreflexible, softwarebasedsolutionsincreasethelatency,which canapproachthatof a
simple system call for unoptimized implementations.
104
As aresultof theseexperiments,aprocessor-sideconditionalstorebufferwith aded-
icatedtrapfor theCSBflush instructionto inserttheprocesscontextappearsto bethebest
solution.It strikesa balancebetweenoverheadandflexibility, anddoesnot requireany
modifications to the context switch handler.
6. DIRECT USER-SPACE TRANSFER
Copyingdatabetweenkernelanduserspaceis in manycasesthelargestcomponent
of I/O overhead.Thisoperationnotonly occupiesthehostCPUfor manycycles,it alsohas
significanteffectson the cacheandTLB. The copy operationloadsdatainto the cache
twice, replacingothercachelinespreviouslyusedby theapplication,andwastesprecious
systembusbandwidthfor theseextradatatransfers.In addition,thehostCPUcopyperfor-
manceis usuallylower thanthatof a dedicatedDMA enginein I/O devices,sincemicro-
processorsaregenerallyoptimizedfor singleword accessesto the level onecacherather
that large data movement operations.
In thecaseof file I/O operations,thedatacopyis necessarybecausetheOSmanages
files in acacheof diskblocksin kernelmemory,from wheretherequesteddataarecopied
into userspace.Thisdesigngivesthekernelbettercontroloverthephysicalmemoryallo-
catedfor thebuffer cache,andit enablesthekernelto performthenecessaryaddresspro-
tectioncheckfor theapplicationbuffer duringthecopyoperation.Networkprotocolcode
maintains kernel buffers for retransmissions in the event of transient network errors.
Next-generationIO deviceshavesufficientcomputationresourcesandautonomyto
performsomeof theoperationsnormallydoneby theoperatingsystem,thusallowing the
kernelto bebypassedfor datatransfers.If datatransfersareperformedby thedeviceinde-
pendentlyfrom thehostprocessor(DMA), thesourceor destinationbuffer mustbespeci-
fied asaphysicaladdress,sinceDMA transfersoccuratthesystembuslevel.Traditionally,
106
the devicedriver translatesa contiguousvirtual buffer spaceinto a list of discontiguous
physicalpagesthatarecommunicatedto theDMA enginein theI/O device.Beforeinitiat-
ing thetransfer,thedevicedriverensuresthatthephysicalpagescorrespondingto thebuff-
er arepresentandvalid, andmarksthemasnonpageablefor thedurationof the transfer.
This pinningor lockingof pagescaneitherbedonebeforeeveryDMA transfer,or when
the buffer is initially allocated.
Existingsolutionsfor high-performancecommunicationnetworksoftenrequirethe
applicationto specifythecommunicationbuffersin advance[24][32][39]. Duringthebuff-
er setup,thekernelpins thepagesin physicalmemory,andtheaddressmappingis made
availableto theI/O device.In addition,thekernelcanarrangethephysicalpagescontigu-
ously.Theapplicationis thenableto initiate DMA transfersusingthis prearrangedbuffer
without kernelassistance.This schemeplacestheburdenof managingthe limited buffer
spaceon theprogrammer,andoften forcestheapplicationto copydatain andout of the
buffer to locationswherethe datais used.Furthermore,pinning pagestakesthesepages
awayfrom thegeneralpoolof availablememory,possiblyaffectingtheperformanceof the
entiresystem,increasingphysicalmemorypressureandlimiting thescalabilityof this ap-
proach.
Enablingthe I/O deviceto transferdatadirectly to and from userbuffers reduces
CPUoccupancyandminimizesthecacheandTLB effectsof I/O transactions.At a mini-
mum, this requiresperformingaccessprotectioncheckson the entireapplicationbuffer,
translatingthevirtual bufferregionintoalist of physicaladdressesandlockingthephysical
pagesin memoryto avoidpagefaults.Executingasystemcall to performtheseoperations
incursextraoverheadanddefeatsthe goal of the user-levelI/O architecture.In addition,
107
specifyingthesourceor destinationbufferasavariable-lengthlist of physicaladdressesre-
quiresasuitablelocationto storetheseaddressesandincreasesthecomplexityof theDMA
engine.Theuser-levelI/O architecturetakesanoptimisticlow-overheadapproachto direct
user-spaceDMA transfers.TheUIO deviceis augmentedwith atranslationlookasidebuff-
er (TLB) thatcachesvirtual to physicaladdressmappingssimilar to aprocessorTLB. This
approachenablestheUIO deviceto translatevirtual to physicaladdresses,performprotec-
tion checksanddetectpagefaultswithout utilizing thehostprocessor.TLB missescanbe
handledeitherby thehostoperatingsystemvia interrupts,or independentlyby thedevice
with thehelpof aprogrammablepagetablewalk engine.Thissectionfirst describestheI/O
deviceTLB designanddiscussesits integrationwith thehostoperatingsystem.It thencon-
traststhe kernel-basedandhardwareTLB misshandlingmechanismsandevaluatesthe
performance of these alternatives.
6.1 Device TLB Design
Theuser-levelI/O architectureallowstheUIO deviceto performvirtual to physical
addresstranslationsautonomously,thusminimizing hostCPU occupancyfor datatrans-
fers.To achievehighDMA bandwidth,addresstranslationsarecachedin adevicetransla-
tion lookasidebuffer (TLB), asshownin Figure20. For everybustransaction,theDMA
enginepresentsthevirtual addressanda contextidentifier to theTLB, which returnsthe
correspondingphysicaladdressandwhich flagsanyaccessviolations.Thecontextidenti-
fier is neededto distinguishaddressmappingsfor differentprocesseswith thesamevirtual
address.OnaTLB miss,thedevicecaneitherperformthenecessarypagetablelookupin-
108
dependently,or invokethekernelfor assistance.Both optionsarediscussedin thefollow-
ing sections.
Unlike a microprocessorTLB, thedeviceTLB doesnot needto performanaddress
translationat everycycle,sinceit is only accessedonceperbustransaction.Furthermore,
dueto therelativelylong latencyof mostI/O transactions,single-cycleaccessto theTLB
is notasimportantasin aprocessor.In addition,mostDMA operationsaccessmemoryse-
quentially,thusreducingtheneedfor a highly associativeTLB designto achieveaccept-
able hit rates. Theserelaxed requirementsdiffer from those for a high-performance
microprocessorandenableTLB designsthatuselarger,lessexpensivecommoditySRAM
structures with low associativity.
EachTLB entryconsistsof aprocesscontextidentifier,anaddresstagderivedfrom
thevirtual pagenumber,thecorrespondingphysicalpagenumber,avalid bit andanumber
Figure 20: Device TLB Design
DMA transaction to I/O bus
virtual address &
TLB miss
context identifier
handler
TLB miss
TLB fill
TLB
table walk / interrupt
109
of protectionbits.Sincetheuser-levelI/O devicemaybeusedin a varietyof systems,the
TLB mustbe compatiblewith different virtual memoryarchitectures.Accessprotection
bitsareencodedin asystemindependentway,andtheTLB misshandleris responsiblefor
convertingthesystem-specificencodingappropriately.A configurationregisterspecifies
thebasepagesizeof thecurrentsystem,andtheTLB usesthesettingto split thevirtual
addressappropriatelyinto tagandoffsetcomponents.To keeptheTLB designassimple
aspossible,variablepagesizesor superpagesarenotsupported,theTLB misshandlercon-
verts large page mappings to the base page size.
In its simplestform, theTLB usesonly thevirtual addressto indexinto thearrayof
TLB entries,andtheprocesscontextandlowerbitsof thevirtualaddressarecomparedwith
thetag.A smalldegreeof setassociativityhelpsto reducethe likelihood of conflictsdue
to virtual addressaliases.If the numberof setsis the sameasthe numberof concurrent
DMA streamssupportedby thedevice,conflictsbetweenstreamscanbecompletelyavoid-
ed.Alternatively,ahashvaluethatis a functionof boththevirtual addressandtheprocess
context can be used to index into the TLB.
6.2 TLB Misses and Faults
Dueto its finite size,theTLB canactonly asa cacheof recentlyusedaddressmap-
pings.If a requestedaddresstranslationis not found in the TLB (TLB miss),it mustbe
loadedfrom theoriginalpagetable.DeviceTLB misseseitherinvoketheoperatingsystem
via aninterrupt,or triggerthedeviceto performthepagetablelookuponits own.Formax-
imum flexibility, TLB missesare satisfiedusing the kernel’s pagetables.This method
avoidstheoverheadof maintaininga separatesetof pagetablesfor the I/O device,but it
110
canincreasethecostof a TLB missslightly comparedto a special-purposepagetable.In
addition,sharingpagetableswith thehostprocessorgivesthedeviceaccessto all relevant
pageinformationsuchasprotectionbits,pagesizeandvalid flags.Forportabilityandflex-
ibility, mostmodernoperatingsystemsmaintaintwo setsof pagetables.A hardwareinde-
pendentstructureis usedfor most high-level memorymanagementoperationswhile a
hardware-specificlow-level pagetableis usedto satisfyprocessorTLB misses.Thepro-
cessor-specificpagetable is usually simpler and can be traversedefficiently, while the
high-levelpagetablecontainsmoredetailedinformationaboutpageusageandsharing.If
deviceTLB missesarehandledby the hostoperatingsystemvia interrupts,the interrupt
handlercanobtainthecorrecttranslationfrom thevirtual memorysubsystemandprovide
it to thedevice.If, ontheotherhand,thedeviceperformspagetablelookupsindependently,
it shouldusethe low-level pagetableasit is keepsthealgorithmexecutedby thedevice
hardware simpler.
To simplify its design,theTLB stallsuntil themissis handled.If no valid mapping
wasfound,theTLB misshandlerinstallsaninvalid mappingfor thecurrentaddressin the
TLB. WhentheTLB lookupis restarted,amatchingentryis found,but thevalid bit is not
setandanaccessviolation is signaledto theoperatingsystem.Similarly, if an illegal ac-
cess,suchasa write to a read-onlypageis attempted,theTLB invokestheoperatingsys-
tem.
For fatal accessviolations,thekernelinterrupthandlerterminatestheprocesslike it
wouldfor anyothermemoryaccessviolation.In addition,theDMA streamthatcausedthe
violation needsto beterminated.Theexactmechanismto achievethis dependson theI/O
networkarchitecture.If thenetworktransmitsdataasa singlestreamor in largepackets,
111
theinterrupthandlercaninstructtheDMA engineto drainthecurrentstreamwithout any
furtherbustransactions.Long I/O transactionsmightbetransmittedasmultiplestreamsto
reducenetworkcontention,in whichcaseeachadditionalstreamcausesanotheraccessvi-
olation that is handledin thesameway. In this case,theper-streaminterruptapproachis
feasibleonly if packetsare sufficiently large so that the numberof kernel interruptsis
small.If dataaretransferredin manysmallpackets,thecostof invoking thekernelto han-
dleaccessviolationsfor eachcannegativelyimpactunrelatedapplications.In theprototype
implementationwith a 400 MHz CPU,a UIO devicepagefault takes4.8 µs to process,
while the interrupthandleroccupiesthe hostprocessorfor over 9 µs. Assumingeach48
byteATM cell triggersanaccessviolation, thehostprocessorreaches100percentsatura-
tion dueto theseinterruptsat 5.2 Mbyte/snetworkbandwidth.At 200 Mbyte/snetwork
bandwidthand1 Kbyte packets,processorutilization dueto accessviolationsexceeds50
percent.Theseexamplesdemonstratetheperformanceimpactafaulty applicationcanhave
on unrelatedprocesses.To reducethis impact in a packet-orientednetwork,the request
statestoredat thedevicefor thedurationof theDMA transfershouldbeusedby theDMA
engine to drop all packets belonging to a stream that caused an access violation.
OutgoingDMA transfersaresomewhateasierto handle,sincetheDMA enginecan
considerthetransfercompleteassoonastheaccessviolation is detected.However,if part
of therequesthasbeentransmittedon to thenetwork,thereceivermustbenotified to dis-
regard the transfer.
PagefaultsoccurwhenaprocessinitiatesanI/O transferto abufferwith at leastone
of thephysicalpagesswappedout.Sincethedeviceusesthesamekernelpagetablesasthe
hostprocessor,it is able to detectthis conditionandinvoke the operatingsystemin the
112
samewayasfor anaccessviolation.However,in thiscasethestreamof dataflowing to or
from thenetworkmustbestalleduntil thepagehasbeenbroughtinto memoryby theop-
eratingsystem.This designhasthesideeffect that theDMA enginethatcausedthepage
fault is blockedfor thedurationof thepagefault, whichmayaffectotherI/O requests.It is
possibleto havetheoperatingsystemremovetherequestwith its currentDMA statefrom
thedeviceandrestartthetransferafterthepagefault is handled,but thiswouldaddsignif-
icantcomplexityto theOSandthedevicehardware.Stallinganoutgoingrequestmayalso
blocknetworkresourcessuchasavirtual circuit for thedurationof thepagefault.Thesys-
temimpactof thesedependsontheparticularnetworkorganization.In additionto blocking
networkresources,stallinganincomingrequestrequiresthatthesenderor thenetworkhas
sufficient buffering for the in-transitdata,and that the networkprovidesan appropriate
flow control mechanisms.Fortunately,mostof theserequirementsarenot uniqueto the
user-levelI/O architectureandmanysystem-areanetworksarepreparedto handlestallsat
both the data source and sink [15].
Theasynchronousoccurrenceof I/O devicepagefaultsalsorequiressomemodifica-
tions to the operatingsystem.Normally, pagefaults aredetectedsynchronouslyandthe
processthatcausesit is blockedwhile thekernelhandlesthefault usingmechanismssim-
ilar to synchronoussystemcalls.Interrupthandlers,on theotherhand,mustnevertrigger
a pagefault sincetheycannot beblocked.Modernoperatingsystemsdealwith thesere-
strictionsby off-loadingsomeof theinterrupthandlerfunctionalityto a regular(although
high-priority)kernelthreadthatis subjectto normalsynchronizationandschedulingactiv-
ity. A similarapproachcanbeusedtoprocessI/O pagefaults.After determiningthatapage
fault is thecauseof theinterrupt,theinterrupthandlerunblocksakernelthreadoraseparate
113
processto handlethecondition.A queuestructurein kernelmemoryis usedto communi-
catethevirtual addressandprocessidentifier to thethreador process,whichreadsthepage
into memory on behalf of the I/O device and restarts the I/O transaction.
6.3 TLB Coherence and Consistency
Aswith anyformof cacheorTLB, coherencymustbemaintainedbetweenthedevice
TLB, the processorTLB andthe pagetablesin memory.Oncethe operatingsystemre-
movesamappingfor aphysicalpage,it is freeto usethesamephysicalpagefor adifferent
virtual address.To avoid corruptingdatain the new page,the old mappingmustbe re-
movedfrom anyTLB in thesystem(TLB shootdown), which in effectmakesthechange
visible to all TLBs.
Severalarchitecturesspecifya busprotocolfor TLB shootdownoperationsthat in-
validatesandsynchronizesall participatingTLBs.Betweenprocessors,thisis implemented
with aspecialbustransactionthatcollectsresponsesfrom all TLBs beforegraduating,but
suchtransactionsarenormallynot forwardedto the I/O bus,hencethe I/O devicecannot
participatein this protocol.Thesameeffect,however,canbeachievedwith anuncached
readoperationfollowing theuncachedwrite that triggersthe invalidation.Thedevicere-
turnsdatafor thereadonly after it hasprocessedthe invalidationrequest.Unfortunately,
uncachedreadtransactionsincur latenciesat leaston theorderof a mainmemoryaccess,
andusingareadto synchronizetheI/O deviceTLB increasesthecostof globalTLB shoot-
downoperationsevenfurther.To keepthevirtual memorymanagementroutinesin theker-
nelmodularanddeviceindependent,thedevicedriver registersadevice-specifictlb-flush
routinewith the kernelduring initialization. The routineis calledby the virtual memory
114
subsystemwheneverapageis unmapped,it performsanyuncachedwritesandreadsneed-
ed to invalidate the page table entry.
To minimizethenumberof I/O bustransactionswheninvalidatingadeviceTLB en-
try, thekernelwritesonly thevirtual addressthatis unmappedto adevicecontrolregister.
ThedeviceTLB removesall entriesthatmatchthisvirtual address.If theTLB usesahash
functionof thevirtual addressandprocesscontextasanindex,thensearchingfor anentry
with only avirtual addresscantakemanycyclesandmayproducemultiplematches.Since
theTLB cannotbeusedby theDMA enginesduringtheinvalidationprocess,thisscheme
canintroducesignificantstall timesfor otherI/O transactions.In addition,it mayunneces-
sarilyremoveentriesof unrelatedprocessesbecauseit invalidatesanymatchingvirtual ad-
dressregardlessof theprocessidentifier.Alternatively,boththeprocesscontextandvirtual
addresscanbecommunicatedto thedevicefor aTLB shootdownoperation,thuseliminat-
ing thetimeconsuminglinearsearchin thistypeof TLB, aswell astheinvalidationof alias
entries.Thedisadvantageof thisapproachis thattwo bustransactionsmayberequired,un-
less the two values can be written in a single, larger transaction.
If, on theotherhand,thedeviceusesonly thevirtual addressto indexinto theTLB
array,invalidationoperationscanbeperformedin a singleTLB access,similar to normal
addresstranslations.In sucha design,it is sufficient to communicateonly thevirtual ad-
dress during the shootdown, unless invalidations of aliases are considered a problem.
To reducetheoverheadof globalTLB shootdownoperationsin thepresenceof large
numbersof processors,modernsystemsusea lazy approachwhereit is guaranteedthata
physicalpageis not reusedwithin a certaintime. ProcessorsperiodicallyinvalidateTLB
entriessuchthatwithin thetimeperiodall entrieshavebeeninvalidatedat leastonce.Note
115
that theseinvalidationsareappliedto physicalTLB arrayentries,e.g.,entryzerothrough
three,entryfour throughsevenandsoon.WhentheTLB entryis reloaded,modifications
madein theprevioustime periodbecomevisible to theprocessor.Sincetheprocessof in-
validatingTLB entriesis local to eachprocessoranddoesnot requireglobalcommunica-
tion, it scalesto arbitrarynumbersof TLBs, at thecostof someadditionalTLB missesdue
to unnecessaryinvalidationsaswell asincreasedphysicalmemorypressuredueto delayed
reuseof physicalpages.This delayedTLB shootdownprotocolcaneasilybeextendedto
includetheI/O deviceTLB, with little ornoadditionaloverhead.Thekernelclockinterrupt
handlercanperiodicallyinvalidateI/O deviceTLB entriesin additionto theCPUTLB en-
tries.Sincethedevicecanguaranteethataninvalidationcompleteswithin a few cyclesaf-
ter it hasreceivedtheuncachedwrite triggeringtheprocess,no explicit acknowledgment
in theform of anuncachedreadis needed.To keeptheclock interrupthandlersystemcon-
figurationindependent,devicedriverscanregisteraninvalidationroutinethat is calledat
everyclockinterrupt.With little additionalhardwarethedevicecanalsoinvalidatepartsor
all of its TLB atprogrammableintervals,thusfurtherreducingthecostof pagetableunmap
operations.Thelatterschemeis preferableover theothertwo TLB invalidationmethods,
asit minimizesthecostof boththepageunmapoperationsandtheclock interrupthandler.
6.4 TLB Miss Handling with Kernel Interrupts
SincethedeviceTLB is only acacheof recentlyusedpagetableentries,afractionof
DMA transactionsdonot find a translationin theTLB at thetimeof theaccess.Justasthe
otheraddress-translationrelatedexceptionsdescribedearlier,TLB missescanbehandled
by thekernelthroughaninterrupt.Suchadesignsimplifiesthedevicehardwareandleaves
116
theoperatingsystemdesignerfreeto useanypagetableorganizationwithoutregardfor the
I/O deviceorganization.This sectiondiscussesthe designandperformanceof a kernel-
basedTLB misshandlingscheme,while thenextsectiondiscussesdetailsof a hardware
based TOB miss handling mechanism.
In theprototypeimplementation,a generalpurposeinterruptroutinehandlesall de-
viceexceptions,includingTLB faults,TLB missesandnetworkerrors.Theparticularerror
conditionis recordedin abitvectorthatis readby theinterruptroutine.Exceptionsmayac-
cumulatein thedevice,butonly thefirst exceptionalconditiontriggerstheinterrupt.After
readingthe interruptcauseregister,the interruptroutinehandlesall accumulatedexcep-
tions,giving thehighestpriority to TLB misses.ThekernelhandlesTLB missesby reading
thevirtual addressandprocessidentifier for therequestedtranslationfrom I/O devicereg-
isters,performsthe pagetablelookup on the device’sbehalf,clearsthe interruptbit and
writestheresultof thetranslationto a devicecontrol register.Whenall pendinginterrupt
conditionsarehandled,the routinechecksthecauseregisteragainin casefurtherexcep-
tions have accumulated that where not reported when the register was read the first time.
Table12presentsdeviceTLB misshandlinglatenciesandoverheadsfor thefour dif-
ferentsystemspresentedin Table11.Thereportedvaluesarebasedon 64 samplesfrom a
singleprogramissuinga variety of I/O requests.From the I/O deviceperspective,TLB
misslatencyis mostcritical influencefor performance.EventhoughaTLB missis thefirst
conditionto becheckedandhandledin thekernelinterruptroutine,theTLB misshandling
latencyobservedby thedevicecanbesignificant.Theinstructionsequenceexecutedupon
entry into the interrupt handleris essentiallythe sameas for a systemcall. It involves
switchingto thekernelstackandsavingthecurrentcontexton it. Like asystemcall, many
117
of the instructionsoperateon global processorstateregistersand throttle the dynamic
schedulingcoreof theprocessor.In addition,a largevariationof thedeviceTLB misshan-
dling latencycanbeobservedif thekernelis executingcritical sectionsduringwhich ex-
ternal interrupts are disabled.
6.5 TLB Miss Handling with a Programmable TLB Fill Engine
An alternativeto handlingTLB missesin thekernelis to enablethedeviceto perform
thepagetablelookupindependently.This requiresthat thedevicehassufficient informa-
tion to find andtraversetheprocesspagetable.Virtual memoryarchitecturesandpageta-
bleorganizationsvarywidely betweensystems.A programmabletablewalk engineoffers
the flexibility neededfor thedeviceto operatein a wide varietyof systems.This section
introducesthedesignof sucha tablewalk engine,evaluatesthehardwarecomplexityof a
prototypeimplementationand presentsthe implementationand performanceof various
page table walk algorithms.
TheproposedTLB fill engineis anaccumulator-basedone-addressmicroprocessor
with averysimpleinstructionset.Primaryobjectivesof thearchitectureareaminimalsize
of the hardwareimplementationwhile maintainingreasonableexecutionefficiency and
Table 12: Kernel TLB Miss Handling Performance
System Minimum Latency Average Latency Average Overhead
400 MHz / 100 MHz 2.72 µs 3.89 µs 4.48 µs
400 MHz / 200 MHz 2.10µs 2.50µs 3.01µs
2 GHz / 100 MHz 2.21 µs 3.07 µs 3.38 µs
2 GHz / 200 MHz 1.42µs 1.85µs 2.06µs
118
sufficientflexibility of theinstructionsetto implementawidevarietyof addresstranslation
algorithms.Thesegoalsareaccomplishedwith avarietyof domain-specificoptimizations.
The instructionsetsupportsonly basicinstructionsthatareneededfor pagetablelookup
operationssuchas addition/subtraction,logical and shift operationsand memoryloads.
Memoryreadaccessescanbeperformedin avarietyof sizesfrom oneto 16word.This is
particularlyusefulsincethedatastructuresin manypagetableorganizationsarerelatively
big. Theaccumulator-basedarchitecturesimplifies theregisterfile design,requiringonly
onereadandonewrite port. Instructionsare16 bits, thusreducingthesizeof therequired
instructionmemory.The relatively largeregisterfile of 32 registerssupportspagetable
lookup algorithmsthat requiremany temporaryvariables,or that manipulatelargedata
structures.
TheTLB fill engineworks in conjunctionwith theon-chipTLB of the I/O device.
Uponamiss,theTLB presentsthevirtual addressandtheprocessidentificationto theen-
gine,which performsa tablelookupandeitherreturnsa valid physicalpagenumberor a
failurestatus.TheUIO deviceTLB doesnot supportvariablepagesizes.Duringstartupit
is configuredto usethesystem’sbasepagesize.TheTLB fill enginemustbeableto deal
with variable-size pages and if necessary convert address mappings to the base page size.
The TLB fill enginecommunicateswith the TLB throughthreeregistersthat are
mappedinto theengine’sgeneralpurposeregistersetasregisters0 through2. Register0
and1 areinitialized by theTLB upona missandcontainthevirtual addressandprocess
context.Register2 is writtenby theTLB fill engineat theendof thetablelookupandcon-
tains the physical page number.
119
ThehostCPUhasaccessto theinstructionmemoryandregisterfile. During system
startup,theI/O devicedriverwritesasystemspecificinstructionsequenceinto theinstruc-
tion memory.Havingaccessto theregisterfile aswell allowsthedriverto initialize thereg-
isters with constant values that are used by the page table walk algorithm.
6.5.1 Instruction Set
TheTLB fill engineinstructionsetarchitecturedefinesasmallnumberof 16-bitwide
instructionsthat includebasicarithmeticandlogic instructions,memoryload operations
for a varietyof datasizesandcontrol transferinstructions.Table13 summarizesthesup-
portedinstructionclasses.Instructionscontaineitheran8-bit immediatevalue,or a 5-bit
registeridentifier. Most instructionsusea registervalueor the sign-extendedimmediate
valueandtheaccumulatorasoperandsandupdatetheaccumulatorwith the resultor the
operation.Controltransferinstructionsuseanabsoluteaddressspecifiedin aregisteror as
animmediatevalueastarget.Notethatbranchtargetsarespecifiedasinstructionmemory
Table 13: Table Walk Engine Instruction Set Summary
Class Instructions
arithmetic add, subtract
logic and, or, xor, invert
shift shift left, shift right
comparison compare, bit test
control flow branch on condition code
move move to accumulator, move from accumulator, swap
memory read load 1, 2, 4, 8 or 16 words
halt halt with success, halt with failure
120
address,not byteaddress.Thus,an8-bit immediatevalueis sufficientto spana 256entry
instructionmemory.All instructionsoperateon 32-bit dataitems.Instructionscanspecify
whethertheyupdatetheaccumulator,theconditioncoderegistersor both,thusenablinga
variety of comparison and test instructions.
6.5.2 Hardware Implementation
Sincemanypagetablelookupalgorithmsexhibit a critical codepathlengthof tens
of instructions,it is importantthat implementationsof theTLB fill enginebeat leastpar-
tially pipelined.Theaccumulator-basedinstructionsetarchitectureandtherelatively low
complexityof thearithmeticinstructionslend themselveseasilyto a simplethreeor four
stagepipeline.Dataneedto beforwardedbetweentheaccumulatorandthefollowing in-
structionif aMOVA or SWAPinstructionis followedby aninstructionthatreadsthedes-
tination register.
Memoryloadsareclearlythedominantfactorof mosttablewalkalgorithms.In many
cases,performanceimprovesif theTLB fill enginedoesnotstallautomaticallyon loadin-
structions.Thisallowstheinstructionsequencetocontinueoperatingwhile theloadoccurs,
for instanceby performingadditionalcheckson someoperandsor by preparingbit masks
or othervaluesthatareneededwhentheloadreturns.Usingthis scheme,theTLB fill en-
ginestallsonly whenit encountersaninstructionthatusesoneof thedestinationregisters
of the pendingload asits sourceor destination,or whenit encountersa secondload.To
simplify the registerconflict detectionlogic, datamust be loadedin registersthat are
aligned at the request size.
121
Thesampleimplementationshownin Figure21featuresa three-stagepipeline.Data
areforwardedfrom MOVA/SWAP instructionsto following consumers.Loadsstall only
whenaninstructionusingtheloadvalueis encountered.Not-takenbranches(fall-through)
incur no delay, taken branches result in a one cycle pipeline bubble.
Thedesignwasimplementedin theVerilog hardwaredescriptionlanguageandsyn-
thesizedwith theSynopsysDesignCompiler[25] for acommercial0.25µm five metallay-
er process.The targetfrequencyis 66 MHz, correspondingto the frequencyof currently
used high-end PCI buses.
Theinstructionmemoryis asingle-portedsynchronousSRAM generatedby aVLSI
foundry specifictool. Similarly, the registerfile is a dual-portedsynchronousSRAM. A
special-purposeregisterfile with dedicatedreadandwrite portswould havehada slightly
smallerfootprint,but this wasnot usedbecausetheregisterfile generatordid not function
properly.Sincetheaccesstimesof theinstructionSRAM andtheregisterfile arewell be-
low 3 ns,it is possibleto performboththeinstructionfetchandoperandreadoperationin
a singlecycle,with theregisterfile runningoff thenegativeclock edge.Ideally, anasyn-
chronousregisterfile wouldbeusedin its place.Thearithmetic-logicunit andits surround-
ing multiplexers and pipeline registers was generated using Synopsys’ module compiler.
In thefirst pipelinestage,thenewPC(eitherprovidedby thePCincrementlogic or
a branchtarget)is latchedinto the synchronousinstructionmemoryandthe PC register.
Theinstructionmemoryprovidesaninstructionwhich is usedto accesstheregisterfile at
thefalling clock edge.At thesametime, theinstruction’simmediatebit controlsthemul-
tiplexer thatchoosesbetweenthesign-extendedimmediatevalueanda registervalue.In
parallelwith theregisterfile, thesetof externallyvisible registersis accessed.Theoutput
122
Figure 21: TLB Refill Engine Architecture
pc
instructions
+1
iaddr idata iwr_l
regfile
instruction
immediate
read write
daddr/ddata mem_data
operand
accumulatorCC
condition codes
mem_addr
ALU control
branch target
forwarding
result write
r0r1
r2r3
dataxi dataxo
ALU
123
of theseregistersis chosenby themultiplexerwhenthe instructionaccessesa registerin
therange0 through3. Forwardingdatahasthehighestpriority, unlesstheinstructionuses
animmediatevalue.Dataareforwardedif theoldestinstructionwritesa valueto thereg-
ister file and the following instruction reads from the same location.
Thefollowing pipelinestagelatchesthecurrentinstructionandtheoperandandper-
formsthedesiredoperation.Theinstructioncontrolswhethertheresultwill be latchedin
theaccumulatorandtheconditioncoderegister.In addition,if theinstructionis amemory
access,the memoryrequestsignalis asserted.This is possibleat this stageonly because
memoryaccessinstructionsdonotperformanyarithmeticoperationandthecontentof the
accumulatorremainsunchangedfrom thepreviousinstruction.Thesecondpipelinestage
alsodrivestheregisterfile write signalif thecurrentinstructionis aswapor mova. If mem-
ory happensto returndataat this time,theregisterfile write port is occupiedandtheentire
pipelineis stalled.Finally, branchprocessingis performedin stage2. Thecurrentoperand
is forwardedto thePClogic andthemultiplexerchoosesthenewPCif theconditioncodes
matchthe branchcondition.The third pipelinestagewrites the result in the accumulator
and condition code register, if enabled by the current instruction.
Theimplementationaspresentedhereis synthesizedfor acommercial0.25µm pro-
cess,with an11 nsclock cycle targetto allow for routingdelaynot accountedfor during
synthesis.Table14 summarizesthearearequirementreportedby theSynopsyssynthesis
tool. It is equivalentto theareaof a2 KbyteSRAM createdby anautomaticmemorygen-
eratorfor thesametechnology.TheALU asit is generatedby theSynopsysModuleCom-
piler [69] containsthe critical path. A custom-designedALU insteadof the current
standard-cellbaseddesignwouldreduceboththedelayandareaof thiscomponent.Certain
124
controlpathsto theregisterfile, which is runningoff the invertedclock, arealsocritical.
However,ideally onewould choseanasynchronousregisterfile, in which casethetiming
requirements could be relaxed.
6.5.3 Table Walk Algorithms
To investigatethefeasibilityof performingpagetablelookupswith aprogrammable
controller,thissectiondescribesthreedifferentpagetableorganizationsfoundin commer-
cial systems, and presents implementations of the corresponding lookup algorithms.
6.5.3.1 32-Bit PowerPC. The 32-bit PowerPCprocessor[83] presentsa 4 Gbyte
virtualaddressspacewhichissplit into16segmentsof 256Mbyteeach.Theprocessorcon-
tains16 segmentdescriptorregisters,eachconsistingof a 24-bit segmentID andaccess
protectioninformation.The four mostsignificantbits of the addressselectoneof the 16
segmentdescriptors.Thevirtual pagenumber,combinedwith thesegmentID anda mask
form anindexinto thehashedpagetable.Eachpagetablebucketcontainseight64-bitpage
tableentries(PTE).If noneof theeightprimaryPTEsmatchthesegmentID, a secondary
Table 14: Table Walk Engine Area Requirements
Module Area inµm2
pc logic 3318
instruction memory 86093
register file 156488
external registers (r0-r3) 39341
alu & pipeline registers 65457
control 18079
total (including estimated routing) 388012
125
hashfunctionis appliedto thevirtual pagenumberandthesegmentID to accesseightsec-
ondaryPTEs.Figure22 summarizesthepagetablelookupalgorithmof thePowerPCar-
chitecture.
Althoughthebasicpagetablelookupalgorithmis implementedin hardware,differ-
entoperatingsystemsmayusetheavailablefeaturesdifferently. For this discussion,it is
assumedthatboth thesegmentdescriptorregisterSDR1andthe16 segmentregistersare
changeduponacontextswitch.Consequently,botharepartof theprocesscontextandcan
beaccessedthroughtheprocessstructurein kernelspace.Thetablelookupalgorithmtakes
a virtual addressa thephysicalpointerto theprocessstructureasinputsandproducesthe
correspondingphysicalpagenumberuponsuccess.Thealgorithmpresentedheredoesnot
support large pages that use the PowerPC block address translation feature.
Figure23presentsamabbreviatedversionof thepagetablelookupalgorithmimple-
mentationfor theprogrammableTLB fill engine.Omittedis thecodethatchecksthe last
six primaryPTEsfor amatchaswell asthecodesegmentthatcomputesthesecondaryhash
index, loadsthesecondaryPTEsandsearchesfor a match.The indentedinstructionsare
executedwhile a loadis outstanding,theydonotcontributeto thetotal latencyof thealgo-
rithm.
The latencyof a pagetablelookupusingthis codedependson whetherthedesired
PTEis locatedin theprimaryor secondarybucket,andwherein thatbucketis found.The
algorithmrequiresaminimumof threememoryaccesses,andafourthaccessif thesecond-
arybucketis needed.Thelatenciesreportedin Table15assumeanaveragememoryaccess
latencyof 30 PCI cyclesat 66 MHz, composedof a 10-cyclememorycontrollerlatency
(150 ns) and 10 cycles each from the PCI bus through the PCI bridge and back.
126
Figure 22: 32-bit PowerPC Page Table Lookup Overview
Seg # 16-bit page # 12-bit offset
24-bit segment ID
0 3 4 19 3120
16-bit page #00019-bit segment ID19-bit segment ID
XOR9-bit mask16-bit base
AND
OR
9 bit9 bit
32-bit PTE address
10 bit
7 bit
000000
V 24-bit segment ID 6-bit APIH
= ? = ?
6 upper bits of page #
20-bit phys. page # protection etc.
12-bit offset20-bit phys. page #
64 bit PTE
SDR1
SR 0 - 15
virtual address
physical address
NEG
hash 1
Page Table
hash 0
127
Figure 23: 32-bit PowerPC Page Table Lookup Code
MOV r1 // get context r1
ADD r15 // add SDR1 offset r15
LDD r6 // load SDR1 into r6
MOV r0 // get VA
SHIFTLi 4
SHIFTRi 16 // extract virtual page #
MOVA r8 // store VPN in r8
MOV r0 // get VA
SHIFTRi 28 // extract segment #
SHIFTLi 2 // 32 bit chunks
ADD r14 // add segment reg offset r14
ADD r1 // add context
LDD r4 // load segment reg into r4
MOV r6 // get SDR1 (r6)
SHIFTLi 23
SHIFTRi 13 // extract 9 bit mask
OR r13 // make lower 10 bits 1
MOVA r5 // store mask in r5
MOV r6 // get SDR1
SHIFTRi 16
SHIFTLi 16 // extract 16 bit base
MOVA r6 // store base in r6
MOV r4 // get segment ID
AND r11 // mask out segment #
XOR r8 // XOR segment with VPN
MOVA r7 // store hash 1 in r7
AND r5 // AND with mask
SHIFTLi 6 // shift left 6 bits
OR r6 // OR with base address
LDH 16 // load first 4 PTEs
MOVi 1
SHIFTLi 31 // create valid bit
MOVA r3 // store bit in r3
MOV r4 // read segment ID
AND r11 // mask out segment #
SHIFTLi 7 // shift left
OR r3 // set valid bit
MOVA r3 // store back in r3
MOV r8 // get VPN
SHIFTRi 10 // extract API
OR r3 // concat. with segment ID
CMP r16 // check PTE 0
BNEi next0 // skip if no match
MOV r17 // read PTE0
BAi succ // done
next0:
CMP r18 // check PTE 1\
BNEi next1 // skip if no match
MOV r19 // read PTE1
BAi succ // done
next1: // continue up to PTE 7
... // if no match, load
... // secondary PTEs
... // check secondary PTEs
HALTFAIL // no match found
succ:
MOVA r5 // store PTE in r5
AND r12 // extract physical page #
SWAP r4 // store PPN, get segment ID
SHIFTRi 27 // extract Kp
OR r5 // combine with PTE
ANDi 0x07 // extract protection bits
CMPi 0x04 // privileged ?
BEi priv // jump ahead
CMPi 0x05 // read only ?
BEi sro // yes -> jump ahead
ANDi 0x03
CMPi 0x03 // read only ?
BEi sro // yes -> jump ahead
MOV r4 // read PPN
MOVA r2 // write TLB entry
HALTSUCC
sro:
MOV r4 // get PPN
ORi 0x01 // set Read-Only bit
MOVA r2 // write TLB entry
HALTSUCC
priv:
MOVi 0x02 // set privileged bit
OR r4 // merge with PPN
MOVA r2 // write TLB entry
HALTSUCC
128
6.5.3.2 MIPS-4 (32-Bit). Many32-bitMIPS-basedsystemsuseatwo-leveladdress
translationscheme.WhenauserprocessincursaTLB miss,thevirtualpagenumberisused
to indexinto aflat pagetablethatstartsatthevirtual addressspecifiedin theXContextreg-
ister.Whenthepagetableentryis loaded,its virtual addressis in turn translatedagain.In
casethe secondtranslationalsomissesin the TLB, a secondaryTLB misshandleris in-
vokedwhich usesa conventionaltwo-tier pagetableorganizationto locatethe required
physicaladdress.Thephysicaladdressis thenusedto loadthepagetableentryfor theuser
process.Eachpagetableentryis 64bitslargeandcontainsthephysicalpagenumber,aval-
id bit, accessprotectioninformationanda four-bit pagesizeindicator.Figure24 summa-
rizes the page table lookup algorithm used in 32-bit MIPS processors.
In manyMIPS systems,theprimaryuserpagetableis locatedat a fixed virtual ad-
dresswhichdoesnotchangeduringcontextswitches.It is assumedthatthebasepointerof
therootpagetableis partof theprocessstructureandis providedasprocesscontextto the
device.Thealgorithmpresentedin Figure25 first calculatesthevirtual addressof theuser
pagetableentry,which is thentranslatedusingthecontextprovidedby theTLB. Both the
Table 15: PowerPC Page Table Lookup Latencies
Case instructions code latency incycles
latency includingmemory access
first primary PTE match 31 35 1.894µs
second primary PTE match 35 39 1.955µs
eighth primary PTE match 59 63 2.319µs
first secondary PTE match 69 73 2.924µs
eighth secondary PTE match 97 101 3.348µs
no PTE match (fault) 98 101 3.348µs
129
Figure 24: 32-bit MIPS Page Table Lookup Overview
VPN offset
30 14 13 0
virtual address 0
31
user PT base
XContext
ADD
17 bit VPN 00000...00
proc. PT base
ADD
L0 idx L1 idx offset31 25 24 14 13 0
L0 idx00...00 000
L0 Page Table
L1 idx00...00 000
ADD
L1 Page Table
NULL?
26 bit PPN 0000 V 0 PS000 00....0028 25
OR
SHIFTL
26 bit PPN 0000 V 0 PS000 00....0028 25
AND
0x00000FFF
0x00000FFF
OR
lower 18 bits
lower 18 bits
physical address
valid ?
valid ?
Context
user page table
SHIFTL
AND
130
rootpagetableentryandtheuserpagetableentrymaymapapagethatis largerthanthe4
Kbyte basepagesize.In this case,thepagesizespecifierin thePTEindicateshow many
additionalbits of the pageoffset areused,andhow by how manybits the physicalpage
number shrinks. The algorithm converts all mappings to the 4 Kbyte base page size.
The algorithmrequiresthreememoryaccesses.With 35 instructionsin the critical
path,thetotal instructionlatencyis 38cycles.Againassuminga30-cyclememorylatency,
a page table lookup requires 128 cycles or 1.939µs.
Figure 25: 32-bit MIPS Page Table Lookup Code
MOV r0 // get VA
SHIFTRi 11 // extract virtual page #
ANDi 0xF8 // align at dword
ADD r15 // add to XContext
SHIFTRi 22 // extract L0 offset
ANDi 0xF8 // align at dword
ADD r1 // add to UPT base
LDD r6 // load L0 ptr into l6/l7
MOV r0 // get VA
SHIFTRi 11 // extract VPN
ANDi 0xF8 // align at dword
ADD r15 // add XContext
MOVA r5 // store UPA in r5
SHIFTLi 7 // remove L0 offset
SHIFTRi 21 // extract L1 offset
SHIFTLi 3 // align at dword
SWAP r7 // save offset, get L1 base
CMPi 0 // check if ptr is NULL
BEi fail // fail if NULL
ADD r7 // add offset to L1 base
LDD r6 // load L1 ptr
MOV r0 // get VA
AND 11 // extract VPN (don’t shift)
MOVA r4 // store VPN in r4
MOV 12 // get large page mask
MOVA r8 // move to r8
MOVA r9 // move to r9
MOV r6 // get UPA PTE[0]
SHIFTRi 25 // get page size bits
ANDi 0x0F // mask out rest
SWAP r8 // swap with large page mask
SHIFTL r8 // shift large page mask
OR r12 // set lower bits
AND r5 // use mask on UPA
SWAP r7 // swap UPA offset with PTE[1]
CMPi 0 // check if ptr is NULL
BEi fail // fail if NULL
SHIFTLi 8 // shift out top bits
ADD r7 // add new VA offset
LDD r6 // load PTE into r6/r7
MOV r7 // read PTE[1]
BTSTi 0x02 // test valid bit
BEi fail // fail if not set
AND r10 // extract PPN
SHIFTLi 8 // shift out top bits
SWAP r6 // store in r6, get PTE[0]
SHIFTRi 25 // shift page size bits
ANDi 0x0F // mask out rest
SWAP r9 // swap with large page mask
SHIFTL r9 // shift left by N bits
AND r4 // extract lower bits of VPN
OR r6 // merge with VPN
MOVA r2 // write PPN
HALTSUCC // done
fail:
HALTFAIL
131
6.5.3.3 Intel IA-32. Similar to thePowerPC,the IA-32 architectureusesa combi-
nationof segmentationandpage-basedmemorymanagement,aspresentedin Figure26.
Theprocessorprovidessix segmentselectorregisters,which index into a globalor local
tableof segmentdescriptors.Threeof thesix segmentsaredefinedby thearchitectureas
text,dataandstacksegment,butinstructionscanoverwritethedefaultsegmentfor memory
accesses.Thesegmentdescriptorselectedby the instructioncontainsa baseaddress,size
andprotectionbits. The baseaddressis addedto the virtual addressto form a linear ad-
dress. This linearaddressis optionallymappedthroughaconventionaltwo-tierpagetable.
Figure 26: IA-32 Page Table Lookup Overview
10-bit dir # 10-bit page # 12-bit offset22 21 031 1112
16-bit Segment Selector
selected by instruction
32-bit virtual address
ADD
Seg. Table Base
GDTR / LDTR
ADD
segment descriptor table
address
Linear Address
20-bit basePDBR
12-bit offset20-bit physical page #32-bit physical address
132
Eachpagetableentrycontainsthebaseaddressof thenextlevelpagetable,or thephysical
pagenumber,aswell asaccessprotectionbits.Alternatively,first level pagetableentries
may point to a superpage instead of the next page table level.
TheIA-32 architectureallowsmanydifferentcombinationof segmentationandpag-
ing with differentlinearaddresssizesandpagesizes.Thealgorithmpresentedin Figure27
assumesthecommonlyusedflat addressspacemodel,whichbypassessegmentationcom-
pletely.Thesix on-chipsegmentselectorregisterspoint to a segmentthatstartat address
0 andspans4 Gbyte.Thefirst levelpagetablebaseaddressis processspecificandcontains
pointersto second-levelpagetablesor 4 Mbytesuperpages.Thebasepagesizeis 4 Kbyte.
Theprocessorstill accessesthesegmenttablefor everymemoryaccess,butthedevicepage
Figure 27: IA-32 Page Table Lookup Code
MOV r0 // get VA
AND r12 // extract L0 offset
SHIFTRi 20 // shift right
ADD r1 // add to context
LDD r4 // load L0 ptr into r4/r5
MOV r0 // get VA
AND r13 // extract L1 offset
SHIFTRi 10 // shift right
MOVA r8 // store in r8
MOV r0 // get VA
AND r15 // get part of VPN for lrg.pg.
MOVA r9 // store in r9
MOV r4 // get L0 ptr
BTSTi 0x01 // check valid bit
BEi fail // skip to fail if not set
BTST r11 // check large page bit
AND r14 // extract L1 base/PPN
BNEi sucb // skip to success big if set
ADD r8 // add L1 offset
LDD r6 // load L1 ptr into r6/r7
MOV r6 // get L1 pointer
BTSTi 0x01 // check valid bit
BEi fail // skip to fail if not set
AND r14 // extract PPN
SWAP r6 // store in r6, get L1 ptr
AND r4 // combine with L0 pointer
succ:
INV r0 // negate bits
ANDi 0x06 // extract protection bits
SHIFTRi 0x01 // move to correct position
OR r6 // combine with PPN
MOVA r2 // write mapping
HALTSUCC // success
sucb:
SWAP r4 // save PPN in r4, get L0 ptr
INV 8’h00 // negate bits
ANDi 8’h06 // extract protection bits
SHIFTRi 8’h01 // move to correct position
OR 8’h04 // combine with PPN
OR 8’h09 // merge with VPN offset
MOVA 8’h02 // write mapping
HALTSUCC // success
fail:
HALTFAIL // failure
133
tablelookup algorithmcanignoresegmentationandusethe virtual addressaslinear ad-
dress.Thealgorithmtakesthevirtual addressandpagetablebaseaddressasargumentsand
returnsthe physicalpagenumber.Mappingsfor 4 Mbyte superpagesareconvertedto 4
Kbyte mappings. Each page table lookup requires two memory accesses.
Performingatwo-tier lookuprequires25instructions,or 28cycles,plustwo memory
accesses.Assuminga30cyclemainmemorylatency,theentiretranslationtakes1.333µs.
A lookupof a4 Mbytepagerequiresonly 19 instructions,or 23cycles,whichcorresponds
to a 0.803µs latency.
6.6 Device TLB Performance Evaluation
Overallperformanceof thedeviceTLB is determinedby severalcomponents.TLB
misshandlinglatencyis measuredfrom themomenta TLB missis detecteduntil the re-
quiredentry is loadedinto theTLB andtheDMA operationrestarts.TLB misshandling
overheadis definedasthetime duringwhich theprocessoris not executingapplicationor
kernelcodebecauseof theTLB miss.In addition,theoverallsystemperformanceimpact
of both measures is affected by the TLB miss ratio.
6.6.1 Miss Handling Latency Comparison
Figure28 comparestheTLB misshandlinglatenciesof thehardwareandsoftware
basedTLB misshandlingmechanismsfor thesetof systemsdescribedin Table11 As ex-
pected,hardwarebasedTLB misshandlinggenerallyprovideslower latency,especially
whenconsideringthatthesimulatedkernelTLB misshandlerusesasimpletwo-levelpage
tableorganizationcomparableto the flat memorymodel in IA-32. Kernel interruptsnot
134
only incur at leasttwice the latencyof the comparabledevice-sidedTLB misshandling,
theyalsointroducesignificantCPUoverheads,asnotedon top of thebargraphs.On the
otherhand,device-basedTLB misshandlerscanonly benefitfrom fasterbusesandlower
mainmemorylatency,while kernelinterrupthandlersalsobenefitfrom fasterprocessors
andareableto reducethelatencygapto thehardwarebasedhandlerspartially.However,
this scalingeffectmaybe lesssignificantfor manyrealisticapplicationsbecausethemi-
crobenchmarkdoesnotutilize thecacheasmuchasrealapplicationswould,andhencethe
interrupthandlerdoesnot incurasmanycacheandTLB misses.In addition,interrupthan-
dlerscanhavenon-negligiblesecondaryeffectsonsystemperformanceby replacingcache
andTLB entriesthatthenneedto bereloadedby theapplicationaftertheinterruptis han-
dled.Theperformanceimpactof theseeffectsis largelydependenton thecacheandTLB
hit ratesof the applicationand is not measuredin theseexperiments.Generally,kernel
basedTLB misshandlingalsoshowsa largerlatencyvariationdueto thefact that thein-
terruptmight bedelayedif thekernelis executinga critical section,while hardwarebased
Figure 28: TLB Miss Handler Latency
0
2
4
3
1
late
ncy
inµs
PPC, first primary matchPPC, last secondary match
MIPSIA-32, 4k page
L-RSIM / LAMIX kernel handler
400 / 100 400 / 200 2G / 100 2G / 200
4.5 µs
3.0 µs3.4 µs
2.1 µs
CPU overhead noted on top of bar
135
handlersproceedindependentlyandareonly affectedby busandmemorycontrollercon-
tention.
6.6.2 Host Processor Occupancy
Thegraphsin Figure29 showCPUoccupancydueto kernelTLB misshandlingfor
a varietyof I/O DMA bandwidthsandTLB missratios.TLB hit ratiosaregivenasa per-
centageof pageaccesses,i.e.,a0%hit ratemeansthateveryfirst accessto apageincursa
TLB miss,buttheremainingaccessesto thesamepagehit. Thisdefinitionof hit rateisused
becauseit is independentof thesizeof a DMA transaction,andhenceindependentof the
numberof DMA transactionsperpage.It alsoimpliesthathit ratesbelow0%arepossible
if multiple streamsinterferewith eachother,causingmultiple TLB missesperpage.CPU
occupancyis determinedanalyticallybasedontheeffectivebandwidthunderagivenpage
sizeandhit rate,andtheresultingnumberof TLB missespersecond.TLB missoverhead
is assumedto remainconstantovertime.Thedistributionof theCPUoccupancyalongthe
y-axis for a givenpagesizecorrespondsto varyingTLB misshandleroverheadsranging
from 1.2 µs (lower bound) to 4 µs (upperbound).For instance,in a systemwith 400
Mbyte/speakDMA bandwidth,a1.2µs TLB misshandleroverheadleadsto a11percent
CPUutilization for a zeropercenthit rate,while a 4 µs overheadleadsto over30 percent
CPUutilization.Thesemisshandleroverheadsareintendedto covertherangeobservedin
the prototype implementation for different processor and memory system configurations.
ThegraphsshowthatTLB missoverheadcanbesignificantevenfor moderatetrans-
fer rates.Smallerpagesleadto ahigherrateof TLB missesgivenafixed hit ratio,sincethe
DMA enginecrossespageboundariesmorefrequently.Thehighsensitivityto TLB hit ra-
136
Figure 29: TLB Miss Handler CPU Utilization
0
5
15
20
30
35
0 20 40 60 80 100
TLB Hit Ratio in %
TLB Hit Ratio in % TLB Hit Ratio in %
TLB Hit Ratio in %
host
pro
cess
or u
tiliz
atio
n in
per
cent
host
pro
cess
or u
tiliz
atio
n in
per
cent
host
pro
cess
or u
tiliz
atio
n in
per
cent
host
pro
cess
or u
tiliz
atio
n in
per
cent
DMA Bandwidth 50 Mbyte/s DMA Bandwidth 100 Mbyte/s
DMA Bandwidth 200 Mbyte/s DMA Bandwidth 400 Mbyte/s
4 Kbyte pages16 Kbyte pages
10
25
0
5
15
20
30
35
0 20 40 60 80 100
10
25
0
5
15
20
30
35
0 20 40 60 80 100
10
25
0
5
15
20
30
35
0 20 40 60 80 100
10
25
137
tios indicatesthat locality within a singlerequeststreamaloneis not sufficient to achieve
acceptablelow CPU overheads.However,sinceI/O DMA streamsexhibit almostcom-
pletelysequentialbehavior,asimpleTLB entryprefetchingschemeis ableto improveTLB
hit ratiossignificantly.UponaTLB miss,themisshandlernotonly resolvestheoutstand-
ing missbutalsoloadstheaddresstranslationfor thesubsequentvirtual pageinto theTLB,
thuseffectivelyreducingthemissrateby a factorof two. Sucha prefetchschemeis most
effectivefor TLBs with somedegreeof associativity,sincethatreducesthelikelihoodthat
a prefetchedTLB entry may replacean entry that is in useby anotherstream.Given the
relativelyhighbaseoverheadof akernelinterrupt,performinganadditionaltranslationin-
curs only a minor overhead increase
6.6.3 DMA Bandwidth Impact of TLB Miss Handling Latency
Theimpactof TLB misshandlinglatencyonoverallI/O performancedependsonthe
peakDMA transferbandwidthandtheTLB missrate.Thegraphsin Figure30plot theef-
fectiveDMA bandwidthfor avarietyof peakbandwidthsandTLB missratios.Eachshad-
ed areadescribesbandwidthfor a particularpagesize for TLB miss handlinglatencies
rangingfrom 0.8µs (upperbound)to 3.0µs (lowerbound).For instance,in asystemwith
400Mbyte/speakDMA bandwidth,theeffectivebandwidthfor azeropercenthit raterang-
esfrom305Mbyte/sfor a3.0µsmisslatencyto370Mbyte/sfor a0.8µsTLB misslatency.
Effectivebandwidthis calculatedwith theassumptionthat in theabsenceof TLB misses,
datais transferredbetweenthedeviceandmainmemoryatpeakbandwidth.DuringaTLB
miss,theDMA enginestalls.Notethatthey-axisof eachgraphrepresentsonly theupper
25% of the total bandwidth scale.
138
Figure 30: Effective DMA Bandwidth
37.5
40.0
42.5
45.0
47.5
50.0
0 20 40 60 80 10075
80
85
90
95
100
0 20 40 60 80 100
300
320
340
360
380
400
0 20 40 60 80 100
TLB Hit Ratio in %
TLB Hit Ratio in % TLB Hit Ratio in %
TLB Hit Ratio in %
effe
ctiv
e ba
ndw
idth
in M
byte
/sef
fect
ive
band
wid
th in
Mby
te/s
effe
ctiv
e ba
ndw
idth
in M
byte
/sef
fect
ive
band
wid
th in
Mby
te/s
DMA Bandwidth 50 Mbyte/s DMA Bandwidth 100 Mbyte/s
DMA Bandwidth 200 Mbyte/s DMA Bandwidth 400 Mbyte/s
150
160
170
180
190
200
0 20 40 60 80 100
4 Kbyte pages16 Kbyte pages
139
ThegraphsclearlyshowthatTLB misslatenciesaremorecritical for higherband-
width, smallerpagesandhigherTLB missratios.Largerpagesincur relativelyfewerTLB
missesandarethuslesssensitiveto TLB misslatenciesbecausetheyamortizethecostof
theinitial TLB missoveralargernumberof subsequenthits.Similarly,ahigherbandwidth
increasestherateof TLB accesses,andhencetherateof TLB missesundera givenmiss
ratio.Thegraphssuggestthatfor modestbandwidth,neitherTLB missrationormisslaten-
cy haveasignificantimpactonoverallperformance.Forhighbandwidthandsmallpages,
low TLB misslatencyandlow missratiosbecomecritical to achieveacceptableperfor-
mance.
To minimize theTLB complexity,theanalysisassumesthatall requestsarestalled
while amissis resolved.ThissimpledesigncannegativelyimpactunrelatedDMA streams
sincetheTLB is sharedamongmultiple sendandreceiveDMA engines.With someaddi-
tionalhardware,it is possibleto only stall therequeststreamthatcausestheTLB miss,thus
reducing the impact of TLB misses on unrelated DMA streams.
6.7 Summary
Copyingdatabetweenkernelanduserspaceis in manycasesthedominantcompo-
nentof I/O overhead.EliminatingthecopyoperationleavestheCPUavailablefor applica-
tion processingandcan leadto improvedsystemthroughputandhigher I/O bandwidth.
EnablingtheI/O deviceto transferdatadirectly to andfrom userspacerequiresthateither
theoperatingsystemor theI/O devicetranslatevirtual addressesspecifiedby theapplica-
tion into physicaladdresses,while performingthenecessaryprotectionchecksanddetect-
ingpagefaults.Sinceinvokingtheoperatingsystemfor everyI/O requestincurssignificant
140
overhead,theuser-levelI/O architecturelets theapplicationprovidevirtual addressesdi-
rectly to theI/O device.ThedeviceprovidesaTLB to cacherecentlyusedaddresstransla-
tionsandusestheexistingkernelpagetablesto look up entriesnot presentin thecache.
Thelookupoperationcanbeperformedwith kernelassistanceor independentlyby thede-
vice.Performingaddresstranslationson demandprovidesflexibility by enablingapplica-
tions to usearbitrarymemoryregionsfor I/O operations,without theneedto preallocate
buffers.
Having thekernelperformaddresstranslationson demandvia interruptssimplifies
thedevicehardwareandleavesthekernelfreeto usethemostefficient pagetableorgani-
zationwithout regardfor theI/O device.However,thisschemeincurshigherTLB missla-
tenciescomparedto thedevice-basedlookupscheme,which leadsto lower effectiveI/O
bandwidth.In addition,underhigh TLB missratesthis schemeusessignificantCPU re-
sources of up to 30% of the available processor cycles.
A programmablepagetablelookupengineresidingin thedevicecombinesthenec-
essaryflexibility with the ability to performTLB refills without hostprocessorinvolve-
ment.Sucha lookup enginecanbe implementedwith a small amountof additionalchip
resources.It is ableto reduceboththeTLB misslatencyobservedby theDMA engineand
to eliminate the CPU involvement during data transfers.
Although TLB miss latenciesare not performancecritical for all but the highest
DMA bandwidth,hostprocessorutilization canbecomesignificantfor low TLB hit rates.
Theseresultsunderscoretheimportanceof performingTLB misshandlingindependently
from the host processor.
7. USER-LEVEL NOTIFICATION
Interruptor notificationhandlingis animportantpartof mostI/O transactions.Inter-
ruptsareasynchronousexternaleventsandarefrequentlyusedto notify theoperatingsys-
tem that an I/O transactionhas completed.In most general-purposemicroprocessors,
interruptscausetheprocessorto saveits currentstatein specialtrapregistersandtransfer
controlto apredefinedaddress.Theinterrupthandlerroutineat thepredeterminedlocation
savesadditionalstateon the kernelstack,determinesthe causeof the interruptandper-
forms thenecessaryoperations.Beforeresumingexecutionat the interruptedinstruction,
manyinterrupthandlerscheckif a schedulingoperationis necessaryandif a signalneeds
to bedeliveredto thecurrentprocessbeforerestoringtheprocessstateandresumingexe-
cution at the original location.
WheninitiatinganI/O requestasaresultof asystemcall performedbyanapplication
program,operatingsystemslike UNIX transfercontrol to a devicedriver which performs
thenecessaryuncachedreadsandwritesto setuptheI/O controlinformationat thedevice.
Theapplicationthenblocksin thedevicedriver waiting for anI/O interruptfrom this de-
vice.WhentheI/O devicedeliversaninterrupt,signallingthatanI/O transfercompleted,
a devicespecificinterrupthandlerunblocksall processeswaiting for that transfer.Using
thesemechanisms,theoperatingsystemhidesthenonblockingcharacteristicsof I/O trans-
actionsfromapplicationsbycontextswitchingtoanotherprocessuntil theI/O requestcom-
142
pletes.It alsoallowstheoperatingsystemto considerI/O activity whenmakingscheduling
decisions and computing process priorities.
Thisoperatingsystemdesignsimplifiestheapplicationinterfaceto I/O, but it incurs
significantoverheads.For instance,a disk controllerinterrupttakesbetween50 to 100µs
to execute,duringwhichtimeit replacesover10Kbytesof instructionsandbetween3 and
4 Kbytesof datafrom thefirst levelcaches[92]. Thisoverheadis a resultof thegenerality
of theinterrupthandler,whichrequiresit tosaveandrestoresignificantprocesscontextand
which leads to a complex code structure to handle all possible interrupt causes.
Handlinginterruptsin dedicateddevicedriver routinesinsidetheoperatingsystem
alsolimits flexibility in thewaycompletionnotificationsaretreatedby individualprocess-
es.ApplicationprogramsareblockedwhenanI/O operationis initiatedandarenotableto
overlapthelonglatencyI/O operationwith otherwork.To mitigatethisshortcoming,most
commercialUNIX variantsoffer anasynchronousI/O interface.This interfaceallowsap-
plicationstocontinueexecutingafterinitiatinganI/O request,atthecostof someadditional
overhead.To dealwith themismatchbetweenblockingdevicedriversandthenonblocking
programminginterface,the applicationlibrary createsan I/O threadwhich performsa
blockingsystemcall on behalfof theoriginal application.Whentheblockingsystemcall
returns,theI/O threadsetsaflag or raisesasignalin theoriginalapplicationandexits.Cre-
atingandremovingtheI/O threadincursthecostof additionalsystemcalls,butallowsthe
library to implementasynchronousI/O with minimalkernelsupport.Flexibility in handling
asynchronous I/O notifications is limited to a signal handler or polling on a shared flag.
Theuser-levelI/O architectureexpandsonthisschemeandprovidesaflexible, light-
weightmechanismto invokearbitraryuserroutinesfor I/O notificationswith minimalker-
143
nel involvement.The user-levelnotification mechanismconsistsof a lightweight kernel
interrupthandlerthatcloselycooperateswith applicationsto asynchronouslyexecuteuser-
level notification routines.A novelnotificationqueuestructurein thehostprocessorbus
interfacereducesthekernelinterrupthandleroverheadby enablingtheUIO deviceto write
all requiredinformationto thehostprocessorat thetimeof theinterrupt.Theinterrupthan-
dler canthusobtainall pertinentinformationlocally from the queuewithout the costof
multipleuncachedreadsfrom theI/O device.Theflexibility of thenotificationmechanism
allowsapplicationsto useanynumberof specificallytailoredroutinesto handlenotifica-
tions,thusreducingtheexecutioncostof eachroutine.Handlingnotificationsalmostex-
clusivelyin userspacereducesoverheadsincenoheavyweightcontextswitchis necessary
andcacheandTLB pollution effectsareminimized.Thenotificationmechanismexploits
thefact thattheoriginatingapplicationprocessis likely to becurrentlyexecuting,in which
caseheavyweightschedulingandcontextswitchingis not necessary.Note,however,that
theselightweightnotificationsareusedonly for thecasewhereanI/O requestcompletes
withoutexception,all errorconditionsandotherexceptionssuchasDMA accessviolations
arehandledby thekernelusingnormalinterrupts.This chapterfirst providesanoverview
of the user-levelnotification mechanism,discussesthe hostprocessornotification queue
andsomemultiprocessorissues,andthenpresentsresultsof a performanceevaluationus-
ing microbenchmarks.
7.1 Lightweight Notification Mechanism
Eachuser-levelI/O requeststructurecontainsthreepiecesof informationneededto
delivernotificationsto theoriginatingapplication.Wheninitiating anI/O request,theap-
144
plicationspecifiesabuffer locationwheretheoriginal requeststructurewith thereturnsta-
tuswill bedepositedby theI/O device.In addition,it specifiestheaddressof a user-level
notification routinethat is to be executedwhenthe I/O requestcompletes.The kernelor
CSBhardwareinsertsa processidentifier into therequestthat is usedto locatethe target
process for a notification.
Notificationsarehandledby theUIO devicein two phases,asshownin Figure31.
WhentheUIO devicereceivestherequeststructureaspartof theresponsefrom a remote
I/O device,it writestherequeststructureinto thepredeterminedbuffer in userspaceusing
theexistingDMA mechanism.Storingtherequestin userspaceallowstheapplicationto
examinethereturnvalueaswell asanyrequestspecificargumentsaspartof thenotifica-
tion handling.Softwareis responsiblefor ensuringthatmultiple outstandingnotifications
aredepositedin distinctbuffer locations,otherwisenotificationsoverwriteeachotherand
do not reach the application.
Figure 31: Notification Handling
application address space
req
1. DMA requestinto user buffer
req
host processor
2. Send request tohost processor
interrupt
request returned from remote device
145
After writing the requeststructureinto userspace,theUIO devicewrites thesame
structureto thenotificationqueuein thehostprocessorbusinterfaceto triggeraninterrupt.
This is similar to theway normalexternalinterruptsaredeliveredin manyarchitectures.
However,thebustransactioncausingtheinterruptcarriesthecompleterequeststructureto
theCPU,thusmakingall pertinentinformationavailablelocally in theCPU.Whenreceiv-
ing thenotification,thehostprocessorentersa low-priority interrupthandlerthataccesses
thecontrolregistersin theCPUbusinterfaceto processthenotification.Usingtheprocess
identifier in therequeststructure,it determinesif thecurrentlyrunningprocessis thetarget
processfor thenotification.If it is, thekernelinterrupthandlersavesaminimalamountof
processstateon theuserstackandmodifiestheprocessstatesuchthattheapplicationpro-
cessstartsexecutingthenotificationhandlerwhenreturningfromtheinterrupt.Theaddress
of therequeststructurein userspaceaswell astheaddressof theinterruptedinstructionare
passedto thenotificationroutineasarguments.Uponcompletion,theuserspacenotifica-
tion handlerreturnsto the interruptedinstructionby executinga jump to theaddresspro-
vided as one of the arguments.
If thecurrentprocessis not thetargetprocessfor thenotification,or thecurrentpro-
cessis notexecutingin usermode(e.g.,in asystemcall), thenotificationcannotbedeliv-
ered immediately and must be queuedfor later delivery. Queuing ensuresthat each
individual notification is forwardedto the application,sinceeachnotification carriesa
uniquereturnstatusfor adistinctrequest.To queuependingnotifications,thekernelmain-
tainsa list of pointersto requeststructuresin theprocessstructure,asshownin Figure32.
Keepingthe list in nonpageablememoryallows the interrupthandlerto accessit without
risking pagefaults.Whenreceivinga notificationfor aninactiveprocess,thekernelinter-
146
rupthandlerappendsthevirtual addressof thecurrentrequeststructurethatwaswritten to
auserbuffer to this list. Themaximumsizeof thelist determinesthemaximumnumberof
outstandingrequestsfor a processandis a system-specificconstant.On a contextswitch,
thekernelchecksif thenewprocesshasanypendingnotifications,similar to thecheckfor
pendingsignalsin UNIX. In fact,oneof theunusedbits in theprocesssignalmaskmaybe
usedto flag thissituation.If aprocesshaspendingnotifications,thekernelusesthepointer
arrayto createa linked list of requeststructuresin userspace.Unlike theinterrupthandler,
Figure 32: Queued Notification Handling
application address space
req
req
req
notification list in process structure
1. Interrupt handlercreate list ofpointers to pendingnotifications
application address space
req
req
req
2. Process resumesmodified application PC
build linked list ofnotifications,resume at firstnotification handler
147
thekernelis ableto accessusermemoryat this time becausea pagefault only blocksthe
currentprocess.The kernel thenmodifies the processstatesuchthat the applicationre-
sumesexecutionat thefirst notificationhandler.Eachnotificationhandleris responsibleto
checkif it is thelastin thelist, orotherwisecall thenextnotificationroutinewith thecorrect
arguments.Thelastnotificationhandlerreturnsto theinstructionthat theapplicationwas
executing before the last context switch, using a jump instruction.
Theamountof statesavedby thekerneluponentry into thenotificationhandlerin-
cludesonly theregistersthatareusedtopassargumentsto theroutine.Thenotificationrou-
tine is responsiblefor savingandrestoringanyadditionalstatethat it modifiesduring its
execution.Subsequently,theroutinecanexecutearbitrarycodeto synchronizewith theap-
plication,suchassettingaflag or unblockingauserthread.Unlike atraditionalUNIX sig-
nal, returningto the interruptedinstructionis accomplishedby executinga jump to the
addressprovidedasoneof thearguments.Thefollowing pseudocodesequenceillustrates
a basic framework for user-level notifications.
void uio_notification(unsigned ret_addr, uio_req *req)
{
save registers
perform notification operation
if (req->next != NULL)
{
restore registers
req->notification(ret_addr, req->notification_buf)
}
restore registers
jump(ret_addr)
}
Notethat to executecorrectlyunderall circumstances,thenotificationroutinemust
bereentrantandmayaccessonly lock-freedatastructures[46], becauseit is notallowedto
148
blockonsynchronizationvariables.However,sincetheroutineexecutescompletelyin user
space, violating these requirements affects only one application and not the entire system.
Reentrycapabilityis neededbecausenotificationscanarriveat anytime, including
duringtheprocessingof a notification,andthesystemprovidesno mechanismto disable
or masknotificationstemporarily.Sincenotificationsarelocal to a process,andnotifica-
tionsarelowestpriority interrupts,it ispossibletoprovideasystemcall interfacethatraises
theinterruptpriority on behalfof anapplication.If theinterruptpriority is partof thepro-
cesscontextandis savedandrestoredonacontextswitch,applicationscanusethesystem
call to disablenotificationsduring critical sections.The kernel also must provide that
queuednotificationsarenotbedeliveredafteracontextswitchif theinterruptpriority pro-
hibits this.However,thecostof asystemcall to masknotificationsfor relativelyshortcrit-
ical sectionsdoes not seemjustified, consideringthat in many caseslock-free data
structurescanbeusedto implementsimilarfunctionality.Alternatively,processorscanim-
plementauseraccessiblecontrolregisterthatis usedto enableor disablenotificationsand
that becomespart of the processcontext.This schemegives applicationscompleteand
lightweightcontrolovernotifications,similar to interruptmasksusedby thekernel.How-
ever,addingregistersto existinginstructionsetarchitecturesis often difficult andcostly
and should not be taken lightly.
7.2 Processor Notification Queue
To minimizetheoverheadof theuser-levelnotificationmechanism,thehostproces-
sorprovidesanotificationqueuethatholdsrequeststructuresfrom thetimetheinterruptis
triggereduntil it is processedby thekernelinterrupthandler.ThequeueenablestheUIO
149
deviceto transferthecompleterequeststructureto theprocessorat thetime it signalsthe
interrupt.The tail of the queueis exportedby the processorasa setof control registers.
Writing to thequeueloadsanentrywith thedataandtriggersan interrupt.To determine
thetargetprocessandnotificationhandleraddress,theinterrupthandlercanaccessthehead
of thequeuelocally without issuingbustransactions,thusavoidingthecostof multipleun-
cached reads from device control registers t.
Thepurposeof thenotificationqueueis to providesufficientstoragefor onenotifi-
cationfrom eachpossibleUIO device.Consequently,its sizelimits themaximumnumber
of UIO devicessupportedby a system.To avoidoverflowingthenotificationqueue,each
notificationneedsto beacknowledgedby thekernelinterrupthandlervia anuncachedstore
to thedevice,similar to regularexternalinterrupts.Thekernelinterrupthandlerprocesses
thenotificationat theheadof thequeue,acknowledgesit to thedeviceandexplicitly shifts
thequeuebeforeexiting.To avoidthecostof polling all UIO devicesto matchthecurrent
notificationto a particulardevice,theoriginatingUIO deviceinsertsits physicalbasead-
dressin the requeststructurebeforewriting it to theprocessornotification registers.The
physicaladdressis configuredin a devicecontrolregisterby thekernelandis usedby the
interrupt handler to write the acknowledgment to the device.
Thenotificationqueueminimizesoverheadby avoidingthecostof uncachedreads
to UIO devices,butthebasicuser-levelnotificationmechanismdoesnotdependonit. Sim-
ilar to conventionaldeviceinterrupthandlers,thekernelnotificationinterrupthandlercan
poll all UIO devicesto determinewhich devicetriggeredthe interrupt,andthenreadthe
notification-specificinformation from control registers.However, the sequenceof un-
cachedreadsrequiredandtheaddedcomplexityof invokingmultipleinterrupthandlerrou-
150
tines to find the originating deviceleadsto overheadthat approachesthat of a regular
interrupt handler.
7.3 Multiprocessor Considerations
In multiprocessorsystems,the operatingsystemoften considersprocessoraffinity
whenschedulingaprocessfor execution.By executingaprocesson thesameprocessoras
duringtheprevioustime slice,cacheandTLB locality canbeexploitedacrosstime slices
to reducethe performanceimpact of multiprogramming.Under this schedulingpolicy,
sendinga notification to theprocessorthatexecutedtheapplicationprocesswhenthere-
questwasissuedincreasesthelikelihood thattheprocessis foundto beactive,thusreduc-
ing notificationlatencyandoverhead.If processesmigratebetweenprocessorswhile UIO
requestsareoutstanding,or if theUIO devicesendsa notificationto thewrongprocessor,
theuser-levelnotificationmechanismqueuesnotificationsmoreoftenthannecessary,de-
gradingperformanceto a specializedsignalhandlingmechanism.Exploiting the perfor-
mancebenefitsof processor-specificnotificationsrequiresthat therequestindicatesfrom
which processorit originated.In manysystemarchitectures,this informationis available
on thesystembusfrom whereit canbeinsertedinto therequeststructureby otherdevices
[68][84]. However,currentI/O busessuchasPCIdonotpropagatethis informationto I/O
devices.Futurelocal I/O interconnectdesignswill possiblypropagatemoredetailedinfor-
mationsuchastheprocessorID to I/O devices,enablingatighterintegrationof I/O devices
andhostprocessors.Alternatively,theCSBcouldinserttheprocessorID into eachrequest
ata fixed location,similar to theprocesscontext.Unfortunately,thisschemerestrictsflex-
ibility of the CSB even further.
151
7.4 Performance Evaluation
Figure33comparesthelatenciesof theuser-levelnotificationmechanismundervar-
iousconditionsto a traditionalinterruptplussignalscenario.Eachgroupof graphsshows
resultsaveragedover70differentI/O requestsfor themachineconfigurationsdescribedin
Table11.Graphsin thetopgrouprepresentabsolutevalueswhile thebottomgraphsshow
overheadrelativeto thebaseline.Theleft mostgraphof eachgroupshowsthelatencyof a
Figure 33: Notification Latencies
0
4
7
6
2over
head
inµs
interrupt + signalqueued notification, worst
queued notification, bestimmediate notification, worst
immediate notification, best
400 / 100 400 / 200 2G / 100 2G / 200
3
5
1
8
norm
aliz
ed o
verh
ead
0
0.4
0.8
1.0
0.6
0.2
400 / 100 400 / 200 2G / 100 2G / 200
152
regularinterrupthandler(bottomportion)plusthecostof handlingthesignalin theappli-
cation(topportion).Theinterrupthandlerperformsthreeuncachedreadsto determinethe
causeof the interrupt,to determinethe targetprocessfor thenotificationandto load the
virtual addressof therequestpacket.It thensendsa signalto thetargetprocess,acknowl-
edgesthe notification via an uncachedstoreandreadsthe devicestatusregisteragainto
checkif anotherinterruptispending.ThesignalhandlingcostismeasuredusingLMBench.
Signalhandlinglatenciesmeasuredon the simulatoraresignificantly lower thanon real
systems,becauseunlike commercialoperatingsystems,theLAMIX kerneldoesnot sup-
portall UNIX signalhandlingfeaturesandbecauseit is notmultithreaded.With a175MHz
microprocessor,the LAMIX kernel is almostfour timesfasterwhendeliveringa signal.
Consequently,theresultsin Figure33arepessimisticasthebaselineoverheadis lowerthan
it would likely be in a real system.
Notethatthis baselineschemeof interrupthandlerandsignaldeliverydoesnot cor-
respondto anyexistingI/O notificationmechanism.Most I/O interrupthandlersunblocka
processandinvokethescheduler,in additionto acknowledgingtheinterruptandsendinga
signal.However,this schemeis usedbecauseit correspondsbestto the nonblockingI/O
model of the user-level I/O architecture.
Thesecondandthird bargraphin eachgroupshowthecostof a user-levelnotifica-
tion if the targetprocessis not running.Thebottomportionof eachgraphrepresentsthe
kernelinterrupthandlercosts,andthetopportioncorrespondsto thecostof deliveringone
queuednotification to the application.Note that the bestandworst caseshownhereare
whathasbeenobservedin themicrobenchmark,theydo not necessarilycorrespondto the
theoretically possible absolute best and worst case.
153
Thelasttwo bargraphscorrespondto thecostof immediatelydeliveringauser-level
notificationto thecurrentlyrunningprocess,againseparatedinto observedbestandworst
case.Notethattheuser-levelnotificationresultsdonot includethecostof theactualappli-
cationnotificationhandler,whichdependsentirelyontheamountof work performedin the
handler.Similarly, thebaselinesignalhandlerresultis obtainedwith anemptysignalhan-
dler.Performingactualwork in thesignalhandleroruser-levelnotificationhandlerincreas-
esthe latenciesshownhereby a constantamount,assumingthatbothhandlersperforma
comparable amount of work.
Therelativeperformancebenefitof theuser-levelnotificationschemeis significant
for all evaluatedsystemconfigurations,for severalreasons.Theuser-levelnotificationin-
terrupthandleris ableto accessall necessaryinformationaboutthe notification locally,
whereasnormalinterrupthandlersissueanumberof uncachedloadsfor this purpose.The
amountof statesavedby thekernelwheninvoking theuser-levelnotificationhandlercon-
sistsonly of theregistersneededto passtheargumentsto thehandler.UNIX signalsrequire
thatthekernelsavestheentireprocessstateontheuserstack.In addition,returningfrom a
UNIX signal requires another system call to restore the original process state.
All notificationschemesbenefitfrom fasterprocessorsaswell asfasterbuses.The
regularinterruptplussignalschemeis ableto exploit thehighercomputeperformanceof
thefasterprocessorbetterbecauseof thelongercodepathwhichcontainsmoreinstruction
levelparallelism,comparedto theuser-levelnotifications.Dueto thelargernumberof un-
cachedloads,it alsobenefitsmorefrom fastersystembuseswith lower latency.However,
it shouldbenotedthatthepresentedresultsfor theinterruptplussignalbaselineareclose
to theachievableminimum,dueto themeasurementmethodologyof LMBench,whereas
154
the user-levelnotification latenciesaremeasuredin a morerealisticapplicationenviron-
ment.
Oneimportantperformancebenefitof theuser-levelnotificationschemeis not cap-
turedby themicrobenchmarkatall. In theuser-levelI/O architecture,applicationscanuse
different notification handlersfor different requests.This allows programmersto write a
setof notificationhandlersspecificallyoptimizedfor onetask,andusetheappropriatehan-
dler for eachrequest.This flexibility cansimplify theprogrammingtaskandleavesroom
for optimizationsin eachhandler.In contrast,becausechangingsignal dispositionsin-
volvesanexpensivesystemcall, signalhandlersareusuallymoregeneralandhencecom-
plex and slow. The performancebenefit of this added flexibility of the user-level
notificationdependsontheparticularsynchronizationschemeusedin anapplicationandis
not further investigated here.
7.5 Summary
In mostcurrentsystems,thenonblockingcharacteristicsof long-latencyI/O opera-
tionsis hiddenfrom applicationsby thekernel.After initiating anI/O transaction,theap-
plicationprocessissuspendeduntil therequestcompletes.Uponcompletion,theI/O device
interruptsthekernelto unblocktheapplicationprocessandinvoke thescheduler.Imple-
mentingnonblockingI/O in aUNIX like kernelusuallyrequiresthattheapplicationspawns
a separatethreadwhich blockson its behalf,a processwhich incursadditionaloverhead
from system calls and costly inter-thread communication via signals.
The user-levelI/O architectureprovidesa flexible low-overheadmechanismto in-
vokearbitraryuserroutineswhenanI/O requestcompletes.Applicationprogrammersare
155
ableto usedifferentnotificationroutinesfor differentrequestsandcanthusoptimizeeach
routinefor its particularpurpose.Thenotificationmechanismassumesthatit is likely that
theinitiating processis still runningat thetime of thenotificationandthata heavyweight
contextswitchis not necessary.In this casethekernelsavesonly minimal statebeforein-
voking theuser-levelhandler,which savesanyadditionalstateasnecessary.Notification
handlersreturnto the original instructionby executinga jump instruction.The queueing
mechanismnecessaryfor caseswherenotificationsarrivefor inactiveprocessesaddsonly
asmallamountof executionoverheadandrequirestheadditionof asmallfixed-sizedarray
of pointers in nonpageable memory.
Theperformanceadvantagesof theuser-levelnotificationschemeovernormalinter-
ruptswith signalsfor synchronizationstemfrom the fact that the UIO devicepushesall
necessaryinformationto theprocessor,andthatthekernelis notinvolvedin returningfrom
anotificationhandler.Otheradvantagesincludetheflexibility of usingspecificallytailored
notificationhandlersfor differentrequestsandthefact thatnotificationsareprocessedal-
mostcompletelyin userspace.As a result,user-levelnotificationsareableto reducenoti-
fication overheadby morethana factorof five for thecommoncase,andby morethana
factor of two if the notification must be queued for later delivery.
8. PERFORMANCE EVALUATION
Previouschaptershavepresentedperformanceevaluationsof theindividual mecha-
nismsandhavefocusedoncomparingalternativedesigns.Thischapterquantifiestheper-
formanceof the entireuser-levelI/O architecture.The low requestoverheadof the UIO
architecturedirectly benefitsapplicationsthat uselatencyhiding to improvethroughput.
Suchapplicationscanimplementuser-levelmultithreadingor otherapplication-levelsyn-
chronizationschemesto overlapI/O latencies.In addition,otherapplicationswith low I/O
requirementsor without theability to hideI/O latenciesthemselvesmaybenefitindirectly
asthereducedrequestoverheadmakesmoreprocessorcyclesavailable.Theability of the
user-levelI/O architectureto performI/O operationswith arbitraryuserbufferswithoutthe
needto wire or preallocatethosebuffersmeansthatapplicationscanrealizeperformance
improvements without software modifications.
This chapterfirst describesdetailsof the prototypeimplementationwith respectto
theremoteI/O devicemodelsandtheapplicationlibraries.It thenpresentsresultsof exper-
imentsthatusea syntheticbenchmarkto measurethemaximumavailableI/O bandwidth
undervariousrequestpatterns,andshowshow a databaseservercanimprovethroughput
without any modifications to the original program.
157
8.1 Prototype Architecture
Theuser-levelI/O architecturepresentedin thiswork is evaluatedwith adistributed
storagedeviceorganizationsimilar to network-attachedsecuredisks[42]. Thisstoragear-
chitectureimprovesfile serverscalabilitybyseparatingdatastoragefromdatamanagement
andby transferringdatadirectly betweendisksandclients.At theheartof thesystemare
autonomousstoragedevicesconnectedto clientsvia a scalablesystem-areanetwork.The
storagedevicesexportanobject-basedinterfaceto clientsontopof whichexistingfile sys-
temabstractionscanbebuilt. A storagemanageroverseesaccessrightsandmetadataop-
erations for the distributed disks, but is not involved in data transfers.
The network-attacheddisk architecturemodeledfollows the generaloutline of the
proposednetwork-attachedsecuredisksarchitecture[42]. However,becausetheuser-level
I/O architecturefocuseson datatransfers,a simplerapproximationof theobjectbasedin-
terfacehasbeenimplemented.Sincenonetwork-attacheddiskdeviceprototypeexists,this
studyusesworkstationnodesto modeltheseautonomousdisks,similarly to otherwork on
network-attacheddisks.Eachdisk noderunsa completeoperatingsystemkernelwith ad-
ditional servercode.Thedisk nodesexporta standardUNIX filesysteminsteadof thege-
neric objectbasedinterface.Objectsarereferencedby filenameinsteadof an encrypted
capability.Theseapproximationsdonotaffecttheperformanceof datatransfers,whichare
the focusof this work. Modeling thesedetailsmoreaccuratelywould haveno impacton
theresultsshownhere.However,transferringfilenamesinsteadof encryptedcapabilities
addsacomparableamountof datato eachrequest.Althoughtheremotediskmodelsdonot
decryptcapabilities,thehigheroverheadof thegeneral-purposekernelrunningonthedisks
compensatesfor thesimplifiedcontrolpath.In addition,mostfile managerrelatedactivity
158
affectsonly metadatamodificationsandnot theactualdatatransferwhich is the focusof
this work.
Thedistributionof responsibilitiesbetweenthekernelandusercodecanvary, from
performingonly performance-criticaldatatransfersat theuserlevel to completelyelimi-
natingtheoperatingsystemfrom any I/O operations.Sincethe remotedisksdo not trust
clientmachines,theexactchoicedoesnotaffectsystemsecurity.In eithercase,auser-level
library maintainsa list of openfiles, similar to theprocessfile descriptortablemanagedby
theoperatingsystem,asshownin Figure34.Eachentrycontainstheappropriateuser-level
deviceaddress,routinginformationto accessto correspondingremotedisk, theobjectca-
pability andthefile offset.Thefirst two itemsallow applicationsto issuerequestsdirectly
to theUIO device,which forwardsthemto thecorrectdisk.Thefile offsetmustbemain-
tainedin thelibrary becausereadandwrite requestsbypassthekernel,andfor scalability
reasonstheremotedisksdo not keepinformationaboutopenfiles. Thefile offsetis trans-
Figure 34: File Descriptors in a User-level I/O Library
read(fd, buf, size) fd>M ?
UIO file descriptors
system callno
yes
UIO requestfile-specific data
other parameters
159
ferredto thedisk with everyrequest,andis updatedlocally whenthereador write opera-
tionssucceed.Maintainingthis informationat theuserlevel allows the library to present
thefamiliar file interface,regardlessof whetherthefile isstoredlocallyoronaremotedisk.
To distinguishregularfiles from user-levelfiles, UIO file descriptorshavehighervalues
than the highest possible kernel file descriptors.
If theoperatingsystemdelegatesall UIO file operationsto usercode,thelibrary must
alsomaintaina list of mountpointsor directoryprefixesat which remotedisk objecthier-
archiesstart.Thelist of prefixesis checkedfor everyfile operationthattakesafilenameas
argument,andtheoperationis eitherforwardedto asystemcall or resultsin aUIO request.
In addition,for completenessit is necessaryto implementthenotionof a currentworking
directory and a root directory in the library.
Delegatingthe file descriptormanagementandpossiblyUIO mountpoint manage-
mentto user-levelsoftwaredoesnot weakenthesecuritymodel,sinceremotestoragede-
vicesdo not trustclientsregardlessof whetherrequestsareissuedby theclient operating
systemor user-levelcode.In addition,failure to correctlyimplementthis functionalityaf-
fectsonly theoneapplicationandhasno impacton thecorrectnessof otherapplications.
Onereasonfor this robustnessis thatremotedisksdonotmaintainanyper-clientinforma-
tion aboutopenfiles. Sinceclientsdo not consumeany resourceson remoteI/O devices
beyondwhatis requiredto handlependingrequests,aclient terminatingwith pendingUIO
requestshasnoimpactontheremotedisksor onotherapplications.However,therequired
functionality increasesapplicationsoftwarecomplexity,andshouldbe implementedin a
dynamicallylinkedlibrary whereit is hiddenfrom applicationprogrammersandcaneasily
160
bereused.A dynamiclibrary hastheadditionaladvantagethatit canadaptdifferenthard-
ware implementation details of the UIO device to a common programming interface.
To completelyhidethedetailsof theuser-levelI/O architecture,thelibrary mustalso
suitably implementthe blocking behaviorof file operations,for instanceby initiating a
threadswitchafterissuingaUIO request.Sincenotificationhandlersmustbereentrantand
maynotblock,blockedthreadscannotbeunblockeddirectlyby thenotificationhandleras
thethreadrunqueuemaybelocked.Instead,thenotificationhandlersetsaflag in thethread
structuremarkingthethreadasready,andsignalstheschedulerthrougha global flag that
at leastonethreadcanbe unblocked.At the next threadschedulerinvocation,the list of
blocked threads is scanned and all ready threads are moved to the run queue.
A compatibilitylibrarygivessingle-threadedapplicationsaccessto theuser-levelI/O
interface.Although suchapplicationscannotdirectly takeadvantageof the reducedI/O
overhead,theincreasedidle timecanbemadeavailableto otherprocessesto improveover-
all systemthroughput.After transmittingan I/O request,the library initiates a context
switchvia theyield() systemcall. A globalflag setby thenotificationhandlerindicatesif
therequestis complete.Thelibrary repeatedlycheckstheflag andinitiatescontextswitch-
esif it is notyetset.Thisschemedoesnotminimizeoverhead,asit wastesprocessorcycles
polling theflag. However,if otherprocessesarecompetingfor theCPU,thepoll interval
is relatively large.Frequentpolling occursonly if little or no otherwork is available,in
whichcasethehigherprocessorutilizationdoesnotaffectthroughput.Themaindisadvan-
tageof this schemeis that thekernelscheduleris unawareof thecompletionnotification
and may delay rescheduling the process unnecessarily, causing longer I/O latencies.
161
8.2 UIO Bandwidth Scaling
A syntheticbenchmarkis usedto measurethe maximumI/O bandwidthattainable
from anumberof independentdisks.ThebenchmarkissuesI/O requestsasquickly aspos-
siblewithout consumingthedata.Requeststo thesamedisk areblocking,while requests
to independentdisksareoverlappedto maximizethroughput.This benchmarkis identical
to theoneusedin Section2.2.3,with addedlibrary supportfor user-levelI/O. Thesimulat-
edarchitectureis alsoidenticalto theonedescribedin Table4, allowing a comparisonof
kernel-based and user-level I/O performance.
Theresultsin Figure35 comparethebandwidthof kernel-basedI/O with theuser-
level I/O architecturefor a varietyof requestsizes,accesspatternsandfor differentnum-
bersof disks.TheUIO modelusesasingleUIO networkinterfacewith asimplifiedmodel
of asystemareanetwork.TheUIO interfaceimplementssix DMA engines,threesenden-
ginesandthreereceiveengines.This meansthata maximumof threemessagescanbein
transitin eachdirection.NotethattheUIO deviceprovidesonly oneTLB thatis sharedbe-
tweenall DMA engines.Preliminaryexperimentsshowedthat due to contention,fewer
DMA enginesaffecttherequestlatencynegatively,decreasingoverallperformance.Onthe
otherhand,thekernel-basedI/O measurementsareperformedwith oneDMA engineper
disk. The bandwidthof the I/O networkis 300 Mbyte/s,which correspondsto the band-
width of InfiniBand overa serialcopperlink. Thelocal I/O bus,on theotherhand,hasno
bandwidthrestrictions,thusapproximatinganarchitecturein whicheachI/O deviceresides
onaprivatebus.In addition,eachSCSIdiskisattachedtoaseparateSCSIbustokeepSCSI
bus and host adapter contention from affecting the results.
162
Figure 35: I/O Bandwidth Scaling Comparison
0
20
40
60
80
100
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
0
20
40
60
80
100
aggr
egat
e ba
ndw
idth
in M
byte
/s
0 4 8 12 16 20 24# of streams
Nonsequential 16 Kbyte Blocks Nonsequential 64 Kbyte Blocks
0
40
80
120
160
200
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
0
40
80
120
160
200
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
Sequential 16 Kbyte Blocks Sequential 64 Kbyte Blocks
320 320
User-level I/O Kernel-based I/O
0 4 8 12 16 20 24
0 4 8 12 16 20 24 0 4 8 12 16 20 24
240
280
240
280
163
The graphsshowthat the UIO architectureachievessignificantly improvedband-
width comparedto thekernelbasedI/O architecture,resultingin twice thebandwidthfor
23disks.Fornonsequentialaccesses,scalabilityis almostlinearsincehostprocessoroccu-
pancyis theonly limiting factor.For sequentialaccesses,bandwidthstartsto saturateasit
approaches the 300 Mbyte/s limit of the I/O network.
Also notethatkernel-basedI/O neveroutperformsUIO, eventhoughUIO requests
experienceaslightly higherlatencydueto theadditionalnetworktraversal.Theseparticu-
lar experimentswhereperformedusingthefast trapmechanismto inserttheprocesscon-
text on a CSB flush. The TLB useskernel-basedinterrupthandling.However,sincethe
benchmarkdoesnot performanyotheroperationsbesidesissuingI/O requests,thechoice
of hardwareor softwaremechanismdoesnotaffecttheoverallbandwidth.In theUIO case,
theCPUutilization for 23 requeststreamsis below2 percent,indicatingthatmechanisms
with slightly higheroverheadwould not changetheseresults.In caseof kernel-basedI/O,
however,CPUutilization is closeto 100percentandhencetheoverallbandwidthis satu-
rated.
Figure36 showshow a finite networkbandwidthof 100Mbyte/seffectsbandwidth
scaling.In thesimulatorprototype,bandwidthis limited by throttling themaximumrateat
which theUIO deviceDMA enginesissuebustransactions.Theeffectof a limited band-
width I/O networkis similar to theoperatingsystemsaturationeffectfor kernel-basedI/O.
Note that for nonsequentialaccesses,evensmall numbersof streamsshowa decreasein
throughput.The lower networkbandwidthincreasesoverall requestlatency.Becausethe
benchmarkis not able to sufficiently overlapthe long disk seeklatencieswith otherre-
questsevenwithout a bandwidthlimited network,overall throughputdecreasesfurther.
164
Figure 36: I/O Bandwidth Scaling Comparison with Network Effects
0
20
40
60
80
100
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
0
20
40
60
80
100
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
Nonsequential 16 Kbyte Blocks Nonsequential 64 Kbyte Blocks
0
40
80
120
160
280
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
0
40
80
120
160
200
aggr
egat
e ba
ndw
idth
in M
byte
/s
# of streams
Sequential 16 Kbyte Blocks Sequential 64 Kbyte Blocks
320 320
User-level I/O with 100 Mbyte/s bandwidth network
Kernel-based I/O
User-level I/O with 300 Mbyte/s bandwidth network
0 4 8 12 16 20 24 0 4 8 12 16 20 24
0 4 8 12 16 20 240 4 8 12 16 20 24
200
240 240
280
165
Thesequentialstreams,ontheotherhand,areabletoapproachthe100Mbyte/slimit quick-
ly. Furthermore,throughputremainscloseto this limit for anincreasingnumberof request
streams,indicatingthat in thepresenceof sufficientconcurrency,theUIO architectureis
ableto sustainthroughputscloseto thehardwarelimit. Notethat,asdescribedin Section
6.6,anincreasedTLB misslatencytranslatesinto lower effectivebandwidthof theDMA
engines,which leadsto similar throughputreductionsasif the network itself wasband-
width limited. Becausethemicrobenchmarkdoesnotutilize theprocessorsufficiently, the
impact of other overheads on system throughput cannot be measured.
8.3 Application Throughput Scaling
Theexperimentdescribedin thissectionevaluatesthethroughputscalabilityof a re-
alistic I/O intensiveapplication,theMySQL databaseserver[98]. Similar to themeasure-
mentspresentedin Section2.2.4,anumberof databasequeriesareissuedto separatetables
residingon independentdisks.A singleinstanceof thedatabaseserverexecutesa thread
for each query to hide I/O latency
TheMySQL databaseserverremainedunmodified,only theuser-levelthreadlibrary
wasextendedto includesupportfor user-levelI/O. Asoutlinedearlier,thelibrarymaintains
a tableof user-levelfile descriptorsaswell asa list of UIO mountpoints.Whena file is
opened,thelibrary comparestheabsolutefilenameto thelist of mountpointsto distinguish
regularfiles from remotefiles. For remotefiles thelibrary maintainsthefile offsetaswell
asaccessandroutinginformationfor thecorrespondingremotedisk.Thethreadstructure
is extendedby a flag indicatingif thethreadis blockedonaUIO request.Blockedthreads
areskippedby the scheduler.The flag is setwhena requestis issuedandis resetby the
166
notification handler.This designenablesa simple lock-free implementationof thread
blockingandscheduling.However,leavingblockedthreadsin therunqueueincreasesthe
complexityof theschedulerslightly, andcanleadto increasedthreadswitchlatenciesif a
largenumberof threadsis blocked.In theexperimentsreportedherethenumberof threads
is smallenoughthat this simpledesignis adequate.Alternatively,after issuinga UIO re-
quest,threadscanbemovedto a blockedthreadlist. A globalflag indicatesif at leastone
threadcanbeunblocked,in which casetheschedulerscansthelist of blockedthreadsand
removesanythatcanbeunblocked.Thisalternateschemecanresultin lowerthreadswitch
latenciesfor largenumbersof threads.However,for smallto moderatenumbersof threads,
theperformanceimpacton theschedulerdueto theadditionalflag is likely comparableto
skipping blocked threads.
Figure37 comparesthe throughputscalingfor varyingnumbersof concurrentque-
riesfor a conventionalkernel-basedI/O systemandtheuser-levelI/O architecture.In the
Figure 37: MySQL Database Throughput Scaling
4
3
2
1
5
6
7
8
0
0 2 4 6 8 10 12
spee
dup
# of queries
14
9
10
User-level I/O
Kernel-based I/O
16
167
kernel-basedI/O systemthroughputsaturatesat nine queriesas the CPU utilization ap-
proaches100percent.Approximately35 to 36 percentof thebusyCPUtime is spentexe-
cutingoperatingsystemcodedirectlyor indirectlyrelatedto I/O processing.Theuser-level
I/O architecturereducesthisoverheadby two ordersof magnitudeandasaresultis ableto
scaleto largernumbersof queries.Indeed,throughputfor 15queriesis improvedby close
to 25percent.Thereasonthroughputdoesnot improveby thetheoreticallypossible35per-
centis thatthedatabaseserverusesadditionalinternalsynchronizationthatsomewhatlim-
its threadconcurrency.In addition,slight changesin disk requestarrival timescanleadto
significantdifferencesin headseektimes,thusaffectingdisk bandwidth.Finally, thesim-
ulatedI/O networklimits bandwidthto 300Mbyte/s.Thesyntheticbenchmarkin thepre-
vioussectiondemonstratesthatthiscanleadto throughputsaturationstartingat13streams.
It shouldbenotedthatevenfor smallnumbersof queries,thekernel-basedI/O sys-
temneveroutperformsUIO. Despitethefact that individual UIO requestsincur a slightly
higherlatencydueto theadditionalnetworktraversal,theloweroverheadallowsbetterla-
tency hiding and hence better throughput scaling.
Theexperimentsalsoshowedthatthreadconcurrencyis critical to achievemaximum
throughput,andthatapplicationperformanceis very sensitiveto thelocking strategy.For
instance,MySQL implementsa globalindexcachethatis sharedby all threads,aswell as
a globalcacheof tabledatastructures.Accessingthesestructuresduring indexedqueries
or whenopeninga tableblocksall otherthreadsandeffectivelyeliminatesthreadconcur-
rency,evenfor threadsoperatingondifferenttablesor databases.Thisdesignis acceptable
for environmentswith a low degreeof concurrencydueto few requestsor high I/O over-
head,butit canhaveasignificantperformanceimpactsonthehighly paralleluser-levelI/O
168
architecture.Thequeriesin this experimentdo not usethetableindexandarehenceable
tomaximizeconcurrencyfor thedataretrievalphase.Asaresult,thesequeriesapproximate
theperformanceof anoptimizeddatabaseimplementationwith fine-grainlockingof index
and table data structures.
8.4 Summary
Theeffectivenessof theuser-levelI/O mechanismsis evaluatedin thecontextof a
distributedautonomousstoragearchitecturesimilar to network-attachedsecuredisks.By-
passingthekernelfor disk I/O operationsmeansthatapplicationsmustmaintainfile state
normallymanagedby thekernel.Theper-file statemayincludeaccessandrouting infor-
mationfor theremotedeviceandthefile offset.Maintainingthis informationat theuser-
level ratherthanin thekerneldoesnot affect theprotectionmodelasremotedisksdo not
trustclientsregardlessof whetherrequestsareissuedby thekernelor by applications.Dy-
namicallylinked librariescanbeusedto hidehardware-dependentdetailsof thearchitec-
ture, to maintainper-file stateand to presenta conventionalblocking I/O interfaceto
applications.
A syntheticbenchmarkthatissuesI/O requeststo asetof disksshowsthatdueto sig-
nificantly reducedoverhead,the user-levelI/O architectureis able to provide twice the
bandwidthof akernel-basedI/O systemwhile at thesametimereducingCPUutilizationto
lessthantwo percent.For theMySQL databaseserver,theseimprovementstranslateinto
25percentincreasedthroughputfor 15concurrentquerieswithoutanyapplicationrestruc-
turing.
9. CONCLUSIONS
Theperformanceof theI/O subsystemis becomingincreasinglyimportantfor a va-
riety of applications.Manyof theseI/O intensiveapplicationsarepartof agrowingmarket
for multiprocessorserversand experienceincreasingperformancerequirements.At the
sametime, technologicaltrendsleadto a growinggapbetweenapplicationandoperating
systemperformance.As aresult,latencyhiding techniquesemployedto improvethrough-
put in thepresenceof longI/O accesslatenciesarebecominglesseffectiveastheoperating
systemoverheadof I/O operationslimits their scalability.Bypassingtheoperatingsystem
for I/O operationswhile maintainingmostof thecharacteristicsof kernel-basedI/O allows
applicationstomaximizetheperformancebenefitof overlappingI/O operationswithoutre-
quiring extensive software modifications.
9.1 User-level I/O Architecture
This thesisintroducedanI/O architecturethatreducesI/O overheadby giving appli-
cationsdirectandprotectedaccessto theI/O subsystem.Thearchitecturebuildson a dis-
tributed organizationof autonomousI/O devicesthat connectto clients over a scalable
system-areanetwork.Clientscommunicatewith remotedevicesthrougha local I/O net-
work interface.Togetherwith thehostprocessorbusinterface,this user-levelI/O device
providesmechanismsthat implementprotecteduser-levelrequestinitiation, user-space
datatransferandcompletionnotification.Thearchitecturalcostof thesemechanismsis low
170
astheyrequireonly smallamountsof chipareaanddonotaffectthecycletimeof thepro-
cessorcoreor I/O device.A statelesscommunicationprotocolfacilitatesanefficient and
scalable hardware implementation of the UIO device.
Building on the basicmechanisms,the user-levelI/O architecturepresentsa light-
weightnonblockingprogramminginterfacesimilarto manymessagepassingarchitectures.
On top of this low-level interfacea varietyof standardI/O interfacescanbeimplemented
thatallow applicationsto takeadvantageof thelow-overheadI/O mechanismswith no or
minimal softwarerestructuring.The flexibility of the underlyingmechanismskeepsthe
overheadof suchlibrarieslow. Bypassingtheoperatingsystemallowsthearchitectureto
providescalableperformancethatis notlimited by thegrowingCPU-memoryperformance
disparity.
A prototypeof theuser-levelI/O architecturehasbeenimplementedin anexecution-
drivensystemsimulator.To thisend,anexistingsimulationtool hasbeensignificantlyex-
tendedto combinedetailedprocessor,cacheandI/O devicemodelswith aUNIX-compat-
ible operatingsystem.Validationof thesimulationsystemagainstanexistingworkstation
demonstratesthat thetool is ableto accuratelycapturethemicroarchitecturalcharacteris-
tics as well as most of the operating system performance of a real computer system.
Microbenchmarksshowthattheuser-levelI/O architecturereducesI/O overheadby
a factorof 100,andthat theoverheadis lesssensitiveto technologicaltrendsthanthatof
anoperatingsystembasedI/O architecture.For23independentrequeststreams,theseover-
headreductionsleadto bandwidthimprovementsof a factorof two, while at thesametime
reducingtheprocessoroccupancyby almosttwo ordersof magnitude.Whereasatradition-
al kernel-basedI/O systemsaturatesat eight to tenstreamsdueto softwareoverhead,the
171
UIO systemexhibitsalmostlinearscalabilityup to thebandwidthlimit imposedby other
factors such as the network or I/O bus.
A public-domaindatabaseserveris ableto improveits throughoutfor 15concurrent
queriesby upto 25percent.TheUIO systemis ableto makealmost100percentof thepro-
cessorcyclesavailablefor applicationprocessing.Most importantly, theseperformance
improvementsarerealizedwithoutanysoftwaremodificationsoutsidetheuser-levelthread
library.
9.2 Limitations and Future Work
Dueto thelimited scope,thisstudyis notableto exploreall aspectsof theuser-level
I/O architecture.Futurework includesa morethoroughI/O characterizationandperfor-
manceanalysisof a representativesetof I/O intensiveapplications.The throughputim-
provementresultingfrom thereducedI/O overheaddependson the I/O characteristicsof
individual applications.In general,applicationswith a high I/O to computationratio will
benefitmorefrom thereducedI/O overhead.Ontheotherhand,applicationswith asignif-
icantcomputationcomponentbenefitindirectlyandlesssubstantiallyfrom loweroverhead
asmoreCPUcyclesbecomeavailableto theapplication.A detailedanalysisof common
I/O intensiveworkloadssuchasWebserversandcollaborativebusinesssoftwareis needed
to quantify the realized performance improvements for a wider variety of applications.
Thecomplexityversusoverheadtrade-offof thearchitecturalmechanismscanonly
beexploredunderawidevarietyof realisticworkloads.Thisstudyhasquantifiedtheover-
headcontributionsof theindividualphasesof anI/O transactionusingfine grainmeasure-
mentsandmicrobenchmarks.The impactof thesefactorson overall systemperformance
172
dependson theCPUutilization of theapplication.In somecasesit maybeacceptableto
incur thecostof a systemcall to initiate anI/O transfer,but otherapplicationscanbenefit
from furtheroverheadreductions.In addition,thelatencyof I/O requestsaffectstheability
of applicationsto overlaprequestsandto tolerateoverhead.Thesefactorsneedfurtherin-
vestigation with a representative set of applications.
9.2.1 General-purpose User-level I/O
Theprototypeimplementationof theuser-levelI/O architecturefocusesondisk I/O.
Expandingthearchitectureto includesupportfor otherhigh-performanceI/O devicessuch
asnetworkadaptersandgraphicsadaptersis neededto makeit a truealternativeto kernel-
basedI/O. Fortunately,graphicsdeviceaccessis similar to disk I/O asrequestsarealways
initiatedby thehostprocessor.Furthermore,almostall transfersoccurfrom memoryto the
device.Network I/O, on theotherhand,cangenerateincomingmessageswithout anex-
plicit requestby thehostprocessor.This unexpectedmessagearrival eitherrequiresinter-
mediatebuffering, or a modification of the programminginterface similar to Active
Messages.A combinationof thesetwo mechanismswould beableto providetheperfor-
manceof directuser-spacetransfersfor manycases,andfall backto intermediatebuffering
in exceptionalsituations.Furthermore,theinexpensivedatatransfermechanismprovided
by the UIO architecturemay be usedto improvelocal datatransfersbetweenkerneland
userbuffers.Extendingthescopeof theuser-levelI/O architectureevenfurther,the low-
overheadcommunicationmechanismsappearwell-suitedto supportmessagepassingcom-
municationbetweenclient systems.Thesignificantlystricterlatencyrequirementof most
parallel programs amplify the need for low overhead.
173
9.2.2 Operating System Implications
Bypassingtheoperatingsystemfor I/O datatransfersnotonly improvesthroughput,
it alsobypassesservicesnormallyprovidedby theoperatingsystem.In caseof disk I/O,
onesuchserviceis thebuffer cache.Thebuffer cacheis amainmemorycacheof recently
useddiskblocksandis managedby theoperatingsystem.Thecacheexploitstemporallo-
cality both within an applicationand betweenapplications,as well as spaciallocality
throughprefetching.Theuser-levelI/O architectureprototypebypassesthebuffercacheon
theclient, but implementsa similar cacheon theremotedisk. Theremotedisk cacheex-
ploits thefact thattheautonomousdisksoperateonobjectsratherthandiskblocksandem-
ploys thesameprefetchingandcachingstrategiesasa client cache.This designallowsa
fair evaluationof thebenefitsof theuser-levelI/O architecturewithout cacheeffectsbias-
ing theresults.However,theperformanceimpactof a remote,distributedcachecompared
to a local cacheneedsto beevaluatedin a realisticenvironmentwith a largernumberof
applications.
Bypassingtheoperatingsystemfor I/O requestsalsomeansthatthekernelis unable
to considerI/O activity in theprocessscheduler.Normally,a processis suspendedby the
kernelduringanI/O operation.In theUIO architecture,if the initiating applicationis un-
abletooverlaptheI/O latencywith otherwork, it needstoexplicitly causeacontextswitch.
However,sincethekernelis unawareof therequestcompletion,it maynot reschedulethe
applicationimmediately,resultingin longerI/O latencyandpoorthroughput.Enablingthe
kernelto wakeup a processwaiting for a user-levelnotificationrequiresthat theapplica-
tion informsthekernelof this situationthrougha systemcall. TheprototypeUIO imple-
mentationalreadyutilizes a fast kernel trap for initial notification handling.A possible
174
solutionis to treatuser-levelnotificationsasaspecialUNIX signalthatispartof thenormal
signalmaskbut is handledvia theUIO notificationmechanism.Thiswouldallow applica-
tionsto suspendthemselveswaiting for a signal,while thekernelwakesup theprocessas
soonasthenotificationarrives.Theimplementationdetailsof this schemeaswell aspos-
sibleperformancepenaltiesdueto increasednotification traphandlercomplexityrequire
further investigation.
9.2.3 Next-generation Architectures
Theincreasingperformancepenaltyof processorpipelinestallshasled to thedevel-
opmentof variouskindsof multithreadedmicroprocessors.Theseprocessorsprovidehard-
waresupportfor inexpensivethread-levelcontextswitchesto hidefrequentbut relatively
shortlatencies.Combiningthelatency-hidingcapabilityof theseprocessorswith thelow-
overheaduser-levelI/O architectureleadsto new challengesaseachprocessorsupports
multiplesimultaneouslyactivecontexts.At thesametime, theability to concurrentlyexe-
cuteinstructionsfrom multiple instructionstreamsenablesnovelwaysto hidetheremain-
ing I/O overhead.
9.2.4 I/O Programming Paradigms
Traditionally,I/O operationshavebeenconsideredexpensiveby programmers,both
in termsof latencyto completetherequestandin termsof softwareoverhead.Overheadin
particular has often preventedaggressiveapplication-levelI/O requestschedulingor
prefetching.Thelow per-requestoverheadof theuser-levelI/O architecturewarrantsa re-
evaluationof the trade-off betweencomputationand I/O. For instance,requestingdata
175
from a remotedevicewhereit hasbeencachedcanbeat leastasfastascopyingthedata
from the local buffer cache,andmay incur lessprocessoroverhead.Issuingnonbinding
prefetchhintstoanintelligentdiskallowstocommunicateapplicationknowledgeaboutac-
cesspatternsto thediskwith extremelylow overhead.Thelow costof initiating datatrans-
fers between clients and I/O devices may allow programmersto eliminate some
computationin favor of I/O operations.RevisitingthecurrentI/O programmingparadigm
in the facehighly parallelanddistributedI/O architecturescombinedwith low-overhead
user-leveldeviceaccessmay leadto innovativemethodsto improveperformanceof data
intensive applications.
REFERENCES
[1] AIC-7770 Data Book, Adaptec, 1992.
[2] Alpha 21164Microprocessor:Hardware ReferenceManual, CompaqComputer,Houston, Tex., 1995.
[3] J.M. Andersonet al., “ContinuousProfiling: WhereHaveAll the CyclesGone?,”Proc.16thACM Symp.OperatingSystemsPrinciples(SOSP-16), ACM Press,NewYork, N.Y., 1997, pp. 357-390.
[4] T.E. Andersonet al., “ServerlessNetwork File Systems,”Proc. 15th ACM Symp.OperatingSystemsPrinciples(SOSP-15),ACM Press,New York, N.Y., 1995,pp.109-126.
[5] R.H. Arpaci-Dusseauet al., “The Architectural Costs of Streaming I/O: AComparisonof Workstations,Clusters,and SMPs,” Proc. 4th Int’l Symp.High-PerformanceComputerArchitecture (HPCA-4), IEEE CS Press,Los Alamitos,Calif., 1998, pp. 90-101.
[6] G.BangaandP.Druschel,“MeasuringtheCapacityof aWebServerUnderRealisticLoads,” World Wide Web Journal Special Issue on World Wide WebCharacterization and Performance Evaluation, vol. 2, no. 1, May 1999, pp. 69-83.
[7] D. BanksandM. Prudence,“A High-performanceNetworkArchitecturefor a PA-RISC Workstation,”IEEE Journal on SelectedAreasin Communications, vol. 11,no. 2, Feb. 1993, pp. 191-202.
[8] J.S.BarreraIII, “A FastMACH NetworkIPC Implementation,”Proc.UsenixMachSymp., Usenix Assoc., Berkeley, Calif., 1991, pp. 1-11.
[9] L.A. Barroso,K. Gharachorloo,andE. Bugnion,“Memory SystemCharacterizationof CommercialWorkloads,”Proc. 25th Int’l Symp.ComputerArchitecture(ISCA-25), ACM Press, New York, N.Y., 1998, pp. 3-14.
[10] D. Bhandarkarand J. Ding, “PerformanceCharacterizationof the PentiumProProcessor,” Proc. 3rd Int’l Symp. High-PerformanceComputer Architecture(HPCA-3), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 288 - 297.
177
[11] B.N. Bershad, D.D. Redell, and J.R. Ellis, “Fast Mutual Exclusion forUniprocessors,”Proc. 5th Int’l Conf. Architectural Support for ProgrammingLanguagesand OperatingSystems(ASPLOSV), ACM Press,New York, N.Y.,1992, pp. 223-233.
[12] B.N. Bershadet al., “Extensibility, SafetyandPerformancein the SPIN OperatingSystem,”Proc. 15th ACM Symp.OperatingSystemPrinciples (SOSP-15),ACMPress, New York, N.Y., 1995, pp. 267-284.
[13] M.A. Blumrichetal.,“Protected,User-levelDMA for theSHRIMPMulticomputer,”Proc. 2nd Int’l Symp.High PerformanceComputerArchitecture(HPCA-2), IEEECS Press, Los Alamitos, Calif., 1996, pp. 154-165.
[14] M.A. Blumrichetal., “Virtual MemoryMappedNetworkInterfacefor theSHRIMPMulticomputer,” Proc. 21st Int’l Symp.ComputerArchitecture(ISCA-21), ACMPress, New York, N.Y., 1994, pp. 143-153.
[15] N.J. Boden et al., “Myrinet: A Gigabit per-SecondLocal Area Network,” IEEEMicro, vol. 15, no. 1, Feb. 1995, pp. 29-36.
[16] R. Bordawekar,QuantitativeCharacterizationandAnalysisof theI/O Behaviorof aCommercialDistributed-shared-memoryMachine, tech.reportCACR 157,Centerfor AdvancedComputingResearch,CaliforniaInst.of Technology,Pasadena,Calif.,1998.
[17] U. BrüningandL. Schaelicke,“Atoll: A High-performanceCommunicationDevicefor Parallel Systems,”Proc. 1997 Conf. Advancesin Parallel and DistributedComputing, IEEE CS Press, Los Alamitos, Calif., 1997, pp. 228-234.
[18] J.C. Brustoloni and P. Steenkiste,“Effects of Buffering Semanticson I/OPerformance,” Proc. Usenix 2nd Symp. Operating Systems Design andImplementation (OSDI ’96), Usenix Assoc, Berkeley, Calif., 1996, pp. 227-291.
[19] D. Burger and T.M. Austin, The SimpleScalarTool SetVersion2.0, tech. report#1f342, ComputerScienceDept., Univ. of Wisconsin-Madison,Madison, Wis.,1997.
[20] R.Card,É.Dumas,andF.Mével,TheLINUXKernelBook,JohnWiley & Sons,NewYork, N.Y., 1998.
[21] Y. Chen et al., “UTLB: A Mechanismfor Address Translation on NetworkInterfaces,”Proc.8thInt’l Conf.ArchitecturalSupportfor ProgrammingLanguagesand OperatingSystems(ASPLOS-VIII), ACM Press,New York, N.Y., 1998,pp.193-204.
178
[22] B. Clarke,“Net ApplianceKeepsDatabaseCurrent,” EE Times, Issue1051,Mar.1999.
[23] D. Culler et al., TheGenericActiveMessageInterfaceSpecification,white paper,Computer Science Division, Univ. of California, Berkeley, Calif., 1994.
[24] A. Davis,M. Swanson,andM. Parker,“Efficient CommunicationMechanismsforClusterBasedParallelComputing,”Proc. 1st Int’l WorkshopCommunicationandArchitecturalSupportfor Network-basedParallel Computing(CANPC‘97), IEEECS Press, Los Alamitos, Calif., 1998, pp 1-15.
[25] DesignCompiler ReferenceManual v2000.05, Synopsys,Mountain View, Calif.,2000.
[26] K. Diefendorff and P.K. Dubey, “How Multimedia Workloads Will ChangeProcessor Design,”IEEE Computer, vol. 30, no. 9, Sept. 1997, pp. 43-45.
[27] P. DruschelandL.L. Peterson,“Fbufs: A High-bandwidthCross-domainTransferFacility,” Proc. 14th ACM Symp.OperatingSystemsPrinciples (SOSP-14),ACMPress, New York, N.Y., 1993, pp. 189-202.
[28] P. Druschel, L.L. Peterson,and B.S. Davie, “Experienceswith a High-SpeedNetwork Adaptor: A SoftwarePerspective,”Proc. SIGCOMM ‘94 Symp., ACMPress, New York, N.Y., 1994, pp. 2-13.
[29] C. Dubnicki et al., “VMMC-2: Efficient Supportfor Reliable,Connection-OrientedCommunication,”Proc. Hot InterconnectsV, IEEE CSPress,Los Alamitos,Calif.,1997.
[30] S.H.Duncan,C.D.Keefer,andT.A. McLaughlin,“High PerformanceI/O Designinthe AlphaServer 4100 Symmetric Multiprocessing System,” DEC TechnicalJournal, vol. 8, no. 4, June 1996, pp. 61-75.
[31] J.H.Edmondsonet al., “Internal Organizationof theAlpha 21164,a 300-MHz64-bit Quad-issueCMOSRISCMicroprocessor,”Digital TechnicalJournal,vol. 7, no.1, July 1995.
[32] T. von Eicken et al., “U-Net: A User-LevelNetwork Interfacefor Parallel andDistributed Computing,” Proc. 15th ACM Symp.Operating SystemsPrinciples(SOSP-15), ACM Press, New York, N.Y., 1995, pp. 40-53.
[33] T. vonEickenetal., “Active Messages:aMechanismfor IntegratedCommunicationandComputation,”Proc.19thInt’l Symp.ComputerArchitecture(ISCA-19),ACMPress, New York, N.Y., 1992, pp. 256-266.
179
[34] R. Enbody, “Perfmon User’s Guide,” http://www.cse.msu.edu/~enbody/perfmon/index.html
[35] Y. Endoet al., “Using Latencyto EvaluateInteractiveSystemPerformance,”Proc.2ndSymp.OperatingSystemDesignandImplementation(OSDI’96), UsenixAssoc,Berkeley, Calif., 1996, pp. 185-199.
[36] D.R. Engler,M.F. Kaashoek,andJ. O’Toole Jr, “Exokernel:An OperatingSystemArchitecturefor Application-LevelResourceManagement,”Proc.15thACMSymp.OperatingSystemsPrinciples(SOSP-15),ACM Press,New York, N.Y., 1995,pp.251-266.
[37] J.R.Eykholtetal., “BeyondMultiprocessing… MultithreadingtheSunOSKernel,”Proc. Summer’92 Usenix Conf., Usenix Assoc., Berkeley, Calif., 1992, pp. 11-18.
[38] B. FalsafiandD.A. Wood,“SchedulingCommunicationon anSMPNodeParallelMachine,” Proc. 3rd Int’l Symp. High-Performance Computer Architecture(HPCA-3), IEEE CS Press, Los Alamitos, Calif. 1997, pp. 288-297.
[39] M. Fillo andR.B.Gillett, “ArchitectureandImplementationof MemoryChannel2,”DEC Technical Journal, vol. 9, no. 1, July 1997.
[40] A. Gallatin,J.Chase,andK. Yocum,“Trapeze/IP:TCP/IPatNear-GigabitSpeeds,”Proc.1999UsenixTechnicalConference, UsenixAssoc.,Berkeley,Calif., 1999,pp.109-120.
[41] G.R. Ganger, B.L. Worthington, and Y.N. Patt, The DiskSim SimulationEnvironmentVersion 1.0 ReferenceManual, tech. report CSE-TR-358-98,Dept.Electrical Eng. and Computer Science, Univ. of Michigan, Ann Arbor, Mich, 1998.
[42] G. Gibsonet al., “A Cost-Effective,High-BandwidthStorageArchitecture,”Proc.8th Int’l Conf. ArchitecturalSupportfor ProgrammingLanguagesand OperatingSystems (ASPLOS-VIII), ACM Press, New York, N.Y., 1998, pp. 92-103.
[43] R.B. Gillett, “Memory ChannelNetworkfor PCI,” IEEE Micro, vol. 16,no.1, Feb.1996, pp. 12-18.
[44] J.L. Hennesseyand D.A. Patterson,Computer Architecture: A QuantitativeApproach,SecondEdition, Morgan Kaufman Publishers,San Francisco,Calif.,1995.
[45] J.L. Henning, “SPEC CPU2000: Measuring CPU Performancein the NewMillennium,” IEEE Computer, vol 33, no. 7, July 2000, pp. 28-35.
180
[46] M.P.Herlihy, “A Methodologyfor ImplementingHighly ConcurrentDataObjects,”Proc. 2ndACMSIGPLANSymp.PrinciplesandPracticeof Parallel Programming,ACM Press, New York, N.Y., pages 197-206, 1990.
[47] M.P. Herlihy andJ.E.B.Moss,“TransactionalMemory: ArchitecturalSupportforLock-FreeDataStructures,”Proc. 20th Int’l Symp.ComputerArchitecture(ISCA-20), ACM Press, New York, N.Y., 1993, pp. 289-300.
[48] S.A. Herrod,UsingCompleteMachineSimulationto UnderstandComputerSystemBehavior, doctoral dissertation,Computer ScienceDept., Stanford University,Stanford, Calif., 1998.
[49] R.Horst,“TNet: A ReliableSystemAreaNetwork,” IEEEMicro, vol. 15,no.1,Feb.1995, pp. 37-45.
[50] N.C. Hutchinson and L.L. Peterson, “The x-Kernel: An Architecture forImplementingNetworkProtocols,”IEEE Trans.SoftwareEngineering,vol. 17, no1, Jan. 1991, pp. 64-76.
[51] InfiniBand ArchitectureSpecificationRelease1.0, InfiniBand TradeAssociation,Portland, Ore., 2000.
[52] B.L. JacobandT.N. Mudge,“A Look atSeveralMemoryManagementUnits,TLB-Refill Mechanisms, and Page Table Organizations,” Proc. 8th Int’l Conf.Architectural Support for Programming Languages and Operating Systems(ASPLOS-VIII), ACM Press, New York, N.Y., 1998, pp. 295-306.
[53] M.J. Kilgard, D. Blythe, and D. Hohn, “System Support for OpenGL DirectRendering,” Proc. Graphics Interface, Canadian Human-ComputerCommunications Soc., Toronto, Ont., Canada, 1995, pp. 116-127.
[54] J. Kluge et al., “The ATOLL Approach for a Fast and Reliable SystemAreaNetwork,” Proc. 3rd Int’l WorkshopAdvancedParallel ProcessingTechnologies(APPT’99), PublishingHouseof ElectronicsIndustry,Beijing, China,1999,pp99-105.
[55] A. Kumar,“The HP PA-8000RISCCPU,” IEEE Micro, vol. 17, no. 2, Mar. 1997,pp. 27-32.
[56] J. LaudonandD. Lenoski,SystemOverviewof the SGI Origin 200/2000ProductLine,white paper, SGI, Mountain View, Calif., 1997.
[57] E.K. Lee,PerformanceModelingandAnalysisof DiskArryas,doctoraldissertation,Computer Science Division, Univ. of California, Berkeley, Calif., 1993.
181
[58] B.-H. Lim etal., “MessageProxiesfor Efficient, ProtectedCommunicationonSMPClusters,”Proc.3rd Int’l Symp.High-PerformanceComputerArchitecture(HPCA-3), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 116 - 127.
[59] B.N. Lipchak et al., “PowerStorm4DT: A High-performanceGraphicsSoftwareArchitecture,”Digital Technical Journal, vol. 9, no. 4, June 1997, pp. 49-60.
[60] M48T02 Data Sheet, ST Microelectronics,Dallas, Texas, 1998.
[61] P.S.Magnussonetal.,“SimICS/sun4m:A Virtual Workstation,”Proc.UsenixConf.,Usenix Assoc., Berkeley, Calif., 1998, pp. 119-130.
[62] A.M. MainwaringandD.E. Culler, “Design Challengesof Virtual Networks:Fast,General-PurposeCommunication,”Proc.7th ACM SIGPLANSymp.PrinciplesandPracticesof Parallel Programming, ACM Press,New York, N.Y., 1999,pp. 119-130.
[63] M.J. Marchi and A. Watson, The Network Appliance Enterprise StorageArchitecture: Systemand Data Availability, technical report TR3065, NetworkAppliance, Sunnyvale, Calif., 1999.
[64] E.P.MarkatosandM.G.H. Katevenis,“User-LevelDMA withoutOperatingSystemKernelModifications,” Proc. 3rd Symp.High-PerformanceComputerArchitecture(HPCA-3), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 322-331.
[65] R.P.Martin etal., “Effectsof CommunicationLatency,Overhead,andBandwidthina Cluster Architecture,” Proc. 24th Int’l Symposiumon ComputerArchitecture(ISCA-24), ACM Press, New York, N.Y., 1997, pp. 85-97.
[66] M.K. McKusick et al., TheDesignand Implementationof the 4.4. BSDOperatingSystem, Addison Wesley Longman, Boston, Mass., 1996.
[67] L. McVoy and C. Staelin,“lmbench: PortableTools for PerformanceAnalysis,”Proc.UsenixAnn.TechnicalConf.,UsenixAssoc.,Berkeley,Calif., 1998,279-294.
[68] MIPS R10000MicroprocessorUser's Manual, Version 2.0, MIPS Technologies,Mountain View, Calif., 1996.
[69] ModuleCompilerReferenceManual v2000.05, Synopsys,MountainView, Calif.,2000.
[70] D. Mosbergerand L.L. Peterson,“Making PathsExplicit in the ScoutOperatingSystem,” Proc. Usenix 2nd Symp.OS Design and Implementation(OSDI ‘96),Usenix Assoc., Berkeley, Calif., 1996, pp. 153-168.
182
[71] D. Mosberger,P.Druschel,andL.L. Peterson,“ImplementingAtomic SequencesonUniprocessorsUsingRollforward,”Software---PracticeandExperience,vol. 26,no.1, Jan. 1996, pp. 1-24.
[72] M. Moudgill andS.Vassiliadis,“PreciseInterrupts,”IEEEMicro, vol. 16,no.1,Feb.1996, pp. 58-67.
[73] S.S. Mukherjee and M.D. Hill, “The Impact of Data Transfer and BufferingAlternativesonNetworkInterfaceDesign,”Proc.4thInt’l Symp.High-PerformanceComputerArchitecture(HPCA-4), IEEE CSPress,Los Alamitos,Calif., 1998,pp.207-218.
[74] J.K. Osterhout, “Why Aren’t Operating Systems Getting Faster as Fast asHardware?,”Proc. UsenixSummerConference, Usenix Assoc.,Berkeley,Calif.,1990, pp. 247-256.
[75] V.S. Pai,P. Druschel,andW. Zwaenepoel,“IO-Lite: A Unified I/O Buffering andCaching System,” Proc. Usenix 3rd Symp. Operating SystemsDesign andImplementation (OSDI ’99), Usenix Assoc., Berkeley, Calif., 1999, pp. 15-28.
[76] V.S. Pai,P. Druschel,andW. Zwaenepoel,“Flash:An Efficient andPortableWebServer,”Proc.UsenixAnn.TechnicalConf., UsenixAssoc.,Berkeley,Calif., 1999,pp. 199-212.
[77] V.S.Paj,P.Rangnathan,andS.V.Adve,RSIMReferenceManual, Version1.0,tech.report 9705, Dept. Electrical and Computer Eng., Rice Univ., Houston, Tex, 1997.
[78] M.A. Pagels,P. Druschel,and L.L. Peterson,Cache and TLB EffectivenessinProcessingNetwork I/O, tech. report 94-08, Dept. ComputerScience,Univ. ofArizona, Tucson, Ariz., 1994.
[79] J. Pasqualeet al., “High-PerformanceI/O and Networking Softwarein Sequoia2000,”Digital Technical Journal, vol. 7, no. 3, Mar. 1995, pp. 84-96.
[80] J. Pasquale,E.W. Anderson,and P.K. Muller, “Container-shipping:OperatingSystemSupportfor IntensiveI/O Applications,”IEEEComputer, vol. 27,no.3,Mar.1994, pp. 84-93.
[81] PCILocalBusSpecificationRevision2.1, PCISpecialInterestGroup,Portland,Ore.,1995.
[82] G. Pfister,In Searchof Clusters, SecondEdition,PrenticeHall, UpperSaddleRiver,N.J., 1998.
183
[83] PowerPC MicroprocessorFamily: The Programming Environmentsfor 32-BitMicroprocessors, Motorola, Schaumburg, Ill., 1997.
[84] PowerPCMicroprocessorFamily: TheBus Interfacefor 32-Bit Microprocessors,Motorola, Schaumburg, Ill., 1997.
[85] P. Ranganathanet al., “Performanceof DatabaseWorkloadson Shared-MemorySystemswith Out-of-OrderProcessors,”Proc.8th Int’l Conf.ArchitecturalSupportfor ProgrammingLanguagesandOperatingSystems(ASPLOS-VIII), ACM Press,New York, N.Y., 1998, pp. 307-318.
[86] R.F. Rashidet al., “Machine-IndependentVirtual Memory Managementfor PagedUniprocessorandMultiprocessorArchitectures,”IEEETransactionsonComputers,vol. 37, no. 8, August 1988, pp. 896-908.
[87] M. Rosenblumet al., “Using the SimOS Machine Simulator to Study ComplexComputerSystems,”ACM TOMACSSpecialIssueon ComputerSimulation, 1997,pp. 79-103.
[88] M. Rosenblumet al., “The Impact of ArchitecturalTrendson OperatingSystemPerformance,”Proc. 15thsACM Symp.OperatingSystemPrinciples (SOSP-15),ACM Press, New York, N.Y., 1995, pp. 285-298.
[89] L. Rzymianowiczetal., “ATOLL: A NetworkonaChip,” Proc.1999Conf.ParallelandDistributedProcessingTechniquesandApplications, CSREAPress,LasVegas,Nev., 1999, pp. 2307-2313.
[90] L. Schaelicke,“L-RSIM: A SimulationEnvironmentfor I/O IntensiveWorkloads,”Proc. 3rd Ann. IEEE WorkshopWorkloadCharacterization2000, IEEE CS Press,Los Alamitos, Calif., 2000, pp. 83-89.
[91] L. SchaelickeandA. Davis,“Improving I/O Performancewith a ConditionalStoreBuffer,” Proc.31stInt’l Symp.Microarchitecture(MICRO-31),IEEECSPress,LosAlamitos, Calif., 1998, pp. 160-169.
[92] L. Schaelicke,A. Davis, and S.A. McKee, “Profiling I/O Interrupts in ModernArchitectures,” Proc. 8th Int’l Symp. Modeling, Analysis and Simulation ofComputerandTelecommunicationSystems(MASCOTS2000),IEEECSPress,LosAlamitos, Calif., 2000, pp. 115-123.
[93] I. Schoinas and M.D. Hill, “Address Translation Mechanismsin NetworkInterfaces,” Proc. 4th Int’l Symp. High-PerformanceComputer Architecture(HPCA-4), IEEE CS Press, Los Alamitos, Calif., 1998, pp. 219-230.
[94] SGI, “IRIX man pages,” http://techpubs.sgi.com/library/
184
[95] SGI, “IRIXview User’s Guide,” http://techpubs.sgi.com
[96] K. SkadronandD. Clark,“DesignIssuesandTradeoffsfor Write Buffers,”Proc.3rdInt’l Symp.High-PerformanceComputerArchitecture(HPCA-3), IEEE CS Press,Los Alamitos, Calif., 1997, pp. 144-155.
[97] K. Swartz, “The Brave Little ToasterMeets Usenet,” Proc. 10th Usenix LargeInstallationSystemAdministrationConference(LISA X), UsenixAssoc.,Berkeley,Calif., 1996, pp. 161-170.
[98] TcX AB, DetronHB, andMonty ProgramKB, “MySQL ReferenceManualVersion3.2.1,” http://www.mysql.com/documentation/mysql/
[99] M.N. ThadaniandY.A. Khalidi, An EfficientZero-CopyI/O Frameworkfor UNIX,tech.reportSMLI TR95-39,SunMicrosystemsLaboratories,PaloAlto, Calif., 1995.
[100] C.A. Thekkath and H.M. Levy, “Hardware and Software Support for EfficientExceptionHandling,”Proc.6th Int’l Conf.ArchitecturalSupportfor ProgrammingLanguagesand OperatingSystems(ASPLOS-VI), ACM Press,New York, N.Y.,1994, pp. 110-119.
[101] J. Torrellas, A. Gupta, and J. Hennessy,“Characterizing the Caching andSynchronizationPerformanceof aMultiprocessorOperatingSystem,”Proc.5thInt’lConf. Architectural Supportfor ProgrammingLanguagesand OperatingSystems(ASPLOS-V), ACM Press, New York, N.Y., 1992, pp. 162-174.
[102] UltraSPARC User’s Manual, Sun Microsystems, Palo Alto, Calif., 1997.
[103] Ultrastar 9ZX Hardware/FunctionalSpecification9.11 GB Model, 10020 RPMVersion1.01, DocumentNumberAS19-0217-01,IBM StorageSystemsDivision,San Jose, Calif., 1997.
[104] D.L. WeaverandT. Germond,TheSPARCArchitectureManualVersion9, PrenticeHall, Upper Saddle River, N.J., 1994.
[105] E.H. Welbon et al., POWER2 PerformanceMonitor, IBM J. ResearchandDevelopment, vol. 38, no. 5, May 1994, pp. 545-554.
[106] M. Welsh,A. Basu,andT. von Eicken, IncorporatingMemoryManagementintoUser-LevelNetworkInterfaces, tech.reportTR97-1620,Dept. ComputerScience,Cornell Univ., Ithaka, N.Y., 1997.
[107] M. Wittle and B.E. Keith, “LADDIS: The Next Generationof NFS File ServerBenchmarking,” Proc. Usenix Summer1993 Technical Conf., Usenix Assoc.,Berkeley, Calif., 1993, pp. 111-128.