Guillaume Girard Report

53
Automated testing using a reference instruction set simulator extracted from documentation Guillaume GIRARD September 24, 2001

Transcript of Guillaume Girard Report

Page 1: Guillaume Girard Report

Automatedtestingusinga referenceinstructionsetsimulatorextractedfrom documentation

GuillaumeGIRARD

September24,2001

Page 2: Guillaume Girard Report

Abstract

Implementingacompleteandcorrectinstructionsetsimulatorof today’scomplex com-puterarchitecturesis a difficult task. Automatedtestingis a way to improve the testcoverageof sucha simulator. Usuallythesetest-casesarecheckedagainsta referencemachine.This paperpresentsthedevelopmentof anautomatedtestframework: a ref-erencesimulatoris implemented,partially extractedfrom the documentation,andanexecutioncomparisonsystemis setup to validatethefunctioningof therealsimulator.A testgeneratoris written andusedto checkthe instructionset,with usefulresultstoimprovethesimulatorcorrectness.

Page 3: Guillaume Girard Report

Contents

1 Intr oduction 31.1 VirtutechandSimics . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Automatedtesting. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Aim of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Organizationof thereport. . . . . . . . . . . . . . . . . . . . . . . . 5

2 Relatedwork 62.1 Hardwaredescriptionlanguages . . . . . . . . . . . . . . . . . . . . 62.2 Usualprocessordescriptions . . . . . . . . . . . . . . . . . . . . . . 8

3 IA-64 Description 93.1 Instructionencoding . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Instructionflow . . . . . . . . . . . . . . . . . . . . . . . . . 93.1.2 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Registerfiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Integercomputations . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Predication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5.1 Controlspeculation. . . . . . . . . . . . . . . . . . . . . . . 143.5.2 Dataspeculation . . . . . . . . . . . . . . . . . . . . . . . . 15

3.6 Registerstack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.7 Registerrotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.8 Floating-pointarchitecture . . . . . . . . . . . . . . . . . . . . . . . 193.9 Multimediasupport . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.10 Memoryorganization. . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.10.1 Virtual addressingandmemoryprotection. . . . . . . . . . . 213.10.2 TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.10.3 VHPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.11 Interruptionhandling . . . . . . . . . . . . . . . . . . . . . . . . . . 243.12 Debuggingandperformancemonitoring . . . . . . . . . . . . . . . . 25

3.12.1 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.12.2 Performancemonitoring . . . . . . . . . . . . . . . . . . . . 25

4 Extraction of information fr om the documentation 274.1 PDFandits shortcomings. . . . . . . . . . . . . . . . . . . . . . . . 274.2 Intel descriptionformat . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Pseudo-code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1

Page 4: Guillaume Girard Report

5 The referencesimulator 355.1 Registerfiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Instructiondecoding . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 Exceptionhandling . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4 Memorysimulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.5 RegisterStackEngine. . . . . . . . . . . . . . . . . . . . . . . . . . 395.6 TranslationLook-asideBuffers . . . . . . . . . . . . . . . . . . . . . 395.7 Floating-pointcomputation. . . . . . . . . . . . . . . . . . . . . . . 405.8 Pseudo-codefunctions . . . . . . . . . . . . . . . . . . . . . . . . . 405.9 Implementationdependentfeatures. . . . . . . . . . . . . . . . . . . 415.10 Statesaving andloading . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Simicsmodule 426.1 Statefiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.2 A testgenerator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426.3 Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Futur ework 46

8 Conclusion 47

A Statefile format 48

2

Page 5: Guillaume Girard Report

Chapter 1

Intr oduction

Theinstructionsetof a modernprocessorcanreachup to severalhundredinstructionsthatperformcomplex operationsontheregistersandthememorysystems.Implement-ing this setof instructionsis a difficult task:correctnessis achievedwheneverysinglebit takesits right valueafter the executionof theoperations,whatever the input stateprovided.On today’scomputers,’input state’meansmegabytesof memoryin amulti-ple levelshierarchyandhugeregisterbankswith stronginteractions.However, thoughtheinput stateis theoreticallyenormous,thechangesareusuallyobservableaftereachinstructionhasbeenrun,andtherangeof their effectsis limited. It is thusconceivableto validateanimplementationby extensiveverificationof theresults.

Theimplementationof aninstructionsetsimulatorencountersthesameproblems:the passagefrom a simulatorable to boot a modernoperatingsystem,to a bug-freesimulatorwhosecorrectnesscouldbevalidatedis a time-consumingtask.Most of theerrorshappenwhenrunningunusualor strangeseriesof instructions,or with specialargumentsor values.It is far from easyto understandwhy a simplecommandlike lscrashedonasimulatorwhenanoperatingsystemlikeSolarisjustbootedandexecutedcorrectlyonebillion instructions.Extensivetestingis awayto improvethecorrectnessof sucha simulatorandto avoid the time spentin debuggingelusive bugslater in thedevelopmentprocess.

1.1 Virtutech and Simics

Virtutechis a company producingsimulationtools for thedevelopmentanddesignofhigh-performancecomputersystems.

Its main product,Simics, is a simulatoracting as one or more virtual worksta-tions or servers. Simicscantodaymodela variety of processortypesandcomputersystems—bothworkstationsandmultiprocessorservers, including SPARC-, Alpha-,andx86-basedsystems.Simicsfully virtualizesthe target,allowing multiple proces-sorsor multi-nodesystemsto bemodeled,with arbitrarymemoryanddevice configu-ration,regardlessof thehostsystem.

Simics is sufficiently fast to run realistically scaledcommercialworkloads,andsufficiently accurateto bootandrun unmodifiedoperatingsystems,includingSolaris,Linux, Tru64,andWindowsNT. To accomplishthis,Simicsincludesbinarycompatiblemodelsfor severaldevices,includingSCSI,graphicscards,console,interrupt,timers,EEPROM andEthernet.Network simulationcanconnectto SimicsCentral,allowing a

3

Page 6: Guillaume Girard Report

virtual serverto appearonthelocalnetwork with full servicesavailable(NFS,NIS, rsh,etc.). SimicsCentralalsoallows the simulationof completenetworks by connectingmultipleSimicsprocessestogether, all within asinglevirtual clockdomain.Suchlargemodelscan be run parallelized,either on a multiprocessorhost or on a network ofworkstations.

Severaltypesof processorsarealreadysimulatedin Simics,andtheaimis to coverall theimportanthigh-endarchitecturesavailableonthecomputermarket today. Whena new architectureis introduced,thegenerictoolsusedat Virtutechmake thedevelop-mentof a functionalsimulatora ratherquick task,usuallya matterof a few months.The Simics framework, alongwith someexternal programs,providesthe automaticgenerationandoptimizationof a decoderanda disassembleror an efficient memorytransactionhandlingto give someexamples. However, Simics is confrontedto theproblemwe exposedpreviously, that is going from a functionalsimulatorto a com-pletelycorrectimplementationof theinstructionset.

1.2 Automated testing

Automatictestingof thesimulatoris awayto find andquickly correctcorner-casebugsandto validateits implementation,asageneratedsetof testswill bemorethoroughandaccuratein its testingthana human-beingcanbe. To be implementedefficiently, thissolutionimpliestheuseof a referencemachineto beableto detectinconsistenciesanderrorsin the simulatorstateafter the executionof oneor several instructions.It alsoimpliesa programwith enoughknowledgeaboutthearchitectureto generaterelevanttest-cases.

Thereferencemachinecanbeprovidedin differentways:

� by usinga realsystem.This methodobviously providestheonly completeandabsolutereferencemodel. It hasa major drawback: it is difficult to usea realsystemasareferencesinceweneedthepossibilityto donon-intrusivesnapshotsof the real machine,which is likely to bevery difficult at system-level. It is infact the very samereasonfor which simulationis an interestingtool in systemdevelopment.

� by usinga completesimulator, at thegatelevel. Suchmodelsareonly availableto thesystem’s designers,andarewell keptsecrets.Becauseof their precision,they’re alsovery slow andnot well-suitedfor high-level testson the instructionset.

� by usinganotherinstructionset simulator. This methodhasthe advantageoffunctioningeven if the real systemis still underdevelopment,or if the systemdoesnotevenexist. It doeshoweverimply thatthereferencesimulatoris ascom-pleteandaccurateaspossible,andit alsomakesthetestingphasemorecomplex,sincediscrepanciescanbeintroducedby bugsin any of thetwo simulators.

The test-casegenerator canat first be reducedto a very simpleprogramgeneratingrandompatternasinstructions.Thiswould bea goodstressteston thedecodingaccu-racy of thesimulator. To producemorecomplex andrealistictestcases,theprogramneedsadescriptionof thearchitecture(at leasttheregisterfilesandtheinstructionset)to generatelegal seriesof instructions.More complex informationcouldbeprovidedto generatetestsfor specificsystemslike theTranslationLook-asideBuffers(TLB).

4

Page 7: Guillaume Girard Report

Theresultsof thesetestsneedto beanalysedandtransmittedto theprogrammersfor debuggingwhenanerroris discovered.Thistaskcanbeperformedby thereferencemachinewith thehelpof informationaboutimplementationdependentfeatures,sothatirrelevantdifferencescanbefilteredout.

1.3 Aim of this thesis

Thismaster’s thesisis apracticalapplicationof thetestingmethoddescribedaboveforthedevelopmentof theIA-64 architecturein Simics.It wasperformedin parallelwiththeimplementationof theSimicssimulator.

Asareferencemachine,thechoicewasmadeto developareferencesimulatorsincethefirst IA-64 implementation,the Itaniumprocessor, wasnot publicly availableyet.Intel providesa pseudo-Cdescriptionof theentireinstructionsetthatcanbeextractedto implementthesemanticsof theinstructions,andthusimprovethecorrectnessof thesimulator.

Theprojectwasdividedinto severalsteps:

� Automaticallyextractingasmuchinformationaspossiblefrom the Intel docu-mentationso that the generationof the referencesimulatorwould be as inde-pendentof humancodingaspossible.This includesextractingtheformatof theinstructionsaswell asthepseudo-codedescribingtheir implementation.

� Building a framework aroundtheextractedinformationto get a working refer-encesimulator.

� CodingaSimicsmoduleto performparalleltestsonSimicsandonthereferencesimulatorandanalysethedifferencesfoundto signalerrors.

� Developinga test-casegeneratorby using the information extractedfrom thedocumentation.

As this representsquitea lot of work, thethesisitself wascenteredaroundthefirstthreepoints:gettingadocumentationgeneratedsimulatorfor theIA-64 andbeingableto compareit to theSimicsimplementation.

1.4 Organizationof the report

Chapter2 will describerelatedworksthathavebeenperformedin automaticgenerationof simulatorsandtestingof existing implementations.Chapter3 will provide a shortdescriptionof the IA-64 architecture.Chapters4 to 7 will presenthow the differentstepswerecarriedoutandwith whichdegreeof success.They will alsoemphasizetheencounteredproblemsaswell aspossibleimprovements.

5

Page 8: Guillaume Girard Report

Chapter 2

Relatedwork

How to describea processorat the instructionset level is a questionthat hasbeendiscussedfor many years(I foundpapersdatingfrom the late sixties,like [DUE68])andit hasgeneratedawiderangeof solutions.Languagesusedto provideadescriptioncanbesortedin differentcategories,accordingto their formalism:

� English, acommonchoiceasa non-formaldescriptionlanguage.

� Non-executed,readablepseudo-codelanguages with vaguesemantics. Theyusuallycontaintyposandbugsthatmake themnon-executable.

� Pseudo-codegeneratedfrom executedcode, transformedto be readable.Theycanbecloseto a realprogramminglanguage(Algol, C, etc.).

� Executedcodein a realprogramminglanguageknownoutsidethecompany whodesignedthechip.

� Hardware descriptionlanguageswith completelydefinedsemantics.They usu-ally providemuchmoreinformationon theprocessorstructures.They alsoneeda completedescriptionof all the internalmechanismswherethe othertypesoflanguageswould referto Englishdescriptions.

We’ll make a quick survey of someof theexisting languagesandtheir applicationto simulationandtesting.

2.1 Hardware description languages

The researchworld hasproducedquite a numberof hardwaredescriptionlanguagesfocusedon different levels of abstraction(gates,components,register transfers,in-structionset,etc.).

Somelanguagespresentthe processorat the instructionset level, including thesemanticsof theinstructions.

ISPSwasdesignedaround1980[SIE82, part1, chapter4]. Thedescriptionis donein thefollowing way: thememorydeclarationsdescribethestructureof theprocessor(registersandmemory),which will be consideredasthe state;the formatsandoper-ationsdescribethe instructionformat (bit fields) andthe servicesfunctions(address

6

Page 9: Guillaume Girard Report

calculation,etc.); the interpreterdecodesthe instructionsusingthe instructionformatandperformsthechangeson thestateof theprocessorastheinstructionsarerun.

Theoperationsaredescribedin aPascal-like language.Theinstructiondecodingisperformedexplicitly with thehelpof a DECODEstatementappliedto specificbit fields(it is similar to a switch/case in C). Although the languagewasprimarily designedfor documentation,a descriptioncanbeeasilycompiledto producea simulatorof theprocessordescribed.

LISAS is aninstructionsetdescriptionlanguagedevelopedin thebeginningof thenineties[COO93, COO94]. It is afunctionallanguagedescribingastateandoperationsthat transformthestate.In a similar way to ISPS,thestateis describedfirst (registersandmemory),thentheformat(bit fields)of aninstructionis defined.Theinstructionsareprovided by specifyingthe valuesof the encodingbit fields andsettingeventualparameters,thentheoperationsthatwill changetheprocessorstate.Decodingis doneimplicitly.

Somelanguagesincludealsomicro-architectureinformation, like latency, execu-tion unitsor pipelinesimulation.

nML waspresentedin [FAU95]. It describesthe structureof the processorandthe behavior of its instructions.Eachof themis describedby its binary encoding,itsassemblysyntaxandthe actionsit performs. nML introducea simpletiming-modelbasedon latency specification.

nML wasextendedinto Sim-nML [RAJ98], by introducingtheconceptof resourceusage. This allows a moreaccuratesimulationof pipelinesandlatencies.UsingSim-nML, severaltoolshavebeendevelopedto generatesimulators,assemblers,disassem-blers,profiling tools,acompilerback-end,etc. [1].

LISA is a languagedesignedfrom thestartto producebit andcycle/phaseaccurateprocessormodels[ZIV96]. It addsto the other languagesa strongpipeline modelbasedon resourceconstraints.Instructionsaredefinedby their binaryencoding,theirschedulingin thepipelinemodelandtheoperationsthatareperformedontheprocessorstructure. It hasbeensuccessfullyusedto describecycle-accuratemodelsfor DSP[PEE99].

Somelanguagesarefocusedon theencoding/decodingtask.SLED (SpecificationLanguagefor EncodingandDecoding)waspresentedin 1997

aspartof theNew Jersey Machine-CodeToolkit [RAM97a]. It is designedto describethe encodingof the instructions,and thus containsno stateinformation. Oncethebit fields are defined,patternsdescribethe binary representationof the instructionsandconstructors connectsymbolic,assembly-languageandbinary representationsofinstructions.SLED doesnot definesemanticsfor the instructions,andcanbeusedinanapplicationprogrameitherto encodeor decodeinstructions.

SLED descriptionswereusedto generateautomaticallycompletetestsof instruc-tion sets[RAM97b]. Thegoalwasto compareSLED descriptionsto an independentreferenceassemblerto checktheir correctness.The testsweredesignedfrom thede-scriptionthemselveswhichprovidedthewayinstructionswerebuilt (encodingbitsandarguments).

Simgen, like SLED, is orientedtoward the encodingandthe decodingof binaryinstructions[LAR97]. Bit fieldsaredefinedandinstructionsaredescribedwith theirbitpattern(to identify thebinaryencodedinstruction)andtheir assemblysyntax.Simgenwaswritten to optimizethedecodingphaseandproducea fastC decoder. It provides

7

Page 10: Guillaume Girard Report

a semanticfield for eachinstruction,to befilled with C codewhich is not checkedbutdirectlyoutputwhengeneratingtheC decoder. Thedecoderthenbecomesasimulator.

Another type of languagesusedto describedhardware componentsare the fulldescriptionlanguagesthatcanmodelfrom thegatesto theinstructionset.

VHDL [SHA86] andevenmoreVerilog [2] arethestandardsusedfor this kind ofmodeling.However, usingthemto describeprocessorstendsto besocomplex thatit isimpossibleto work at theinstructionsetlevel anddescribetheoperationsinvolvedin asimplemanner. MIMOLA [BAS94] is anattemptto providea tool in-between,wherea processoris still simpleto describe,but whereonecancontrolup to thegatelevel ifit’sneeded.

Plentyof otherlanguageshave not becitedhereandfit in thedifferentcategoriesor havespecificsof theirown. Amongothers,RADL, EXPRESSION,ISDL, HMDES,Maril, VLW, etc.

2.2 Usualprocessordescriptions

Interestingly, theofficial descriptionsof themostimportantprocessorarchitecturesusenoformally definedlanguageslikethosepresentedabove. I havefoundnopublicationsrelatedto the developmentandtestingmethodsthesecompaniesuse,or in what lan-guagesthey describethedifferentlevelsof thechipsthey design.They certainlyhaveinternalsimulatorsto performvalidationandthey sometimesseemto generatefromthemthepseudo-codeincludedin their documentation.

As farastheencodingis concerned,all of thempresenttablesto specifyopcodebitfields andvalues.For the instructions’operations,Sparcsimply providesan Englishdescription;IA-32 instructionsaredescribedin a non-executedpseudo-code,whereasAlpha and PPCpresentpseudo-codethat hasbeenexecuted;the IA-64 descriptionusesanextendedC language.It canprobablybearguedthatthedescriptionpresentedin the manualsis targetedtoward humanreadabilityratherthanformalism,andthuspseudo-codesare chosenratherthan new, formally correct languages.In the sameway, tablesaremorereadablefor encodingthanbit fieldsdefinitionsandmatching.Inall the architectures,the processorstructureitself is alwaysdescribedin English: thefloatingpointunit andthememorymanagementunit, to giveafew examples,areneverprovidedin a formal language.

Comparingrealhardwaredescriptionlanguagesto thedocumentationof theIA-64specificationhighlightsthelackof formal informationprovidedby thelater, especiallyconcerningthe processorstructureand internal mechanisms.They will have to beprogrammedin the referencesimulatorbasedon their Englishdescription,which islessthanideal for correctnesspurposes.In the sameway, generatingtestsfrom thisdocumentationmayprovedifficult withoutproviding thetestgeneratorwith additionalstructureinformation.A morein-depthstudyof theIA-64 codewill alsoshow thatthelevel of abstractionchosenfor thepseudo-codeis variable,in thatinternalmechanismsare sometimesmadeexplicit althoughthey are hiddenthe rest of the time withoutstrongreasonsto do so.

8

Page 11: Guillaume Girard Report

Chapter 3

IA-64 Description

The IA-64 is an entirely new architecturedevelopedduring the pastdecadeby IntelandHewlett-Packard. It is a complex designwith several advancedfeatureslike fullpredication,speculationandexplicit instructionparallelism.Thefirst processorimple-mentingthe IA-64 specificationshasbeenthe Intel Itanium which is availablesinceMay 2001.

3.1 Instruction encoding

3.1.1 Instruction flow

In theIA-64 architecture,the instructionflow follows stricterrulesthanthoseusuallyimposedin othersimilar processors.The flow is divided in groupsof successive in-structions.All the instructionsinsidea givengroupareindependentfrom eachother,so that their orderof executiondoesnot matter: they canbe executedall in parallelor reorganizedto matchthe bestuseof the processor’s resources.The allocationofthegroupsis doneat compiletime,andthestopsthat limit theendof a groupandthebeginningof a new oneareencodeddirectly into theinstructionflow1. Whena stopisfound,all thependingresultsfrom thecurrentgrouparecommittedto theregistersandtheexecutioncancontinueto thenext one.

Thedependenciesthat may not occurbetweeninstructionsin the samegrouparerathersimplein thegeneralcasebut includealsosomesubtleandcomplex rules:

� read-after-write (RAW) andwrite-after-write (WAW) registerdependenciesarenot allowedwithin an instructiongroup. write-after-read(WAR) is allowedex-ceptin somespecialcases.This includesexplicit registeraccess(registersen-codedinto theinstruction)andimplicit registeraccess(controlregistersaccessedby someinstructionsasa side-effect).

� RAW, WAW andWAR dependenciesareallowedfor memoryaccesses.A loadfrom a previously written addresswithin the instructiongroup will return thelatestwrittenvalue.

1Someinstructions(somebranchesanda returnfrom interruptionfor example)candynamicallyendagroupevenif nostopis present.

9

Page 12: Guillaume Girard Report

� Somespecificinstructionshave differentrequirementin position(first or last ina group)andin dependencies.[IA64-v1] in section3.4 providesmoredetailedinformation.

3.1.2 Encoding

Instructionsareencodedby threeinto abundleof 128bits. Eachbundleis dividedintoa templateof 5 bits,andthreeinstructionslotsof 41 bits (Seefigure3.1).

127 87 86 46 45 5 4 0

instructionslot2 instructionslot1 instructionslot0 template41 41 41 5

Figure3.1: IA-64 InstructionEncodingFormat(from [IA64-v1], Figure3–15)

Thetemplateprovidestwo piecesof information:

� Theslot mappingof the instructions,that is on which typeof unit they will beexecuted.

� Theinstructiongroupstops.

Table3.1 providesa partof thepossibletemplateencodingasexamples.A com-pletereferencecanbe found in [IA64-v1] in section3.3. A doublevertical line indi-catesa groupstopbetweentwo instructions.For example,thetemplate03 hasa stopafter instructionslot 1 andinstructionslot 2, while the template00 doesnot stopthecurrentinstructiongroup.

Table3.2showsthedifferenttypesof instructionsandonwhichexecutionunit theyarescheduled.TheL+X typeis aspecialtypefor someinstructionsthatareencodedin82bits insteadof 41. ThesearescheduledonanI-unit (for example64-bitsimmediateloadinginto a register)exceptfor thelong branchwhich is executedona B-unit.

It’ s importantto mentionthat not all combinationsof unit typesareavailableastemplates,andonly themostcommonhave beenprovided(24 in thecurrentarchitec-ture). Thecompilerwill addno-operationsto completethebundlesif needed.Noteaswell thatbundlesandgroupsarecompletelyindependentconcepts.TheItaniumpro-cessor, for example,canloadtwo bundlessimultaneouslyandexecutethemin parallelif no stopoccursandenoughunitsareavailable. Theuseof bundleshowever implies

Template Slot 0 Slot 1 Slot 200 M-unit I-unit I-unit01 M-unit I-unit I-unit02 M-unit I-unit I-unit03 M-unit I-unit I-unit

. . .1C M-unit F-unit B-unit1D M-unit F-unit B-unit

. . .

Table3.1: TemplateField EncodingandInstructionSlot Mapping(from [IA64-v1],Table3-10)

10

Page 13: Guillaume Girard Report

Instruction type Description ExecutionUnit TypeA IntegerALU I-unit or M-unitI Non-ALU integer I-unitM Memory M-unitF Floating-point F-unitB Branch B-unit

L+X Extended I-unit/B-unit

Table 3.2: RelationshipbetweenInstruction Type and ExecutionUnit Type (from[IA64-v1], Table3-9)

that jumpscanonly target the beginningof a bundlesincethere’s no way to indicatewhich slotshouldbeselected.

3.2 Registerfiles

TheIA-64 architecturedefinesthefollowing registers(Seealsofigure3.2):

General registers(GR) It is a bankof 12864-bitsregistersprovidedfor integerandmultimediacomputationsandavailableat all privilegelevels.Eachof thesereg-istershasanadditionalbit calledNot a Thingbit (NaT) for speculationpurpose(seesection3.5). GR0 is always0 andis read-only. GR0to GR31arecalledthestaticgeneral registers. GR32 to GR127 arecalledthestackedgeneral registers,andareavailableto a programby allocatinga stackframeof local andoutputregistersin a similar way a procedureallocatesstackspacein memoryfor itslocal variables. The mechanismof the registerstackengineis describedlater(section3.6). A portionof the stackedgeneralregisterscanbeprogrammedtorotatefor somespecificoptimizedloops(seesection3.7). A specialbankof 16registerscanreplaceGR16 to GR31 to provide quickly scratchregistersfor codelike interrupthandlers,andthusavoid saving andrestoringmanipulations.Thisbankswitchcanonly beperformedby system-level code.

Floating-point registers(FR) It is a bankof 12882-bitsfloating-pointregisters.Thefloating-pointformatsandoperationsaredescribedin section3.8. FR0 andFR1areread-onlyandcontainsrespectively +0.0and+1.0.All theotherregistersarefreely availablefor applications.As for the stacked generalregisters,FR32 toFR127 canbeprogrammedto rotateduringsomespecificloops.

Predicatesregisters(PR) It is abankof 641-bit registersthatareusedin predication(seesection3.4). PR0 is always1 (true)andignoresall writes.PR16 to PR63 canbeprogrammedto rotatein somespecificloops.

Branch registers(BR) It is abankof 8 64-bitsregistersusedto specifyaddressesforindirectbranches.Amongotherthings,they areusedto storethereturnaddresswhencalling a procedure.

Instruction Pointer (IP) IP is a 64 bits register containingthe address(virtual orphysical,accordingto theprocessormode)of thecurrentbundlebeingexecuted.IP is always16-bytesaligned(128bits).

11

Page 14: Guillaume Girard Report

Figure3.2: Systemregistermodel(from [IA64-v2] figure3–1)

12

Page 15: Guillaume Girard Report

Curr ent Frame Mark er (CFM) This registerholds the stateof the currentgeneralregisterframe,that is it keepstrack of the last allocationthat wasdonein thestackedgeneralregisters.It alsokeepstrackof therotatingstateof thegeneral,floating-pointandpredicateregisters.

Application Registers(AR) This is a setof 128 64-bits registers,althoughmostofthem are reserved. They containcontrol information available at applicationlevel, like theregisterstackengineconfiguration,thefloating-pointstatusregis-terandtheloopcounters.

AdvancedLoad Addr essTable (ALAT) The ALAT is an intern tableusedfor dataspeculation(seesection3.5).

Processoridentification registers(CPUID) It is a setof severalregistersdescribingthe processormanufacturer, the versionof the architectureand the processorspecificversionnumber. It alsoprovidesinformationabouttheoptionalfeaturesimplemented.

Performancemonitor data/configuration registers(PMD/PMC) Thesetwo setsofregistersareusedto accumulateperformanceinformationduring the processorexecution.Theconfigurationregistersareonly availableto privilegedprogramsandcanbe set to countvarioustypesof events. This information is collectedinto the dataregisters,which can eventually be accessedby application-levelprograms.Seesection 3.12for moreinformation.

RegionRegisters(RR) This is a setof registersusedin virtual memoryaddressing(seesection3.10).

ProtectionKey registers(PKR) Theseregisterscontainkeysusedto protectmemoryareasin virtual memoryaddressing(seesection3.10).

Translation Look-asideBuffers (TLB) Theseare intern tablesdescribedin section3.10.They containthepagetranslationlists for virtual memorymanagement.

ProcessorStatusRegister(PSR) This registercontainsthe main optionsandstatusbits controlling the functionsof the processor, like endianness,virtual memoryoptions.. .The first 6 bits of the PSR form the User Mask andareavailable toapplicationlevel programs.

Breakpointsregisters(IBR/IBR) Theseregistersprovide hardwarebreakpointsforinstructionsandmemoryaccesses.

Control Registers(CR) This is a bankof 128 64-bitsregistersthat controlssystemlevel configurationandinformationfor exceptionsandexternalinterrupts.Mostof themarereserved.They areonly accessibleto system-level programs.

3.3 Integer computations

Theintegerexecutionunitsprovideseveraltypesof instructions:

� Arithmetic instructions(addition,subtraction,shift left andadd).

� Logical instructions(and,or, andcomplement,xor).

13

Page 16: Guillaume Girard Report

� 32-bits integer operations(addpointer, shift left andaddpointer, sign extend,zeroextend).

� Bit field instructions(signed/unsignedshift right, shift left, extraction,deposit,pairedshift right).

� Immediatevalueloading(shortmove(22 bits) andlong move(64 bits) to regis-ter).

3.4 Predication

TheIA-64 instructionsetis fully-predicated:every instructionis encodedwith a6 bitsindex value referencinga predicateregister. If this register is equalto 1 (true), theinstructionis executed;if not, except in rarecases,it actsasa no-operation.PR0 isalways1, soinstructionspredicatedby PR0 arealwaysexecuted.

Several typesof instructionsset predicateregistersto 0 or 1 accordingto theirresult. They arecomparisons,floating-pointcomparisons,bit testing,NaT bit testingandsomespecificfloating-pointinstructions.The IA-64 specificationdefinesseveralwaysof settingthepredicatesregistersaccordingto theresultto obtain. [IA64-v1] insection4.3providesmoreinformationon thesecases.

As anexample(from [IA64-v1], section8.5),considerthefollowing code:

if (a) {b = c + d;

}if (e) {

h = i - j;}

Thiscodecanbecompiledinto IA-64 assemblywithoutany branches:

cmp.ne p1,p2=a,r0 // p1 takes the value of (a != 0)// p2 takes the opposite value

cmp.ne p3,p4=e,r0;; // p3 takes the value of (e != 0)// p4 takes the opposite value// ;; ends the group

(p1) add b=c,d // if (p1) then add(p3) sub h=i,j // if (p3) then sub

3.5 Speculation

Speculationallows the compilerto loaddatain advanceandthusreducethe memorylatenciesusually introducedby theseoperations.Two typesof speculationarepro-vided.

3.5.1 Control speculation

In controlspeculation,theprocessorperformsthememoryloadin advancebut defersall exceptionsencountered(that would prevent it from loadingthe valueandcall thesystemto know what to do) andset insteadthe NaT bit of the target registerto 1 to

14

Page 17: Guillaume Girard Report

indicatethatits valueis notvalid. ThisNaTbit propagatesin all computationsto insureno wrong result is computedfrom an invalid load. For the floating-pointregisters,aspecialNaTVal encodingis usedinsteadof a separateNaTbit.

Whenthe valueitself or a subsequentresult is to be used,it canbe checkedby achk instructionthat will branchto a recovery codeif the valueis invalid, or simplycontinueif it is valid. This systemallows loadsandtheir dependentusesto besafelymovedabovebranches,asin thefollowing example(from [IA64-v1] section8.4.3):

(p1) br.cond.dptk L1 // Cycle 0ld8 r3 = [r5] ;; // Cycle 1shr r7 = r3,r87 // Cycle 3

If wesupposetheloadhasa latency of 2 cycles,theshift right will stallwaiting fortheresult.Usingcontrolspeculation,thecodecanberewrittenas:

ld8.s r3 = [r5] // Earlier cycle - speculative load// other instructions

(p1) br.cond.dptk L1;; // Cycle 0chk.s r3, recovery // Cycle 1 - checking the loadshr r7 = r3,r87 // Cycle 1

If anexceptionis raisedduringtheload,it will bedeferredandtheNaT bit for r3will besetto 1. If p1 is true,theconditionalbranchis takenandtheloadvalueis neverused,soeverythingrun asif theloadhadneverbeenperformed.If p1 is false,a checkoccurson r3 . If theNaT bit is set,theprocessorjumpsto therecoverycodeto getthecorrectvalue(andlet the systemhandlesthe exception). If not, the load hasalreadybeenperformedwhichmakesthattheprocessordoesnot stallwaiting for thememory.

This codeassumesthat r5 is readyat an earlierpoint andthat therearesomein-structionsto insertbetweenthespeculative loadandtheconditionalbranch.

3.5.2 Data speculation

Dataspeculationallows thecompilerto scheduleloadsin advancedespitethefactthatsomeconflicting memorywrites may occurbeforethe valueis used. The processorkeepsa table of the advancedloads(the ALAT or AdvancedLoad AddressTable)which is updatedevery time a storeis performed. If the storeinvalidatesa previousadvancedload(by writing to anoverlappingmemoryarea),theloadentrydisappears,andwhenthe loadvalueis checked, theprocessorreissuesthe loadasif nothinghadbeendone. If the load entry is still present,the value is still valid andcanbe useddirectly.

This is illustratedby thefollowing code(from [IA64-v1] section8.4.4).Thecom-piler generatesawrite to memory, andit doesnotknow if thememoryareatheproces-sorwill bewriting to will overlapwith thememoryit will readat thenext load:

st8 [r55] = r45 // Cycle 0 - writeld8 r3 = [r5];; // Cycle 0 - readshr r7 = r3,r87 // Cycle 2 - operation

However, it canuseanadvancedload,thusproducingthefollowing code:

15

Page 18: Guillaume Girard Report

ld8.a r3 = [r5];; // Earlier cycle - advanced load// Other instructions

st8 [r55] = r45 // Cycle 0ld8.c r3 = [r5] // Cycle 0 - checking the loadshr r7 = r3,r87 // Cycle 0

If the storehaswritten to a memoryareathat is differentfrom the load memoryarea,theentryfor theloadin theALAT will still bepresentandthecheckwill simplyconfirmthat thevalueis valid. If thestorehasoverwrittena partof thememoryarealoadedduring the store,the load entry will have disappearedandthe processorwillre-issuetheload,thusstallingfor somecycles.Suchanoptimizationis of courseto beusedwhenthecompilercancomputethat thereis a high probability that the loadandthestorewon’t overlap.

It is possibleto combineadvancedandspeculative loads. It is possibleaswell tocheckanadvancedloadwith thechk instructionto branchto a recovery codeinsteadof simply reloadingthevalue.Notethatcontraryto thecontrolspeculation,wheretheNaT stateis propagated,theALAT keepsa valuethatdependson theregisterusedastargetfor theload,sothecheckcanonly beperformedon theoriginal targetregister.

3.6 Registerstack

The IA-64 architectureprovides a complex mechanismto allow proceduresto callothersubroutineswithouthaving to explicitly savetheregistersthey use,andevenpassdirectly parametersandresultsto eachother. A partof thegeneralregisterscalledthestackedgeneralregisters(from GR32up to GR127) is usedasa stack.

Whena procedureis called,it automaticallyinheritssomestackregistersfrom itsparent(they arecalledinput registers).It canalsoallocatemoreregisterson thestackwith thealloc instruction.Thesenew registersaredividedbetweenthe local registersthat areavailableonly to the currentprocedure,and the output registers,which willbecomeinput registersfor any child procedureit will call. Figure3.3 illustratesthismechanism.With this systemof overlappinginput/outputregisters,parametersandresultscanbedirectly passedwithout extra-copying. This processis madeinvisible totheprocedurethemselvessincethefirst currentregisteronthestackis alwaysGR32andthefollowing registersarerenamedaccordingly.

TheRegisterStackEngine(RSE)is responsiblefor spilling andfilling thestackedgeneralregistersaccordingto theallocationdemands.Whenaprocedureasksfor moreregistersthanarephysicallyavailable,the RSEspills someor all of the old registersto a programspecificmemoryareacalledbacking store. Whenproceduresreturnandstackframesarerestored,the RSEensuresthat the allocatedregistersarephysicallypresent,andif not it loadsthemfrom thebackingstore.NotethattheRSEkeepstrackof theNaTbits aswell. Figure3.4 illustratesthis mechanism.

TheRSEhandlesspillsandfills independentlyfrom theprogramandtheprocessorso its functioningis invisible for theprogrammer. It canbeprogrammedto anticipateon the instructionflow by spilling or restoringmoreregistersthannecessary. In thismode,it doesnot raiseany exceptionfor non-mandatoryload/store.

16

Page 19: Guillaume Girard Report

Figure3.3: Registerstackbehavior on procedurecall andreturn(from [IA64-v1] Fig-ure4–1)

3.7 Registerrotation

TheIA-64 supportsa featurecalledrotatingregistersto improvethespeedof loopsbyfilling completelytheprocessorpipeline.

Let considera simpleloop (from [IA64-v1] section11):

L1: ld4 r4 = [r5],4 // Cycle 0 - 2 cycles latencyadd r7 = r4,r9 // Cycle 2st4 [r6] = r7, 4 // Cycle 3br.cloop L1 // Cycle 3

Thesuccessivememoryaccessesandtheregistersdependenciesintroducelatenciesthuspreventingtheprocessorfrom parallelizinginstructionsin thecoreof theloopandlimiting the speedto four cycle per iteration. One solution consistsin performingseveraliterationsof theloop at thesametime to improvetheuseof thememoryunits.This is calledloop-unrollingandwouldgive thefollowing code(with two iterationsatthesametime):

// setup the address registersadd r15 = 4,r5add r16 = 4,r6

17

Page 20: Guillaume Girard Report

Figure3.4: Relationshipbetweenphysicalregistersandbackingstore(from [IA64-v2]Figure6–1)

L1:ld4 r4 = [r5],8 // Cycle 0ld4 r14 = [r15],8 // Cycle 0add r7 = r4,r9 // Cycle 2add r17 = r14,r9 // Cycle 2st4 [r6] = r7, 8 // Cycle 3st4 [r16] = r17, 8 // Cycle 3br.cloop L1 // Cycle 3

Notethat theloop usestwo differentaddressregistersto removethedependenciesintroducedby thepostincrementoperation.This allows theprocessorto performtwoiterationsof the loop in 4 cycles. This resultcanbe improvedby unrolling four itera-tionsof the loop, which would make useof the cycle 1 not usedin the two iterationsversion,andwould run four iterationsin fivecycles(see[IA64-v1] section11.3.1).

TheIA-64 architectureprovidesa way of doingsoftwarepipeliningof loopswith-out any codeexpansion,througha setof rotatingregisters.Let considerthefollowingloop:

mov lc = 199 // LC = loop count - 1mov ec = 4 // EC = epilog stages + 1mov pr.rot = 1<<16 // PR16 = 1, rest = 0

L1:(p16) ld4 r32 = [r5], 4 // Cycle 0(p18) add r35 = r34, r9 // Cycle 0(p19) st4 [r6] = r36, 4 // Cycle 0

br.ctop L1 // Cycle 0

18

Page 21: Guillaume Girard Report

Thegoalis to let theprocessorenableparalleliterationswith thehelpof thepredi-cateregisters.At eachexecutionof theinstructionbr.ctop , thegeneralandpredicateregistersare rotatedso that their index numberis increasedby 1. If we look at theexecutionof theloop,we getthetraceshown in table3.3.

Cycle Instructions/Units Statebeforebr.ctopM I M B p16 p17 p18 p19 LC EC

0 ld4 br.ctop 1 0 0 0 199 41 ld4 br.ctop 1 1 0 0 198 42 ld4 add br.ctop 1 1 1 0 197 43 ld4 add st4 br.ctop 1 1 1 1 196 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100 ld4 add st4 br.ctop 1 1 1 1 99 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .199 ld4 add st4 br.ctop 1 1 1 1 0 4200 add st4 br.ctop 0 1 1 1 0 3201 st4 br.ctop 0 0 1 1 0 2202 br.ctop 0 0 0 1 0 1. . . 0 0 0 0 0 0

Table3.3: Instructiontracefor therotatingregisterloop.

At cycle 0, only PR16 is equalto 1 so ld4 is the only instructionexecuted. Atcycle 1, theregistersarerotatedsoGR32becomesGR33andPR16 becomesPR17. Thebr.ctop enablesthenew PR16 to 1 sotheld4 is executedwith thenew GR32astarget.At cycle 2, PR16, PR17 andPR18 areenabledsotheadd is executedaswell. It refersto GR34which is thefirst GR32having alreadyrotatedtwice. At cycle3, GR35(thefirstGR32 loaded)is written to memory. Theloop is now fully pipelinedandtheprocessorperformsonestepof eachfour iterationsfor every cycle, which is equivalentto oneiterationper cycle. The epiloguecount is usedto get out of the loop: the differentpredicateregistersareprogressively switchedto 0 to stoptheparalleloperations.

Theamountof registersthatrotateis configurablefor thegeneralregisters(apartofthestackedgeneralregisters),andfixedfor thepredicatesandfloating-pointregisters(PR16 to PR63 andFR32 to FR127). Thecurrentrotatingvaluesarestoredin theCFMregister.

Thearchitectureprovidesalsomorefeatureslikewhile loops,explicit prologueandepilogue,multiple-exit loops,etc. (see[IA64-v1] section11).

3.8 Floating-point architecture

The IA-64 definesa new floating-pointencodingfully compliantwith the IEEE-754standard[IEEE-std]. Theformatof a floating-pointregisteris describedin figure3.5.

81 80 64 63 0

sign exponent significand1 17 64

Figure3.5: Floating-pointregisterformat(from [IA64-v1], Figure5–1)

19

Page 22: Guillaume Girard Report

Classor subclass Sign Biasedexponent Significand (i.bb. . .bb)

NaNs 0/1 0x1FFFF 1.000.. .01 through1.111.. .11QuietNaNs 0/1 0x1FFFF 1.100.. .00 through1.111.. .11SignalingNaNs 0/1 0x1FFFF 1.000.. .01 through1.011.. .11

Infinity 0/1 0x1FFFF 1.000.. .00

Normalizednumbers 0/1 0x0001 through0x1FFFE

1.000.. .00 through1.111.. .11

Integer or parallel FP 0 0x1003E 1.000.. .00 through1.111.. .11

Unormalizednumbers 0/1 0x00000 0.000.. .01 through1.111.. .110/1 0x00001 through

0x1FFFE0.000.. .01 through1.111.. .11

0/1 0x00001 through0x1FFFD

0x000.. .00

1 0x1FFFE 0.000.. .00Integer or parallel FP 0 0x1003E 0.000.. .00 through0.111.. .11Pseudo-zeroes 0/1 0x00001 through

0x1FFFD0.000.. .00

1 0x1FFFE 0.000.. .00

NaTVal 0 0x1FFFE 0.000.. .00Zero 0/1 0x00000 0.000.. .00FR0 0 0x00000 0.000.. .00FR1 0 0x0FFFF 1.000.. .00

Table3.4: Floating-pointregisterencodings(from [IA64-v1] Table5–2)

Thevalueof a finite floating-pointnumberwith a non-zeroexponentfield canbecomputedasfollow:

���1��� sign� 2 � exponent� 65535� � significand 63�� significand 62: 0 �

If theexponentis zero,thefollowing formulaapplies:

���1� � sign� 2 � exponent� 16382� � significand 63�� significand 62: 0 �

Theimplicit bit is alwayspresentin theregisterformatasit is usuallydonefor thedouble-extendedformatsdefinedby theIEEE standard.

Therearetwo specialwaysof usingthefloating-pointregisters:

� asanintegervalue,wherethesignificandcontainsthe64-bitsinteger.

� astwo singlefloating-pointvaluesof 32 bits each,packed into the significand.This is calledparallelfloating-pointmode.

Table3.4providesa list of themainencodingsthatcanbeused.TheNaTVal valueis usedwhendoingspeculationandis propagatedin thesameway astheNaT bit forgeneralregisters(seesection3.5). All registers’encodingsareallowed as inputs toarithmeticoperations.However someencodingscannever be generatedasa result.Thesection5.1in [IA64-v1] providesa completedescriptionfor all cases.

Thefloating-pointexecutionunitscanprovide resultsin all thestandardIEEE 754formats,that is with 24,53 or 64 bits of precisionfor thesignificand,and8, 11, 15 or

20

Page 23: Guillaume Girard Report

17 bits of precisionfor theexponent.Thedifferentprecisionscanbecombinedfreely.Thefloating-pointstatusregisterreportsall thefaultsandtrapsdefinedby thestandardthatoccurduringthecomputation.It allowsalsosomefaultsor trapsto beignoredandsimply reportedinsteadof generatinga call to thefault handler.

The IA-64 architecturedescribesa specialmechanismthat allows the processorto generatea softwareassistancefault whenneeded.The Itaniumprocessorusesthispossibilityto handleunormalizednumbersin softwareratherthanin hardwareandthusreducesthecomplexity of its floating-pointcomputationunits.This softwarehandlingis invisible for theprograms.

Thefollowing floating-pointoperationsaredefined:

� Memory accessfor variousformatsincluding parallel-fp (load and storewithmemory/registerformatconversion).

� Transferbetweengeneralandfloating-pointregisters.

� Arithmetic operations(fp multiply andadd,fp reciprocalapproximation,fp re-ciprocalsquarerootapproximation,fp compare,fp min/max,conversionfrom/tointeger).Theseoperationsarealsoavailablefor parallel-fpmode.

� Non-arithmeticinstructions(classify, merge,mix, sign extend,pack,swap, bi-naryoperations,select).

� Statusregisterinstructions(check,set,clear).

� Integermultiply andaddinstruction.

3.9 Multimedia support

Themultimediainstructionstreatthecontentof thegeneralregistersasconcatenationsof eight8-bits,four 16-bitsor two 32-bitselementsthataremanipulatedindependently.The instructionsetcoversparalleladdition,subtraction,average,averageof a differ-ence,shift left/right andadd,compare,multiply, multiply andshift right, sumof ab-solutedifferenceandmin/max. Threemodesof computationsexist: modulo(resultsarewrappedwhenthey overflow), signedandunsignedsaturation.Thesection4.6 in[IA64-v1] describestheseinstructionsmoreprecisely.

3.10 Memory organization

This sectionis a short introductionto thememorymanagementin theIA-64 architec-ture.For a completereference,see[IA64-v2] chapter4.

3.10.1 Virtual addressingand memory protection

For IA-64 programs,thevirtual addressingmodelis fundamentallya64-bitsflat linearvirtual addressspace.A 64-bitspointeris dividedasshownonfigure3.6.The3 highestbits are usedasan index to selecta region register. This registercontainsa regionidentifierof 24 bits, aswell assomeotherparameters.The restof thebits representsthevirtual addressinto theregion.

Whenthe processortranslatesa virtual addressto its physicalequivalent, it pro-ceedsin severalsteps(seefigure3.7):

21

Page 24: Guillaume Girard Report

63 61 60 0

region virtual address3 61

Figure3.6: A 64-bit pointer

Figure3.7: Conceptualvirtual addresstranslation(from [IA64-v2] Figure4–2)

� Thethreehighestbit of thepointerareextractedto selecttheregion register.

� Theregion registeris readto gettheregion ID.

� The coupleformedby the region ID andthe virtual addressis searchedin theTLB for amatchingtranslation.

� If a translationis found,theaccessrightsarecheckedfor thepagefound.

� If theaccessis allowed,thephysicaladdressis computedfrom thephysicalpagenumberandthelowestbitsof thevirtual address(offset)andusedby thecurrentinstruction.

Virtualaddressingcanbeenabledseparatelyfor instruction,dataandRSEaccesses.Thefollowing sectionsdescribethedifferentmechanismsinvolvedin thetranslation.

3.10.2 TLB

TheIA-64 architecturedefinestwo distinctTLBs, the InstructionTLB for instructionfetchesandtheDataTLB for dataandRSEaccesses.Thesetablesaredividedinto twosub-sectionscalledtranslationregistersandtranslationcache.

22

Page 25: Guillaume Girard Report

Thetranslationsregistersareafixed-sizearrayentirelymanagedby software.Theyallow theoperatingsystemto keepin theTLBs critical memorytranslationslike sen-sitive interruptioncodeor kernelmemoryareas.An entrycanbeinsertedin a specificslotby theinstructionitr . No overlappingtranslationmaybeinsertedor theprocessorbehavior maybeundefined.Theminimal sizeof thetranslationregistersis 8 slotsforeachof them.

The translationcacheis a dynamicstructurecontrolledby the processorwhich isresponsiblefor choosingwhich pagetranslationsto replaceor remove accordingtoimplementation-dependentalgorithms.It is intendedto beusedfor themorecommon,non-criticalvirtual memorytranslationsin a multitaskingenvironment.Softwarecaninsertentriesinto the cachebut cannot assumeanything aboutthe time theseentrieswill last (althoughsomerules exist to prevent the last insertedentry to be removedimmediately).Theminimal sizefor thetranslationcachesis 1 entryfor eachof them.

Several purge instructionsareprovided to managethe TLBs. When they’re ap-plied to thetranslationcaches,theapplicationis insuredthatat leasttheselectedpageshave beenremoved,but the purgemay be moreimportant. Insertionmay alsocauseadditionalpurgesby theprocessor.

A TLB entrycontainsthefollowing informationaboutthepage:

� thememoryattributefield thatdescribesthecacheabilityandothermemoryre-latedfeaturesor limitations.

� theprivilegelevel (0-3).

� theaccessrightsdescription.

� thephysicaladdress.

� thepagesize.

� theprotectionkey.

� thevirtual address.

� theregion ID.

� severalotherbits to control if thepageis present,if it’s dirty, if exceptionscanbedeferred.. .

Theaccessrightsarecontrolledin two ways:

� Theaccessright field coupledwith theprivilegelevel of thepageandthepriv-ilege level of the programindicatewhich of the operationsRead,Write andExecuteare authorizedon the page. Somespecialcombinationsincreasetheprogram’sown privilegelevel (for systemcallsfor example).

� A secondcontrol is donethroughthe protectionkeys: the protectionkey ID issearchedthroughouttheprotectionkey registers.If a matchis found,theaccessrightscanbefurtherrestrictedaccordingto thebits thataresetin theregister.

Pagesizesbetween4Kb and256Mb areallowed in the TLBs (andup to 4Gb forpurges).Pagesshouldbealignedon their naturalboundaries.

23

Page 26: Guillaume Girard Report

Figure3.8: Interruptionclassification(from [IA64-v2] Figure5–1)

3.10.3 VHPT

The Virtual HashPageTable (VHPT) is an extensionof the TLB tableslocatedinmemoryusedto improvevirtual addresstranslation.An IA-64 processorcanoptionallyimplementahardwarepagewalker for theVHPT, sothatif asearchin theTLBs is notsuccessful,the processorcan look for the translationin the VHPT without invokingany operatingsystemcode.

TheVHPT is locatedin thevirtual memoryspace.It canbeconfiguredasthemainOSpagetable,or asabig cachefor thetranslations.TheprocessordoesnotupdatetheVHPT nor doesit insurethat theTLBs andtheVHPT areconsistent.Whenit findsatranslationin the VHPT, a new entry is addedin the correspondingtranslationcacheandtheexecutioncontinuesasif thetranslationhadbeenfoundin theTLB.

TheVHPT canbeconfiguredfor two formats:ashortformatfor aper-regionlinearpagetablewhereeachentryoccupies8 bytesandalongformatfor asinglelargevirtualpagetablewhereentriesare32 bytesin size. TheVHPT functionsandconfigurationsystemaredescribedin [IA64-v2] sections4.1.5to 4.1.7.

3.11 Interruption handling

In anIA-64 system,theinterruptionscanbedividedin two maincategoriesaccordingto theway they arehandled:

� IVA-basedinterruptionsareservicedby theoperatingsystemthroughthe inter-ruptionvectorin CR2.

� PAL-basedinterruptionsareservicedby theProcessorAbstractionLayer(PAL)firmware,thesystemfirmwareor theoperatingsystem.

Interruptionscanbe divided in four types: Aborts, Interrupts,FaultsandTraps.Figure3.8presentsthesetypesandtheway they arehandled.

Aborts An abort interrupt is executedwhen the processorhasdetectedan internalmalfunctionor a processorreset.

Interrupts An externaldevicehasrequestedtheattentionof theprocessor.

24

Page 27: Guillaume Girard Report

Faults Thecurrentinstructionis trying to performan invalid or unauthorizedopera-tion.

Traps Thecurrentinstructionhasexecutedcorrectlybut needsthesystemattention.

Whenan interruptoccurs,the processorsavesthe minimum stateinformationtoallow thesoftwareto handletheevent.Thesavedstateis storedin severalcontrolreg-istersassociatedwith theinterruptmanagement.Externalinterruptdelivery is disabledandthebankof 16 scratchgeneralregistersis enabledautomatically.

The InterruptionVectorTable (IVT) is a 32Kb memoryareafor 68 different in-terrupts. The first 20 vectorscontains64 bundleseachso that performance-criticalinterruptscanbehandleddirectly. The48 othervectorsprovide16 bundlespervector.Severalvectorshave morethanoneinterruptionassociatedto them.[IA64-v2] in sec-tions 5 and8 providesa moredetailedexplanationanda list of the interruptionsandinterruptionvectors.

3.12 Debugging and performancemonitoring

3.12.1 Debugging

SeveralDataBreakpointRegisters(DBR) andInstructionBreakpointRegisters(IBR)aredefinedto holdaddressbreakpointvalues.Along with them,theIA-64 architecturedefinesthefollowing facilities:

Break Instruction fault Theinstructionbreak resultsin abreakinstructionfaultwithanimmediatevalue.

Instruction Debug fault Whenthe processorloadsan addressthat matchesthe pa-rametersin theIBR, aninstructiondebug fault is raised.

Data Debug fault When the processorexecutesa memoryoperationon an addressmatchingtheparametersin theDBR, a datadebug fault is raised.

SingleSteptrap After eachsuccessfulinstruction,a singlesteptrap is raisedby theprocessor.

Taken Branch trap A takenbranchtrapis raisedoneveryinstructiontakingabranch.

Lower PrivilegeTransfer trap A lower privilege transfertrap is taken if a branchreducestheprivilegelevel.

The IBR/DBR aredivided betweenodd andeven registers:odd registerscontainthe configurationand mask information while even registerscontain the addresstomatch.Everyevenregisteris associatedwith theoddregisterwith thenext index value(registers0 and1, 2 and3, etc.).

3.12.2 Performancemonitoring

Two banksof registersareprovidedto counta largenumberof differentevents.PMCregistersarecontrolsregisterswhile PMDregisterscontaintheaccumulateddata.PMCandPMD registersareassociatedby pair (PMC4with PMD4,etc.).

Thefirst four PMC registersarenot configurationregisters.A countercanreportaneventthroughaninterruption.Whenthis is thecase,or whena counteroverflows,

25

Page 28: Guillaume Girard Report

thecorrespondingbit for thecounterin the256bits of thefirst four PMC registersissetto 1 to indicateto a programwhichcounterraisedtheinterruption.

A hugequantityof eventscanbe monitored. They belongto the following cate-gories:

� Basicevents,like clockcycleandretiredinstructions

� Instructionexecution(decode,issue,execution,speculation,memoryoperation).

� Cycleaccountingevents(stall).

� Branchevents(prediction).

� Memoryhierarchy(instructionanddatacaches).

� Systemevents(TLBs).

26

Page 29: Guillaume Girard Report

Chapter 4

Extraction of information fr omthe documentation

TheIA-64 documentationprovidedby Intel is encodedasPortableDocumentFormatfiles[PDF99]. [IA64-v3] is themostimportantfor usasit containsacompletedescrip-tion of the instructionsalongwith pseudo-codeandencoding/decodinginformation.Extractingautomaticallythis informationin a usableform would be a hugestepfor-wardin generatinga simulatorascloseto thedocumentationaspossible.

4.1 PDF and its shortcomings

PDFis a very usefulcross-platformformat,very well suitedto provide ready-to-printdocuments. A documentencodedin PDF is divided in pages,eachof them beingbuilt from vector-baseddescriptions.Thebasicelementsof this descriptionarefonts,charactersandpaths1. This makesPDFa format thatcontainsvery little semanticin-formationaboutits content: althoughit allows a userto searchthroughoutthe text,no differencecanbe madebetweena drawing andthe lines that composea tableorbetweena title andanannotation.Moreover, PDFis a very complex formatto manip-ulate.

Sometoolsexist to convertPDFto otherformats:

� Postscriptis a naturaltarget sinceits descriptionpossibilitiesarea supersetofPDF’s. However, it doesnot presentmuchinterestexceptfor printing purpose,its encodingbeing even more complex. Several software can export PDF toPostscript:amongothersghostscript[3] andxpdf [4].

� Thetext containedin thePDFfilescanbeextractedin differentways.Someperlscriptssimply remove thePDFcommandsfrom thefile. Themostefficient toolis onceagainxpdf [4] which ’draws’ thePDFinformationon a ’ text page’,justasit ’draws’ it on a graphicspagein thenormalcase.It usuallyhandlesprettywell line breaksandspacings,andprovidesareasonablyclosetext versionof thePDFdocument,at leastto thehumaneye.

1seriesof curvesandlinesthatcanbedrawn andfilled with differentstylesandcolors.

27

Page 30: Guillaume Girard Report

� Sometoolstry to convertPDF(or Postscript)to anotherpopularcross-platformformat,HTML. The inherentlack of structureinformationmakesthis transfor-mationquitedifficult andoftenno betterthanasimpletext extraction2.

The goal of the conversionwasto provide a format simpleto parsewith a mini-mumof humanintervention. The text conversionhasbeenfound to be theclosesttothesegoalsasfar asthepseudo-codeof the instructionsis concerned.An exampleoftransformationis providedasfigures4.1and4.2.

BecausePDFdoesnot usetheconceptof tables,thedecoding/encodinginforma-tion couldnot beextractedto somethingusable.Theencodingtablesareheavily de-pendentonthepositionof thebit markers,thatis thenumbersthatmarkthesizeandthepositionof thedifferentfields. It is impossibleto usethetext conversiontherebecauseit’s not preciseenough. In the sameway, mostof the valuesarecenteredverticallyandhorizontallyin the tablecell they occupy, andcellssometimesspanseveral rowsor columns. This makesthe structureof a tablevery difficult to graspjust from thepositioningof thevaluesit contains.

The pseudo-codeof the add instructionprovides a good exampleof a mistakecausedby theconversionfrom thePDFformat.At theendof thecode,onecanread:

if (plus1_form)GR[r1] = tmp_src + GR[r3] + 1;

elseGR[r1] = tmp_src + GR[r3];

which shouldobviouslybereadas:

if (plus1_form)GR[r1] = tmp_src + GR[r3] + 1;

elseGR[r1] = tmp_src + GR[r3];

Theseproblemsarerelatively few onthewholeinstructionsetbut humaninterven-tion is neededin theprocess.

4.2 Intel description format

The descriptionformat usedfor eachIA-64 instructionis the following: after the in-struction’sname,aseriesof sectionsprovidesdifferentkind of information.Figure4.1is a good,althoughsimpleexampleof theformat.

� The format sectionis composedof lines describingthe differentvariationsoftheinstruction.Theassemblysyntaxis givenfirst, thenflagsthatmake explicittheparticularitiesof thecurrentvariation.Theseflagsareusedasbooleanvaluesin thepseudo-codeto decidewhichoperationsto perform.Insteadof theseflags,thefactthatthecurrentvariationis a ’pseudo-op’of a morecomplex instructioncan be precised. Last comesthe referencenumberfor the instruction in theencodingtables.

� The description sectionis an English explanationof the parametersand theeventualcomplementvalues,aswell as the behavior of the instructionin spe-cific cases.

2Mostof thetoolsdonot evenproduceany outputon theIntel documentation.

28

Page 31: Guillaume Girard Report

Figure4.1: ADD instruction’sdescription(from [IA64-v3] p. 2–3)

29

Page 32: Guillaume Girard Report

add

AddFormat: (qp) add r1 = r2, r3 register_form A1

(qp) add r1 = r2, r3, 1 plus1_form, register_form A1(qp) add r1 = imm, r3 pseudo-op(qp) adds r1 = imm14, r3 imm14_form A4(qp) addl r1 = imm22, r3 imm22_form A5

Description: The two source operands (and an optional constant 1) are added and the result placed in GR r1. Inthe register form the first operand is GR r2; in the imm_14 form the first operand is taken from thesign-extended imm14 encoding field; in the imm22_form the first operand is taken from thesign-extended imm22 encoding field. In the imm22_form, GR r3 can specify only GRs 0, 1, 2 and 3.The plus1_form is available only in the register_form (although the equivalent effect in theimmediate forms can be achieved by adjusting the immediate).The immediate-form pseudo-op chooses the imm14_form or imm22_form based upon the size ofthe immediate operand and the value of r3.

Operation: if (PR[qp]) {check_target_registe r( r1) ;

if (register_form) // register formtmp_src = GR[r2];

else if (imm14_form) // 14-bit immediate formtmp_src = sign_ext(imm14, 14);

else // 22-bit immediate formtmp_src = sign_ext(imm22, 22);

tmp_nat = (register_form ? GR[r2].nat : 0);

if (plus1_form)GR[r1] = tmp_src + GR[r3] + 1;

elseGR[r1] = tmp_src + GR[r3];GR[r1].nat = tmp_nat || GR[r3].nat;

}

Interruptions: Illegal Operation fault

u

IA-64 Instruction Reference 2-3

Figure4.2: ADD instruction’sdescriptiontransformedby xpdf

30

Page 33: Guillaume Girard Report

IA-64 Documentationin text format

Commentfiles

pseudo-codefiles

pseudo-codeparser

indent

C++ files

Human readableC++ files

FinalC++ files

IA-64 Documentationin PDF format pdftotext

perl script

concatenation

Figure4.3: Thedocumentationextractionprocess

� Theoperation sectioncontainsthepseudo-codeof the instruction.The formatusedfor thepseudo-codeis describedlater(seesection4.3).

� The interruptions sectionlists all thenon-trivial interruptionsraisedby the in-struction.

� Themapping sectionis a complementarydescriptionpresentfor somefloating-point instructions.

� The fp exceptionssectiondescribesthe faultsandtrapsthat canbe raisedbyfloating-pointinstructions.

� The serialization sectiondescribesif andhow serializationis neededto com-pletetheresultsof theinstruction.

Figure4.3 presentthewholeextractionprocess.We’ll describeit stepby step.Toparsethetext file obtainedafterconversionby xpdf, a perl scriptwaswritten with thefollowing functions:

� it readsthroughthewholetext file containingall theinstructions,andextracttwofilesfor eachrealinstruction:acodefile andacommentfile. Pseudo-instructionsareremoved.All thedocumentationis gatheredasaC/C++comment.

� it parsesthe format sectionto guessthe argumentsthat will be usedin thepseudo-codeandto identify pseudo-instructions.It alsoperformssomeguessingwork on thetypeof theseargumentsaccordingto their names.

� it performssomesimplesearch-and-replacetransformationin the pseudo-codeandremoveall comments.

� for eachinstruction,it wrapsthepseudo-codeinto a function,whereall the ar-gumentsfoundin theformatsectionareconsideredasfunctionarguments.

31

Page 34: Guillaume Girard Report

void ia64_add( int qp, int r1, int r2, int r3, int_64 imm14, int_64 imm22, bool imm14_form,bool imm22_form, bool plus1_form, bool register_form)

{if (PR[qp]) {

check_target_registe r( r1) ;if (register_form)

tmp_src = GR[r2];else if (imm14_form)

tmp_src = sign_ext(imm14, 14);else

tmp_src = sign_ext(imm22, 22);tmp_nat = (register_form ? GR[r2].nat : 0);if (plus1_form)

GR[r1] = tmp_src + GR[r3] + 1;elseGR[r1] = tmp_src + GR[r3];GR[r1].nat = tmp_nat || GR[r3].nat;

}}

/* Instruction: add** Add* (qp) add r1 = r2, r3 register_form A1

...* Illegal Operation fault*/

Figure4.4: ADD instruction’sdescriptiontransformedby theperl script

#include "framework.h"#include "pseudo-functions. h"#include "ia64-functions.h"

/* Instruction: add** Add* (qp) add r1 = r2, r3 register_form A1

...* Illegal Operation fault*/

voidia64_add (int qp, int r1, int r2, int r3, int_64 imm14, int_64 imm22,

bool imm14_form, bool plus1_form, bool register_form){

notype tmp_src;notype tmp_nat;notype elseGR;

if (PR[qp]){

check_target_regis ter (r1);if (register_form)

tmp_src = GR[r2];else if (imm14_form)

tmp_src = sign_ext (imm14, 14);else

tmp_src = sign_ext (imm22, 22);tmp_nat = (register_form ? GR[r2].nat : 0);if (plus1_form)

GR[r1] = tmp_src + GR[r3] + 1;elseGR[r1] = tmp_src + GR[r3];GR[r1].nat = tmp_nat || GR[r3].nat;

}}

Figure4.5: ADD instruction’sdescriptionat theendof theextractionprocess

32

Page 35: Guillaume Girard Report

Figure4.4 illustratestheresultsfor theparsingof theadd instruction.

This parsingmethodis ratherefficient but fails in somecases.Whenthe text in-formationis extracted,the italic style is lost andthis leadsto confusingerrors. In theformatsection,theargumentsarewritten in italic whereasthenameof theinstructionis written in roman. This makes the parsersometimesfind an argumentthat is notvalid. For example,in parsingmov.ret. mwh.ihb1 � r2 � tag13, ret is incorrectlyun-derstoodasanargument.Somemorecomplex argumentsaredifficult to parseaswell,like itr.d dtr[ r3] � r2. Theparsertries to guessasmany argumentsasit can,andlet thepseudo-codeparserhandlethemandremovethemif necessary.

4.3 Pseudo-code

The Intel pseudo-codedescribingthe behavior of the IA-64 instructionsis a slightlyenhancedC. Thefollowing languageconstructshavebeenadded:

� ’ msb:lsb ’ and ’ bit ’ arepost-fixoperatorsusedfor selectinga rangeofbits. The ’ msb:lsb ’ selectsthe bits betweenthebit numberlsb (wherebit0 is the lowestsignificantbit in the number)andmsb. The ’ bit ’ variationselectsonly onebit.

� u>, u>=, u<, u<=, u>>, u>>=, u+ andu* arelike theusualoperatorsbut considertheir operandsasunsigned.

� enumerationsfor argumentsarewritten as ‘values’ whereasotherenumeratevaluesareclassicaluppercasesymbols.

TheIntel pseudo-codedoesnotdescribethecompleteoperationof theinstructions.Whentheseoperationscanbe codedin C in a simpleway, the codeis directly inte-grated. Otherwisethe operationsare delegatedto pseudo-codefunctionswhich aredescribedonly in Englishin themanual.They oftenprovideaccessto somepartof theIA-64 processorstructure(like the RegisterStackEngine),performlow-level opera-tions (floating-pointcomputation)or containthe logic for exceptionchecking(mem-ory, floating-point,etc.).Thepseudo-codeis verydependenton themandthusmostofthemhave to be implementedbeforeinstructionscanbe run on a simulator. They’redescribedin moredetailsin chapter5.

In the add instructionexample,check target register() andsign ext() arepseudo-codefunctionsthatrespectively checkif the targetregisteris valid (i.e. if it isinsidethestackframeandnot zero)andsign-extenda valuefrom agivenhighestbit.

A parserfor this extendedC dialectwasbuilt in C (with thehelpof lex andyacc).Theparsertransformsthecodewith thefollowing goals:

� it looksfor theargumentsthatareactuallyusedin thepseudo-codeandremovefrom thefunction’sdefinitionthosethatarenot useful.

� it looksfor unknown symbolsanddefinesthemaslocal variables.Somehelpisprovided througha list of the symbolsthat will be definedexternally (pseudo-codefunctions,registerbanks’names,etc.).

� it transformsthebit operationsin functioncalls.

33

Page 36: Guillaume Girard Report

� it translatestheunsignedoperatorsby castingtheargumentsto unsignedintegersof 64 bits.

� it transformsthequotedenumerations’valuesinto realconstantsthataredefinednext to thefunction.

Thecodeof thefunctionsis thenpassedthroughindent[5] andthedocumentationis addedat thebeginningof thefile. Thespecifiedincludesdirectivesarealsoaddedsothatthecodewill bereadyto compile.

Figure4.5 shows the contentsof the file add.cc after all the extractionprocesshascompleted.OnecannoticethattheelseGR errorhasbeenparsedsuccessfullyandconsideredasa localvariable.Humaninterventionis necessaryto correcttheproblem.

Only two instructionsgenerateerrorsduring the process: frcpa and frsqrta .They defineahelpfunctioninsidetheirpseudo-codewhich is wrappedinsidethefunc-tion createdby theperl scriptandthusgeneratesa syntaxerror. It is enoughto cut outthis functionto solve theproblem.

The local variablesarenot typedautomatically. Although it would bepossibleinmostcasesto guessthis type from thecontext or at leastthevariablename,it canbemisleadingandnot alwaysaccurate.Floating-pointfunctionsprovide goodexamplesof wherecomplex typing is needed.Somefunctionsusearraysof valueswhosetypeisnot easilyguessedfrom context. In somerarecases,somevariablesarein factglobalvariables(like slot in br which refersto the currentslot in the instructionbundle).Moreover, somepseudo-codefunctionsarenot describedandthereforenot presentinthelist of ’known’ symbols.It is necessarythata humanperformsa checkon eachofthegeneratedfile to catcheventualerrorsandperformthemorecomplex typing.

Theparseralsogeneratesa headerfile thatcontainsthedefinitionof all the func-tionsaswell astheenumerationsneededto make themcompile. Theseenumerationsvaluesmayneedto beadaptedsincesomeof themarecommonto many functions(likememoryaccessflags)while othersareusedin oneinstructiononly.

4.4 Results

At theendof theextractionprocess,eachrealIA-64 instructionis encodedin afile con-tainingthedocumentationandafunctionimplementingthepseudo-codeof themanual.Thecodeis C/C++compliantandcanbecompiledwith theproperexternaldefinitionsandthecorrecttyping for the local variablesandthe functions’arguments.A headerfile hasbeencreatedwith thelist of thefunctions’andenumerations’definitions.

The entirepseudo-codeof the Intel documentationis thus availableasusableCcode. However, the encodinginformation could not be processedand needsto beprovidedby humancoding.

34

Page 37: Guillaume Girard Report

Chapter 5

The referencesimulator

Thestructureof thereferencesimulatorwasdesignedto usethepseudo-codeextractedfrom theIntel documentationwith asfew changesaspossible.Sincethepseudo-codeis mainlyC, aC-like languagewasthemostviablealternative. C++ waschosenfor itscapacityto redefineoperators(andthushidethecomplex operationsthatarenot takeninto accountin the Intel pseudo-code)and the inheritancesystemthat would maketypingeasierandwouldallow moregenericfunctions.

Thereferencesimulatoris dividedin severalparts:

� The framework of the simulator, which containsthe stateof the processorandthe definitionsneededso that the pseudo-codecanrun. It alsoimplementsthemechanisms(TLB, RSE,etc.) thatareinvisible to theprogrammer.

� The pseudo-codeextractedfrom the documentation,which is readyto be usedby a decoder.

� The pseudo-codefunctionsthat areneededfor the pseudo-codeto run, whichprovidesanumberof servicesandaccesspointsto theframework of theproces-soritself.

� Thebinary instructiondecoderwhich calls the right functionswith therelevantarguments.

This chapterwill describehow the simulationof the differentaspectsof an IA-64 processorwere implemented.It will alsocover the framework neededto usethereferencesimulatorasa referencefor Simics. It provideshowever only an overviewof theimplementationchoicesandpossibilities.Figure5.3presentsa summaryof thestructureof thereferencesimulatorwith a sizeevaluationfor eachmodulein termsoflinesof code.

5.1 Registerfiles

A classhierarchywasdefinedto handlethedifferenttypesof registers.Most of themare64-bitswide registers,sometimesdivided in bit fields. The Intel pseudo-codeac-cessesregisterseitherasa whole 64-bitsvalueor asa structureof bit fields. It usesmostof theassignment,comparisonsandcomputationoperatorsontheregistersor the

35

Page 38: Guillaume Girard Report

Register-value: int_64

Ignored_reg-value: int_64 = reading returns 0

Reserved_reg-value: int_64 = reading is not allowed

BitControlledRegister-BitRegStatus[64]: enum{normal, reserved, ignored}

OneBitRegister-value: boolean

FloatRegister+sign: int+exponent: int+significand: int_64

GeneralRegister+nat: OneBitRegister

RegisterPart-value: &int_64 (from a register)-startBit: int-endBit: int

Other specific registers

RSC_reg+mode: RegisterPart+pl: RegisterPart+be: RegisterPart+loadrs: RegisterPart

GeneralRegisterVector+operator[](index:int): GeneralRegister& (as visible for the software) +getRSEReg(index:int): GeneralRegister& (without rotation)+getPhysReg(index:int): GeneralRegister& (physical register))

RegisterVector+operator[](index:int): T&

T

FloatRegisterVector+operator[](index:int): FloatRegister& (as visible for the software) +getPhysReg(index:int): FloatRegister& (physical register))

PredicateRegisterVector+operator[](index:int): PredicateRegister& (as visible for the software) +getPhysReg(index:int): PredicateRegister& (physical register))

Figure5.1: Classhierarchyfor registertypes

instruction add_register_regis te r(q p,r 1,r 2, r3)pattern

aslot_type == 1 && opcode == 8 && a_x2a == 0 && a_ve == 0 && a_x4 == 0 && a_x2b == 0syntax

"(p{d:qp})\t add r{d:r1} = r{d:r2}, r{d:r3}"semantics

#{ ia64_add(qp,r1,r2,r 3,0 ,0 ,fa lse ,fa lse ,t rue ); #}

Figure5.2: Exampleof simgenencodingfor theinstructionadd reg, reg

36

Page 39: Guillaume Girard Report

fieldsaswell. To handletheserequirements,thefollowing classhierarchywasdefined(seealsofigure5.1):

� Registerdefinesa 64-bit registerwith all theoperatorsthatcanbeappliedto it(includingacastto a 64 bits integer).

� Ignored regandReserved regdefinetwo registersthatrespectIntel’sdefinitionof ignoredandreservedregistersfor readandwrite of values(anignoredregistercanbe written to but returnsalways0, whereasa reserved registercannot beaccessed).

� BitContr olledRegisteris a registerwhereevery bit hasa statusamongthe list:normal,ignored,reserved. An ignoredbit is always0 andcanbewritten to, butdonotchange.A reservedbit is always0 andraiseanexceptionif awrite triestosetit to 1. Readandwrite in BitControlledRegistersaredoneaccordingto theirbit tables.

� RegisterPart containsareferenceto aregistervalue,aswell asabit field defini-tion (startbit, endbit). WhenaRegisterPart is read,it extractsits bit field fromtheregistervalue. Whenit is written to, it setsits bit field in the registerto thegivenvalue.

� By combiningBitContr olledRegisterandRegisterPart , all specificregisterscanbedescribedandbeaccessedasa wholeor asastructure.

� OneBitRegisterdefinesa 1-bit registerfor predicateandNaT-bits.

� FloatRegisterdefinesafloating-pointregisterof 82-bits.

� GeneralRegisterdefinesa 64-bitsregisterwith a NaTvalue.

Theregisterbanksaredefinedasa fixed-sizetemplatevectorcalledRegisterVec-tor . ThreespecializationshandlerotatingregistersandtheRegisterStackEngineoper-ations,GeneralRegisterVector, FloatRegisterVector andPredicateRegisterVector.All of theseclassesredefinetheoperator[] sothat thepseudo-codecantransparentlyaccesstheright registerswhile ignoringtherotatingandrenamingoperationshappen-ing in thebackground.

In the Intel pseudo-code,someregistersareaccessedasstructuredespitethe factthat they arereferencedfrom a vector(like AR[RSC].be ). To handlethis properly, thepseudo-codehasbeentransformedinto a referenceto a register(RSCreal.be ). Thenamesof theregistershavebeendefinedasenumerationsto beuseddirectlyasavectorindex (AR[RSC]), andthevectorcontainsa referenceto the XXX real registersat theappropriateindex.

Theentiresetof registersof theIA-64 is describedin onefile. Theseregistersaredefinedglobally to be consistentwith the pseudo-codeusage. Sincethey’re global,they have beendefinedso that initialization needsto be performedseparatelyfromtheir constructor. This allows the setupcodeto setsomeimplementation-dependentvariablesbeforecreatingtheIA-64 framework.

5.2 Instruction decoding

The instructiondecodingwaswritten with simgen, a internal tool developedat Vir-tutech.It waskeptassimpleandstraightforwardaspossibleandthususesonly a few

37

Page 40: Guillaume Girard Report

Pseudo-code(10070 lines)

Internal processor systemTLB, interrupts, execution

(1670 lines)

Pseudo-codefunctions

(4750 lines)

Decoder(3120 lines)

Softfloat

Memory device(200 lines)

State filemanipulation(800 lines)

Registerframework

(1500 lines)

ReferenceSimulator

Extractionprocess

IA-64 Documentation

Figure5.3: The referencesimulator—The linesnumberrefersto the numberof codelinesusedin eachmodule.A straightarrow showsthattherespectivemodulehasbeenextractedfrom thedocumentationwhereasadashedarrow emphasizesthatthemodulewasbasedon anEnglishdescription.

featuresavailable. The list of thebit fieldsusedis completelydefined,theneachen-coding is describedas simply aspossible. Instructionsaregroupedonly when it isobviously clearerto do so. Theonly operationperformedby a decodedinstructionisto call thecorrespondingpseudo-codefunctionwith theappropriatearguments.Someinstructionsperformsomecheckon theargumentsandraiseexceptionsif needed.

An exampleof encodingis given in figure 5.2. The instructionadd_register_register is definedwith the parametersqp, r1 , r2 and r3 . Thoseare previouslydefinedbit fields that will be extractedfrom the binary instruction. The bit patternto recognizethe instructionis defined,thena humanreadablesyntaxis provided,sothat the decodercan print out the instruction it is decoding. Finally the semanticspart containsthe codethat will be executedif this instructionis decoded,which is asimplecall to thegeneratedfunction ia64 add with theright parametersset.Thelastargumentsof the functionsarethe booleanflagsthatwill control the execution(hereregister form is truewhile imm14 form andplus1 form arefalse).

5.3 Exceptionhandling

Exceptionsareraisedby the raiseInterruption() function. If the exceptionis tobe deferred,it is simply storedas a referenceand to checkif other exceptionswillbe precluded1. If the exceptionis raised,all registersthat canbe setat this point aregiventheirvalue(someregistersthatgivemoreinformationaboutthecurrentstate,the

1In the IA-64 model, an exceptioncan be deferredand simply ignored. However, somesubsequentexceptionscanbedeferredautomaticallyeven if thesystemdoesnot asktheprocessorto do so,becauseapreviously relatedandmoreimportantexceptionhasbeendeferred.They aresaidto beprecludedby thefirstdeferredexception.

38

Page 41: Guillaume Girard Report

registersthatshouldbesaved,etc.). Somespecificregistersareconsideredasalreadyset by the function that can eventually raisethis exception(it is mostly the caseofmemoryaccessesthatsetregisterswith thefaultingaddressor theVHPT information).To handledeferredexceptionscorrectly, thesefunctionssetshadow registersthat areonly copiedto the real registersif an exceptionis raised. Whenan interruptionhasbeenraisedandthe registersarecorrectlyset,a C++ exceptionis thrown to preventany furtherexecutionof thecurrentinstruction.This exceptionis to becaughtby theexecutionunit.

Trapsarehandledin a similar way, exceptthatno C++ exceptionis thrown sincethetrapsareraisedonly by theexecutionunit.

Thesettingof theregistersafteranexceptionis not completelydefinedin theIA-64 architecture,that is someregisterssettingsareimplementation-dependentandthuscanhave any value. This is handledwith a booleanvalueaddedto every registerthatindicatesif the currentvalueis implementation-dependentor not, andif it shouldbeincludedin thecomparisonswhenusingthereferencesimulator.

5.4 Memory simulation

Memoryis simulatedin averysimplemanner:It is allocatedwhenneededby blocksoffixedsize(by default4Kb), physicaladdressescanbe64-bitswide, readandwrite op-erationsareindependentof thehost’s endianness.Non-allocatedaddressesdo alwaysreturn0. Valuesfrom 1 to 16 bytescanberead/writtenin onecall.

5.5 RegisterStackEngine

The RegisterStackEngineis implementedexactly as it is describedin [IA64-v2] inchapter6 (andbriefly explainedin section3.6). The samestatevaluesarekept andupdated.The RSEimplementsonly the lazy mode,that is it never performsload orstorein advancebut only whenneeded.ThecodethatupdatestheRSEis in thepseudo-codefunctions.

5.6 Translation Look-asideBuffers

The TLBs aredefinedfrom a versatilevectorstructurethat canwork asa fixed-sizearrayor asan infinite-sizelist. Fixed-sizearraymodeis usedfor the TLB registers,which areaccessedby index. Infinite-sizelist modeis usedfor the TLB caches:thereferencesimulatorneedsto keepacompletetrackof theTLB operations,sothatit caneventuallycopewith animplementationthatis slightly differentfrom its own. Indeed,all thealgorithmsto updatetheTLB cachesareleft to the implementation,providingthat they follow somebasicrulesto keepthe instructionflow running. The referencesimulatorneedsto beableto handlea TLB missthatwould havebeena hit in its ownsystem,or to handlepurgesmoreimportantthanwhatwasrequiredby software.At thesametimeit mustbeableto insurethevalidity of aTLB hit andnotblindly follow whattheotherimplementationdoes.Thisshouldbedoneby keepingall pages’translationsuntil a missshows thatthey havebeenpurgedby theotherimplementation2.

2Thereferencesimulatoris working this way. However, sincenoTLB relatedeventsarecollectedyet, itnever removespagesfrom its cacheexceptduringanexplicit purge.

39

Page 42: Guillaume Girard Report

TheTLB arraysalsocontainthebasiccodeto matchandpurgepagesaccordingtotherulesdefinedby theIA-64 specifications.

5.7 Floating-point computation

Floatingpoint computationis donewith anextensionof the library softfloat[6]. Thislibrary performsIEEE754[IEEE-std] floating-pointcomputationswith numbersof size32, 64, 80 and128 bits. It wasextendedto handle82 bits numbersasdefinedin theIA-64 specifications:

� thetype floatx82 wasdefined.

� theaddition,subtraction,multiplicationandroundingfunctionswereadaptedto82bitswidefloating-pointnumbers.Theroundingwasextendedsothatit couldbeperformedwith a precisionof 24, 53 or 64 bits in thesignificand,and7, 11,15or 17 bits in theexponent.

� the fusedmultiply-add operation3 was introducedalong with the specialtypefloatx82_infp which is designedto keepenoughprecisionduring the multi-plication and the addition so that they can be consideredas infinite precisionoperations.Theroundingis doneafterthey havebeenperformed.

� the fpa floating-pointinformationhasbeenaddedto theroundingcode. It sig-nalsthefact that themagnitudeof thedeliveredresultis greaterthat themagni-tudeof the infinite precisionresultbeforerounding. Sinceit is not part of theIEEEstandard,it doesnot delivera trapby itself.

� thefloating-pointstandardcomparisonshavebeenportedaswell.

The pseudo-codefunctions that actually perform the computationare basedontheseextensionsof softfloat.

5.8 Pseudo-codefunctions

The pseudo-codefunctionsperforma large numberof tasksso that the pseudo-codeitself is not clobberedwith complex computationsandimplementsonly the top-levellogic of theinstruction.They canbeclassifiedin categories:

� Generichelpfunctions(concatenation,sign/zeroextending,shift, etc.).

� Processorgenericfunctions(templateinformation,registercheck,ignoredandreservedcheck,etc.).

� TLB relatedfunctions(TLB searching,exceptioncheck,etc.).

� Physicalmemoryfunctions(alignment,read,write, semaphore,cache,etc.).

� RSErelatedfunctions.

� Exceptionsraising.3This instructionperformstheoperationa � b � c in infinite precisionandroundthefinal result.Variants

are � a � b � c� , ��� a � b � c� , etc.

40

Page 43: Guillaume Girard Report

� Floating-pointexceptionchecking.

� Floating-pointcomputations.

� Floating-pointmemoryoperations.

� Implementationdependentalgorithms(TLB insert,VHPT tags,etc.).

5.9 Implementation dependentfeatures

In theframework andthepseudo-codefunctions,functionsandvaluesthatarenot de-scribedin theIA-64 manual(andthusimplementation-dependent)havebeenseparatedso it is relatively easyto modify the referencesimulatorto have somethingelsethananItaniumprocessor. This includesthenumbersof registersin certainbanks,thesizeof somefields, the sizeof the virtual andphysicaladdressspaces,TLB algorithms,exceptionchecking,etc.

As anote,it is importantto considerthattheItaniuminternalfeaturesarenotcom-pletely available to the public, and thus the referencesimulatoris an IA-64 genericsimulator, thatcanbecustomizedto matchascloselyaspossiblea specificimplemen-tation(currentlytheItanium).Whensomefeaturesarenotdescribedpreciselyenough,thereferencesimulatorshouldtry to matcha supersetof thepossibilitiesandadaptitsbehavior accordingto theotherprocessor(theTLBs area goodexample).This is notcompletelyimplementedin thecurrentreferencesimulatorbut theframework hasbeendesignedwith this goalin mind.

5.10 Statesaving and loading

To actasa comparisonreferencemachine,thereferencesimulatorneedsto beabletoloadandsave thestateof theprocessorandgatherinformationaboutwhathappensintheexecution.A statefile formatwasdefined:it containsastartstatewith theregistersandmemoryvaluesthatareneeded,informationabouthow many instructionsto run,if interruptionshappenedat a specificinstructions,if memoryaccessesweredone.. . ,andfinally theendstate.Thewholeformatis describedin appendixA.

TheIA-64 framework containscodeto saveandrestoreits state,aswell asaparserto readthestatefilesandexecutethem.It keepsa list of the’events’happeningduringexecution. It also containsa simple comparatorthat checksif the final stateof thereferencesimulatoris identicalto theendstatein thestatefile.

At this time,thestatefilesandthecomparisonarenotcomplete.Generalandpred-icatedregistersarehandledcorrectly, aswell assomeotherspecificregisters(IP, CFM,etc.). Floating-pointregistersareignored,memoryaccessesareignoredaswell, TLBrelatedeventsarenot collected.. .This is dueto two majorsreasons:theIA-64 Simicssimulatoris notcompletesotheIA-64 structureis notcompletelyavailableto theuser,andmostof thefunctionhooksto gathereventsarenot handledyet; theway thestatefile andtheeventsarehandledis dependenton theteststhatareperformed,andno testgeneratorhasbeendesignedyet,exceptsomesimpleexamplesthataredescribedin thenext section.However, the framework to performa completecomparisonandhandlemoreeventsis in placeandshouldbeeasyto extend.

41

Page 44: Guillaume Girard Report

Chapter 6

Simicsmodule

To beableto compareSimicsexecutionto thereferencesimulator, we needto collectstateinformationfrom Simics,whichmeansweneedamoduleableto saveandrestorestatefiles. SinceSimicscanbe entirelyscriptedin Python,it wasan obviouschoiceto quickly develop this module,andto createa simple testgeneratorto validatethecomparisonsystem.

6.1 Statefiles

Two classeswerewrittenin Pythontohandlethestatefiles: oneisgeneric(stateFile )and the other, specializedfor the IA-64, is derived from the first (ia64StateFile ).They providefunctionsto write acompletestatefile (beginningandendstates,events)andto readthestartstatefrom astatefile. They alsoprovidefunctionsto print agenericstatefile on thescreen.

With thesetwo classes,two Pythonscriptsareprovided:

print-state-file.py is astand-alonescriptto print agenericstatefile. It is farfrom perfectsinceit doesn’t interpretany of the implementationspecificfieldsbut it shouldbeableto print any file following thestandard.

rerun-test.py providesa loadTest() functionto beusedinsideSimicsto loadthestartstatefrom astatefile. Simicsis thenreadyto repeatthetest.

6.2 A testgenerator

A simpletestgeneratorwaswrittenin Python.Its goalis to generaterandominstructionpatternsto testif the decodingandexecutionof theseinstructionsaredonecorrectly.If we considerthebinary instructionascomposedof encodingbits (they definewhichinstructionwill bedecoded)andparameterbits (they definewhich registersor imme-diatevalueswill beused),it is possibleto loop over thewholeencodingspaceto testall theinstructions,andto settheparametersto interestingvalues.This is whatis doneby thetestgenerator.

For eachtest, the script setsup a bundle wherethe instruction0 and 2 are no-operations,and the instruction1 is the instructionto test. The secondpositionwaschosenbecauseit is the only slot thatacceptsall the instructiontypesdefinedby the

42

Page 45: Guillaume Girard Report

Simics

Simicsmodule

- generating instruction to run- generating a random state- running the instruction- writing the state file

Referencesimulator

- reading the state file- executing the instruction- comparing the results- writing an eventual error report

State file

Error report

Figure6.1: Thetestprocess.

standard(I, M, B, F—X beingratherspecial).It is possibleto specify’encodingbits’in the binary encodedinstructionfor which all bit combinationswill be tried, that isthe test generatoris going to test all the possibleways to fill thesebit fields. Forcombinatorialexplosionreasons,thesebit fieldsshouldnot excessa certainsize. It isalsopossibleto specifybit fieldsthatwill takesomespecificvalues(i.e. parameters).

Thetestgeneratorwill thengeneratea testfor eachencodingit wasasked to test,producea randomstatefor the processor, run the test in Simics and createa statefile containingasmuchinformationaspossible.This statefile is directly readby thereferencesimulator, which performsthe testandprints out all the differencesfound.This is actually donethrougha namedpipe wherethe Pythonscript writes and thereferencesimulatorreads.If adifferenceis foundduringor at theendof theexecution,thecorrespondingstatefile is thensavedfor furtheranalysis.

Includingthefile test- instruction- r ange.p y providesa functioncalledtestInstructionRange() which takesthefollowing arguments:

� Thefile or namedpipeusedto write output.

� Theinstructiontypefor theslot 1 (I=0,M=1,F=2,B=3).

� Thelist of bits thatwill beentirelytested[(start bit, length), ...] .

� Thestartvaluefor thesebits. This valueis usedto fill progressively all the bitfieldsdefinedabove(it mustbeaPythonlong integer).

� The endvalue. The startandendvaluescontrol the lengthof the testsinceallthevaluesin-betweenwill beused.

� The list of bits that take specificvalues(arguments)[(start bit, length),...] .

� A list of valuesto put into thebits areasdefinedabove [v1, v2, ...] (it mustbe Pythonlong integers). The currentversionof the testgeneratortries all thepossiblecombinationsof thesevalueswith the definedarguments,which canleadto hugetests.

� A list of threevaluesthatareusedasseedsin therandomgenerator. Thesevaluesareprovidedsoit is possibleto runthetestseveraltimesin exactlythesameway.

43

Page 46: Guillaume Girard Report

6.3 Results

ThetestgeneratorhasbeenrunontheentireI-instructionencodingwith arandomstateandvalid parameters.Hereis a transcriptof thewholetest.

Firstwecreateanamedpipein a temporarydirectory, thenwerunSimicsandloadtheconfigurationfor anIA-64 machinewith amemorydevice.

bash> mkfifo /tmp/npipebash> simics-ia64-itaniumsimics> read-configuration gurra.confsimics> source test-instruction-ra nge .py

Werunthereferencesimulatoronthenamedpipeandwelet it wait for Simicsto begingeneratingtests.The-sf flagasksthereferencesimulatorto save statefileson error.

bash> simics-compare /tmp/npipe -sf

Thetestis thenlaunchedon Simics.

simics> @testInstructionRang e(" /tm p/n pip e", 0, [(13,1), (27,14)],0L, 0x7FFFL, [(6,7), (13,7), (20,7)],[4L, 5L], [1,1,1])

Theargumentsarethefollowing: theoutputis writtenin /tmp/npipe , thebit encodingareasare1 bit at position13 andall thelast14 bits of theinstruction.Sinceall of thebits arecovered,the testgoesfrom 0 to 0x7FFFor 215 �

1. The argumentsare thethreeregistersindicesthat canbe encoded;they take either the value4 or 5, andallpossibilitiesaretested(23 � 8). We will thusgenerate8 � 215 � 262000tests.Fromthis thereferencesimulatorproduces35 000errorsfiles thatarereducedto 100afterasmallperl scriptparsedthefiles to removeredundantinstructions.

Amongtheerrorsdetected,herearesomerepresentativeexamples:

� mov.i ar4 = 4 raisedanexceptionon thereferencesimulatorbut not on Sim-ics. Theseapplicationsregistersarenot availablewhenthe instructionis in anI-slot andSimicsdid not checkit.

� zxt1 r4 = r4 wasnot decodedby Simicsbecauseit is not implementedyet.

� mov pr.rot = immediate did not setthepredicateregisterscorrectlyin Sim-ics.

It is importantto notethatsomeerrorsaregeneratedby thereferencesimulatoritself asit containsbugsof its own. Hopefullythesebugsdonotextendto theIntel pseudo-codewhich staysthereferencefor theinstructions’functioning.

Severalproblemsarisewith this typeof test:

� The numberof errorsis quite impressive, sincethe testcanbecomehighly re-dundant.A script to sortout thingsis definitelyvery helpful. However, it can’tbetoo smartwithout knowing a lot aboutthe instructionset. Thesolutionusedhereis to reduceerrorsto oneper instruction. It is possiblethat the script willskip someerrors,but they will hopefully be caughtin the next test,when thepreviouserrorsarecorrected.

44

Page 47: Guillaume Girard Report

� Thistestis prettystraightforwardfor arithmeticinstructions,but canbecomedif-ficult to setupandcheckwhenmemoryaccessesareinvolved:randomgeneratedaddressestendto getout of theaddressspace.Moreover, it is not very smarttotransferthewholememoryto compareits statebeforeandafter theinstruction.A new schemeis neededwherememoryaccessesaretrappedanddirectedby thetestgeneratorandmemoryvaluesaregeneratedandstoredon thefly.

Despitesomeproblems,this test shows quite nicely what can be donewith thereferencesimulatorin termof regressiontesting. A seriesof similar testsrunningallthe time would probablycatchmost of the new errorseventually introducedin theSimicssimulator. The test itself took 7 hoursto run on a PentiumIII 900MHz, andmostof this time wasusedby the very slow andunoptimizedPythontestgeneratorcode.Thereferencesimulatoritself wasrunningonly 10%of thetime.

As an estimateof the quality of the results,15 Simics relatedbugs (not count-ing unimplementedinstructions)werediscoveredduring the test,while 2 bugsin thereferencesimulatorproducedfalseerror reports. A second-runsomedayslater on amorerecentversionof Simicsdiscoveredbugsin newly implementedinstructionsandvalidatedsomeof thepreviousbugcorrections.

45

Page 48: Guillaume Girard Report

Chapter 7

Futur ework

Thereareseveralareasthatdeserve attentionif the IA-64 referencesimulatoris to befurtherdeveloped.

� The simulatorwas developedin about four months. It is not complete(par-allel floating-point instructionsare not implemented)and neither is it a fullyfeaturedIA-64 simulator(somefunctionalitieswerekept to theminimumlevelimposedby thedocumentation).A betterVHTP simulationis definitelyneededto completethememorymanagementunit. To provide realisticpredictionhan-dling, the AdvancedLoad AddressTable shouldbe implemented.Thesetwosystemsincreasetheneedfor thereferencesimulatorto handleimplementationsdifferentfrom its own in the comparisonprocess,sincethey introducea lot ofimplementation-dependentfeatures.

� Moretestingis needed,aboveall for thevirtual memorysystem,theRSEengineandthefloating-pointerrorsettings.

� In the extractionprocess,typing variablesby guessingcould be extendedandrefinedto simplify thework of theprogrammer. It seemsdifficult at this time toimplementa bettertypingprocessdueto thelackof informationavailableto theparser.

� Thesimpletestgeneratorcouldbe extendedfor all typesof instructionsby in-cluding memorytransactionsin the statefile. This is not assimpleas it mayseemsincememoryhandlingincludescheckingif thewritten or readvaluesarecorrect,but alsoif no othervalueshave beenchanged.Obviously onedoesnotwantto transmitwholememorysnapshotsin thestatefile, sosomeintermediateschemeis needed.

� Thewholetestingprocessneedsto beautomatedevenmore.Theerrorextractionhasto bemoreuser-friendly andprovidemoreoptions.Theerrorreportsshouldbecodifiedproperlysothey couldbeparsedautomatically, sortedor searched.

� The testgenerationcould be donemoreintelligently by providing moreinfor-mationon theprocessorstructureandtestingmorespecificpointsthanrandominstructions(for exampleby parsingthesimgenfile to get instructionencodinginformation).

46

Page 49: Guillaume Girard Report

Chapter 8

Conclusion

Thegoalof this thesiswasto provideanew methodto testSimicsat theinstruction-setlevel. A referencesimulatorfor theIA-64 Itaniumwasimplementedwith thepseudo-codeextractedfrom theIntel documentation.It includesa testframework to comparethe correctnessof Simicsin runninginstructions.A simpletestgeneratorwasdevel-opedto validatethemethodandgavegoodresults.

Thequality of theextractionprocessfor thepseudo-codeis goodandnearlycom-pletely automated. It would only take a few daysto processa new versionof thedocumentationandgeneratean up-to-datereferencesimulator, provided that the for-matof thepseudo-codedoesnot changetoo much.In thesameway, implementinganIA-64 processordifferentfrom theItaniumrequiresonly a few changesin reasonablywell delimitedareas.

Theextractedcoderepresents11000linesof code,that is around50%of thetotalcodeof thereferencesimulator. Mesuredagainstthefew weeksspentontheextractionprocessdevelopmentandthe greatercorrectnessachieved throughit, it confirmsthechoiceof a documentation-basedsimulatorasa goodsystemfor a referencemachine.

Thecomparisonmethodhasbeenshown to beusefulandto produceinterestingandexploitableresults. In fact the referencesimulatoris currentlyusedin two differentways: asa referencedecoderwhenSimicsdoesnot decodea specificinstructionordecodesit wrong,andasa referencesimulatorduring the tests.It is intendedto bearegressiontestingtool alsowhenSimicsIA-64 hasreacheda sufficientmaturitylevel.

This master’s thesishasvalidatedthetestingmodelthatwasproposed,which canbeextendedandusedwith bettertestgeneratorsduringSimicsIA-64 developmenttoreducetheimplementationtime andimprovethequality of thesimulator.

47

Page 50: Guillaume Girard Report

Appendix A

Statefile format

Thefollowing text describestheformatof thefile saving thestateof aprocessorbeforeandafter theexecutionof a sequenceof instructions,aswell astheeventshappeningduringthesequence,andotherinformationthatmayberelevant.

The file is encodedin binary format. It containsseveral sectionsidentifiedby aspecificsectionnumberof 32 bits (whoseMSB is 1). Thegenericformatof thefile isthefollowing:

� 4 bytesfor thesectionnumber.

� For eachfield in thecorrespondingsection:

– 4 bytesfor theidentifier(whoseMSB is 0).

– 4 bytesfor thebinarylengthof thefield (excludingthe8 mandatorybytes)in bytes.

– therestof theinformationis field dependent.

Sincethe field encodesits length,an unknown field canbe skippedeasily. Sincea sectionbeginsby a magicnumbergreaterthan0x80000000,anda field beginsby amagicnumbersmallerthan0x80000000,an unknow sectioncanbeskippedby skip-pingall thefieldsit containsto thenext section.Sectionsshouldbeprovidedif possiblein order. TableA.1 givesthesectionnumbersattributed.

Section Magic numberHeader 0xFF0000F0 or 0xF00000FFStartregisters 0x80000001Startmemory 0x80000003Info 0x80000010ASM 0x80000011Code 0x80000012Endregisters 0x80000002Endmemory 0x80000004Endof file 0xFFFFFFFF

TableA.1: Sectionsandmagicnumbers

48

Page 51: Guillaume Girard Report

� Header� Thissectionis composedof three32 bits values:

� 0xFF 0x00 0x00 0xF0 for big endian,0xF0 0x00 0x00 0xFF for little endian,this is themagicnumberandtheendiandetector. All successive valueswill beencodedin theselectedformat.

� Architecture: Two highestbytesarearchitecture,two lowestare sub-versions(IA64 - Itanium= 0x1A640x0000)

� Softwareversion(0x10000is thefirst draft version)

� Start/End registers� This is a dumpof registervalues,respectively at thebegin-ningandat theendof theinstructionrun. A field is encodedasfollow:

� 4 bytesfor theregisternumberwhich is architecturedependant(andMSB is 0).Thetwo highestbytedefinestheregisterbankswhile thetwo lowestdefinestheregisternumberin thebank.

� 4 bytesfor thelengthof thedata.

� lengthbytesfor thecontent.

� Start/End memory � This is a dumpof severalmemoryareasof varioussize. Afield is encodedasfollow:

� 4 bytesto indicatesa memorypagenumber(if needed).

� 4 bytesfor thesizeof thedata(includinginformationlikephysicaladdress,sizeof thearea,encodingmode,etc.).Thedataencodingis implementationspecific.

� Inf o � Thisis aper-executed-instructionbasedinformation.Eachinfo field is iden-tified by the instructionnumberthatgeneratedit during the run. Theremay be morethanoneinfo field perinstruction.Thefieldsaresortedby increasinginstructionnum-ber. Theformatof afield is thefollowing:

� 4 bytesto indicatestheinstructionnumberin therunningsequence.

� 4 bytesfor thesizeof thedata.

� ASM � This is a per-executed-instructionbaseddisassembledcode.Thefield en-codingis definedas:

� 4 bytesto indicatestheinstructionnumberin therunningsequence.

� 4 bytesfor the sizeof the string. The string shouldendwith a 0x00 byte forC-like languages.

� Code� This is a per-executed-instructionbasedbinarycode. This sectioncanbeusedto provide the codeto executeinsteadof usinga memoryarea. The format isdefinedasusual.

49

Page 52: Guillaume Girard Report

Bibliography

[BAS94] S.Bashford,TheMIMOLA Language, version4.1, September1994

[COO93] Todd A. Cook, Paul D. Franzon,Ed A. Harcourt,ThomasK. Miller III,System-LevelSpecificationof InstructionSets, 1993

[COO94] Todd A. Cook, Ed Harcourt,A FunctionalSpecificationLanguage for In-structionSetArchitectures, January1994

[DUE68] Duley, J.R., and Dietmeyer, D. L., A Digital SystemDesign Language(DDL), IEEETransactionson Computers,C-17(9), 1968,p. 850

[FAU95] A. Fauth,J. VanPraet,M. Freericks,DescribingInstructionsetProcessorsUsingnML, March1995

[IA64-v1] Intel Corp., IA-64 Architecture Software Developer’s Manual, Volume1:IA-64 ApplicationArchitecture, Revision1.1, July2000

[IA64-v2] Intel Corp., IA-64 Architecture Software Developer’s Manual, Volume2:IA-64 SystemArchitecture, Revision1.1, July2000

[IA64-v3] Intel Corp., IA-64 Architecture Software Developer’s Manual, Volume3:IA-64 InstructionSetReference, Revision1.1, July2000

[IA64-v4] Intel Corp., IA-64 Architecture Software Developer’s Manual, Volume4:ItaniumProcessorProgrammer’sGuide, Revision1.1, July2000

[IA64-Errata] Intel Corp., IA-64 Architecture Software Developer’s Manual, Specifi-cationUpdate, Revision3.0, December2000

[IA64-Asm] Intel Corp., Itanium Architecture AssemblyLanguage ReferenceGuide,October2000

[IA64-FP] Intel Corp., Itanium ProcessorFloating-point Software AssistanceandFloating-pointExceptionHandling, January2000

[IEEE-std] The Instituteof ElectricalandElectronicsEngineers,Inc, IEEE Standardfor Binary Floating-Point Arithmetic,IEEE Std754-1085, 1985,1990

[IEEE-tut] Sun Microsystems,Inc., NumericalComputationGuide, http://docs.sun.com/htmlcoll/col l.6 48. 2/i so- 8859- 1/N UMCOMPGD/nc gTOC.h tml

[LAR97] FredrikLarsson,Generating EfficientSimulators from a SpecificationLan-guage, January1997

50

Page 53: Guillaume Girard Report

[PDF99] AdobeSystemsIncorporated,PortableDocumentFormatReferenceManual,version1.3, March11,1999

[PEE99] Stefan Pees,AndreasHoffmann,Vojin Zivojnovic, Heinrich Meyr, LISA –Machine Description Language for Cycle-Accurate Models of ProgrammableDSPArchitectures, 1999

[RAJ98] V. Rajesh,A GenericApproach to PerformanceModelingandIts Applicationto SimulatorGenerator, August1998

[RAM97a] NormanRamsey, Mary F. Fernandez,SpecifyingRepresentationsof Ma-chineInstructions, May 1997

[RAM97b] Norman Ramsey, Mary Fernandez,Automatic Checking of InstructionSpecifications, 1997

[SHA86] MoeShahdad,Anoverview of VHDL languageandtechnology

[SIE82] Daniel P. Siewiorek, C. GordonBell, Allen Newell, ComputerStructures:PrinciplesandExamples, 1982,ISBN 0-07-057302-6

[SimicsUG] VirtutechAB, SimicsUserGuide, April 2001

[ZIV96] Vojin Zivojnovic,StefanPees,HeinreichMeyr, LISA– MachineDescriptionLanguageandGenerecMachineModelfor HW/SWCod-Design, October1996

[1] Sim-nML homepage,http://www.cse.iitk. ac. in/ sim - nml/ ind ex. cgi

[2] VerilogFAQ, http://www.angelfire .co m/i n/v eri log faq /

[3] ghostscripthomepage,http://www.cs.wisc.e du/ ˜gh ost /

[4] xpdf homepage,http://www.foolabs.co m/x pdf /

[5] indenthomepage,http://www.gnu.org/s oft war e/i ndent/ ind ent.h tml

[6] softfloathomepage,http://www.cs.berke ley .ed u/˜ jha us er/ ari thmeti c/softfloat.html

51