Linux Insides
description
Transcript of Linux Insides
-
1. Introduction2. Booting
i. Frombootloadertokernelii. Firststepsinthekernelsetupcodeiii. Videomodeinitializationandtransitiontoprotectedmodeiv. Transitionto64-bitmodev. Kerneldecompression
3. Initializationi. Firststepsinthekernelii. Earlyinterruptshandleriii. Lastpreparationsbeforethekernelentrypointiv. Kernelentrypointv. Continuearchitecture-specificboot-timeinitializationsvi. Architecture-specificinitializations,again...vii. Endofthearchitecture-specificinitializations,almost...viii. Schedulerinitialization
4. Memorymanagementi. Memblockii. Fixmapsandioremap
5. Interrupts6. vsyscallsandvdso7. SMP8. Concepts
i. Per-CPUvariablesii. Cpumasks
9. DataStructuresintheLinuxKerneli. Doublylinkedlist
10. Theoryi. Pagingii. Elf64iii. CPUIDiv. MSR
11. Initialramdiski. initrd
12. Misci. Kernelbuildingandinstalationii. WriteandSubmityourfirstLinuxkernelPatchiii. Datatypesinthekernel
13. Usefullinks14. Contributors
TableofContents
LinuxInside
2
-
Aseriesofpostsaboutthelinuxkernelanditsinsides.
Thegoalissimple-tosharemymodestknowledgeabouttheinternalsofthelinuxkernelandhelppeoplewhoareinterestedinthelinuxkernelinternals,andotherlow-levelsubjectmatter.
Questions/Suggestions:Feelfreeaboutanyquestionsorsuggestionsbypingingmeattwitter@0xAX,addingissueorjustdropmeemail.
SupportIfyoulikelinux-insidesyoucansupportmewith:
Feelfreetocreateissuesorcreatepull-requestsifyoufindanyissuesormyEnglishispoor.
PleasereadCONTRIBUTING.mdbeforepushinganychanges.
@0xAX
linux-internals
Support
Contributions
Author
LinuxInside
3Introduction
-
Thischapterdescribesthelinuxkernelbootprocess.Youwillseehereacoupleofpostswhichdescribethefullcycleofthekernelloadingprocess:
Fromthebootloadertokernel-describesallstagesfromturningonthecomputertobeforethefirstinstructionofthekernel;Firststepsinthekernelsetupcode-describesfirststepsinthekernelsetupcode.Youwillseeheapinitialization,queryingofdifferentparameterslikeEDD,ISTandetc...Videomodeinitializationandtransitiontoprotectedmode-describesvideomodeinitializationinthekernelsetupcodeandtransitiontoprotectedmode.Transitionto64-bitmode-describespreparationfortransitioninto64-bitmodeandtransitionintoit.KernelDecompression-describespreparationbeforekerneldecompressionanddirectlydecompression.
Kernelbootprocess
LinuxInside
4Booting
-
Ifyouhavereadmypreviousblogposts,youcanseethatsometimeagoIstartedtogetinvolvedwithlow-levelprogramming.Iwrotesomepostsaboutx86_64assemblyprogrammingforLinux.Atthesametime,IstartedtodiveintotheLinuxsourcecode.Itisveryinterestingformetounderstandhowlow-levelthingswork,howprogramsrunonmycomputer,howtheyarelocatedinmemory,howthekernelmanagesprocessesandmemory,howthenetworkstackworksonlow-levelandmanymanyotherthings.IdecidedtowriteyetanotherseriesofpostsabouttheLinuxkernelforx86_64.
NotethatI'mnotaprofessionalkernelhacker,andIdon'twritecodeforthekernelatwork.It'sjustahobby.Ijustlikelow-levelstuff,anditisinterestingformetoseehowthesethingswork.Soifyounoticeanythingconfusing,orifyouhaveanyquestions/remarks,pingmeontwitter0xAX,dropmeanemailorjustcreateanissue.Iappreciateit.Allpostswillalsobeaccessibleatlinux-insidesandifyoufindsomethingwrongwithmyEnglishorpostcontent,feelfreetosendpullrequest.
Notethatthisisn'tofficialdocumentation,justlearningandsharingknowledge.
Requiredknowledge
UnderstandingCcodeUnderstandingassemblycode(AT&Tsyntax)
Anyway,ifyoujuststartedtolearnsometools,Iwilltrytoexplainsomepartsduringthisandfollowingposts.Ok,littleintroductionfinishedandnowwecanstarttodiveintokernelandlow-levelstuff.
Allcodeisactualforkernel-3.18,iftherearechanges,Iwillupdateposts.
Despitethatthisisaseriesofpostsaboutlinuxkernel,wewillnotstartfromkernelcode(atleastinthisparagraph).Ok,youpressedmagicpowerbuttononyourlaptopordesktopcomputeranditstartedtowork.Afterthemotherboardsendsasignaltothepowersupply,thepowersupplyprovidesthecomputerwiththeproperamountofelectricity.Oncemotherboardreceivesthepowergoodsignal,ittriestoruntheCPU.TheCPUresetsallleftoverdatainitsregistersandsetsuppredefinedvaluesforeveryregister.
80386andlaterCPUsdefinethefollowingpredefineddatainCPUregistersafterthecomputerresets:
IP0xfff0CSselector0xf000CSbase0xffff0000
Theprocessorstartsworkinginrealmodenowandweneedtomakealittleretreatforunderstandingmemorysegmentationinthismode.Realmodeissupportedinallx86-compatibleprocessors,from8086tomodernIntel64-bitCPUs.The8086processorhada20-bitaddressbus,whichmeansthatitcouldworkwith0-2^20bytesaddressspace(1megabyte).Butitonlyhad16-bitregisters,andwith16-bitregistersthemaximumaddressis2^16or0xffff(64kilobytes).Memorysegmentationwasusedtomakeuseofalloftheaddressspace.Allmemorywasdividedintosmall,fixed-sizesegmentsof65535bytes,or64KB.Sincewecannotaddressmemorybehind64KBwith16bitregisters,anothermethodtodoitwasdevised.Anaddressconsistsoftwoparts:thebeginningaddressofthesegmentandtheoffsetfromthebeginningofthissegment.Togetaphysicaladdressinmemory,weneedtomultiplythesegmentpartby16andaddtheoffsetpart:
Kernelbootingprocess.Part1.
Fromthebootloadertokernel
Magicpowerbutton,what'snext?
LinuxInside
5Frombootloadertokernel
-
PhysicalAddress=Segment*16+Offset
ForexampleCS:IPis0x2000:0x0010.Thecorrespondingphysicaladdresswillbe:
>>>hex((0x2000>>hex((0xffff
-
NowtheBIOShasstartedtowork.Afterinitializingandcheckingthehardware,itneedstofindabootabledevice.AbootorderisstoredintheBIOSconfiguration,controllingwhichdevicesthekernelattemptstoboot.Inthecaseofattemptingtobootaharddrive,theBIOStriestofindabootsector.OnharddrivespartitionedwithanMBRpartitionlayout,thebootsectorisstoredinthefirst446bytesofthefirstsector(512bytes).Thefinaltwobytesofthefirstsectorare0x55and0xaawhichsignalstheBIOSthatthedeviceasbootable.Forexample:
;;Note:thisexamplewrittenwithIntelsyntax;[BITS16][ORG0x7c00]
boot:moval,'!'movah,0x0emovbh,0x00movbl,0x07
int0x10jmp$
times510-($-$$)db0
db0x55db0xaa
Buildandrunitwith:
nasm-fbinboot.nasm&&qemu-system-x86_64boot
ThiswillinstructQEMUtousethebootbinarywejustbuiltasadiskimage.Sincethebinarygeneratedbytheassemblycodeabovefulfillstherequirementsofthebootsector(theoriginissetto0x7c00,andweendwiththemagicsequence),QEMUwilltreatthebinaryasthemasterbootrecordofadiskimage.
Wewillsee:
LinuxInside
7Frombootloadertokernel
-
Inthisexamplewecanseethatthiscodewillbeexecutedin16bitrealmodeandwillstartat0x7c00inmemory.Afterthestartitcallsthe0x10interruptwhichjustprints!symbol.Itfillsrestof510byteswithzerosandfinishwithtwomagicbytes0xaaand0x55.
Althoughyoucanseebinarydumpofitwithobjdumputil:
nasm-fbinboot.nasmobjdump-D-bbinary-mi386-Maddr16,data16,intelboot
Areal-worldbootsectorhascodeforcontinuingthebootprocessandthepartitiontable...insteadofabunchof0'sandanexclamationpoint:)Ok,so,fromthismomentBIOShandedcontroltothebootloaderandwecangoahead.
NOTE:asyoucanreadabovetheCPUisinrealmode.Inrealmode,calculatingthephysicaladdressinmemoryisasfollows:
PhysicalAddress=Segment*16+Offset
asIwroteabove.Butwehaveonly16bitgeneralpurposeregisters.Themaximumvalueof16bitregisteris:0xffff;Soifwetakethebiggestvalues,itwillbe:
>>>hex((0xffff*16)+0xffff)'0x10ffef'
Where0x10ffefisequalto1mb+64KB-16b.Buta8086processor,whichwasfirstprocessorwithrealmode,had20bitaddressline,and2^20=1048576.0is1MB,soitmeansthatactuallyavailablememoryamountis1MB.
Generalrealmode'smemorymapis:
0x00000000-0x000003FF-RealModeInterruptVectorTable0x00000400-0x000004FF-BIOSDataArea0x00000500-0x00007BFF-Unused0x00007C00-0x00007DFF-OurBootloader0x00007E00-0x0009FFFF-Unused0x000A0000-0x000BFFFF-VideoRAM(VRAM)Memory0x000B0000-0x000B7777-MonochromeVideoMemory0x000B8000-0x000BFFFF-ColorVideoMemory0x000C0000-0x000C7FFF-VideoROMBIOS0x000C8000-0x000EFFFF-BIOSShadowArea0x000F0000-0x000FFFFF-SystemBIOS
Butstop,atthebeginningofpostIwrotethatfirstinstructionexecutedbytheCPUislocatedataddress0xfffffff0,whichismuchbiggerthan0xfffff(1MB).HowcanCPUaccessitinrealmode?AsIwriteaboutandyoucanreadincorebootdocumentation:
0xFFFE_0000-0xFFFF_FFFF:128kilobyteROMmappedintoaddressspace
AtthestartofexecutionBIOSisnotinRAM,itislocatedinROM.
ThereareanumberofbootloaderswhichcanbootLinux,suchasGRUB2andsyslinux.TheLinuxkernelhasaBoot
Bootloader
LinuxInside
8Frombootloadertokernel
-
protocolwhichspecifiestherequirementsforbootloaderstoimplementLinuxsupport.ThisexamplewilldescribeGRUB2.
NowthattheBIOShaschosenabootdeviceandtransferredcontroltothebootsectorcode,executionstartsfromboot.img.Thiscodeisverysimpleduetothelimitedamountofspaceavailable,andcontainsapointerthatitusestojumptothelocationofGRUB2'scoreimage.Thecoreimagebeginswithdiskboot.img,whichisusuallystoredimmediatelyafterthefirstsectorintheunusedspacebeforethefirstpartition.Theabovecodeloadstherestofthecoreimageintomemory,whichcontainsGRUB2'skernelanddriversforhandlingfilesystems.Afterloadingtherestofthecoreimage,itexecutesgrub_main.
grub_maininitializesconsole,getsbaseaddressformodules,setsrootdevice,loads/parsesgrubconfigurationfile,loadsmodulesetc...Attheendofexecution,grub_mainmovesgrubtonormalmode.grub_normal_execute(fromgrub-core/normal/main.c)completeslastpreparationandshowsamenuforselectinganoperatingsystem.Whenweselectoneofgrubmenuentries,grub_menu_execute_entrybeginstobeexecuted,whichexecutesgrubbootcommand.Itstartstobootoperatingsystem.
Aswecanreadinthekernelbootprotocol,thebootloadermustreadandfillsomefieldsofkernelsetupheaderwhichstartsat0x01f1offsetfromthekernelsetupcode.Kernelheaderarch/x86/boot/header.Sstartsfrom:
.globlhdrhdr:setup_sects:.byte0root_flags:.wordROOT_RDONLYsyssize:.long0ram_size:.word0vid_mode:.wordSVGA_MODEroot_dev:.word0boot_flag:.word0xAA55
Thebootloadermustfillthisandtherestoftheheaders(onlymarkedaswriteinthelinuxbootprotocol,forexamplethis)withvalueswhichiteithergotfromcommandlineorcalculated.Wewillnotseedescriptionandexplanationofallfieldsofkernelsetupheader,wewillgetbacktoitwhenkernelusesit.Anyway,youcanfinddescriptionofanyfieldinthebootprotocol.
Aswecanseeinkernelbootprotocol,thememorymapwillbethefollowingafterkernelloading:
|Protected-modekernel|100000+------------------------+|I/Omemoryhole|0A0000+------------------------+|ReservedforBIOS|Leaveasmuchaspossibleunused~~|Commandline|(CanalsobebelowtheX+10000mark)X+10000+------------------------+|Stack/heap|Forusebythekernelreal-modecode.X+08000+------------------------+|Kernelsetup|Thekernelreal-modecode.|Kernelbootsector|Thekernellegacybootsector.X+------------------------+|Bootloader|
Soafterthebootloadertransferredcontroltothekernel,itstartssomewhereat:
0x1000+X+sizeof(KernelBootSector)+1
whereXistheaddresskernelbootsectorloaded.InmycaseXis0x10000(),wecanseeitinmemorydump:
LinuxInside
9Frombootloadertokernel
-
Ok,bootloaderloadedlinuxkernelintomemory,filledheaderfieldsandjumpedtoit.Nowwecanmovedirectlytothekernelsetupcode.
Finallyweareinthekernel.Technicallykerneldidn'trunyet,firstofallweneedtosetupkernel,memorymanager,processmanager,etc.Kernelsetupexecutionstartsfromarch/x86/boot/header.Satthe_start.Itislittlestrangeatthefirstlook,therearemanyinstructionsbeforeit.Actually....
Longtimeagolinuxhaditsownbootloader,butnowifyourunforexample:
qemu-system-x86_64vmlinuz-3.18-generic
Youwillsee:
Actuallyheader.SstartsfromMZ(seeimageabove),errormessageprintingandfollowingPEheader:
#ifdefCONFIG_EFI_STUB
Startofkernelsetup
LinuxInside
10Frombootloadertokernel
-
#"MZ",MS-DOSheader.byte0x4d.byte0x5a#endif.........pe_header:.ascii"PE".word0
ItneedsthisforloadingoperatingsystemwithUEFI.Herewewillnotseehowitworks(willlookintoitinthenextparts).
Soactualkernelsetupentrypointis:
//header.Sline292.globl_start_start:
Bootloader(grub2andothers)knowsaboutthispoint(0x200offsetfromMZ)andmakesajumpdirectlytothispoint,despitethefactthatheader.Sstartsfrom.bstextsectionwhichprintserrormessage:
////arch/x86/boot/setup.ld//.=0;//currentposition.bstext:{*(.bstext)}//put.bstextsectiontoposition0.bsdata:{*(.bsdata)}
Sokernelsetupentrypointis:
.globl_start_start:.byte0xeb.bytestart_of_setup-1f1:////restoftheheader//
Herewecanseejmpinstructionopcode-0xebtothestart_of_setup-1fpoint.Nfnotationmeansfollowing:2freferstothenextlocal2:label.Inourcaseitislabel1whichgoesrightafterjump.Itcontainsrestofsetupheaderandrightaftersetupheaderwecansee.entrytextsectionwhichstartsatstart_of_setuplabel.
Actuallyit'sfirstcodewhichstartstoexecutebesidespreviousjumpinstruction.Afterkernelsetupgotthecontrolfrombootloader,firstjmpinstructionislocatedat0x200(first512bytes)offsetfromthestartofkernelrealmode.Thiswecanreadinlinuxkernelbootprotocolandalsoseeingrub2sourcecode:
state.gs=state.fs=state.es=state.ds=state.ss=segment;state.cs=segment+0x20;
Itmeansthatsegmentregisterswillhavefollowingvaluesafterkernelsetupstartstowork:
fs=es=ds=ss=0x1000cs=0x1020
LinuxInside
11Frombootloadertokernel
-
formycasewhenkernelloadedat0x10000.
Afterjumptostart_of_setup,needstodofollowingthings:
BesurethatallvaluesofallsegmentregistersareequalSetupcorrectstackifneedSetupbssJumptoCcodeatmain.c
Let'slookatimplementation.
Firstofallitensuresthatdsandessegmentregisterspointtothesameaddressandenablesinterruptswithstiinstruction:
movw%ds,%axmovw%ax,%essti
Asiwroteabove,grub2loadskernelsetupcodeat0x10000addressandcsat0x1020becauseexecutiondoesn'tstartfromthestartoffile,butfrom:
_start:.byte0xeb.bytestart_of_setup-1f
jump,whichis512bytesoffsetfromthe4d5a.Alsoneedtoaligncsfrom0x10200to0x10000asallothersegmentregisters.Afterthatwesetupstack:
pushw%dspushw$6flretw
pushdsvaluetostack,andaddressof6labelandexecutelretwinstruction.Whenwecalllretw,itloadsaddressof6labeltoinstructionpointerregisterandcswithvalueofds.Afteritwewillhavedsandcswiththesamevalues.
Actually,almostallofthesetupcodeispreparationforClanguageenvironmentintherealmode.Thenextstepischeckingofssregistervalueandmakingofcorrectstackifssiswrong:
movw%ss,%dxcmpw%ax,%dxmovw%sp,%dxje2f
Generally,itcanbe3differentcases:
sshasvalidvalue0x10000(asallothersegmentregistersbesidecs)ssisinvalidandCAN_USE_HEAPflagisset(seebelow)
Segmentregistersalign
Stacksetup
LinuxInside
12Frombootloadertokernel
-
ssisinvalidandCAN_USE_HEAPflagisnotset(seebelow)
Let'slookatallofthesecases:
1. sshasacorrectaddress(0x10000).Inthiscasewegoto2label:
2:andw$~3,%dxjnz3fmovw$0xfffc,%dx3:movw%ax,%ssmovzwl%dx,%espsti
Herewecanseealigningofdx(containsspgivenbybootloader)to4bytesandcheckingthatitisnotzero.Ifitiszeroweput0xfffc(4bytealignedaddressbeforemaximumsegmentsize-64KB)todx.Ifitisnotzerowecontinuetousespgivenbybootloader(0xf7f4inmycase).Afterthisweputaxvaluetosswhichstorescorrectsegmentaddress0x10000andsetupcorrectsp.Afteritwehavecorrectstack:
1. Inthesecondcase(ss!=ds),firstofallput_end(addressofendofsetupcode)valueindx.Andcheckloadflagsheaderfieldwithtestbinstructiontooseeifwecanuseheapornot.loadflagsisabitmaskheaderwhichisdefinedas:
#defineLOADED_HIGH(1
-
1. ThelastcasewhenCAN_USE_HEAPisnotset,wejustuseminimalstackfrom_endto_end+STACK_SIZE:
ThelasttwostepsthatneedtohappenbeforewecanjumptothemainCcode,arethatweneedtosetupthebssarea,andcheckthe"magic"signature.Firstly,signaturechecking:
cmpl$0x5a5aaa55,setup_sigjnesetup_bad
Thissimplyconsistsofcomparingthesetup_sigagainstthemagicnumber0x5a5aaa55;iftheyarenotequal,afatalerrorisreported.
Butifthemagicnumbermatches,knowingwehaveasetofcorrectsegmentregisters,andastack,weneedonlysetupthebsssectionbeforejumpingintotheCcode.
Thebsssectionisusedforstoringstaticallyallocated,uninitialized,data.Linuxcarefullyensuresthisareaofmemoryisfirstblanked,usingthefollowingcode:
movw$__bss_start,%dimovw$_end+3,%cxxorl%eax,%eaxsubw%di,%cxshrw$2,%cxrep;stosl
Bsssetup
LinuxInside
14Frombootloadertokernel
-
Firstofallthe__bss_startaddressismovedintodi,andthe_end+3address(+3-alignsto4bytes)ismovedintocx.Theeaxregisteriscleared(usinganxorinstruction),andthebsssectionsize(cx-di)iscalculatedandputintocx.Then,cxisdividedbyfour(thesizeofa'word'),andthestoslinstructionisrepeatedlyused,storingthevalueofeax(zero)intotheaddresspointedtobydi,andautomaticallyincreasingdibyfour(thisoccursuntilcxreacheszero).Theneteffectofthiscode,isthatzerosarewrittenthroughallwordsinmemoryfrom__bss_startto_end:
That'sall,wehavestack,bssandnowwecanjumptomainCfunction:
calllmain
whichisinarch/x86/boot/main.c.Whatwillbethere?Wewillseeitinthenextpart.
Thisistheendofthefirstpartaboutlinuxkernelinternals.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateissue.InthenextpartwewillseefirstCcodewhichexecutesinlinuxkernelsetup,implementationofmemoryroutinesasmemset,memcpy,earlyprintkimplementationandearlyconsoleinitializationandmanymore.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.
Intel80386programmer'sreferencemanual1986MinimalBootLoaderforIntelArchitecture808680386ResetvectorRealmodeLinuxkernelbootprotocolCoreBootdevelopermanualRalfBrown'sInterruptListPowersupply
Jumptomain
Conclusion
Links
LinuxInside
15Frombootloadertokernel
-
Powergoodsignal
LinuxInside
16Frombootloadertokernel
-
Westartedtodiveintolinuxkernelinternalsinthepreviouspartandsawtheinitialpartofthekernelsetupcode.Westoppedatthefirstcallofthemainfunction(whichisthefirstfunctionwritteninC)fromarch/x86/boot/main.c.Herewewillcontinuetoresearchthekernelsetupcodeandseewhatisprotectedmode,somepreparationforthetransitionintoit,theheapandconsoleinitialization,memorydetectionandmuchmuchmore.So...Let'sgoahead.
BeforewecanmovetothenativeIntel64Longmode,thekernelmustswitchtheCPUintoprotectedmode.Whatistheprotectedmode?TheProtectedmodewasfirstaddedtothex86architecturein1982andwasthemainmodeofIntelprocessorsfrom80286processoruntilIntel64andlongmode.TheMainreasontomoveawayfromtherealmodethatthereisverylimitedaccesstotheRAM.Asyoucanrememberfromthepreviouspart,thereisonly2^20bytesor1megabyte,sometimesevenonly640kilobytes.
Protectedmodebroughtmanychanges,butthemainisadifferentmemorymanagement.The24-bitaddressbuswasreplacedwitha32-bitaddressbus.Itallowstoaccessto4gigabytesofphysicaladdressspace.Alsopagingsupportwasaddedwhichwewillseeinthenextparts.
Memorymanagementintheprotectedmodeisdividedintotwo,almostindependentparts:
SegmentationPaging
Herewecanonlyseesegmentation.Asyoucanreadinthepreviouspart,addressesconsistoftwopartsintherealmode:
BaseaddressofsegmentOffsetfromthesegmentbase
Andwecangetthephysicaladdressifweknowthesetwopartsby:
PhysicalAddress=Segment*16+Offset
Memorysegmentationwascompletelyredoneintheprotectedmode.Thereareno64kilobytesfixed-sizesegments.AllmemorysegmentsaredescribedbytheGlobalDescriptorTable(GDT)insteadofsegmentregisters.TheGDTisastructurewhichresidesinmemory.Thereisnofixedplaceforitinmemory,butitsaddressisstoredinthespecialGDTRregister.LaterwewillseetheGDTloadinginthelinuxkernelcode.Therewillbeanoperationforloadingitintomemory,somethinglike:
lgdtgdt
wherethelgdtinstructionloadsthebaseaddressandlimitofglobaldescriptortabletotheGDTRregister.GDTRisa48-bitregisterandconsistsoftwoparts:
size-16bitofglobaldescriptortable;address-32-bitoftheglobaldescriptortable.
Kernelbootingprocess.Part2.
Firststepsinthekernelsetup
Protectedmode
LinuxInside
17Firststepsinthekernelsetupcode
-
Theglobaldescriptortablecontainsdescriptorswhichdescribememorysegments.Everydescriptoris64-bit.Generalschemeofadescriptoris:
3124191670------------------------------------------------------------|||B||A|||||0|E|W|A|||BASE31..24|G|/|L|V|LIMIT|P|DPL|S|TYPE|BASE23:16|4|||D||L|19..16||||1|C|R|A||------------------------------------------------------------||||BASE15..0|LIMIT15..0|0|||------------------------------------------------------------
Don'tworry,iknowthatitlooksalittlescaryafterrealmode,butit'seasy.Let'slookonitcloser:
1. Limit(0-15bits)definesalength_of_segment-1.ItdependsonGbit.
ifG(55-bit)is0andsegmentlimitis0,sizeofsegmentis1byteifGis1andsegmentlimitis0,sizeofsegmentis4096bytesifGis0andsegmentlimitis0xfffff,sizeofsegmentis1megabyteifGis1andsegmentlimitis0xfffff,sizeofsegmentis4gigabytes
2. Base(0-15,32-39and56-63bits)definesthephysicaladdressofthesegment'sstartaddress.
3. Type(40-47bits)definesthetypeofsegmentandkindsofaccesstoit.NextSflagspecifiesdescriptortype.ifSis0thenthissegmentisasystemsegment,whereasifSis1thenthisisacodeordatasegment(Stacksegmentsaredatasegmentswhichmustberead/writesegments).Ifthesegmentisacodeordatasegment,itcanbeoneofthefollowingaccesstypes:
|TypeField|DescriptorType|Description|-----------------------------|-----------------|------------------|Decimal|||0EWA|||00000|Data|Read-Only|10001|Data|Read-Only,accessed|20010|Data|Read/Write|30011|Data|Read/Write,accessed|40100|Data|Read-Only,expand-down|50101|Data|Read-Only,expand-down,accessed|60110|Data|Read/Write,expand-down|70111|Data|Read/Write,expand-down,accessed|CRA|||81000|Code|Execute-Only|91001|Code|Execute-Only,accessed|101010|Code|Execute/Read|111011|Code|Execute/Read,accessed|121100|Code|Execute-Only,conforming|141101|Code|Execute-Only,conforming,accessed|131110|Code|Execute/Read,conforming|151111|Code|Execute/Read,conforming,accessed
Aswecanseethefirstbitis0fordatasegmentand1forcodesegment.NextthreebitsEWAareexpansiondirection(expand-downsegmentwillgrowdown,youcanreadmoreaboutithere),writeenableandaccessedfordatasegments.CRAbitsareconforming(Atransferofexecutionintoamore-privilegedconformingsegmentallowsexecutiontocontinueatthecurrentprivilegelevel),readenableandaccessed.
1. DPL(descriptorprivilegelevel)definestheprivilegelevelofthesegment.Itcanbe0-3where0isthemostprivileged.
2. Pflag-indicatesifsegmentispresentinmemoryornot.
3. AVLflag-Availableandreservedbits.
LinuxInside
18Firststepsinthekernelsetupcode
-
4. Lflag-indicateswhetheracodesegmentcontainsnative64-bitcode.If1thenthecodesegmentexecutesin64bitmode.
5. B/Dflag-defaultoperationsize/defaultstackpointersizeand/orupperbound.
Segmentregistersdon'tcontainthebaseaddressofthesegmentasintherealmode.Insteadtheycontainaspecialstructure-segmentselector.Selectorisa16-bitstructure:
-----------------------------|Index|TI|RPL|-----------------------------
WhereIndexshowstheindexnumberofthedescriptorindescriptortable.TIshowswheretosearchforthedescriptor:intheglobaldescriptortableorlocal.AndRPListheprivilegelevel.
Everysegmentregisterhasavisibleandhiddenpart.Whenaselectorisloadedintooneofthesegmentregisters,itwillbestoredintothevisiblepart.Thehiddenpartcontainsthebaseaddress,limitandaccessinformationofthedescriptorwhichpointedtotheselector.Thefollowingstepsareneededtogetthephysicaladdressintheprotectedmode:
Segmentselectormustbeloadedinoneofthesegmentregisters;CPUtriestofind(byGDTaddress+Indexfromselector)andloadthedescriptorintothehiddenpartofsegmentregister;Baseaddress(fromsegmentdescriptor)+offsetwillbethelinearaddressofthesegmentwhichisthephysicaladdress(ifpagingisdisabled).
Schematicallyitwilllooklikethis:
LinuxInside
19Firststepsinthekernelsetupcode
-
THealgorithmforthetransitionfromtherealmodeintoprotectedmodeis:
Disableinterrupts;DescribeandloadGDTwithlgdtinstruction;SetPE(ProtectionEnable)bitinCR0(ControlRegister0);Jumptoprotectedmodecode;
Wewillseethetransitiontotheprotectedmodeinthelinuxkernelinthenextpart,butbeforewecanmovetoprotectedmode,weneedtodosomepreparations.
Let'slookonarch/x86/boot/main.c.Wecanseesomeroutinestherewhichmakekeyboardinitialization,heapinitialization,etc...Let'slookintoit.
Wewillstartfromthemainroutinein"main.c".Firstfunctionwhichiscalledinmainiscopy_boot_params.Itcopiesthekernelsetupheaderintothefieldoftheboot_paramsstructurewhichisdefinedinthearch/x86/include/uapi/asm/bootparam.h.
Theboot_paramsstructurecontainsthestructsetup_headerhdrfield.Thisstructurecontainsthesamefieldsasdefinedinlinuxbootprotocolandisfilledbythebootloaderandalsoatkernelcompile/buildtime.copy_boot_paramsdoestwothings:copieshdrfromheader.Stotheboot_paramsstructureinsetup_headerfieldandupdatespointertothekernelcommandlineifthekernelwasloadedwitholdcommandlineprotocol.
Copyingbootparametersintothe"zeropage"
LinuxInside
20Firststepsinthekernelsetupcode
-
Notethatitcopieshdrwithmemcpyfunctionwhichisdefinedinthecopy.Ssourcefile.Let'shavealookinside:
GLOBAL(memcpy)pushw%sipushw%dimovw%ax,%dimovw%dx,%sipushw%cxshrw$2,%cxrep;movslpopw%cxandw$3,%cxrep;movsbpopw%dipopw%siretlENDPROC(memcpy)
Yeah,wejustmovedtoCcodeandnowassemblyagain:)Firstofallwecanseethatmemcpyandotherroutineswhicharedefinedhere,startandendwiththetwomacros:GLOBALandENDPROC.GLOBALisdescribedinarch/x86/include/asm/linkage.hwhichdefinesglobldirectiveandthelabelforit.ENDPROCisdescribedininclude/linux/linkage.hwhichmarksnamesymbolasfunctionnameandendswiththesizeofthenamesymbol.
Implementationofthememcpyiseasy.Atfirst,itpushesvaluesfromsianddiregisterstothestackbecausetheirvalueswillchangeinthememcpy,sopushitonthestacktopreservetheirvalues.memcpy(andotherfunctionsincopy.S)usefastcallcallingconventions.Soitgetsincomingparametersfromtheax,dxandcxregisters.Callingmemcpylookslikethis:
memcpy(&boot_params.hdr,&hdr,sizeofhdr);
Soaxwillcontaintheaddressoftheboot_params.hdr,dxwillcontaintheaddressofhdrandcxwillcontainthesizeofhdr(allinbytes).memcpyputstheaddressofboot_params.hdrtothediregisterandaddressofhdrtosiandsavesthesizeonthestack.Afterthisitshiftstotherighton2size(ordivideon4)andcopiesfromsitodiby4bytes.Afteritwerestorethesizeofhdragain,alignitby4bytesandcopytherestofbytesfromsitodibytebybyte(ifthereisrest).Restoresianddivaluesfromthestackintheendandafterthiscopyingisfinished.
Afterthehdriscopiedintoboot_params.hdr,thenextstepisconsoleinitializationbycallingtheconsole_initfunctionwhichisdefinedinarch/x86/boot/early_serial_console.c.
Ittriestofindtheearlyprintkoptioninthecommandlineandifthesearchwassuccessful,itparsestheportaddressandbaudrateoftheserialportandinitializestheserialport.Valueofearlyprintkcommandlineoptioncanbeoneofthe:
*serial,0x3f8,115200*serial,ttyS0,115200*ttyS0,115200
Afterserialportinitializationwecanseethefirstoutput:
if(cmdline_find_option_bool("debug"))puts("earlyconsoleinsetupcode\n");
putsdefinitionisintty.c.AswecanseeitprintscharacterbycharacterintheloopbycallingTheputcharfunction.Let's
Consoleinitialization
LinuxInside
21Firststepsinthekernelsetupcode
-
lookintotheputcharimplementation:
void__attribute__((section(".inittext")))putchar(intch){if(ch=='\n')putchar('\r');
bios_putchar(ch);
if(early_serial_base!=0)serial_putchar(ch);}
__attribute__((section(".inittext")))meansthatthiscodewillbeinthe.inittextsection.Wecanfinditinthelinkerfilesetup.ld.
Firstofall,put_charchecksforthe\nsymbolandifitisfound,prints\rbefore.AfterthatitoutputsthecharacterontheVGAscreenbycallingtheBIOSwiththe0x10interruptcall:
staticvoid__attribute__((section(".inittext")))bios_putchar(intch){structbiosregsireg;
initregs(&ireg);ireg.bx=0x0007;ireg.cx=0x0001;ireg.ah=0x0e;ireg.al=ch;intcall(0x10,&ireg,NULL);}
Hereinitregstakesthebiosregsstructureandfirstfillsbiosregswithzerosusingthememsetfunctionandthenfillsitwithregistervalues.
memset(reg,0,sizeof*reg);reg->eflags|=X86_EFLAGS_CF;reg->ds=ds();reg->es=ds();reg->fs=fs();reg->gs=gs();
Let'slookonthememsetimplementation:
GLOBAL(memset)pushw%dimovw%ax,%dimovzbl%dl,%eaximull$0x01010101,%eaxpushw%cxshrw$2,%cxrep;stoslpopw%cxandw$3,%cxrep;stosbpopw%diretlENDPROC(memset)
Asyoucanreadabove,itusesfastcallcallingconventionslikethememcpyfunction,whichmeansthatthefunctiongetsparametersfromax,dxandcxregisters.
LinuxInside
22Firststepsinthekernelsetupcode
-
Generallymemsetislikeamemcpyimplementation.Itsavesthevalueofthediregisteronthestackandputstheaxvalueintodiwhichistheaddressofthebiosregsstructure.Nextisthemovzblinstruction,whichcopiesthedlvaluetothelow2bytesoftheeaxregister.Theremaining2highbytesofeaxwillbefilledwithzeros.
Thenextinstructionmultiplieseaxwith0x01010101.Itneedstobecausememsetwillcopy4bytesatthesametime.Forexampleweneedtofillastructurewith0x7withmemset.eaxwillcontain0x00000007valueinthiscase.Soifwemultiplyeaxwith0x01010101,wewillget0x07070707andnowwecancopythese4bytesintothestructure.memsetusesrep;stoslinstructionsforcopyingeaxintoes:di.
Therestofthememsetfunctiondoesalmostthesameasmemcpy.
Afterthatbiosregsstructureisfilledwithmemset,bios_putcharcallsthe0x10interruptwhichprintsacharacter.Afterwardsitchecksiftheserialportwasinitializedornotandwritesacharactertherewithserial_putcharandinb/outbinstructionsifitwasset.
Afterthestackandbsssectionwerepreparedinheader.S(seepreviouspart),thekernelneedstoinitializetheheapwiththeinit_heapfunction.
Firstofallinit_heapcheckstheCAN_USE_HEAPflagfromtheloadflagskernelsetupheaderandcalculatestheendofthestackifthisflagwasset:
char*stack_end;
if(boot_params.hdr.loadflags&CAN_USE_HEAP){asm("leal%P1(%%esp),%0":"=r"(stack_end):"i"(-STACK_SIZE));
orinotherwordsstack_end=esp-STACK_SIZE.
Thenthereistheheap_endcalculationwhichisheap_end_ptror_end+512andacheckifheap_endisgreaterthanstack_endmakesitequal.
Fromthismomentwecanusetheheapinthekernelsetupcode.WewillseehowtouseitandhowtheAPIforitisimplementedinnextposts.
Thenextstepaswecanseeiscpuvalidationbyvalidate_cpufromarch/x86/boot/cpu.c.
Itcallsthecheck_cpufunctionandpassescpulevelandrequiredcpuleveltoitandchecksthatkernellaunchedattherightcpu.Itchecksthecpu'sflags,presenceoflongmode(whichwewillseemoredetailsoninthenextparts)forx86_64,checkstheprocessor'svendorandmakespreparationforcertainvendorsliketurningoffSSE+SSE2forAMDiftheyaremissingandetc...
Thenextstepismemorydetectionbythedetect_memoryfunction.Itusesdifferentprogramminginterfacesformemorydetectionlike0xe820,0xe801and0x88.Wewillseeonlytheimplementationof0xE820here.Let'slookintothedetect_memory_e820implementationfromthearch/x86/boot/memory.csourcefile.Firstofall,detect_memory_e820functioninitializesbiosregsstructureaswesawaboveandfillsregisterswithspecialvaluesforthe0xe820call:
Heapinitialization
CPUvalidation
Memorydetection
LinuxInside
23Firststepsinthekernelsetupcode
-
initregs(&ireg);ireg.ax=0xe820;ireg.cx=sizeofbuf;ireg.edx=SMAP;ireg.di=(size_t)&buf;
Theaxregistermustcontainthenumberofthefunction(0xe820inourcase),cxregistercontainssizeofthebufferwhichwillcontaindataaboutmemory,edxmustcontaintheSMAPmagicnumber,es:dimustcontaintheaddressofthebufferwhichwillcontainmemorydataandebxhastobezero.
Nextisaloopwheredataaboutthememorywillbecollected.Itstartsfromthecallofthe0x15biosinterrupt,whichwritesonelinefromtheaddressallocationtable.Forgettingthenextlineweneedtocallthisinterruptagain(whichwedointheloop).Beforethenextcallebxmustcontainthevaluereturnedpreviously:
intcall(0x15,&ireg,&oreg);ireg.ebx=oreg.ebx;
Ultimately,itdoesiterationsinthelooptocollectdatafromtheaddressallocationtableandwritesthisdataintothee820entryarray:
startofmemorysegmentsizeofmemorysegmenttypeofmemorysegment(whichcanbereserved,usableandetc...).
Youcanseetheresultofthisinthedmesgoutput,somethinglike:
[0.000000]e820:BIOS-providedphysicalRAMmap:[0.000000]BIOS-e820:[mem0x0000000000000000-0x000000000009fbff]usable[0.000000]BIOS-e820:[mem0x000000000009fc00-0x000000000009ffff]reserved[0.000000]BIOS-e820:[mem0x00000000000f0000-0x00000000000fffff]reserved[0.000000]BIOS-e820:[mem0x0000000000100000-0x000000003ffdffff]usable[0.000000]BIOS-e820:[mem0x000000003ffe0000-0x000000003fffffff]reserved[0.000000]BIOS-e820:[mem0x00000000fffc0000-0x00000000ffffffff]reserved
Thenextstepistheinitializationofthekeyboardwiththecallofthekeyboard_initfunction.Atfirstkeyboard_initinitializesregistersusingtheinitregsfunctionandcallingthe0x16interruptforgettingthekeyboardstatus.Afterthisitcalls0x16againtosetrepeatrateanddelay.
Thenextcoupleofstepsarequeriesfordifferentparameters.Wewillnotdiveintodetailsaboutthesequeries,butwillbebacktotheallofitinthenextparts.Let'smakeashortlookonthisfunctions:
Thequery_mcaroutinecallsthe0x15BIOSinterrupttogetthemachinemodelnumber,sub-modelnumber,BIOSrevisionlevel,andotherhardware-specificattributes:
intquery_mca(void){structbiosregsireg,oreg;u16len;
Keyboardinitialization
Querying
LinuxInside
24Firststepsinthekernelsetupcode
-
initregs(&ireg);ireg.ah=0xc0;intcall(0x15,&ireg,&oreg);
if(oreg.eflags&X86_EFLAGS_CF)return-1;/*NoMCApresent*/
set_fs(oreg.es);len=rdfs16(oreg.bx);
if(len>sizeof(boot_params.sys_desc_table))len=sizeof(boot_params.sys_desc_table);
copy_from_fs(&boot_params.sys_desc_table,oreg.bx,len);return0;}
Itfillstheahregisterwith0xc0andcallsthe0x15BIOSinterruption.Aftertheinterruptexecutionitchecksthecarryflagandifitissetto1,BIOSdoesn'tsupportMCA.Ifcarryflagissetto0,ES:BXwillcontainapointertothesysteminformationtable,whichlookslikethis:
OffsetSizeDescription)00hWORDnumberofbytesfollowing02hBYTEmodel(see#00515)03hBYTEsubmodel(see#00515)04hBYTEBIOSrevision:0forfirstrelease,1for2nd,etc.05hBYTEfeaturebyte1(see#00510)06hBYTEfeaturebyte2(see#00511)07hBYTEfeaturebyte3(see#00512)08hBYTEfeaturebyte4(see#00513)09hBYTEfeaturebyte5(see#00514)---AWARDBIOS---0AhNBYTEsAWARDcopyrightnotice---PhoenixBIOS---0AhBYTE???(00h)0BhBYTEmajorversion0ChBYTEminorversion(BCD)0Dh4BYTEsASCIZstring"PTL"(PhoenixTechnologiesLtd)---QuadramQuad386---0Ah17BYTEsASCIIsignaturestring"QuadramQuad386XT"---Toshiba(SatellitePro435CDSatleast)---0Ah7BYTEssignature"TOSHIBA"11hBYTE???(8h)12hBYTE???(E7h)productID???(guess)13h3BYTEs"JPN"
Nextwecalltheset_fsroutineandpassthevalueoftheesregistertoit.Implementationofset_fsisprettysimple:
staticinlinevoidset_fs(u16seg){asmvolatile("movw%0,%%fs"::"rm"(seg));}
Thereisinlineassemblywhichgetsthevalueofthesegparameterandputsitintothefsregister.Therearemanyfunctionsinboot.hlikeset_fs,forexampleset_gs,fs,gsforreadingavalueinitandetc...
Intheendofquery_mcaitjustcopiesthetablewhichpointedtobyes:bxtotheboot_params.sys_desc_table.
ThenextisgettingIntelSpeedStepinformationwiththecallofquery_istfunction.FirstofallitchecksCPUlevelandifitiscorrect,calls0x15forgettinginfoandsavestheresulttoboot_params.
Thefollowingquery_apm_biosfunctiongetsAdvancedPowerManagementinformationfromtheBIOS.query_apm_bioscallsthe0x15BIOSinterruptiontoo,butwithah-0x53tocheckAPMinstallation.Afterthe0x15execution,query_apm_biosfunctionschecksPMsignature(itmustbe0x504d),carryflag(itmustbe0ifAPMsupported)andvalueof
LinuxInside
25Firststepsinthekernelsetupcode
-
thecxregister(ifit's0x02,protectedmodeinterfaceissupported).
Nextitcallsthe0x15again,butwithax=0x5304fordisconnectingtheAPMinterfaceandconnectthe32bitprotectedmodeinterface.Intheenditfillsboot_params.apm_bios_infowithvaluesobtainedfromtheBIOS.
Notethatquery_apm_bioswillbeexecutedonlyifCONFIG_APMorCONFIG_APM_MODULEwassetinconfigurationfile:
#ifdefined(CONFIG_APM)||defined(CONFIG_APM_MODULE)query_apm_bios();#endif
Thelastisthequery_eddfunction,whichasksEnhancedDiskDriveinformationfromtheBIOS.Let'slookintothequery_eddimplementation.
Firstofallitreadstheeddoptionfromkernel'scommandlineandifitwassettooffthenquery_eddjustreturns.
IfEDDisenabled,query_eddgoesoverBIOS-supportedharddisksandqueriesEDDinformationinthefollowingloop:
for(devno=0x80;devnoext_ramdisk_imagehdr.ramdisk_image;initrd_size=(u64)real_mode->ext_ramdisk_sizehdr.ramdisk_size;mem_avoid[1].start=initrd_start;mem_avoid[1].size=initrd_size;
Herewecanseecalculationoftheinitrdstartaddressandsize.ext_ramdisk_imageishigh32-bitsoftheramdisk_imagefieldfrombootheaderandext_ramdisk_sizeishigh32-bitsoftheramdisk_sizefieldfrombootprotocol:
OffsetProtoNameMeaning/Size.........0218/42.00+ramdisk_imageinitrdloadaddress(setbybootloader)021C/42.00+ramdisk_sizeinitrdsize(setbybootloader)...
Andext_ramdisk_imageandext_ramdisk_sizeyoucanfindintheDocumentation/x86/zero-page.txt:
OffsetProtoNameMeaning/Size...
LinuxInside
53Kerneldecompression
-
...
...0C0/004ALLext_ramdisk_imageramdisk_imagehigh32bits0C4/004ALLext_ramdisk_sizeramdisk_sizehigh32bits...
Sowe'retakingext_ramdisk_imageandext_ramdisk_size,shiftingtheylefton32(nowtheywillcontainlow32-bitsinthehigh32-bitbits)andgettingstartaddressoftheinitrdandsizeofit.Afterthiswestorethesevaluesinthemem_avoidarraywhichdefinedas:
#defineMEM_AVOID_MAX5staticstructmem_vectormem_avoid[MEM_AVOID_MAX];
wheremem_vectorstructureis:
structmem_vector{unsignedlongstart;unsignedlongsize;};
Thenextstepafterwecollectedallunsafememoryregionsinthemem_avoidarraywillbesearchoftherandomaddresswhichdoesnotoverlapwiththeunsaferegionswiththefind_random_addrfunction.
Firstofallwecanseealignoftheoutputaddressinthefind_random_addrfunction:
minimum=ALIGN(minimum,CONFIG_PHYSICAL_ALIGN);
youcanrememberCONFIG_PHYSICAL_ALIGNconfigurationoptionfromthepreviouspart.Thisoptionprovidesthevaluetowhichkernelshouldbealignedanditis0x200000bydefault.Afterthatwegotalignedoutputaddress,wegothroughthememoryandcollectregionswhicharegoodfordecompressedkernelimage:
for(i=0;ie820_entries;i++){process_e820_entry(&real_mode->e820_map[i],minimum,size);}
Youcanrememberthatwecollectede820_entriesinthesecondpartoftheKernelbootingprocesspart2.
Firstofallprocess_e820_entryfunctiondoessomechecksthate820memoryregionisnotnon-RAM,thatthestartaddressofthememoryregionisnotbiggerthanMaximumallowedaslroffsetandthatmemoryregionisnotlessthanvalueofkernelalignment:
structmem_vectorregion,img;
if(entry->type!=E820_RAM)return;
if(entry->addr>=CONFIG_RANDOMIZE_BASE_MAX_OFFSET)return;
if(entry->addr+entry->sizeaddr;region.size=entry->size;
Aswestorethesevalues,wealigntheregion.startaswediditinthefind_random_addrfunctionandcheckthatwedidn'tgetaddressthatbiggerthanoriginalmemoryregion:
region.start=ALIGN(region.start,CONFIG_PHYSICAL_ALIGN);
if(region.start>entry->addr+entry->size)return;
NextwegetdifferencebetweentheoriginaladdressandalignedandcheckthatifthelastaddressinthememoryregionisbiggerthanCONFIG_RANDOMIZE_BASE_MAX_OFFSET,wereducethememoryregionsizethatendofkernelimagewillbelessthanmaximumaslroffset:
region.size-=region.start-entry->addr;
if(region.start+region.size>CONFIG_RANDOMIZE_BASE_MAX_OFFSET)region.size=CONFIG_RANDOMIZE_BASE_MAX_OFFSET-region.start;
Intheendwegothroughtheallunsafememoryregionsandcheckthatthisregiondoesnotoverlapunsafeareswithkernelcommandline,initrdandetc...:
for(img.start=region.start,img.size=image_size;mem_contains(®ion,&img);img.start+=CONFIG_PHYSICAL_ALIGN){if(mem_avoid_overlap(&img))continue;slots_append(img.start);}
Ifmemoryregiondoesnotoverlapunsaferegionswecallslots_appendfunctionwiththestartaddressoftheregion.slots_appendfunctionjustcollectsstartaddressesofmemoryregionstotheslotsarray:
slots[slot_max++]=addr;
whichdefinedas:
staticunsignedlongslots[CONFIG_RANDOMIZE_BASE_MAX_OFFSET/CONFIG_PHYSICAL_ALIGN];staticunsignedlongslot_max;
Afterprocess_e820_entrywillbeexecuted,wewillhavearrayoftheaddresseswhicharesafeforthedecompressedkernel.Nextwecallslots_fetch_randomfunctionforgettingrandomitemfromthisarray:
if(slot_max==0)return0;
returnslots[get_random_long()%slot_max];
whereget_random_longfunctionchecksdifferentCPUflagsasX86_FEATURE_RDRANDorX86_FEATURE_TSCandchooses
LinuxInside
55Kerneldecompression
-
methodforgettingrandomnumber(itcanbeobtainwithRDRANDinstruction,Timestampcounter,programmableintervaltimerandetc...).Afterthatwegotrandomaddressexecutionofthechoose_kernel_locationisfinished.
Nowlet'sbacktothemisc.c.Afterwegotaddressforthekernelimage,thereneedtodosomecheckstobesurethatgottenrandomaddressiscorrectlyalignedandaddressisnotwrong.
Afterallthesecheckswillseethefamiliarmessage:
DecompressingLinux...
andcalldecompressfunctionwhichwilldecompressthekernel.decompressfunctiondependsonwhatdecompressionalgorithmwaschosenduringkernelcompilartion:
#ifdefCONFIG_KERNEL_GZIP#include"../../../../lib/decompress_inflate.c"#endif
#ifdefCONFIG_KERNEL_BZIP2#include"../../../../lib/decompress_bunzip2.c"#endif
#ifdefCONFIG_KERNEL_LZMA#include"../../../../lib/decompress_unlzma.c"#endif
#ifdefCONFIG_KERNEL_XZ#include"../../../../lib/decompress_unxz.c"#endif
#ifdefCONFIG_KERNEL_LZO#include"../../../../lib/decompress_unlzo.c"#endif
#ifdefCONFIG_KERNEL_LZ4#include"../../../../lib/decompress_unlz4.c"#endif
Afterkernelwillbedecompressed,thelastfunctionhandle_relocationswillrelocatethekerneltotheaddressthatwegotfromchoose_kernel_location.Afterthatkernelrelocatedwereturnfromthedecompress_kerneltothehead_64.S.Theaddressofthekernelwillbeintheraxregisterandwejumponit:
jmp*%rax
That'sall.Nowweareinthekernel!
Thisistheendofthefifthandthelastpartaboutlinuxkernelbootingprocess.Wewillnotseepostsaboutkernelbootinganymore(maybeonlyupdatesinthisandpreviousposts),buttherewillbemanypostsaboutotherkernelinternals.
Nextchapterwillbeaboutkernelinitializationandwewillseethefirststepsinthelinuxkernelinitializationcode.
Ifyouwillhaveanyquestionsorsuggestionswritemeacommentorpingmeintwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
Conclusion
LinuxInside
56Kerneldecompression
-
addressspacelayoutrandomizationinitrdlongmodebzip2RDdRandinstructionTimeStampCounterProgrammableIntervalTimersPreviouspart
Links
LinuxInside
57Kerneldecompression
-
Youwillfindhereacoupleofpostswhichdescribethefullcycleofkernelinitializationfromitsfirststepsafterthekernelhasdecompressedtothestartofthefirstprocessrunbythekernelitself.
Firststepsafterkerneldecompression-describesfirststepsinthekernel.Earlyinterruptandexceptionhandling-describesearlyinterruptsinitializationandearlypagefaulthandler.Lastpreparationsbeforethekernelentrypoint-describesthelastpreparationsbeforethecallofthestart_kernel.Kernelentrypoint-describesfirststepsinthekernelgenericcode.Continueofarchitecture-specificinitializations-describesarchitecture-specificinitialization.Architecture-specificinitializations,again...-describescontinueofthearchitecture-specificinitializationprocess.TheEndofthearchitecture-specificinitializations,almost...-describestheendofthesetup_archrelatedstuff.Schedulerinitialization-describespreparationbeforeschedulerinitializationandinitializationofit.
Kernelinitializationprocess
LinuxInside
58Initialization
-
Inthepreviouspost(Kernelbootingprocess.Part5.)-Kerneldecompressionwestoppedatthejumponthedecompressedkernel:
jmp*%rax
andnowweareinthekernel.Therearemanythingstodobeforethekernelwillstartfirstinitprocess.Hopewewillseeallofthepreparationsbeforekernelwillstartinthisbigchapter.Wewillstartfromthekernelentrypoint,whichisinthearch/x86/kernel/head_64.S.Wewillseefirstpreparationslikeearlypagetablesinitialization,switchtoanewdescriptorinkernelspaceandmanymanymore,beforewewillseethestart_kernelfunctionfromtheinit/main.cwillbecalled.
Solet'sstart.
Okay,wegotaddressofthekernelfromthedecompress_kernelfunctionintoraxregisterandjustjumpedthere.Decompressedkernelcodestartsinthearch/x86/kernel/head_64.S:
__HEAD.code64.globlstartup_64startup_64:.........
Wecanseedefinitionofthestartup_64routineanditdefinedinthe__HEADsection,whichisjust:
#define__HEAD.section".head.text","ax"
Wecanseedefinitionofthissectioninthearch/x86/kernel/vmlinux.lds.Slinkerscript:
.text:AT(ADDR(.text)-LOAD_OFFSET){_text=.;.........}:text=0x9090
Wecanunderstanddefaultvirtualandphysicaladdressesfromthelinkerscript.Notethataddressofthe_textislocationcounterwhichisdefinedas:
.=__START_KERNEL;
forx86_64.Wecanfinddefinitionofthe__START_KERNELmacrointhearch/x86/include/asm/page_types.h:
Kernelinitialization.Part1.
Firststepsinthekernelcode
Firststepsinthekernel
LinuxInside
59Firststepsinthekernel
-
#define__START_KERNEL(__START_KERNEL_map+__PHYSICAL_START)
#define__PHYSICAL_STARTALIGN(CONFIG_PHYSICAL_START,CONFIG_PHYSICAL_ALIGN)
Herewecanseethat__START_KERNEListhesumofthe__START_KERNEL_map(whichis0xffffffff80000000,seepostaboutpaging)and__PHYSICAL_START.Where__PHYSICAL_STARTisalignedvalueoftheCONFIG_PHYSICAL_START.SoifyouwillnotusekASLRandwillnotchangeCONFIG_PHYSICAL_STARTintheconfigurationaddresseswillbefollowing:
Physicaladdress-0x1000000;Virtualaddress-0xffffffff81000000.
Nowweknowdefaultphysicalandvirtualaddressesofthestartup_64routine,buttoknowactualaddresseswemusttocalculateitwiththefollowingcode:
leaq_text(%rip),%rbpsubq$_text-__START_KERNEL_map,%rbp
Herewejustputtherip-relativeaddresstotherbpregisterandthansubtract$_text-__START_KERNEL_mapfromit.Weknowthatcompiledaddressofthe_textis0xffffffff81000000and__START_KERNEL_mapcontains0xffffffff81000000,sorbpwillcontainphysicaladdressofthetext-0x1000000afterthiscalculation.Weneedtocalculateitbecausekernelcanberunnednotonthedefaultaddress,butnowweknowactualphysicaladdress.
Inthenextstepwechecksthatthisaddressisalignedwith:
movq%rbp,%raxandl$~PMD_PAGE_MASK,%eaxtestl%eax,%eaxjnzbad_address
Herewejustputaddresstothe%raxandtestfirstbit.PMD_PAGE_MASKindicatesthemaskforPagemiddledirectory(readpagingaboutit)anddefinedas:
#definePMD_PAGE_MASK(~(PMD_PAGE_SIZE-1))
#definePMD_PAGE_SIZE(_AC(1,UL)
-
Thefirststepbeforewestartedtosetupidentitypaging,needtocorrectfollowingaddresses:
addq%rbp,early_level4_pgt+(L4_START_KERNEL*8)(%rip)addq%rbp,level3_kernel_pgt+(510*8)(%rip)addq%rbp,level3_kernel_pgt+(511*8)(%rip)addq%rbp,level2_fixmap_pgt+(506*8)(%rip)
Hereweneedtocorrectearly_level4_pgtandotheraddressesofthepagetabledirectories,becauseasIwroteabove,kernelcanberunnednotatthedefault0x1000000address.rbpregistercontainsactualladdresssoweaddtotheearly_level4_pgt,level3_kernel_pgtandlevel2_fixmap_pgt.Let'strytounderstandwhatthislabelsmeans.Firstofalllet'slookontheirdefinition:
NEXT_PAGE(early_level4_pgt).fill511,8,0.quadlevel3_kernel_pgt-__START_KERNEL_map+_PAGE_TABLE
NEXT_PAGE(level3_kernel_pgt).fillL3_START_KERNEL,8,0.quadlevel2_kernel_pgt-__START_KERNEL_map+_KERNPG_TABLE.quadlevel2_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE
NEXT_PAGE(level2_kernel_pgt)PMDS(0,__PAGE_KERNEL_LARGE_EXEC,KERNEL_IMAGE_SIZE/PMD_SIZE)
NEXT_PAGE(level2_fixmap_pgt).fill506,8,0.quadlevel1_fixmap_pgt-__START_KERNEL_map+_PAGE_TABLE.fill5,8,0
NEXT_PAGE(level1_fixmap_pgt).fill512,8,0
Lookshard,butitisnottrue.
Firstofalllet'slookontheearly_level4_pgt.Itstartswiththe(4096-8)bytesofzeros,itmeansthatwedon'tusefirst511early_level4_pgtentries.Andafterthiswecanseelevel3_kernel_pgtentry.Notethatwesubtract__START_KERNEL_map+_PAGE_TABLEfromit.Asweknow__START_KERNEL_mapisabasevirtualaddressofthekerneltext,soifwesubtract__START_KERNEL_map,wewillgetphysicaladdressofthelevel3_kernel_pgt.Nowlet'slookon_PAGE_TABLE,itisjustpageentryaccessrights:
#define_PAGE_TABLE(_PAGE_PRESENT|_PAGE_RW|_PAGE_USER|\_PAGE_ACCESSED|_PAGE_DIRTY)
moreaboutit,youcanreadinthepagingpost.
level3_kernel_pgt-storesentrieswhichmapkernelspace.Atthestartofit'sdefinition,wecanseethatitfilledwithzerosL3_START_KERNELtimes.HereL3_START_KERNEListheindexinthepageupperdirectorywhichcontains__START_KERNEL_mapaddressanditequals510.Afteritwecanseedefinitionoftwolevel3_kernel_pgtentries:level2_kernel_pgtandlevel2_fixmap_pgt.Firstissimple,itispagetableentrywhichcontainspointertothepagemiddledirectorywhichmapskernelspaceandithas:
#define_KERNPG_TABLE(_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|\_PAGE_DIRTY)
Fixbaseaddressesofpagetables
LinuxInside
61Firststepsinthekernel
-
accessrights.Thesecond-level2_fixmap_pgtisavirtualaddresseswhichcanrefertoanyphysicaladdressesevenunderkernelspace.
Thenextlevel2_kernel_pgtcallsPDMSmacrowhichcreates512megabytesfromthe__START_KERNEL_mapforkerneltext(afterthese512megabyteswillbemodulesmemoryspace).
NowweknowLet'sbacktoourcodewhichisinthebeginningofthesection.Rememberthatrbpcontainsactualphysicaladdressofthe_textsection.Wejustaddthisaddresstothebaseaddressofthepagetables,thatthey'llhavecorrectaddresses:
addq%rbp,early_level4_pgt+(L4_START_KERNEL*8)(%rip)addq%rbp,level3_kernel_pgt+(510*8)(%rip)addq%rbp,level3_kernel_pgt+(511*8)(%rip)addq%rbp,level2_fixmap_pgt+(506*8)(%rip)
Atthefirstlineweaddrbptotheearly_level4_pgt,atthesecondlineweaddrbptothelevel2_kernel_pgt,atthethirdlineweaddrbptothelevel2_fixmap_pgtandaddrbptothelevel1_fixmap_pgt.
Afterallofthiswewillhave:
early_level4_pgt[511]->level3_kernel_pgt[0]level3_kernel_pgt[510]->level2_kernel_pgt[0]level3_kernel_pgt[511]->level2_fixmap_pgt[0]level2_kernel_pgt[0]->512MBkernelmappinglevel2_fixmap_pgt[506]->level1_fixmap_pgt
Aswecorrectedbaseaddressesofthepagetables,wecanstarttobuildit.
Nowwecanseesetuptheidentitymappingearlypagetables.IdentityMappedPagingisavirtualaddresseswhicharemappedtophysicaladdressesthathavethesamevalue,1:1.Let'slookonitindetails.Firstofallwegettherip-relativeaddressofthe_textand_early_level4_pgtandputtheyintordiandrbxregisters:
leaq_text(%rip),%rdileaqearly_level4_pgt(%rip),%rbx
Afterthiswestorephysicaladdressofthe_textintheraxandgettheindexofthepageglobaldirectoryentrywhichstores_textaddress,byshifting_textaddressonthePGDIR_SHIFT:
movq%rdi,%raxshrq$PGDIR_SHIFT,%rax
leaq(4096+_KERNPG_TABLE)(%rbx),%rdxmovq%rdx,0(%rbx,%rax,8)movq%rdx,8(%rbx,%rax,8)
wherePGDIR_SHIFTis39.PGDIR_SHFTindicatesthemaskforpageglobaldirectorybitsinavirtualaddress.Therearemacroforalltypesofpagedirectories:
#definePGDIR_SHIFT39#definePUD_SHIFT30#definePMD_SHIFT21
Identitymappingsetup
LinuxInside
62Firststepsinthekernel
-
Afterthisweputtheaddressofthefirstlevel3_kernel_pgttotherdxwiththe_KERNPG_TABLEaccessrights(seeabove)andfilltheearly_level4_pgtwiththe2level3_kernel_pgtentries.
Afterthisweadd4096(sizeoftheearly_level4_pgt)totherdx(itnowcontainstheaddressofthefirstentryofthelevel3_kernel_pgt)andputrdi(itnowcontainsphysicaladdressofthe_text)totherax.Andafterthiswewriteaddressesofthetwopageupperdirectoryentriestothelevel3_kernel_pgt:
addq$4096,%rdxmovq%rdi,%raxshrq$PUD_SHIFT,%raxandl$(PTRS_PER_PUD-1),%eaxmovq%rdx,4096(%rbx,%rax,8)incl%eaxandl$(PTRS_PER_PUD-1),%eaxmovq%rdx,4096(%rbx,%rax,8)
Inthenextstepwewriteaddressesofthepagemiddledirectoryentriestothelevel2_kernel_pgtandthelaststepiscorrectingofthekerneltext+datavirtualaddresses:
leaqlevel2_kernel_pgt(%rip),%rdileaq4096(%rdi),%r81:testq$1,0(%rdi)jz2faddq%rbp,0(%rdi)2:addq$8,%rdicmp%r8,%rdijne1b
Hereweputtheaddressofthelevel2_kernel_pgttotherdiandaddressofthepagetableentrytother8register.Nextwecheckthepresentbitinthelevel2_kernel_pgtandifitiszerowe'removingtothenextpagebyadding8bytestordiwhichcontaitnsaddressofthelevel2_kernel_pgt.Afterthiswecompareitwithr8(containsaddressofthepagetableentry)andgobacktolabel1ormoveforward.
Inthenextstepwecorrectphys_basephysicaladdresswithrbp(containsphysicaladdressofthe_text),putphysicaladdressoftheearly_level4_pgtandjumptolabel1:
addq%rbp,phys_base(%rip)movq$(early_level4_pgt-__START_KERNEL_map),%raxjmp1f
wherephys_basemathesthefirstentryofthelevel2_kernel_pgtwhichis512MBkernelmapping.
Afterthatwejumpedtothelabel1weenablePAE,PGE(PagingGlobalExtension)andputthephysicaladdressofthephys_base(seeabove)totheraxregisterandfillcr3registerwithit:
1:movl$(X86_CR4_PAE|X86_CR4_PGE),%ecxmovq%rcx,%cr4
addqphys_base(%rip),%raxmovq%rax,%cr3
Lastpreparations
LinuxInside
63Firststepsinthekernel
-
InthenextstepwecheckthatCPUsupportNXbitwith:
movl$0x80000001,%eaxcpuidmovl%edx,%edi
Weput0x80000001valuetotheeaxandexecutecpuidinstructionforgettingextendedprocessorinfoandfeaturebits.Theresultwillbeintheedxregisterwhichweputtotheedi.
Nowweput0xc0000080orMSR_EFERtotheecxandcallrdmsrinstructionforthereadingmodelspecificregister.
movl$MSR_EFER,%ecxrdmsr
Theresultwillbeintheedx:eax.GeneralviewoftheEFERisfollowing:
6332--------------------------------------------------------------------------------|||ReservedMBZ|||--------------------------------------------------------------------------------311615141312111098710--------------------------------------------------------------------------------||T|||||||||||ReservedMBZ|C|FFXSR|LMSLE|SVME|NXE|LMA|MBZ|LME|RAZ|SCE|||E||||||||||--------------------------------------------------------------------------------
Wewillnotseeallfieldsindetailshere,butwewilllearnaboutthisandotherMSRsinthespecialpartabout.AswereadEFERtotheedx:eax,wechecks_EFER_SCEorzerobitwhichisSystemCallExtensionswithbtslinstructionandsetittoone.BythesettingSCEbitweenableSYSCALLandSYSRETinstructions.Inthenextstepwecheck20thbitintheedi,rememberthatthisregisterstoresresultofthecpuid(seeabove).If20bitisset(NXbit)wejustwriteEFER_SCEtothemodelspecificregister.
btsl$_EFER_SCE,%eaxbtl$20,%edijnc1fbtsl$_EFER_NX,%eaxbtsq$_PAGE_BIT_NX,early_pmd_flags(%rip)1:wrmsr
IfNXbitissupportedweenable_EFER_NXandwriteittoo,withthewrmsrinstruction.
InthenextstepweneedtoupdateGlobalDescriptortablewithlgdtinstruction:
lgdtearly_gdt_descr(%rip)
whereGlobalDescriptortabledefinedas:
early_gdt_descr:.wordGDT_ENTRIES*8-1early_gdt_descr_base:.quadINIT_PER_CPU_VAR(gdt_page)
LinuxInside
64Firststepsinthekernel
-
WeneedtoreloadGlobalDescriptorTablebecausenowkernelworksintheuserspaceaddresses,butsoonkernelwillworkinit'sownspace.Nowlet'slookonearly_gdt_descrdefinition.GlobalDescriptorTablecontains32entries:
#defineGDT_ENTRIES32
forkernelcode,data,threadlocalstoragesegmentsandetc...it'ssimple.Nowlet'slookontheearly_gdt_descr_base.Firstofgdt_pagedefinedas:
structgdt_page{structdesc_structgdt[GDT_ENTRIES];}__attribute__((aligned(PAGE_SIZE)));
inthearch/x86/include/asm/desc.h.Itcontainsonefieldgdtwhichisarrayofthedesc_structstructureswhichdefinedas:
structdesc_struct{union{struct{unsignedinta;unsignedintb;};struct{u16limit0;u16base0;unsignedbase1:8,type:4,s:1,dpl:2,p:1;unsignedlimit:4,avl:1,l:1,d:1,g:1,base2:8;};};}__attribute__((packed));
andpresentsfamiliartousGDTdescriptor.Alsowecannotethatgdt_pagestructurealignedtoPAGE_SIZEwhichis4096bytes.Itmeansthatgdtwilloccupyonepage.Nowlet'strytounderstandwhatisitINIT_PER_CPU_VAR.INIT_PER_CPU_VARisamacrowhichdefinedinthearch/x86/include/asm/percpu.handjustconcatsinit_per_cpu__withthegivenparameter:
#defineINIT_PER_CPU_VAR(var)init_per_cpu__##var
Afterthiswehaveinit_per_cpu__gdt_page.Wecanseeinthelinkerscript:
#defineINIT_PER_CPU(x)init_per_cpu__##x=x+__per_cpu_loadINIT_PER_CPU(gdt_page);
Aswegotinit_per_cpu__gdt_pageinINIT_PER_CPU_VARandINIT_PER_CPUmacrofromlinkerscriptwillbeexpandedwewillgetoffsetfromthe__per_cpu_load.Afterthiscalculations,wewillhavecorrectbaseaddressofthenewGDT.
Generallyper-CPUvariablesisa2.6kernelfeature.Youcanunderstandwhatisitfromit'sname.Whenwecreateper-CPUvariable,eachCPUwillhavewillhaveit'sowncopyofthisvariable.Herewecreatinggdt_pageper-CPUvariable.Therearemanyadvantagesforvariablesofthistype,liketherearenolocks,becauseeachCPUworkswithit'sowncopyofvariableandetc...Soeverycoreonmultiprocessorwillhaveit'sownGDTtableandeveryentryinthetablewillrepresentamemorysegmentwhichcanbeaccessedfromthethreadwhichrunnedonthecore.Youcanreadindetailsaboutper-CPUvariablesintheTheory/per-cpupost.
AsweloadednewGlobalDescriptorTable,wereloadsegmentsaswediditeverytime:
xorl%eax,%eax
LinuxInside
65Firststepsinthekernel
-
movl%eax,%dsmovl%eax,%ssmovl%eax,%esmovl%eax,%fsmovl%eax,%gs
Afterallofthesestepswesetupgsregisterthatitposttotheirqstack(wewillseeinformationaboutitinthenextparts):
movl$MSR_GS_BASE,%ecxmovlinitial_gs(%rip),%eaxmovlinitial_gs+4(%rip),%edxwrmsr
whereMSR_GS_BASEis:
#defineMSR_GS_BASE0xc0000101
WeneedtoputMSR_GS_BASEtotheecxregisterandloaddatafromtheeaxandedx(whicharepointtotheinitial_gs)withwrmsrinstruction.Wedon'tusecs,fs,dsandsssegmentregistersforaddressationinthe64-bitmode,butfsandgsregisterscanbeused.fsandgshaveahiddenpart(aswesawitintherealmodeforcs)andthispartcontainsdescriptorwhichmappedtoModelspecificregisters.Sowecanseeabove0xc0000101isags.baseMSRaddress.
Inthenextstepweputtheaddressoftherealmodebootparamstructuretotherdi(rememberrsiholdspointertothisstructurefromthestart)andjumptotheCcodewith:
movqinitial_code(%rip),%raxpushq$0pushq$__KERNEL_CSpushq%raxlretq
Hereweputtheaddressoftheinitial_codetotheraxandpushfakeaddress,__KERNEL_CSandtheaddressoftheinitial_codetothestack.Afterthiswecanseelretqinstructionwhichmeansthatafteritreturnaddresswillbeextractedfromstack(nowthereisaddressoftheinitial_code)andjumpthere.initial_codedefinedinthesamesourcecodefileandlooks:
__REFDATA.balign8GLOBAL(initial_code).quadx86_64_start_kernel.........
Aswecanseeinitial_codecontainsaddressofthex86_64_start_kernel,whichdefinedinthearch/x86/kerne/head64.candlookslikethis:
asmlinkage__visiblevoid__initx86_64_start_kernel(char*real_mode_data){.........}
Ithasoneargumentisareal_mode_data(rememberthatwepassedaddressoftherealmodedatatotherdiregister
LinuxInside
66Firststepsinthekernel
-
previously).
ThisisfirstCcodeinthekernel!
Weneedtoseelastpreparationsbeforewecansee"kernelentrypoint"-start_kernelfunctionfromtheinit/main.c.
Firstofallwecanseesomechecksinthex86_64_start_kernelfunction:
BUILD_BUG_ON(MODULES_VADDR__START_KERNEL));BUILD_BUG_ON(!(((MODULES_END-1)&PGDIR_MASK)==(__START_KERNEL&PGDIR_MASK)));BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses)
-
Afterthisweclear_bssfromthe__bss_stopto__bss_startandthenextstepwillbesetupoftheearlyIDThandlers,butit'sbigthemesowewillseeitinthenextpart.
Thisistheendofthefirstpartaboutlinuxkernelinitialization.
Ifyouhavequestionsorsuggestions,feelfreetopingmeintwitter0xAX,dropmeemailorjustcreateissue.
Inthenextpartwewillseeinitializationoftheearlyinterruptionhandlers,kernelspacememorymappingandmanymanymore.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.
ModelSpecificRegisterPagingPreviouspart-KerneldecompressionNXASLR
Conclusion
Links
LinuxInside
68Firststepsinthekernel
-
Inthepreviouspartwestoppedbeforesettingofearlyinterrupthandlers.Wecontinueinthispartandwillknowmoreaboutinterruptandexceptionhandling.
Rememberthatwestoppedbeforefollowingloop:
for(i=0;i0xFF);\_set_gate(n,GATE_INTERRUPT,(void*)addr,0,0,\__KERNEL_CS);\_trace_set_gate(n,GATE_INTERRUPT,(void*)trace_##addr,\0,0,__KERNEL_CS);\}while(0)
Firstofallitcheckswiththatpassedinterruptnumberisnotgreaterthan255withBUG_ONmacro.Weneedtodothischeckbecausewecanhaveonly256interrupts.Afterthisitcalls_set_gatewhichwritesaddressofaninterruptgatetotheIDT:
staticinlinevoid_set_gate(intgate,unsignedtype,void*addr,unsigneddpl,unsignedist,unsignedseg){gate_descs;pack_gate(&s,type,(unsignedlong)addr,dpl,ist,seg);write_idt_entry(idt_table,gate,&s);write_trace_idt_entry(gate,&s);}
Atthestartof_set_gatefunctionwecanseecallofthepack_gatefunctionwhichfillsgate_descstructurewiththegivenvalues:
staticinlinevoidpack_gate(gate_desc*gate,unsignedtype,unsignedlongfunc,unsigneddpl,unsignedist,unsignedseg){gate->offset_low=PTR_LOW(func);gate->segment=__KERNEL_CS;gate->ist=ist;gate->p=1;gate->dpl=dpl;gate->zero0=0;gate->zero1=0;gate->type=type;gate->offset_middle=PTR_MIDDLE(func);gate->offset_high=PTR_HIGH(func);}
Asmentionedabovewefillgatedescriptorinthisfunction.Wefillthreepartsoftheaddressoftheinterrupthandlerwiththeaddresswhichwegotinthemainloop(addressoftheinterrupthandlerentrypoint).Weareusingthreefollowingmacrotosplitaddressonthreeparts:
#definePTR_LOW(x)((unsignedlonglong)(x)&0xFFFF)#definePTR_MIDDLE(x)(((unsignedlonglong)(x)>>16)&0xFFFF)#definePTR_HIGH(x)((unsignedlonglong)(x)>>32)
WiththefirstPTR_LOWmacrowegetthefirst2bytesoftheaddress,withthesecondPTR_MIDDLEwegetthesecond2bytesoftheaddressandwiththethirdPTR_HIGHmacrowegetthelast4bytesoftheaddress.Nextwesetupthesegmentselectorforinterrupthandler,itwillbeourkernelcodesegment-__KERNEL_CS.InthenextstepwefillInterruptStackTableandDescriptorPrivilegeLevel(highestprivilegelevel)withzeros.AndwesetGAT_INTERRUPTtypeintheend.
NowwehavefilledIDTentryandwecancallnative_write_idt_entryfunctionwhichjustcopiesfilledIDTentrytotheIDT:
staticinlinevoidnative_write_idt_entry(gate_desc*idt,intentry,constgate_desc*gate){memcpy(&idt[entry],gate,sizeof(*gate));}
LinuxInside
72Earlyinterruptshandler
-
Afterthatmainloopwillfinished,wewillhavefilledidt_tablearrayofgate_descstructuresandwecanloadIDTwith:
load_idt((conststructdesc_ptr*)&idt_descr);
Whereidt_descris:
structdesc_ptridt_descr={NR_VECTORS*16-1,(unsignedlong)idt_table};
andload_idtjustexecuteslidtinstruction:
asmvolatile("lidt%0"::"m"(*dtr));
Youcannotethattherearecallsofthe_trace_*functionsinthe_set_gateandotherfunctions.ThesefunctionsfillsIDTgatesinthesamemannerthat_set_gatebutwithonedifference.Thesefunctionsusetrace_idt_tableInterruptDescriptorTableinsteadofidt_tablefortracepoints(wewillcoverthisthemeintheanotherpart).
Okay,nowwehavefilledandloadedInterruptDescriptorTable,weknowhowtheCPUactsduringinterrupt.Sonowtimetodealwithinterruptshandlers.
Asyoucanreadabove,wefilledIDTwiththeaddressoftheearly_idt_handlers.Wecanfinditinthearch/x86/kernel/head_64.S:
.globlearly_idt_handlersearly_idt_handlers:i=0.reptNUM_EXCEPTION_VECTORS.if(EXCEPTION_ERRCODE_MASK>>i)&1ASM_NOP2.elsepushq$0.endifpushq$ijmpearly_idt_handleri=i+1.endr
Wecanseehere,interrupthandlersgenerationforthefirst32exceptions.Wecheckhere,ifexceptionhaserrorcodethenwedonothing,ifexceptiondoesnotreturnerrorcode,wepushzerotothestack.Wedoitforthatwouldstackwasuniform.Afterthatwepushexceptionnumberonthestackandjumpontheearly_idt_handlerwhichisgenericinterrupthandlerfornow.Asiwroteabove,CPUpushesflagregister,CSandRIPonthestack.Sobeforeearly_idt_handlerwillbeexecuted,stackwillcontainfollowingdata:
|--------------------||%rflags||%cs||%rip||rsp-->errorcode||--------------------|
Nowlet'slookontheearly_idt_handlerimplementation.Itlocatesinthesamearch/x86/kernel/head_64.S.Firstofallwe
Earlyinterruptshandlers
LinuxInside
73Earlyinterruptshandler
-
canseecheckforNMI,wenoneedtohandleit,sojustignoretheyintheearly_idt_handler:
cmpl$2,(%rsp)jeis_nmi
whereis_nmi:
is_nmi:addq$16,%rspINTERRUPT_RETURN
wedroperrorcodeandvectornumberfromthestackandcallINTERRUPT_RETURNwhichisjustiretq.AswecheckedthevectornumberanditisnotNMI,wecheckearly_recursion_flagtopreventrecursionintheearly_idt_handlerandifit'scorrectwesavegeneralregistersonthestack:
pushq%raxpushq%rcxpushq%rdxpushq%rsipushq%rdipushq%r8pushq%r9pushq%r10pushq%r11
weneedtodoittopreventwrongvaluesinitwhenwereturnfromtheinterrupthandler.Afterthiswechecksegmentselectorinthestack:
cmpl$__KERNEL_CS,96(%rsp)jne11f
itmustbeequaltothekernelcodesegmentandifitisnotwejumponlabel11whichprintsPANICmessageandmakesstackdump.
Aftercodesegmentwaschecked,wecheckthevectornumber,andifitis#PF,weputvaluefromthecr2totherdiregisterandcallearly_make_pgtable(wellseeitsoon):
cmpl$14,72(%rsp)jnz10fGET_CR2_INTO(%rdi)callearly_make_pgtableandl%eax,%eaxjz20f
Ifvectornumberisnot#PF,werestoregeneralpurposeregistersfromthestack:
popq%r11popq%r10popq%r9popq%r8popq%rdipopq%rsipopq%rdxpopq%rcxpopq%rax
LinuxInside
74Earlyinterruptshandler
-
andexitfromthehandlerwithiret.
Itistheendofthefirstinterrupthandler.Notethatitisveryearlyinterrupthandler,soithandlesonlyPageFaultnow.Wewillseehandlersfortheotherinterrupts,butnowlet'slookonthepagefaulthandler.
Inthepreviousparagraphwesawfirstearlyinterrupthandlerwhichchecksinterruptnumberforpagefaultandcallsearly_make_pgtableforbuildingnewpagetablesifitis.Weneedtohave#PFhandlerinthisstepbecausethereareplanstoaddabilitytoloadkernelabove4Gandmakeaccesstoboot_paramsstructureabovethe4G.
Youcanfindimplementationoftheearly_make_pgtableinthearch/x86/kernel/head64.candtakesoneparameter-addressfromthecr2register,whichcausedPageFault.Let'slookonit:
int__initearly_make_pgtable(unsignedlongaddress){unsignedlongphysaddr=address-__PAGE_OFFSET;unsignedlongi;pgdval_tpgd,*pgd_p;pudval_tpud,*pud_p;pmdval_tpmd,*pmd_p;.........}
Itstartsfromthedefinitionofsomevariableswhichhave*val_ttypes.Allofthesetypesarejust:
typedefunsignedlongpgdval_t;
Alsowewilloperatewiththe*_t(notval)types,forexamplepgd_tandetc...Allofthesetypesdefinedinthearch/x86/include/asm/pgtable_types.handrepresentstructureslikethis:
typedefstruct{pgdval_tpgd;}pgd_t;
Forexample,
externpgd_tearly_level4_pgt[PTRS_PER_PGD];
Hereearly_level4_pgtpresentsearlytop-levelpagetabledirectorywhichconsistsofanarrayofpgd_ttypesandpgdpointstolow-levelpageentries.
Afterwemadethecheckthatwehavenoinvalidaddress,we'regettingtheaddressofthePageGlobalDirectoryentrywhichcontains#PFaddressandputit'svaluetothepgdvariable:
pgd_p=&early_level4_pgt[pgd_index(address)].pgd;pgd=*pgd_p;
Inthenextstepwecheckpgd,ifitcontainscorrectpageglobaldirectoryentryweputphysicaladdressofthepageglobaldirectoryentryandputittothepud_pwith:
Pagefaulthandling
LinuxInside
75Earlyinterruptshandler
-
pud_p=(pudval_t*)((pgd&PTE_PFN_MASK)+__START_KERNEL_map-phys_base);
wherePTE_PFN_MASKisamacro:
#definePTE_PFN_MASK((pteval_t)PHYSICAL_PAGE_MASK)
whichexpandsto:
(~(PAGE_SIZE-1))&((1
-
Thisistheendofthesecondpartaboutlinuxkernelinternals.Ifyouhavequestionsorsuggestions,pingmeintwitter0xAX,dropmeemailorjustcreateissue.Inthenextpartwewillseeallstepsbeforekernelentrypoint-start_kernelfunction.
PleasenotethatEnglishisnotmyfirstlanguageandIamreallysorryforanyinconvenience.IfyoufoundanymistakespleasesendmePRtolinux-internals.
GNUassembly.reptAPICNMIPreviouspart
Links
LinuxInside
77Earlyinterruptshandler
-
ThisisthethirdpartoftheLinuxkernelinitializationprocessseries.Inthepreviouspartwesawearlyinterruptandexceptionhandlingandwillcontinuetodiveintothelinuxkernelinitializationprocessinthecurrentpart.Ournextpointis'kernelentrypoint'-start_kernelfunctionfromtheinit/main.csourcecodefile.Yes,technicallyitisnotkernel'sentrypointbutthestartofthegenerickernelcodewhichdoesnotdependoncertainarchitecture.Butbeforewewillseecallofthestart_kernelfunction,wemustdosomepreparations.Solet'scontinue.
InthepreviouspartwestoppedatsettingInterruptDescriptorTableandloadingitintheIDTRregister.Atthenextstepafterthiswecanseeacallofthecopy_bootdatafunction:
copy_bootdata(__va(real_mode_data));
Thisfunctiontakesoneargument-virtualaddressofthereal_mode_data.Rememberthatwepassedtheaddressoftheboot_paramsstructurefromarch/x86/include/uapi/asm/bootparam.htothex86_64_start_kernelfunctionasfirstargumentinarch/x86/kernel/head_64.S:
/*rsiispointertorealmodestructurewithinterestinginfo.passittoC*/movq%rsi,%rdi
Nowlet'slookat__vamacro.Thismacrodefinedininit/main.c:
#define__va(x)((void*)((unsignedlong)(x)+PAGE_OFFSET))
wherePAGE_OFFSETis__PAGE_OFFSETwhichis0xffff880000000000andthebasevirtualaddressofthedirectmappingofallphysicalmemory.Sowe'regettingvirtualaddressoftheboot_paramsstructureandpassittothecopy_bootdatafunction,wherewecopyreal_mod_datatotheboot_paramswhichisdeclaredinthearch/x86/kernel/setup.h
externstructboot_paramsboot_params;
Let'slookatthecopy_boot_dataimplementation:
staticvoid__initcopy_bootdata(char*real_mode_data){char*command_line;unsignedlongcmd_line_ptr;
memcpy(&boot_params,real_mode_data,sizeofboot_params);sanitize_boot_params(&boot_params);cmd_line_ptr=get_cmd_line_ptr();if(cmd_line_ptr){command_line=__va(cmd_line_ptr);memcpy(boot_command_line,command_line,COMMAND_LINE_SIZE);}
Kernelinitialization.Part3.
Lastpreparationsbeforethekernelentrypoint
boot_paramsagain
LinuxInside
78Lastpreparationsbeforethekernelentrypoint
-
}Firstofall,notethatthisfunctionisdeclaredwith__initprefix.Itmeansthatthisfunctionwillbeusedonlyduringtheinitializationandusedmemorywillbefreed.
Wecanseedeclarationoftwovariablesforthekernelcommandlineandcopyingreal_mode_datatotheboot_paramswiththememcpyfunction.Thenextcallofthesanitize_boot_paramsfunctionwhichfillssomefieldsoftheboot_paramsstructurelikeext_ramdisk_imageandetc...ifbootloaderswhichfailtoinitializeunknownfieldsinboot_paramstozero.Afterthiswe'regettingaddressofthecommandlinewiththecalloftheget_cmd_line_ptrfunction:
unsignedlongcmd_line_ptr=boot_params.hdr.cmd_line_ptr;cmd_line_ptr|=(u64)boot_params.ext_cmd_line_ptr
-
.p2align4.Lloop:decl%ecx#definePUT(x)movq%rax,x*8(%rdi)movq%rax,(%rdi)PUT(1)PUT(2)PUT(3)PUT(4)PUT(5)PUT(6)PUT(7)leaq64(%rdi),%rdijnz.LloopnopretCFI_ENDPROC.Lclear_page_end:ENDPROC(clear_page)
Asyoucanunderstartfromthefunctionnameitclearsorfillswithzerospagetables.FirstofallnotethatthisfunctionstartswiththeCFI_STARTPROCandCFI_ENDPROCwhichareexpandstoGNUassemblydirectives:
#defineCFI_STARTPROC.cfi_startproc#defineCFI_ENDPROC.cfi_endproc
andusedfordebugging.AfterCFI_STARTPROCmacrowezeroouteaxregisterandput64totheecx(itwillbecounter).Nextwecanseeloopwhichstartswiththe.Llooplabelanditstartsfromtheecxdecrement.Afteritweputzerofromtheraxregistertotherdiwhichcontainsthebaseaddressoftheinit_level4_pgtnowanddothesameprocedureseventimesbuteverytimemoverdioffseton8.Afterthiswewillhavefirst64bytesoftheinit_level4_pgtfilledwithzeros.Inthenextstepweputtheaddressoftheinit_level4_pgtwith64-bytesoffsettotherdiagainandrepeatalloperationswhichecxisnotzero.Intheendwewillhaveinit_level4_pgtfilledwithzeros.
Aswehaveinit_level4_pgtfilledwithzeros,wesetthelastinit_level4_pgtentrytokernelhighmappingwiththe:
init_level4_pgt[511]=early_level4_pgt[511];
Rememberthatwedroppedallearly_level4_pgtentriesinthereset_early_page_tablefunctionandkeptonlykernelhighmappingthere.
Thelaststepinthex86_64_start_kernelfunctionisthecallofthe:
x86_64_start_reservations(real_mode_data);
functionwiththereal_mode_dataasargument.Thex86_64_start_reservationsfunctiondefinedinthesamesourcecodefileasthex86_64_start_kernelfunctionandlooks:
void__initx86_64_start_reservations(char*real_mode_data){if(!boot_params.hdr.version)copy_bootdata(__va(real_mode_data));
reserve_ebda_region();
start_kernel();}
LinuxInside
80Lastpreparationsbeforethekernelentrypoint
-
Youcanseethatitisthelastfunctionbeforeweareinthekernelentrypoint-start_kernelfunction.Let'slookwhatitdoesandhowitworks.
Firstofallwecanseeinthex86_64_start_reservationsfunctioncheckforboot_params.hdr.version:
if(!boot_params.hdr.version)copy_bootdata(__va(real_mode_data));
andifitisnotwecallagaincopy_bootdatafunctionwiththevirtualaddressofthereal_mode_data(readaboutaboutit'simplementation).
Inthenextstepwecanseethecallofthereserve_ebda_regionfunctionwhichdefinedinthearch/x86/kernel/head.c.ThisfunctionreservesmemoryblockforthEBDAorExtendedBIOSDataArea.TheExtendedBIOSDataArealocatedinthetopofconventionalmemoryandcontainsdataaboutports,diskparametersandetc...
Let'slookonthereserve_ebda_regionfunction.Itstartsfromthecheckingisparavirtualizationenabledornot:
if(paravirt_enabled())return;
weexitfromthereserve_ebda_regionfunctionifparavirtualizationisenabledbecauseifitenabledtheextendedbiosdataareaisabsent.Inthenextstepweneedtogettheendofthelowmemory:
lowmem=*(unsignedshort*)__va(BIOS_LOWMEM_KILOBYTES);lowmem
-
}onlywithonedifference:wepassargumentwiththephys_addr_twhichdependsonCONFIG_PHYS_ADDR_T_64BIT:
#ifdefCONFIG_PHYS_ADDR_T_64BITtypedefu64phys_addr_t;#elsetypedefu32phys_addr_t;#endif
ThisconfigurationoptionisenabledbyCONFIG_PHYS_ADDR_T_64BIT.AfterthatwegotvirtualaddressofthesegmentwhichstoresthebaseaddressoftheextendedBIOSdataarea,weshiftiton4andreturn.Afterthisebda_addrvariablescontainsthebaseaddressoftheextendedBIOSdataarea.
InthenextstepwecheckthataddressoftheextendedBIOSdataareaandlowmemoryisnotlessthanINSANE_CUTOFFmacro
if(ebda_addrregions[0].size==0){WARN_ON(type->cnt!=1||type->total_size);type->regions[0].base=base;type->regions[0].size=size;type->regions[0].flags=flags;memblock_set_region_node(&type->regions[0],nid);type->total_size=size;return0;}
Afterwefilledourregionwecanseethecallofthememblock_set_region_nodefunctionwithtwoparameters:
addressofthefilledmemoryregion;NUMAnodeid.
whereourregionsrepresentedbythememblock_regionstructure:
structmemblock_region{phys_addr_tbase;phys_addr_tsize;unsignedlongflags;#ifdefCONFIG_HAVE_MEMBLOCK_NODE_MAPintnid;#endif};
NUMAnodeiddependsonMAX_NUMNODESmacrowhichisdefinedintheinclude/linux/numa.h:
#defineMAX_NUMNODES(1
-
memblick_set_region_nodefunctionjustfillsnidfieldfrommemblock_regionwiththegivenvalue:
staticinlinevoidmemblock_set_region_node(structmemblock_region*r,intnid){r->nid=nid;}
Afterthiswewillhavefirstreservedmemblockfortheextendedbiosdataareainthe.meminit.datasection.reserve_ebda_regionfunctionfinisheditsworkonthisstepandwecangobacktothearch/x86/kernel/head64.c.
Wefinishedallpreparationsbeforethekernelentrypoint!Thelaststepinthex86_64_start_reservationsfunctionisthecallofthe:
start_kernel()
functionfrominit/main.cfile.
That'sallforthispart.
Itistheendofthethirdpartaboutlinuxkernelinternals.Innextpartwewillseethefirstinitializationstepsinthekernelentrypoint-start_kernelfunction.Itwillbethefirststepbeforewewillseelaunchofthefirstinitprocess.
Ifyouhaveanyquestionsorsuggestionswritemeacommentorpingmeattwitter.
PleasenotethatEnglishisnotmyfirstlanguage,AndIamreallysorryforanyinconvenience.IfyouwillfindanymistakespleasesendmePRtolinux-internals.
BIOSdataareaWhatisintheextendedBIOSdataareaonaPC?Previouspart
Conclusion
Links
LinuxInside
85Lastpreparationsbeforethekernelentrypoint
-
Ifyouhavereadthepreviouspart-Lastpreparationsbeforethekernelentrypoint,youcanrememberthatwefinishedallpre-initializationstuffandstoppedrightbeforethecallofthestart_kernelfunctionfromtheinit/main.c.Thestart_kernelistheentryofthegenericandarchitectureindependentkernelcode,althoughwewillreturntothearch/foldermanytimes.Ifyouwilllookinsideofthestart_kernelfunction,youwillseethatthisfunctionisverybig.Forthismomentitcontainsabout86callsoffunctions.Yes,it'sverybigandofcoursethispartwillnotcoverallprocesseswhichareoccurinthisfunction.Inthecurrentpartwewillonlystarttodoit.ThispartandallthenextwhichwillbeintheKernelinitializationprocesschapterwillcoverit.
Themainpurposeofthestart_kerneltofinishkernelinitializationprocessandlaunchfirstinitprocess.Beforethefirstprocesswillbestarted,thestart_kernelmustdomanythingsas:toenablelockvalidator,toinitializeprocessorid,toenableearlycgroupssubsystem,tosetupper-cpuareas,toinitializedifferentcachesinvfs,toinitializememorymanager,rcu,vmalloc,scheduler,IRQs,ACPIandmanymanymore.Onlyafterthesestepswewillseethelaunchofthefirstinitprocessinthelastpartofthischapter.Somanykernelcodewaitsus,let'sstart.
NOTE:AllpartsfromthisbigchapterLinuxKernelinitializationprocesswillnotcoveranythingaboutdebugging.Therewillbeseparatechapteraboutkerneldebuggingtips.
AsIwroteabove,thestart_kernelfunciondefinedintheinit/main.c.Thisfunctiondefinedwiththe__initattributeandasyoualreadymayknowfromotherparts,allfunctionwhicharedefinedwiththisattributedarenecessaryduringkernelinitialization.
#define__init__section(.init.text)__coldnotrace
Afterinitilizationprocesswillbefinished,thekernelwillreleasethesesectionswiththecallofthefree_initmemfunction.Notealsothat__initdefinedwithtwoattributes:__coldandnotrace.Purposeofthefirstcoldattributeistomarkthefunctionthatitisrarelyusedandcompilerwilloptimizethisfunctionforsize.Thesecondnotraceisdefinedas:
#definenotrace__attribute__((no_instrument_function))
whereno_instrument_functionsaystocompilertonotgenerateprofilingfunctioncalls.
Inthedefinitionofthestart_kernelfunction,youcanalsoseethe__visibleattributewhichexpandstothe:
#define__visible__attribute__((externally_visible))
whereexternally_visibletellstothecompilerthatsomethingusesthisfunctionorvariable,topreventmarkingthisfunction/variableasunusable.Definitionofthisandothermacroattributesyoucanfindintheinclude/linux/init.h.
Kernelinitialization.Part4.
Kernelentrypoint
Alittleaboutfunctionattributes
Firststepsinthestart_kernel
LinuxInside
86Kernelentrypoint
-
Atthebeginningofthestart_kernelyoucanseedefinitionofthetwovariables:
char*command_line;char*after_dashes;
Thefirstpresentspointertothekernelcommandlineandthesecondwillcontainresultoftheparse_argsfunctionwhichparsesaninputstringwithparametersintheformname=value,lookingforspecifickeywordsandinvokingtherighthandlers.Wewillnotgointodetailsatthistimerelatedwiththesetwovariables,butwillseeitinthenextparts.Inthenextstepwecanseecallof:
lockdep_init();
function.lockdep_initinitializeslockvalidator.It'simplementationisprettyeasy,itjustinitializestwolist_headhashesandsetglobalvariablelockdep_initializedto1.Lockvalidatordetectscircularlockdependeciesandcalledwhenanyspinlockormutexisacquired.
Thenextfunctionisset_task_stack_end_magicwhichtakesaddressoftheinit_taskandsetsSTACK_END_MAGIC(0x57AC6E9D)ascanaryforit.init_taskpresentsinitialtaskstructure:
structtask_structinit_task=INIT_TASK(init_task);
wheretask_structstructurestoresallinformantionaboutaprocess.Iwillnotdefinitionofthisstructureinthisbook,becauseit'sverybig.Youcanfinditsdefinitionintheinclude/linux/sched.h.Forthismomenttask_structcontainsmorethan100fields!Althoughyouwillnotseedefinitionofthetask_structinthisbook,wewilluseitveryoften,sinceitisthefundamentalstructurewhichdescribestheprocessintheLinuxkernel.Iwilldescribethemeaningofthefieldsofthisstructureaswewillmeetwiththeminpractice.
Youcanseethedefinitionoftheinit_taskanditinitializedbyINIT_TASKmacro.Thismacroisfromtheinclude/linux/init_task.handitjustfillstheinit_taskwiththevaluesforthefirstprocess.Forexampleitsets:
initprocessstatetozeroorrunnable.ArunnableprocessisonewhichiswaitingonlyforaCPUtorunon;initprocessflags-PF_KTHREADwhichmeans-kernelthread;alistofrunnabletask;processaddressspace;initprocessstacktothe&init_thread_infowhichisinit_thread_union.thread_infoandinitthread_unionhastype-thread_unionwhichcontainsthread_infoandprocessstack:
unionthread_union{structthread_infothread_info;unsignedlongstack[THREAD_SIZE/sizeof(long)];};
Everyprocesshasownstackanditis16killobytesor4pageframes.inx86_64.Wecannotethatitdefinedasarrayofunsignedlong.Thenextfieldofthethread_unionis-thread_infodefinedas:
structthread_info{structtask_struct*task;structexec_domain*exec_domain;__u32flags;__u32status;__u32cpu;intsaved_preempt_count;
LinuxInside
87Kernelentrypoint
-
mm_segment_taddr_limit;structrestart_blockrestart_block;void__user*sysenter_return;unsignedintsig_on_uaccess_error:1;unsignedintuaccess_err:1;};
andoccupies52bytes.thread_infostructurecontainsarchetecture-specificinforamtionthethread.Weknowthatonx86_64stackgrowsdownandthread_union.thread_infoisstoredatthebottomofthestackinourcase.Sotheprocessstackis16killobytesandthread_infoisatthebottom.Remainingthread_sizewillbe16killobytes-62bytes=16332bytes.Notethatthread_uniounrepresentedastheunionandnotstructure,itmeansthatthread_infoandstacksharethememoryspace.
Schematicallyitcanberepresentedasfollows:
+-----------------------+|||||stack||||_______________________|||||||||||______________________|+--------------------+|||||thread_info||task_struct|||||+-----------------------++--------------------+
http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct
SoINIT_TASKmacrofillsthesetask_struct'sfieldsandmanymanymore.Asialreadywroteabout,IwillnotdescribeallfieldsanditsvaluesintheINIT_TASKmacro,butwewillseeitsoon.
Nowlet'sbacktotheset_task_stack_end_magicfunction.Thisfunctiondefinedinthekernel/fork.candsetsacanarytotheinitprocessstacktopreventstackoverflow.
voidset_task_stack_end_magic(structtask_struct*tsk){unsignedlong*stackend;stackend=end_of_stack(tsk);*stackend=STACK_END_MAGIC;/*foroverflowdetection*/}
Itsimplementationiseasy.set_task_stack_end_magicgetstheendofthestackforthegivetask_structwiththeend_of_stackfunction.EndofaprocessstackdependsonCONFIG_STACK_GROWSUPconfigurationoption.Aswelearningx86_64architecture,stackgrowsdown.Sotheendoftheprocessstackwillbe:
(unsignedlong*)(task_thread_info(p)+1);
wheretask_thread_infojustreturnsthestackwhichwefilledwiththeINIT_TASKmacro:
#definetask_thread_info(task)((structthread_info*)(task)->stack)
LinuxInside
88Kernelentrypoint
-
Aswegotendoftheinitprocessstack,wewriteSTACK_END_MAGICthere.Aftercanaryset,wecancheckitlikethis:
if(*end_of_stack(task)!=STACK_END_MAGIC){////handlestackoverflowhere//}
Thenextfunctionaftertheset_task_stack_end_magicissmp_setup_processor_id.Thisfunctionhasemptybodyforx86_64:
void__init__weaksmp_setup_processor_id(void){}
asitimplementednotforallarchitectures,butfors390,arm64andetc...
Thenextfunctionis-debug_objects_early_initinthestart_kernel.Implementationofthesefunctionisalmostthesameaslockdep_init,butfillshashesforobjectdebugging.Asiwroteabout,wewillnotseedescriptionofthisandotherfunctionswhicharefordebuggingpurposesinthischapter.
Afterdebug_object_early_initfunctionwecanseethecalloftheboot_init_stack_canaryfunctionwhichfillstask_struct->canarywiththecanaryvalueforthe-fstack-protectorgccfeature.ThisfunctiondependsonCONFIG_CC_STACKPROTECTORconfigurationoptionandifthisoptionisdisabledboot_init_stack_canarydoesnotanything,inanotherwayitgeneraterandomnumberbasedonrandompoolandtheTSC:
get_random_bytes(&canary,sizeof(canary));tsc=__native_read_tsc();canary+=tsc+(tscstack_canary=canary;
andwritesthisvaluetothetopoftheIRQstackwiththe:
this_cpu_write(irq_stack_union.stack_canary,canary);//readbellowaboutthis_cpu_write
Again,wewillnotdiveintodetailshere,willcoveritinthepartaboutIRQs.Ascanaryset,wedisablelocalandearlybootIRQsandregisterthebootstrapcpuinthecpumaps.Wedisablelocalirqs(interruptsforcurrentCPU)withthelocal_irq_disablemacrowhichexpandstothecallofthearch_local_irq_disablefunctionfromtheinclude/linux/percpu-defs.h:
staticinlinenotracevoidarch_local_irq_enable(void){native_irq_enable();}
Wherenative_irq_enableiscliinstructionforx86_64.AsinterruptsaredisabledwecanregistercurrentcpuwiththegivenIDinthecpubitmap.
LinuxInside
89Kernelentrypoint
-
Currentfunctionfromthestart_kernelisthe-boot_cpu_init.Thisfunctioninitalizesvariouscpumasksfortheboostrapprocessor.Firstofallitgetsthebootstrapprocessoridwiththecallof:
intcpu=smp_processor_id();
Fornowitisjustzero.IfCONFIG_DEBUG_PREEMPTconfigurationoptionisdisabled,smp_processor_idjustexpandstothecalloftheraw_smp_processor_idwhichexpandstothe:
#defineraw_smp_processor_id()(this_cpu_read(cpu_number))
this_cpu_readasmanyotherfunctionlikethis(this_cpu_write,this_cpu_addandetc...)definedintheinclude/linux/percpu-defs.handpresentsthis_cpuoperation.Theseoperationsprovideawayofopmizingaccesstotheper-cpuvariableswhichareassociatedwiththecurrentprocessor.Inourcaseitis-this_cpu_readexpandstotheofthe:
__pcpu_size_call_return(this_cpu_read_,pcp)
Rememberthatwehavepassedcpu_numberaspcptothethis_cpu_readfromtheraw_smp_processor_id.Nowlet'slookon__pcpu_size_call_returnimplementation:
#define__pcpu_size_call_return(stem,variable)\({\typeof(variable)pscr_ret__;\__verify_pcpu_ptr(&(variable));\switch(sizeof(variable)){\case1:pscr_ret__=stem##1(variable);break;\case2:pscr_ret__=stem##2(variable);break;\case4:pscr_ret__=stem##4(variable);break;\case8:pscr_ret__=stem##8(variable);break;\default:\__bad_size_call_parameter();break;\}\pscr_ret__;\})
Yes,itlookalittlestrange,butit'seasy.Firstofallwecanseedefinitionofthepscr_ret__variablewiththeinttype.Whyint?Ok,variableiscommon_cpuanditwasdeclaredasper-cpuintvariable:
DECLARE_PER_CPU_READ_MOSTLY(int,cpu_number);
Inthenextstepwecall__verify_pcpu_ptrwiththeaddressofcpu_number.__veryf_pcpu_ptrusedtoverifyingthatgivenparameterisanper-cpupointer.Afterthatwesetpscr_ret__valuewhichdependsonthesizeofthevariable.Ourcommon_cpuvariableisint,soit4bytessize.Itmeansthatwewillgetthis_cpu_read_4(common_cpu)inpscr_ret__.Intheendofthe__pcpu_size_call_returnwejustcallit.this_cpu_read_4isamacro:
#definethis_cpu_read_4(pcp)percpu_from_op("mov",pcp)
whichcallspercpu_from_opandpassmovinstructionandper-cpuvariablethere.percpu_from_opwillexpandtotheinlineassemblycall:
Thefirstprocessoractivation
LinuxInside
90Kernelentrypoint
-
asm("movl%%gs:%1,%0":"=r"(pfo_ret__):"m"(common_cpu))
Let'strytounderstandhowitworksandwhatitdoes.gssegmentregistercontainsthebaseofper-cpuarea.Herewejustcopycommon_cpuwhichisinmemorytothepfo_ret__withthemovlinstruction.Orwithanotherwords:
this_cpu_read(common_cpu)
isthesamethat:
movl%gs:$common_cpu,$pfo_ret__
Aswedidn'tsetupper-cpuarea,wehaveonlyone-forthecurrentrunningCPU,wewillgetzeroasaresultofthesmp_processor_id.
Aswegotcurrentprocessorid,boot_cpu_initsetsthegivencpuonline,active,presentandpossiblewiththe:
set_cpu_online(cpu,true);set_cpu_active(cpu,true);set_cpu_present(cpu,true);set_cpu_possible(cpu,true);
Allofthesefunctionsusetheconcept-cpumask.cpu_possibleisasetofcpuID'swhichcanbepluggedinanytimeduringthelifeofthatsystemboot.cpu_presentrepresentswhichCPUsarecurrentlypluggedin.cpu_onlinerepresentssubsetofthecpu_presentandindicatesCPUswhichareavailableforscheduling.ThesemasksdependsonCONFIG_HOTPLUG_CPUconfigurationoptionandifthisoptionisdisabledpossible==presentandactive==online.Implementationoftheallofthesefunctionsareverysimilar.Everyfunctionchecksthesecondparameter.Ifitistrue,callscpumask_set_cpuorcpumask_clear_cpuotherwise.
Forexamplelet'slookonset_cpu_possible.Aswepassedtrueasthesecondparameter,the:
cpumask_set_cpu(cpu,to_cpumask(cpu_possible_bits));
willbecalled.Firstofalllet'strytounderstandto_cpu_maskmacro.Thismacrocastsabitmaptoastructcpumask*.CpumasksprovideabitmapsuitableforrepresentingthesetofCPU'sinasystem,onebitpositionperCPUnumber.CPUmaskpresentedbythecpu_maskstructure:
typedefstructcpumask{DECLARE_BITMAP(bits,NR_CPUS);}cpumask_t;
whichisjustbitmapdeclaredwiththeDECLARE_BITMAPmacro:
#defineDECLARE_BITMAP(name,bits)unsignedlongname[BITS_TO_LONGS(bits)]
Aswecanseefromitsdefinition,DECLARE_BITMAPmacroexpandstothearrayofunsignedlong.Nowlet'slookonhowto_cpumaskmacroimplemented:
#defineto_cpumask(bitmap)\((structcpumask*)(1?(bitmap)\
LinuxInside
91Kernelentrypoint
-
:(void*)sizeof(__check_is_bitmap(bitmap))))
Idon'tknowhowaboutyou,butitlookedreallyweirdformeatthefirsttime.Wecanseeternaryoperatoroperatorherewhichistrueeverytime,butwhythe__check_is_bitmaphere?It'ssimple,let'slookonit:
staticinlineint__check_is_bitmap(constunsignedlong*bitmap){return1;}
Yeah,itjustreturns1everytime.Actuallyweneedinithereonlyforonepurpose:Incompiletimeitchecksthatgivenbitmapisabitmap,orwithanotherwordsitchecksthatgivenbitmaphastype-unsignedlong*.Sowejustpasscpu_possible_bitstotheto_cpumaskmacroforconvertingarrayofunsignedlongtothestructcpumask*.Nowwecancallcpumask_set_cpufunctionwiththecpu-0andstructcpumask*cpu_possible_bits.Thisfunctionmakesonlyonecalloftheset_bitfunctionwhichsetsthegivencpuinthecpumask.Alloftheseset_cpu_*functionsworkonthesameprinciple.
Ifyou'renotsurethatthisset_cpu_*operationsandcpumaskarenotclearforyou,don'tworryaboutit.Youcangetmoreinfobyreadingofthespecialpartaboutit-cpumaskordocumentation.
Asweactivatedthebootstrapprocessor,timetogotothenextfunctioninthestart_