Post on 26-Apr-2018
ìComputer Systems and NetworksECPE170– JeffShafer– UniversityofthePacific
MemoryHierarchy(PerformanceOptimization)
Lab Schedule
Activitiesì ThisWeek
ì Lab6– Perf Optimizationì Lab7– MemoryHierarchy
ì NextTuesdayì IntrotoPython
ì NextThursdayì **MidtermExam **
AssignmentsDueì Lab6
ì DuebyMar6th 5:00am
ì Lab7ì DuebyMar20th 5:00am
Spring2017ComputerSystemsandNetworks
2
Your Personal Repository
Spring2017ComputerSystemsandNetworks
3
2017_spring_ecpe170\lab02lab03lab04lab05lab06lab07lab08lab09lab10lab11lab12.hg
HiddenFolder!(namestartswithperiod)
UsedbyMercurialtotrackallrepositoryhistory(files,changelogs,…)
Mercurial .hg Folder
ì Theexistenceofa.hg hiddenfolderiswhatturnsaregulardirectory(anditssubfolders)intoaspecialMercurialrepository
ì Whenyouadd/commitfiles,Mercuriallooksforthis.hg folderinthecurrentdirectoryoritsparents
Spring2017ComputerSystemsandNetworks
4
ìMemory Hierarchy
Spring2017ComputerSystemsandNetworks
5
Memory Hierarchy
Spring2017ComputerSystemsandNetworks
6
FastPerformanceand LowCost
Goalassystemdesigners:
Tradeoff:Fastermemoryismoreexpensive thanslowermemory
Memory Hierarchy
ì Toprovidethebestperformanceatthelowestcost,memoryisorganizedinahierarchicalfashionì Small,fast storageelementsarekeptintheCPUì Larger,slowermainmemoryareoutsidetheCPU
(andaccessedbyadatabus)ì Largest,slowest,permanentstorage(disks,etc…)
isevenfurtherfromtheCPU
7
Spring2017ComputerSystemsandNetworks
Spring2017ComputerSystemsandNetworks
8
Todate,you’veonlycaredabouttwolevels:MainmemoryandDisks
ìMemory Hierarchy– Registers and Cache
Spring2017ComputerSystemsandNetworks
9
Spring2017ComputerSystemsandNetworks
10
Let’sexaminethefastestmemoryavailable
Memory Hierarchy – Registers
ì Storagelocationsavailableontheprocessoritself
ì Manuallymanagedbytheassemblyprogrammerorcompiler
ì You’llbecomeintimatelyfamiliarwithregisterswhenwedoassemblyprogramming
Spring2017ComputerSystemsandNetworks
11
Memory Hierarchy – Caches
ì Whatisacache?ì Speedupmemoryaccessesbystoringrecentlyused
dataclosertotheCPUì Closer thanmainmemory– ontheCPUitself!ì Althoughcacheismuchsmallerthanmainmemory,
itsaccesstimeismuchfaster!ì Cacheisautomaticallymanagedbythehardware
memorysystemì Cleverprogrammerscanhelpthehardwareusethe
cachemoreeffectively
12
Spring2017ComputerSystemsandNetworks
Memory Hierarchy – Caches
ì Howdoesthecachework?ì Notgoingtodiscusshowcachesworkinternally
ì Ifyouwanttolearnthat,takeECPE173!ì Thisclassisfocusedonwhatdoestheprogrammer
needtoknowabouttheunderlyingsystem
Spring2017ComputerSystemsandNetworks
13
Memory Hierarchy – Access
ì CPUwishestoreaddata (neededforaninstruction)1. Doestheinstructionsayitisinaregisteror
memory?ì Ifregister,gogetit!
2. Ifinmemory,sendrequesttonearestmemory(thecache)
3. Ifnotincache,sendrequesttomainmemory4. Ifnotinmainmemory,sendrequesttothedisk
14
Spring2017ComputerSystemsandNetworks
(Cache) Hits versus Misses
Hitì Whendataisfoundata
givenmemorylevel(e.g.acache)
Missì Whendataisnot foundata
givenmemorylevel(e.g.acache)
Spring2017ComputerSystemsandNetworks
15
Youwanttowriteprogramsthatproducealotofhits,notmisses!
Memory Hierarchy – Cache
ì OncethedataislocatedanddeliveredtotheCPU,itwillalsobesavedintocachememoryforfutureaccessì Weoftensavemorethanjustthespecificbyte(s)
requestedì Typical:Neighboring64bytes
(calledthecachelinesize)
16
Spring2017ComputerSystemsandNetworks
Cache Locality
Spring2017ComputerSystemsandNetworks
17
Onceadataelementisaccessed,itislikelythatanearbydataelement(oreventhesameelement)willbeneededsoon
PrincipleofLocality
Cache Locality
ì Temporallocality– Recently-accesseddataelementstendtobeaccessedagainì Imaginealoopcounter…
ì Spatiallocality- Accessestendtoclusterinmemoryì Imaginescanningthroughallelementsinanarray,
orrunningseveralsequentialinstructionsinaprogram
18
Spring2017ComputerSystemsandNetworks
Spring2017ComputerSystemsandNetworks
19
Programswithgoodlocalityrunfasterthanprogramswithpoor
locality
Spring2017ComputerSystemsandNetworks
20
Aprogramthatrandomlyaccessesmemoryaddresses(butneverrepeats)willgainnobenefit fromacache
Recap – Cache
ì Whichisbigger– acacheormainmemory?ì Mainmemory
ì Whichisfastertoaccess– thecacheormainmemory?ì Cache– Itissmaller (whichisfastertosearch)andcloser
totheprocessor(signalstakelesstimetopropagateto/fromthecache)
ì Whydoweaddacachebetweentheprocessorandmainmemory?ì Performance– hopefullyfrequently-accesseddatawillbe
inthefastercache(sowedon’thavetoaccessslowermainmemory)
Spring2017ComputerSystemsandNetworks
21
Recap – Cache
ì Whichismanuallycontrolled– acacheoraregister?ì Registersaremanuallycontrolledbytheassembly
languageprogram(orthecompiler)ì Cacheisautomaticallycontrolledbyhardware
ì Supposeaprogramwishestoreadfromaparticularmemoryaddress.Whichissearchedfirst– thecacheormainmemory?ì Searchthecachefirst– otherwise,there’sno
performancegain
Spring2017ComputerSystemsandNetworks
22
Recap – Cache
ì Supposethereisacachemiss(datanotfound)duringa1bytememoryreadoperation.Howmuchdataisloadedintothecache?ì Trickquestion– wealwaysloaddataintothecache
1“line”atatime.ì Cachelinesizevaries– 64bytesonaCorei7
processor
Spring2017ComputerSystemsandNetworks
23
Cache Q&A
ì Imagineacomputersystemonlyhasmainmemory (nocachewaspresent).Istemporal orspatiallocalityimportantforperformancewhenrepeatedlyaccessinganarraywith8-byteelements?ì No.Localityisnotimportantinasystemwithout
caching,becauseeverymemoryaccesswilltakethesamelengthoftime.
Spring2017ComputerSystemsandNetworks
24
Cache Q&A
ì Imagineamemorysystemhasmainmemoryanda1-levelcache,buteachcachelinesizeisonly8bytes insize.Assumethecacheismuchsmallerthanmainmemory.Istemporal orspatiallocalityimportantforperformanceherewhenrepeatedlyaccessinganarraywith8-byteelements?ì Only1arrayelementisloadedatatimeinthiscacheì Temporallocalityisimportant(accesswillbefasterifthe
sameelementisaccessedagain)ì Spatiallocalityisnot important(neighboringelements
arenotloadedintothecachewhenanearlierelementisaccessed)
Spring2017ComputerSystemsandNetworks
25
Cache Q&A
ì Imagineamemorysystemhasmainmemoryanda1-levelcache,andthecachelinesizeis64bytes.Assumethecacheismuchsmallerthanmainmemory.Istemporal orspatiallocality importantforperformanceherewhenrepeatedlyaccessinganarraywith8-byteelements?ì 8elements(64B)areloadedintothecacheatatimeì Both formsoflocalityareusefulhere!
Spring2017ComputerSystemsandNetworks
26
Cache Q&A
ì Imagineyourprogramaccessesa100,000elementarray(of8byteelements)oncefrombeginningtoendwithstride1.Thememorysystemhasa1-levelcachewithalinesizeof64bytes.Nopre-fetchingisimplemented.Howmanycachemisseswouldbeexpectedinthissystem?ì 12500 cachemisses.Thearrayhas100,000
elements.Uponacachemiss,8adjacentandalignedelements(oneofwhichisthemiss)ismovedintothecache.Futureaccessestothoseremainingelementsshouldhitinthecache.Thus,only1/8ofthe100,000elementaccessesresultinamiss
Spring2017ComputerSystemsandNetworks
27
Cache Q&A
ì Imagineyourprogramaccessesa100,000elementarray(of8byteelements)oncefrombeginningtoendwithstride1.Thememorysystemhasa1-levelcachewithalinesizeof64bytes.Ahardwareprefetcher isimplemented.Inthebest-possiblecase,howmanycachemisseswouldbeexpectedinthissystem?ì 1cachemiss - Thisprogramhasatrivialaccesspattern
withstride1.Intheperfectworld,thehardwareprefetcher wouldbeginguessingfuturememoryaccessesaftertheinitialcachemissandloadingthemintothecache.Assumingtheprefetcher canstayaheadoftheprogram,thenallfuturememoryaccesseswiththetrivial+1patternshouldresultincachehits
Spring2017ComputerSystemsandNetworks
28
Cache Example – Intel Core i7 980x
ì 6coreprocessorwithasophisticatedmulti-levelcachehierarchy
ì 3.5GHz,1.17billiontransistors
Spring2017ComputerSystemsandNetworks
29
Cache Example – Intel Core i7 980x
ì EachprocessorcorehasitsownaL1andL2cacheì 32kBLevel1(L1)datacacheì 32kBLevel1(L1)instructioncacheì 256kBLevel2(L2)cache(bothinstructionanddata)
ì Theentirechip(all6cores)share asingle12MBLevel3(L3)cache
Spring2017ComputerSystemsandNetworks
30
Cache Example – Intel Core i7 980x
ì Accesstime?(Measuredin3.5GHzclockcycles)ì 4cyclestoaccessL1cacheì 9-10cyclestoaccessL2cacheì 30-40cyclestoaccessL3cache
ì Smallercachesarefastertosearchì Andcanalsofitclosertotheprocessorcore
ì Largercachesareslowertosearchì Pluswehavetoplacethemfurtheraway
Spring2017ComputerSystemsandNetworks
31
Caching is Ubiquitous!
Spring2017ComputerSystemsandNetworks
32
Type WhatCached WhereCached ManagedBy
TLB AddressTranslation(Virtual->PhysicalMemoryAddress)
On-chipTLB Hardware MMU(MemoryManagementUnit)
Buffer cache Partsoffileson disk Mainmemory Operating Systems
Diskcache Disksectors Diskcontroller Controllerfirmware
Browsercache Webpages LocalDisk Web browser
Manytypesof“cache”incomputerscience,withdifferentmeanings
ìMemory Hierarchy – Virtual Memory
Spring2017ComputerSystemsandNetworks
33
Virtual Memory
VirtualMemoryisaBIGLIE!ì Welie toyourapplicationand
tellitthatthesystemissimple:ì Physicalmemoryisinfinite!
(oratleasthuge)ì Youcanaccessall ofphysical
memoryì Yourprogramstartsat
memoryaddresszeroì Yourmemoryaddressis
contiguous andin-orderì YourmemoryisonlyRAM
(mainmemory)
WhattheSystemReallyDoes
Spring2017ComputerSystemsandNetworks
34
Why use Virtual Memory?
ì Wewanttorunmultipleprogramsonthecomputerconcurrently(multitasking)ì Eachprogramneedsitsownseparatememoryregion,so
physicalresourcesmustbedividedì Theamountofmemoryeachprogramtakescouldvary
dynamicallyovertime(andtheusercouldrunadifferentmixofappsatonce)
ì Wewanttousemultipletypesofstorage(mainmemory,disk)toincreaseperformanceandcapacity
ì Wedon’twanttheprogrammertoworryaboutthisì Maketheprocessorarchitecthandlethesedetails
Spring2017ComputerSystemsandNetworks
35
Pages and Virtual Memory
ì Mainmemoryisdividedintopagesforvirtualmemoryì Pagessize=4kBì Dataismovedbetweenmainmemoryanddiskata
pagegranularityì i.e.likethecache,wedon’tmovesinglebytesaround,
butratherbiggroupsofbytes
Spring2017ComputerSystemsandNetworks
36
Pages and Virtual Memory
ì Mainmemoryandvirtualmemoryaredividedintoequalsizedpages
ì Theentireaddressspacerequiredbyaprocessneednotbeinmemoryatonceì Somepagescanbeondisk
ì Pushtheunneededpartsouttoslowdiskì Otherpagescanbeinmainmemory
ì Keepthefrequentlyaccessedpagesinfastermainmemory
ì Thepagesallocatedtoaprocessdonotneedtobestoredcontiguously-- eitherondiskorinmemory
37
Spring2017ComputerSystemsandNetworks
Virtual Memory Terms
ì Physicaladdress– theactualmemoryaddressintherealmainmemory
ì Virtualaddress– thememoryaddressthatisseeninyourprogramì Specialhardware/softwaretranslatesvirtualaddressesinto
physicaladdresses!
ì Pagefaults – aprogramaccessesavirtualaddressthatisnotcurrentlyresidentinmainmemory(ataphysicaladdress)ì Thedatamustbeloadedfromdisk!
ì Pagefile – Thefileondiskthatholdsmemorypagesì Usuallytwicethesizeofmainmemory
38
Spring2017ComputerSystemsandNetworks
Cache Memory vs Virtual Memory
ì Goalofcachememoryì Fastermemoryaccessspeed(performance)
ì Goalofvirtualmemoryì Increasememorycapacity withoutactuallyadding
moremainmemoryì Dataiswrittentodiskì Ifdonecarefully,thiscanimprove performanceì Ifoverused,performancesuffers greatly!
ì Increasesystemflexibilitywhenrunningmultipleuserprograms(aspreviouslydiscussed)
39
Spring2017ComputerSystemsandNetworks
ìMemory Hierarchy – Magnetic Disks
Spring2017ComputerSystemsandNetworks
40
Magnetic Disk Technology
ì Harddiskplattersaremountedonspindles
ì Read/writeheadsaremountedonacombthatswingsradiallytoreadthediskì Allheadsmove
together!
Spring2017ComputerSystemsandNetworks
41
Magnetic Disk Technology
ì Thereareanumberofelectromechanicalpropertiesofharddiskdrivesthatdeterminehowfastitsdatacanbeaccessed
ì Seektime– timethatittakesforadiskarmtomoveintopositionoverthedesiredcylinder
ì Rotationaldelay– timethatittakesforthedesiredsectortomoveintopositionbeneaththeread/writehead
ì Seektime+rotationaldelay= accesstime
Spring2017ComputerSystemsandNetworks
42
How Big Will Hard Drives Get?
ì Advancesintechnologyhavedefiedalleffortstodefinetheultimateupperlimitformagneticdiskstorageì Inthe1970s,theupperlimitwasthoughttobearound
2Mb/in2
ì Asdatadensitiesincrease,bitcellsconsistofproportionatelyfewermagneticgrainsì Thereisapointatwhichtherearetoofewgrainstohold
avalue,anda1mightspontaneouslychangetoa0,orviceversa
ì Thispointiscalledthesuperparamagnetic limit
Spring2017ComputerSystemsandNetworks
43
How Big Will Hard Drives Get?
ì Whenwillthelimitbereached?
ì In2006,thelimitwasthoughttoliebetween150Gb/in2and200Gb/in2(with longitudinalrecordingtechnology)
ì 2010:Commercialdriveshavedensitiesupto667Gb/in2
ì 2012:Seagatedemosdrivewith1Tbit/in²densityì Withheat-assistedmagneticrecording – theyusealaser
toheatbitsbeforewritingì Eachbitis~12.7nminlength(adozenatoms)
Spring2017ComputerSystemsandNetworks
44
ìMemory Hierarchy – SSDs
Spring2017ComputerSystemsandNetworks
45
Emergence of Solid State Disks (SSD)
ì Harddriveadvantages?ì Lowcostperbits
ì Harddrivedisadvantages?ì Veryslowcomparedtomainmemoryì Fragile(everdroppedone?)ì Movingpartswearout
ì Reductionsinflashmemorycosthascreatedanotherpossibility:solidstatedrives (SSDs)ì SSDsappearlikeharddrivestothecomputer,buttheystore
datainnon-volatileflashmemorycircuitsì Flashisquirky! Physicallimitationsposeengineering
challenges…
Spring2017ComputerSystemsandNetworks
46
Flash Memory
ì TypicalflashchipsarebuiltfromdensearraysofNANDgates
ì Differentfromharddrives– wecan’t read/writeasinglebit(orbyte)ì Readingorwriting? Datamustbereadfromanentireflash
page (2kB-8kB)ì Readingmuchfasterthanwritingapageì Ittakessometimebeforethecellchargereachesastablestate
ì Erasing? Anentireerasureblock(32-128pages)mustbeerased(settoall1’s)firstbeforeindividualbitscanbewritten(setto0)ì Erasingtakestwoordersofmagnitudemoretimethanreading
Spring2017ComputerSystemsandNetworks
47
Flash-based Solid State Drives (SSDs)
Advantagesì Sameblock-addressableI/O
interfaceasharddrives
ì Nomechanicallatencyì Accesslatencyisindependent
oftheaccesspatternì Comparethistoharddrives
ì Energyefficient(nodisktospin)
ì Resistanttoextremeshock,vibration,temperature,altitude
ì Near-instantstart-uptime
Challengesì Limitedenduranceandthe
needforwearleveling
ì Veryslowtoeraseblocks(neededbeforereprogramming)ì Erase-before-write
ì Read/writeasymmetryì Readsarefasterthan
writes
Spring2017ComputerSystemsandNetworks
48
Flash Translation Layer
ì FlashTranslationLayer(FTL)ì Necessaryforflashreliability
andperformanceì “Virtual”addressesseenbythe
OSandcomputerì “Physical”addressesusedby
theflashmemory
ì Performwritesout-of-placeì Amortizeblockerasuresover
manywriteoperations
ì Wear-levelingì Writingthesame“virtual”
addressrepeatedlywon’twritetothesamephysicalflashlocationrepeatedly!
Spring2017ComputerSystemsandNetworks
49
“Virtual”addresses
“Physical”addresses
devicelevel
flashchiplevelFlashTranslationLayer
logicalpage
flashpage flashblock sparecapacity
Spring2017ComputerSystemsandNetworks
50