Lecture 13: Memory Consistency

Carnegie Mellon

Lecture13:

MemoryConsistency

ParallelComputerArchitectureandProgrammingCMU15-418/15-618,Fall2016

CMU15-418/618,Fall2017 1

Carnegie Mellon

WhatisCorrectBehaviorforaParallelMemoryHierarchy?

•  Note:side-effectsofwritesareonlyobservablewhenreadsoccur–  sowewillfocusonthevaluesreturnedbyreads

•  IntuiMveanswer:–  readingalocaMonshouldreturnthelatestvaluewriOen(byanythread)

•  Hmm…whatdoes“latest”meanexactly?–  withinathread,itcanbedefinedbyprogramorder–  butwhataboutacrossthreads?

•  themostrecentwriteinphysicalMme?–  hopefullynot,becausethereisnowaythatthehardwarecanpullthatoff

»  e.g.,ifittakes>10cyclestocommunicatebetweenprocessors,thereisnowaythatprocessor0canknowwhatprocessor1did2clockMcksago

•  mostrecentbaseduponsomethingelse?–  Hmm…

CMU15-418/618,Fall2017 2

Carnegie Mellon

RefiningOurIntuiMon

•  WhatwouldbesomeclearlyillegalcombinaMonsof(A,B,C)?•  Howabout:

•  Whatcanwegeneralizefromthis?–  writesfromanyparMcularthreadmustbeconsistentwithprogramorder

•  inthisexample,observedevennumbersmustbeincreasing(diOoforodds)

–  acrossthreads:writesmustbeconsistentwithavalidinterleavingofthreads•  notphysicalMme!(programmercannotrelyuponthat)

CMU15-418/618,Fall2017 3

// write evens to X for (i=0; i<N; i+=2) { X = i; … }

Thread0// write odds to X for (j=1; j<N; j+=2) { X = j; … }

Thread1… A = X; … B = X; … C = X; …

Thread2

(Assume:X=0iniMally,andthesearetheonlywritestoX.)

(4,8,1)? (9,12,3)? (7,19,31)?

Carnegie Mellon

VisualizingOurIntuiMon

•  Eachthreadproceedsinprogramorder•  Memoryaccessesinterleaved(oneataMme)toasingle-portedmemory

–  rateofprogressofeachthreadisunpredictable

CMU15-418/618,Fall2017 4

Thread1… A = X; … B = X; … C = X; …

Thread2

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Carnegie Mellon

CorrectnessRevisited

Recall:“readingalocaMonshouldreturnthelatestvaluewriOen(byanythread)”à  “latest”meansconsistentwithsomeinterleavingthatmatchesthismodel–  thisisahypotheMcalinterleaving;themachinedidn’tnecessarydothis!

CMU15-418/618,Fall2017 5

Thread1… A = X; … B = X; … C = X; …

Thread2

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Carnegie Mellon

Part2ofMemoryCorrectness:MemoryConsistencyModel

1.  “CacheCoherence”–  doallloadsandstorestoagivencacheblockbehavecorrectly?

2.  “MemoryConsistencyModel”(someMmescalled“MemoryOrdering”)–  doallloadsandstores,eventoseparatecacheblocks,behavecorrectly?

Recall:ourintuiMon

CMU15-418/618,Fall2017 6

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Carnegie Mellon

Whyisthissocomplicated?

•  Fundamentalissue:–  loadsandstoresareveryexpensive,evenonauniprocessor

•  caneasilytake10’sto100’sofcycles

•  WhatprogrammersintuiMvelyexpect:–  processoratomicallyperformsoneinstrucMonataMme,inprogramorder

•  Inreality:–  iftheprocessoractuallyoperatedthisway,itwouldbepainfullyslow–  instead,theprocessoraggressivelyreordersinstruc6onstohidememorylatency

•  Upshot:–  withinagiventhread,theprocessorpreservestheprogramorderillusion–  butthisillusionhasnothingtodowithwhathappensinphysicalMme!–  fromtheperspecMveofotherthreads,allbetsareoff!

CMU15-418/618,Fall2017 7

Carnegie Mellon

HidingMemoryLatencyisImportantforPerformance

•  Idea:overlapmemoryaccesseswithotheraccessesandcomputaMon

•  Hidingwritelatencyissimpleinuniprocessors:

–  addawritebuffer

•  (ButthisaffectscorrectnessinmulMprocessors)

CMU15-418/618,Fall2017 8

write A

read B

write A read B

Processor

READS WRITES

writebuffer

Carnegie Mellon

HowCanWeHidetheLatencyofMemoryReads?

“Outoforder”pipelining:–  whenaninstrucMonisstuck,perhapstherearesubsequentinstrucMonsthat

canbeexecuted

•  ImplicaMon:memoryaccessesmaybeperformedout-of-order!!!

CMU15-418/618,Fall2017 9

stuckwaiMngontruedependencestuckwaiMngontruedependencesuffersexpensivecachemisssuffersexpensivecachemissx = *p;

y = x + 1; z = a + 2; b = c / 3; } thesedonotneedtowait

Carnegie Mellon

WhatAboutCondiMonalBranches?

•  DoweneedtowaitforacondiMonalbranchtoberesolvedbeforeproceeding?–  No!JustpredictthebranchoutcomeandconMnueexecuMngspeculaMvely.

•  ifpredicMoniswrong,squashanyside-effectsandrestartdowncorrectpath

CMU15-418/618,Fall2017 10

x = *p; y = x + 1; z = a + 2; b = c / 3; if (x != z) d = e – 7; else d = e + 5; …

ifhardwareguessesthatthisistruethenexecute“then”part(speculaMvely)(withoutwaiMngforxorz)

Carnegie Mellon

HowOut-of-OrderPipeliningWorksinModernProcessors

•  FetchandgraduateinstrucMonsin-order,butissueout-of-order

•  Intra-threaddependencesarepreserved,butmemoryaccessesgetreordered!

CMU15-418/618,Fall2017 11

issue(cachemiss)

0x1c: b = c / 3;

0x18: z = a + 2;

0x14: y = x + 1;

0x10: x = *p;

PC:0x10Inst.Cache

BranchPredictor

0x140x180x1c

0x1c: b = c / 3;

0x18: z = a + 2;

0x14: y = x + 1;

0x10: x = *p;

Reorde

issue(cachemiss)

issue(out-of-order)issue(out-of-order)

can’tissuecan’tissueissue(out-of-order)issue(out-of-order)

Carnegie Mellon

Analogy:GasParMclesinBalloons

•  ImaginethateachinstrucMonwithinathreadisagasparMcleinsideatwistyballoon•  Theywerenumberedoriginally,butthentheystarttomoveandbouncearound•  Whenagiventhreadobservesmemoryaccessesfromadifferentthread:

–  thosememoryaccessescanbe(almost)arbitrarilyjumbledaround•  liketryingtolocatetheposiMonofaparMculargasparMcleinaballoon

•  Aswe’llseelater,theonlythingthatwecandoistoputtwistsintheballoon

CMU15-418/618,Fall2017 12

(wikiHow)

Thread0 Thread1 Thread2 Thread3

Carnegie Mellon

UniprocessorMemoryModel

•  Memorymodelspecifiesorderingconstraintsamongaccesses

•  Uniprocessormodel:memoryaccessesatomicandinprogramorder

•  NotnecessarytomaintainsequenMalorderforcorrectness–  hardware:buffering,pipelining–  compiler:registerallocaMon,codemoMon

•  Simpleforprogrammers

•  Allowsforhighperformance

CMU15-418/618,Fall2017 13

write A write B read A read B

Processor

READS WRITES

writebuffer

Readscheckformatchingaddressesinwritebuffer

Carnegie Mellon

InParallelMachines(withaSharedAddressSpace)

•  OrderbetweenaccessestodifferentlocaMonsbecomesimportant

CMU15-418/618,Fall2017 14

A = 1;

Ready = 1; while (Ready != 1);

… = A;

(Ini6allyAandReady=0)

Carnegie Mellon

HowUnsafeReorderingCanHappen

•  DistribuMonofmemoryresources–  accessesissuedinordermaybeobservedoutoforder

CMU15-418/618,Fall2017 15

Processor

Memory

Processor

Memory

Processor

Memory

InterconnecMonNetwork

…A = 1; Ready = 1;

A: 0 Ready:0

wait(Ready==1);…=A;

A = 1;

Ready = 1;

Carnegie Mellon

CachesComplicateThingsMore•  MulMplecopiesofthesamelocaMon

CMU15-418/618,Fall2017 16

InterconnecMonNetwork

A = 1; wait(A ==1);B = 1;

A = 1;

B = 1;

Processor

Memory

Cache A:0

Processor

Memory

Cache A:0 B:0

Processor

Memory

Cache A:0 B:0

wait(B ==1);… = A;

A = 1;

à1 à1 à1 à1

Carnegie Mellon

OurIntuiMveModel:“SequenMalConsistency”(SC)

•  FormalizedbyLamport(1979)–  accessesofeachprocessorinprogramorder–  allaccessesappearinsequenMalorder

•  Anyorderimplicitlyassumedbyprogrammerismaintained

CMU15-418/618,Fall2017 17

Memory

P0 P1 Pn…

Carnegie Mellon

ExamplewithSequenMalConsistency

SimpleSynchronizaMon:

P0 P1 A = 1 (a) Ready = 1(b) x = Ready (c) y = A (d)

•  alllocaMonsareiniMalizedto0•  possibleoutcomesfor(x,y):

–  (0,0),(0,1),(1,1)•  (x,y)=(1,0)isnotapossibleoutcome(i.e.Ready=1,A=0):

–  weknowa->bandc->dbyprogramorder–  b->cimpliesthata->d–  y==0impliesd->awhichleadstoacontradicMon

–  butrealhardwarewilldothis!

CMU15-418/618,Fall2017 18

Carnegie Mellon

AnotherExamplewithSequenMalConsistency

Stripped-downversionofa2-processmutex(minustheturn-taking):

P0 P1 want[0] = 1(a) want[1] = 1(c) x = want[1] (b) y = want[0] (d)

•  alllocaMonsareiniMalizedto0•  possibleoutcomesfor(x,y):

–  (0,1),(1,0),(1,1)•  (x,y)=(0,0)isnotapossibleoutcome(i.e.want[0]=0,want[1]=0):

–  a->bandc->dimpliedbyprogramorder–  x=0impliesb->cwhichimpliesa->d–  a->dsaysy=1whichleadstoacontradicMon–  similarly,y=0impliesx=1whichisalsoacontradicMon–  butrealhardwarewilldothis!

CMU15-418/618,Fall2017 19

Carnegie Mellon

OneApproachtoImplemenMngSequenMalConsistency

1.  Implementcachecoherenceà writestothesamelocaMonareobservedinsameorderbyallprocessors

2.  Foreachprocessor,delaystartofmemoryaccessunMlpreviousonecompletesà eachprocessorhasonlyoneoutstandingmemoryaccessataMme

•  Whatdoesitmeanforamemoryaccesstocomplete?

CMU15-418/618,Fall2017 20

Carnegie Mellon

WhenDoMemoryAccessesComplete?

•  MemoryReads:–  areadcompleteswhenitsreturnvalueisbound

CMU15-418/618,Fall2017 21

load r1 ß X X=???

(FindXinmemorysystem)X=17

Carnegie Mellon

WhenDoMemoryAccessesComplete?

•  MemoryReads:–  areadcompleteswhenitsreturnvalueisbound

•  MemoryWrites:–  awritecompleteswhenthenewvalueis“visible”tootherprocessors

•  Whatdoes“visible”mean?–  itdoesNOTmeanthatotherprocessorshavenecessarilyseenthevalueyet–  itmeansthenewvalueiscommiOedtothehypotheMcalserializableorder(HSO)

•  alaterreadofXintheHSOwillseeeitherthisvalueoralaterone–  (forsimplicity,assumethatwritesoccuratomically)

CMU15-418/618,Fall2017 22

store 23 à X X=23

(Committomemoryorder)(aka“serialize”)

Carnegie Mellon

SummaryforSequenMalConsistency

•  Maintainorderbetweensharedaccessesineachprocessor

•  Balloonanalogy:

–  likepuqngatwistbetweeneachindividual(ordered)gasparMcle

•  SeverelyrestrictscommonhardwareandcompileropMmizaMons

CMU15-418/618,Fall2017 23

READ READ WRITE WRITE

READ WRITE READ WRITE

Don’tstartunMlpreviousaccesscompletes

Carnegie Mellon

•  Processorissuesaccessesone-at-a-MmeandstallsforcompleMon

•  LowprocessoruMlizaMon(17%-42%)evenwithcaching

PerformanceofSequenMalConsistency

CMU15-418/618,Fall2017 24

FromGuptaetal,“Compara6veevalua6onoflatencyreducingandtolera6ngtechniques.”InProceedingsofthe18thannualInterna6onalSymposiumonComputerArchitecture(ISCA'91)

Carnegie Mellon

AlternaMvestoSequenMalConsistency

•  Relaxconstraintsonmemoryorder

CMU15-418/618,Fall2017 25

TotalStoreOrdering(TSO)(SimilartoIntel)

ParMalStoreOrdering(PSO)

SeeSecMon8.2of“Intel®64andIA-32ArchitecturesSotwareDeveloper’sManual,Volume3A:SystemProgrammingGuide,Part1”,hOp://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-sotware-developer-vol-3a-part-1-manual.pdf

Carnegie Mellon

PerformanceImpactofTSOvs.SC

•  Canuseawritebuffer•  WritelatencyiseffecMvelyhidden

CMU15-418/618,Fall2017 26

“Base”=SC“WR”=TSO

Processor

READS WRITES

writebuffer

Carnegie Mellon

ButCanProgramsLivewithWeakerMemoryOrders?

•  “Correctness”:sameresultsassequenMalconsistency•  Mostprogramsdon’trequirestrictordering(alloftheMme)for“correctness”

•  Buthowdoweknowwhenaprogramwillbehavecorrectly?

CMU15-418/618,Fall2017 27

ProgramOrder

A = 1;

B = 1;

unlock L; lock L;

… = A;

… = B;

SufficientOrder

A = 1;

B = 1;

unlock L; lock L;

… = A;

… = B;

Carnegie Mellon

IdenMfyingDataRacesandSynchronizaMon

•  Twoaccessesconflictif:–  (i)accesssamelocaMon,and(ii)atleastoneisawrite

•  Orderaccessesby:–  programorder(po)–  dependenceorder(do):op1-->op2ifop2readsop1

•  DataRace:

–  twoconflicMngaccessesondifferentprocessors–  notorderedbyinterveningaccesses

•  ProperlySynchronizedPrograms:–  allsynchronizaMonsareexplicitlyidenMfied–  alldataaccessesareorderedthroughsynchronizaMon

CMU15-418/618,Fall2017 28

P1 P2WriteAWriteFlag ReadFlag

Carnegie Mellon

OpMmizaMonsforSynchronizedPrograms

•  IntuiMon:manyparallelprogramshavemixturesof“private”and“public”parts*

–  the“private”partsmustbeprotectedbysynchronizaMon(e.g.,locks)–  canwetakeadvantageofsynchronizaMontoimproveperformance?

CMU15-418/618,Fall2017 29

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

Example:

Grabalock

Releasethelock

Insertnodeintodatastructure•  EssenMallya“private”acMvity;reorderingisok

•  Nowwemakeit“public”totheothernodes

*Caveat:shareddataisinfactalwaysvisibletootherthreads.

Carnegie Mellon

OpMmizaMonsforSynchronizedPrograms

•  ExploitinformaMonaboutsynchronizaMon

•  properlysynchronizedprogramsshouldyieldthesameresultasonanSCmachine

CMU15-418/618,Fall2017 30

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

“WeakOrdering”(WO)

BetweensynchronizaMonoperaMons:•  wecanallowreorderingofmemoryoperaMons•  (aslongasintra-threaddependencesarepreserved)

JustbeforeandjustaVersynchronizaMonoperaMons:•  threadmustwaitforallprioroperaMonstocomplete

Carnegie Mellon

Intel’sMFENCE(MemoryFence)OperaMon

•  AnMFENCEoperaMonenforcestheorderingseenonthepreviousslide:–  doesnotbeginunMlallpriorreads&writesfromthatthreadhavecompleted–  nosubsequentreadorwritefromthatthreadcanstartunMlateritfinishes

CMU15-418/618,Fall2017 31

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

MFENCE

Balloonanalogy:itisatwistintheballoon•  nogasparMclescanpassthroughit

(wikiHow)

Goodnews:xchgdoesthisimplicitly!

Carnegie Mellon

ARMProcessors

•  ARMprocessorshaveaveryrelaxedconsistencymodel

•  ARMhassomegreatexamplesintheirprogrammer’sreference:–  http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/

Barrier_Litmus_Tests_and_Cookbook_A08.pdf

•  Agreatlistregardingrelaxedmemoryconsistencyingeneral:–  http://www.cl.cam.ac.uk/~pes20/weakmemory/

CMU15-418/618,Fall2017 32

Carnegie Mellon

CommonMisconcepMonaboutMFENCE

•  MFENCEoperaMonsdoNOTpushvaluesouttootherthreads–  itisnotamagic“makeeverythreadup-to-date”operaMon

•  Instead,theysimplystallthethreadthatperformstheMFENCE

CMU15-418/618,Fall2017 33

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

MFENCE

MFENCE 14

Thread0 Thread1 Thread2 Thread3

MFENCEoperaMonscreatepar6alorderings•  thatareobservableacrossthreads

Carnegie Mellon

Earlier(Broken)ExampleRevisited

WhereexactlyshouldweinsertMFENCEoperaMonstofixthis?

P0 P1 [1:Here?] A = 1 [2:Here?] [4:Here?] Ready = 1 x = Ready [3:Here?] [5:Here?] y = A [6:Here?]

CMU15-418/618,Fall2017 34

Carnegie Mellon

OverlyConservaMve

ExploiMngAsymmetryinSynchronizaMon:“ReleaseConsistency”

•  LockoperaMon:onlygains(“acquires”)permissiontoaccessdata•  UnlockoperaMon:onlygivesaway(“releases”)permissiontoaccessdata

CMU15-418/618,Fall2017 35

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

UNLOCK

WeakOrdering(WO)

3ReleaseConsistency(RC)

READ/WRITE…

READ/WRITE

ACQUIRE

RELEASE

READ/WRITE…

READ/WRITE 12

READ/WRITE…

READ/WRITE3

Carnegie Mellon

Intel’sFullSetofFenceOperaMons

•  InaddiMontoMFENCE,IntelalsosupportstwootherfenceoperaMons:–  LFENCE:serializesonlywithrespecttoloadoperaMons(notstores!)–  SFENCE:serializesonlywithrespecttostoreoperaMons(notloads!)

•  Note:Itdoesslightlymorethanthis;seethespecfordetails:–  Sec6on8.2.5of“Intel®64andIA-32ArchitecturesSo_wareDeveloper’s

Manual,Volume3A:SystemProgrammingGuide,Part1

•  InpracMce,youaremostlikelytouse:–  MFENCE–  xchg

CMU15-418/618,Fall2017 36

Carnegie Mellon

Take-AwayMessagesonMemoryConsistencyModels

•  DON’TuseonlynormalmemoryoperaMonsforsynchronizaMon–  e.g.,Peterson’ssoluMon(fromSynchronizaMon#1lecture)

•  DOuseeitherexplicitsynchronizaMonoperaMons(e.g.,xchg)orfences

CMU15-418/618,Fall2017 37

boolean want[2] = {false, false}; int turn = 0; want[i] = true; turn = j; while (want[j] && turn == j) continue; …cri6calsec6on…want[i] = false;

Exerciseforthereader:Whereshouldweaddfences(andwhichtype)tofixthis?

while (!xchg(&lock_available, 0) continue; …cri6calsec6on…xchg(&lock_available, 1);

Carnegie Mellon

Summary:RelaxedConsistency

•  MoMvaMon:–  obtainhigherperformancebyallowingreorderingofmemoryoperaMons

•  (reorderingisnotallowedbysequenMalconsistency)

•  Onecostissotwarecomplexity:–  theprogrammerorcompilermustinsertsynchronizaMon

•  toensurecertainspecificorderingswhenneeded

•  InpracMce:–  complexiMesotenencapsulatedinlibrariesthatprovideintuiMveprimiMves

•  e.g.,lock/unlock,barriers(orlower-levelprimiMveslikefence)

•  Relaxedmodelsdifferinwhichmemoryorderingconstraintstheyignore

CMU15-418/618,Fall2017 38

Lecture 13: Memory Consistency - 15-418/618 Fall...

Documents

Transcript of Lecture 13: Memory Consistency - 15-418/618 Fall...

Relaxed Shared Memory Consistency Modelsmeseec.ce.rit.edu/756-projects/spring2006/d2/4/relaxed shared memory consistency models...Classes of Consistency Models Relaxed Consistency

Memory Consistency and Event Ordering in Scalable Shared ...kubitron/cs258/handouts/papers/kourosh.pdf · Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors

CS 162 Memory Consistency Models

Memory Consistency Models - courses.cs.washington.edu€¦ · Memory consistency models The short version: • Multiprocessors reorder memory operations in unintuitive, scary ways

“Shared Memory Consistency Models: A Tutorial”

Transactional Memory Coherence and Consistency

Memory Consistency - University of Washingtoncourses.cs.washington.edu/courses/cse471/13sp/lectures/ConsistencySlides.pdf · Memory Consistency Model “Deﬁnes the value a read

RISC-V Memory Consistency Model Tutorial

Shared Memory Consistency Models. Quiz (1) Let’s define shared memory.

Memory Consistency Models for Shared-Memory Multiprocessors

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Designing Memory Consistency Models For Shared-Memory …sadve.cs.illinois.edu/Publications/thesis.pdf · 2005-05-06 · Abstract The memory consistency model (or memory model) of

Loose-Ordering Consistency for Persistent Memory

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Computer Architecture Memory Coherency & Consistency

Memory Consistency

Lecture 4. Memory Consistency Models

1 Shared Memory Multiprocessors Sequential consistency.

MEMORY CONSISTENCY MODELS FOR SHARED-MEMORY MULTIPROCESSORSinfolab.stanford.edu/pub/cstr/reports/csl/tr/95/685/CSL-TR-95-685.pdf · MEMORY CONSISTENCY MODELS FOR SHARED-MEMORY MULTIPROCESSORS