Lecture 13: Memory Consistency - 15-418/618 Fall...

38
Carnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2016 CMU 15-418/618, Fall 2017 1

Transcript of Lecture 13: Memory Consistency - 15-418/618 Fall...

Page 1: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Lecture13:

MemoryConsistency

ParallelComputerArchitectureandProgrammingCMU15-418/15-618,Fall2016

CMU15-418/618,Fall2017 1

Page 2: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

WhatisCorrectBehaviorforaParallelMemoryHierarchy?

•  Note:side-effectsofwritesareonlyobservablewhenreadsoccur–  sowewillfocusonthevaluesreturnedbyreads

•  IntuiMveanswer:–  readingalocaMonshouldreturnthelatestvaluewriOen(byanythread)

•  Hmm…whatdoes“latest”meanexactly?–  withinathread,itcanbedefinedbyprogramorder–  butwhataboutacrossthreads?

•  themostrecentwriteinphysicalMme?–  hopefullynot,becausethereisnowaythatthehardwarecanpullthatoff

»  e.g.,ifittakes>10cyclestocommunicatebetweenprocessors,thereisnowaythatprocessor0canknowwhatprocessor1did2clockMcksago

•  mostrecentbaseduponsomethingelse?–  Hmm…

CMU15-418/618,Fall2017 2

Page 3: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

RefiningOurIntuiMon

•  WhatwouldbesomeclearlyillegalcombinaMonsof(A,B,C)?•  Howabout:

•  Whatcanwegeneralizefromthis?–  writesfromanyparMcularthreadmustbeconsistentwithprogramorder

•  inthisexample,observedevennumbersmustbeincreasing(diOoforodds)

–  acrossthreads:writesmustbeconsistentwithavalidinterleavingofthreads•  notphysicalMme!(programmercannotrelyuponthat)

CMU15-418/618,Fall2017 3

// write evens to X for (i=0; i<N; i+=2) { X = i; … }

Thread0// write odds to X for (j=1; j<N; j+=2) { X = j; … }

Thread1… A = X; … B = X; … C = X; …

Thread2

(Assume:X=0iniMally,andthesearetheonlywritestoX.)

(4,8,1)? (9,12,3)? (7,19,31)?

Page 4: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

VisualizingOurIntuiMon

•  Eachthreadproceedsinprogramorder•  Memoryaccessesinterleaved(oneataMme)toasingle-portedmemory

–  rateofprogressofeachthreadisunpredictable

CMU15-418/618,Fall2017 4

// write evens to X for (i=0; i<N; i+=2) { X = i; … }

Thread0// write odds to X for (j=1; j<N; j+=2) { X = j; … }

Thread1… A = X; … B = X; … C = X; …

Thread2

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Page 5: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

CorrectnessRevisited

Recall:“readingalocaMonshouldreturnthelatestvaluewriOen(byanythread)”à  “latest”meansconsistentwithsomeinterleavingthatmatchesthismodel–  thisisahypotheMcalinterleaving;themachinedidn’tnecessarydothis!

CMU15-418/618,Fall2017 5

// write evens to X for (i=0; i<N; i+=2) { X = i; … }

Thread0// write odds to X for (j=1; j<N; j+=2) { X = j; … }

Thread1… A = X; … B = X; … C = X; …

Thread2

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Page 6: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Part2ofMemoryCorrectness:MemoryConsistencyModel

1.  “CacheCoherence”–  doallloadsandstorestoagivencacheblockbehavecorrectly?

2.  “MemoryConsistencyModel”(someMmescalled“MemoryOrdering”)–  doallloadsandstores,eventoseparatecacheblocks,behavecorrectly?

Recall:ourintuiMon

CMU15-418/618,Fall2017 6

CPU0 CPU1 CPU2

Memory

Singleporttomemory

Page 7: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Whyisthissocomplicated?

•  Fundamentalissue:–  loadsandstoresareveryexpensive,evenonauniprocessor

•  caneasilytake10’sto100’sofcycles

•  WhatprogrammersintuiMvelyexpect:–  processoratomicallyperformsoneinstrucMonataMme,inprogramorder

•  Inreality:–  iftheprocessoractuallyoperatedthisway,itwouldbepainfullyslow–  instead,theprocessoraggressivelyreordersinstruc6onstohidememorylatency

•  Upshot:–  withinagiventhread,theprocessorpreservestheprogramorderillusion–  butthisillusionhasnothingtodowithwhathappensinphysicalMme!–  fromtheperspecMveofotherthreads,allbetsareoff!

CMU15-418/618,Fall2017 7

Page 8: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

HidingMemoryLatencyisImportantforPerformance

•  Idea:overlapmemoryaccesseswithotheraccessesandcomputaMon

•  Hidingwritelatencyissimpleinuniprocessors:

–  addawritebuffer

•  (ButthisaffectscorrectnessinmulMprocessors)

CMU15-418/618,Fall2017 8

write A

read B

write A read B

Processor

Cache

READS WRITES

writebuffer

Page 9: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

HowCanWeHidetheLatencyofMemoryReads?

“Outoforder”pipelining:–  whenaninstrucMonisstuck,perhapstherearesubsequentinstrucMonsthat

canbeexecuted

•  ImplicaMon:memoryaccessesmaybeperformedout-of-order!!!

CMU15-418/618,Fall2017 9

stuckwaiMngontruedependencestuckwaiMngontruedependencesuffersexpensivecachemisssuffersexpensivecachemissx = *p;

y = x + 1; z = a + 2; b = c / 3; } thesedonotneedtowait

Page 10: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

WhatAboutCondiMonalBranches?

•  DoweneedtowaitforacondiMonalbranchtoberesolvedbeforeproceeding?–  No!JustpredictthebranchoutcomeandconMnueexecuMngspeculaMvely.

•  ifpredicMoniswrong,squashanyside-effectsandrestartdowncorrectpath

CMU15-418/618,Fall2017 10

x = *p; y = x + 1; z = a + 2; b = c / 3; if (x != z) d = e – 7; else d = e + 5; …

ifhardwareguessesthatthisistruethenexecute“then”part(speculaMvely)(withoutwaiMngforxorz)

Page 11: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

HowOut-of-OrderPipeliningWorksinModernProcessors

•  FetchandgraduateinstrucMonsin-order,butissueout-of-order

•  Intra-threaddependencesarepreserved,butmemoryaccessesgetreordered!

CMU15-418/618,Fall2017 11

issue(cachemiss)

0x1c: b = c / 3;

0x18: z = a + 2;

0x14: y = x + 1;

0x10: x = *p;

PC:0x10Inst.Cache

BranchPredictor

0x140x180x1c

0x1c: b = c / 3;

0x18: z = a + 2;

0x14: y = x + 1;

0x10: x = *p;

Reorde

rBuff

er

issue(cachemiss)

issue(out-of-order)issue(out-of-order)

can’tissuecan’tissueissue(out-of-order)issue(out-of-order)

Page 12: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Analogy:GasParMclesinBalloons

•  ImaginethateachinstrucMonwithinathreadisagasparMcleinsideatwistyballoon•  Theywerenumberedoriginally,butthentheystarttomoveandbouncearound•  Whenagiventhreadobservesmemoryaccessesfromadifferentthread:

–  thosememoryaccessescanbe(almost)arbitrarilyjumbledaround•  liketryingtolocatetheposiMonofaparMculargasparMcleinaballoon

•  Aswe’llseelater,theonlythingthatwecandoistoputtwistsintheballoon

CMU15-418/618,Fall2017 12

(wikiHow)

14

13

12

15

11

15

11

14

13

12

12

13

15

11

14

11

12

13

14

15

Thread0 Thread1 Thread2 Thread3

Time

Page 13: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

UniprocessorMemoryModel

•  Memorymodelspecifiesorderingconstraintsamongaccesses

•  Uniprocessormodel:memoryaccessesatomicandinprogramorder

•  NotnecessarytomaintainsequenMalorderforcorrectness–  hardware:buffering,pipelining–  compiler:registerallocaMon,codemoMon

•  Simpleforprogrammers

•  Allowsforhighperformance

CMU15-418/618,Fall2017 13

write A write B read A read B

Processor

Cache

READS WRITES

writebuffer

Readscheckformatchingaddressesinwritebuffer

Page 14: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

InParallelMachines(withaSharedAddressSpace)

•  OrderbetweenaccessestodifferentlocaMonsbecomesimportant

CMU15-418/618,Fall2017 14

A = 1;

Ready = 1; while (Ready != 1);

… = A;

P1 P2

(Ini6allyAandReady=0)

Page 15: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

HowUnsafeReorderingCanHappen

•  DistribuMonofmemoryresources–  accessesissuedinordermaybeobservedoutoforder

CMU15-418/618,Fall2017 15

Processor

Memory

Processor

Memory

Processor

Memory

InterconnecMonNetwork

…A = 1; Ready = 1;

A: 0 Ready:0

wait(Ready==1);…=A;

A = 1;

Ready = 1;

à1

Page 16: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

CachesComplicateThingsMore•  MulMplecopiesofthesamelocaMon

CMU15-418/618,Fall2017 16

InterconnecMonNetwork

A = 1; wait(A ==1);B = 1;

A = 1;

B = 1;

Processor

Memory

Cache A:0

Processor

Memory

Cache A:0 B:0

Processor

Memory

Cache A:0 B:0

wait(B ==1);… = A;

A = 1;

à1 à1 à1 à1

Oops!

Page 17: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

OurIntuiMveModel:“SequenMalConsistency”(SC)

•  FormalizedbyLamport(1979)–  accessesofeachprocessorinprogramorder–  allaccessesappearinsequenMalorder

•  Anyorderimplicitlyassumedbyprogrammerismaintained

CMU15-418/618,Fall2017 17

Memory

P0 P1 Pn…

Page 18: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

ExamplewithSequenMalConsistency

SimpleSynchronizaMon:

P0 P1 A = 1 (a) Ready = 1(b) x = Ready (c) y = A (d)

•  alllocaMonsareiniMalizedto0•  possibleoutcomesfor(x,y):

–  (0,0),(0,1),(1,1)•  (x,y)=(1,0)isnotapossibleoutcome(i.e.Ready=1,A=0):

–  weknowa->bandc->dbyprogramorder–  b->cimpliesthata->d–  y==0impliesd->awhichleadstoacontradicMon

–  butrealhardwarewilldothis!

CMU15-418/618,Fall2017 18

Page 19: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

AnotherExamplewithSequenMalConsistency

Stripped-downversionofa2-processmutex(minustheturn-taking):

P0 P1 want[0] = 1(a) want[1] = 1(c) x = want[1] (b) y = want[0] (d)

•  alllocaMonsareiniMalizedto0•  possibleoutcomesfor(x,y):

–  (0,1),(1,0),(1,1)•  (x,y)=(0,0)isnotapossibleoutcome(i.e.want[0]=0,want[1]=0):

–  a->bandc->dimpliedbyprogramorder–  x=0impliesb->cwhichimpliesa->d–  a->dsaysy=1whichleadstoacontradicMon–  similarly,y=0impliesx=1whichisalsoacontradicMon–  butrealhardwarewilldothis!

CMU15-418/618,Fall2017 19

Page 20: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

OneApproachtoImplemenMngSequenMalConsistency

1.  Implementcachecoherenceà writestothesamelocaMonareobservedinsameorderbyallprocessors

2.  Foreachprocessor,delaystartofmemoryaccessunMlpreviousonecompletesà eachprocessorhasonlyoneoutstandingmemoryaccessataMme

•  Whatdoesitmeanforamemoryaccesstocomplete?

CMU15-418/618,Fall2017 20

Page 21: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

WhenDoMemoryAccessesComplete?

•  MemoryReads:–  areadcompleteswhenitsreturnvalueisbound

CMU15-418/618,Fall2017 21

load r1 ß X X=???

(FindXinmemorysystem)X=17

r1=17

Page 22: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

WhenDoMemoryAccessesComplete?

•  MemoryReads:–  areadcompleteswhenitsreturnvalueisbound

•  MemoryWrites:–  awritecompleteswhenthenewvalueis“visible”tootherprocessors

•  Whatdoes“visible”mean?–  itdoesNOTmeanthatotherprocessorshavenecessarilyseenthevalueyet–  itmeansthenewvalueiscommiOedtothehypotheMcalserializableorder(HSO)

•  alaterreadofXintheHSOwillseeeitherthisvalueoralaterone–  (forsimplicity,assumethatwritesoccuratomically)

CMU15-418/618,Fall2017 22

store 23 à X X=23

(Committomemoryorder)(aka“serialize”)

Page 23: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

SummaryforSequenMalConsistency

•  Maintainorderbetweensharedaccessesineachprocessor

•  Balloonanalogy:

–  likepuqngatwistbetweeneachindividual(ordered)gasparMcle

•  SeverelyrestrictscommonhardwareandcompileropMmizaMons

CMU15-418/618,Fall2017 23

READ READ WRITE WRITE

READ WRITE READ WRITE

Don’tstartunMlpreviousaccesscompletes

Page 24: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

•  Processorissuesaccessesone-at-a-MmeandstallsforcompleMon

•  LowprocessoruMlizaMon(17%-42%)evenwithcaching

PerformanceofSequenMalConsistency

CMU15-418/618,Fall2017 24

FromGuptaetal,“Compara6veevalua6onoflatencyreducingandtolera6ngtechniques.”InProceedingsofthe18thannualInterna6onalSymposiumonComputerArchitecture(ISCA'91)

Page 25: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

AlternaMvestoSequenMalConsistency

•  Relaxconstraintsonmemoryorder

CMU15-418/618,Fall2017 25

READ READ WRITE WRITE

READ WRITE READ WRITE

TotalStoreOrdering(TSO)(SimilartoIntel)

READ READ WRITE WRITE

READ WRITE READ WRITE

ParMalStoreOrdering(PSO)

SeeSecMon8.2of“Intel®64andIA-32ArchitecturesSotwareDeveloper’sManual,Volume3A:SystemProgrammingGuide,Part1”,hOp://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-sotware-developer-vol-3a-part-1-manual.pdf

Page 26: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

PerformanceImpactofTSOvs.SC

•  Canuseawritebuffer•  WritelatencyiseffecMvelyhidden

CMU15-418/618,Fall2017 26

“Base”=SC“WR”=TSO

Processor

Cache

READS WRITES

writebuffer

Page 27: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

ButCanProgramsLivewithWeakerMemoryOrders?

•  “Correctness”:sameresultsassequenMalconsistency•  Mostprogramsdon’trequirestrictordering(alloftheMme)for“correctness”

•  Buthowdoweknowwhenaprogramwillbehavecorrectly?

CMU15-418/618,Fall2017 27

ProgramOrder

A = 1;

B = 1;

unlock L; lock L;

… = A;

… = B;

SufficientOrder

A = 1;

B = 1;

unlock L; lock L;

… = A;

… = B;

Page 28: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

IdenMfyingDataRacesandSynchronizaMon

•  Twoaccessesconflictif:–  (i)accesssamelocaMon,and(ii)atleastoneisawrite

•  Orderaccessesby:–  programorder(po)–  dependenceorder(do):op1-->op2ifop2readsop1

•  DataRace:

–  twoconflicMngaccessesondifferentprocessors–  notorderedbyinterveningaccesses

•  ProperlySynchronizedPrograms:–  allsynchronizaMonsareexplicitlyidenMfied–  alldataaccessesareorderedthroughsynchronizaMon

CMU15-418/618,Fall2017 28

P1 P2WriteAWriteFlag ReadFlag

ReadA

po

po

do

Page 29: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

OpMmizaMonsforSynchronizedPrograms

•  IntuiMon:manyparallelprogramshavemixturesof“private”and“public”parts*

–  the“private”partsmustbeprotectedbysynchronizaMon(e.g.,locks)–  canwetakeadvantageofsynchronizaMontoimproveperformance?

CMU15-418/618,Fall2017 29

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

SYNCH

SYNCH

Example:

Grabalock

Releasethelock

Insertnodeintodatastructure•  EssenMallya“private”acMvity;reorderingisok

•  Nowwemakeit“public”totheothernodes

*Caveat:shareddataisinfactalwaysvisibletootherthreads.

Page 30: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

OpMmizaMonsforSynchronizedPrograms

•  ExploitinformaMonaboutsynchronizaMon

•  properlysynchronizedprogramsshouldyieldthesameresultasonanSCmachine

CMU15-418/618,Fall2017 30

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

SYNCH

SYNCH

“WeakOrdering”(WO)

BetweensynchronizaMonoperaMons:•  wecanallowreorderingofmemoryoperaMons•  (aslongasintra-threaddependencesarepreserved)

JustbeforeandjustaVersynchronizaMonoperaMons:•  threadmustwaitforallprioroperaMonstocomplete

Page 31: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Intel’sMFENCE(MemoryFence)OperaMon

•  AnMFENCEoperaMonenforcestheorderingseenonthepreviousslide:–  doesnotbeginunMlallpriorreads&writesfromthatthreadhavecompleted–  nosubsequentreadorwritefromthatthreadcanstartunMlateritfinishes

CMU15-418/618,Fall2017 31

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

MFENCE

MFENCE

Balloonanalogy:itisatwistintheballoon•  nogasparMclescanpassthroughit

(wikiHow)

Goodnews:xchgdoesthisimplicitly!

Page 32: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

ARMProcessors

•  ARMprocessorshaveaveryrelaxedconsistencymodel

•  ARMhassomegreatexamplesintheirprogrammer’sreference:–  http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/

Barrier_Litmus_Tests_and_Cookbook_A08.pdf

•  Agreatlistregardingrelaxedmemoryconsistencyingeneral:–  http://www.cl.cam.ac.uk/~pes20/weakmemory/

CMU15-418/618,Fall2017 32

Page 33: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

CommonMisconcepMonaboutMFENCE

•  MFENCEoperaMonsdoNOTpushvaluesouttootherthreads–  itisnotamagic“makeeverythreadup-to-date”operaMon

•  Instead,theysimplystallthethreadthatperformstheMFENCE

CMU15-418/618,Fall2017 33

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

MFENCE

MFENCE 14

13

11

15

12

15

11

14

13

12

13

12

11

11

12

13

14

15

Thread0 Thread1 Thread2 Thread3

Time

14

15

MFENCEoperaMonscreatepar6alorderings•  thatareobservableacrossthreads

Page 34: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Earlier(Broken)ExampleRevisited

WhereexactlyshouldweinsertMFENCEoperaMonstofixthis?

P0 P1 [1:Here?] A = 1 [2:Here?] [4:Here?] Ready = 1 x = Ready [3:Here?] [5:Here?] y = A [6:Here?]

CMU15-418/618,Fall2017 34

Page 35: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

OverlyConservaMve

ExploiMngAsymmetryinSynchronizaMon:“ReleaseConsistency”

•  LockoperaMon:onlygains(“acquires”)permissiontoaccessdata•  UnlockoperaMon:onlygivesaway(“releases”)permissiontoaccessdata

CMU15-418/618,Fall2017 35

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

READ/WRITE…

READ/WRITE

LOCK

UNLOCK

WeakOrdering(WO)

1

2

3ReleaseConsistency(RC)

READ/WRITE…

READ/WRITE

ACQUIRE

RELEASE

READ/WRITE…

READ/WRITE 12

READ/WRITE…

READ/WRITE3

Page 36: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Intel’sFullSetofFenceOperaMons

•  InaddiMontoMFENCE,IntelalsosupportstwootherfenceoperaMons:–  LFENCE:serializesonlywithrespecttoloadoperaMons(notstores!)–  SFENCE:serializesonlywithrespecttostoreoperaMons(notloads!)

•  Note:Itdoesslightlymorethanthis;seethespecfordetails:–  Sec6on8.2.5of“Intel®64andIA-32ArchitecturesSo_wareDeveloper’s

Manual,Volume3A:SystemProgrammingGuide,Part1

•  InpracMce,youaremostlikelytouse:–  MFENCE–  xchg

CMU15-418/618,Fall2017 36

Page 37: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Take-AwayMessagesonMemoryConsistencyModels

•  DON’TuseonlynormalmemoryoperaMonsforsynchronizaMon–  e.g.,Peterson’ssoluMon(fromSynchronizaMon#1lecture)

•  DOuseeitherexplicitsynchronizaMonoperaMons(e.g.,xchg)orfences

CMU15-418/618,Fall2017 37

boolean want[2] = {false, false}; int turn = 0; want[i] = true; turn = j; while (want[j] && turn == j) continue; …cri6calsec6on…want[i] = false;

Exerciseforthereader:Whereshouldweaddfences(andwhichtype)tofixthis?

while (!xchg(&lock_available, 0) continue; …cri6calsec6on…xchg(&lock_available, 1);

Page 38: Lecture 13: Memory Consistency - 15-418/618 Fall …15418.courses.cs.cmu.edu/.../13_consistency_slides.pdfCarnegie Mellon Lecture 13: Memory Consistency Parallel Computer Architecture

Carnegie Mellon

Summary:RelaxedConsistency

•  MoMvaMon:–  obtainhigherperformancebyallowingreorderingofmemoryoperaMons

•  (reorderingisnotallowedbysequenMalconsistency)

•  Onecostissotwarecomplexity:–  theprogrammerorcompilermustinsertsynchronizaMon

•  toensurecertainspecificorderingswhenneeded

•  InpracMce:–  complexiMesotenencapsulatedinlibrariesthatprovideintuiMveprimiMves

•  e.g.,lock/unlock,barriers(orlower-levelprimiMveslikefence)

•  Relaxedmodelsdifferinwhichmemoryorderingconstraintstheyignore

CMU15-418/618,Fall2017 38