Abstractions for Relaxed Memory Models

1

Abstractions for Relaxed Memory Models

Andrei Dan, Yuri Meshman, Martin Vechev, Eran Yahav

2

Verification under a relaxed memory model

P M S?

3

Really about

P M SP’ M SChange the program

P M S

…

Logozzo, Ball. Modular and Verified Automatic Program Repair, OOPSLA'12

Automatic Inference of Memory Fences, FMCAD’10

Abstraction-Guided Synthesis of Synchronization, POPL’10

4

Example: Correct and Efficient Synchronization with Barriers

• Boids simulation• Craig Reynolds ,"Flocks, herds and schools: A distributed

behavioral model.", SIGGRAPH '87• Sequential implementation by Paul Richmond (U. of Sheffield)

…

Global state contains shared arrayof Boid locations

N Boids

http://www.paulrichmond.staff.shef.ac.uk/boids.php

5

Boid Task

…Read locations of other boids

…Update my location

Specification: guarantee conflict-freedom

Conflict: two threads are enabled to access the same memory location and (at least) one of these accesses is a write

Boid task pid=2

pid=2

…Read locations of other boids

…Update my location

Boid task pid=3

6

Boids Simulation

…

Generate initial locationsSpawn N Boid TasksWhile (true) { Read locations of all boids Render boids}

While (true) { Read locations of other boids Compute new location Update my location}

Main task Boid task

Shared Memory (Global State)

Locations of N Boids

Where should I put synchronization barriers?

7

Textbook Example: Dekker’s Algorithm

Thread 0: flag0 := truewhile flag1 = true { if turn ≠ 0 { flag0 := false while turn ≠ 0 { } flag0 := true }}// critical sectionturn := 1flag0 := false

Thread 1:

flag1 := truewhile flag0 = true { if turn ≠ 1 { flag1 := false while turn ≠ 1 { } flag1 := true }}// critical sectionturn := 0flag1 := false

initial: flag0 = false, flag1 = false, turn = 0

spec: mutual exclusion over critical section

sequential consistency

relaxed model x86 TSO

Yes

No

8

Concrete PSO semantics using store buffers

…P0

MainMemory

…T0

……

……

flag0flag1turn

flag0flag1turn

store flush

load

fence

9

Where should I put fences?

On the one hand, memory barriers are expensive (100s of cycles, maybe more), and should be used only when necessary.

On the other, synchronization bugs can be very difficult to track down, so memory barriers should be used liberally, rather than relying on complex platform-specific guarantees about limits to memory instruction reordering.

– Herlihy and Shavit

10

May seem easy…

Thread 0: flag0 := truefencewhile flag1 = true { if turn ≠ 0 { flag0 := false while turn ≠ 0 { } flag0 := true fence }}// critical sectionturn := 1flag0 := false

Thread 1:

flag1 := truefencewhile flag0 = true { if turn ≠ 1 { flag1 := false while turn ≠ 1 { } flag1 := true fence }}// critical sectionturn := 0flag1 := false


relaxed model x86 TSO Yes

spec: mutual exclusion over critical section

11

1 int take() {2 long b = bottom – 1;3 item_t * q = wsq;4 bottom = b

5 long t = top6 if (b < t) {7 bottom = t;8 return EMPTY;9 }10 task = q->ap[b % q->size];11 if (b > t)12 return task13 if (!CAS(&top, t, t+1))14 return EMPTY;15 bottom = t + 1;16 return task;17 }

1 void push(int task) {2 long b = bottom;3 long t = top;4 item_t * q = wsq;5 if (b – t ≥ q->size – 1) {6 wsq = expand();7 q = wsq;8 }9 q->ap[b % q->size] = task;

10 bottom = b + 1;11 }

1 int steal() {2 long t = top;

3 long b = bottom;

4 item_t * q = wsq;5 if (t >= b)6 return EMPTY;7 task = q->ap[t % q->size];

8 if (!CAS(&top, t, t+1))9 return ABORT;10 return task;11 }

fence

fence

fence

fence

fence

Chase-Lev Work-Stealing Queue

12

Goal

• Help the programmer place fences– Find optimal fence placement

• Principle– Restrict non-determinism s.t. program stays

within set of safe executions

13

Our Approach: Overview

• P’ satisfies the specification S under M

FENDER

ProgramP

Specification S

MemoryModel

M

Program P’with

Fences

14

1. Compute reachable states for the program

2. Compute weakest constraints on execution that guarantee all “bad states” are avoided

3. Implement the constraints with fences <Formula>

Our Approach: Recipe

15

• Compute reachable states of the program

• Compute constraints on execution that guarantee that all “bad states” are avoided

• Implement the constraints with fences

Bad News [Atig et al. POPL’10]

Even for programs that are finite-state under SC

Reachability undecidable for RMO

Non-primitive recursive complexity for TSO/PSO

Our Approach: Recipe

16

Challenges

• Automatic verification and fence inference that works for realistic programs

• Handle two sources of unboundedness– Unbounded SC state– Unbounded store buffers

17

Bound state and buffer (under-approx)[Automatic Inference of Memory Fences, FMCAD’10]

18

Demonic scheduling for dynamic exploration of executions[Dynamic Synthesis for Relaxed Memory Models , PLDI’12]

Can infer fences for large tricky programs such as a lock-free memory allocator

LLVM-GCC

LLVM Interpreter

Threading

DemonicScheduler

Memory Model Specification

Trace Analysis SAT Solver

Fence Enforcement

Concurrent C/C++ code Client

trace Order formula

.bc modified .bc

our extension

existing work

Fixed bytecode &Fence location report

DFENCE: support for concurrency and RMM

satisfyingassignment

Open-source, available at: http://practicalsynthesis.org/fender/

20

Idea 1: Abstraction-Guided Synthesis

• Synthesis of synchronization via abstract interpretation– Compute over-approximation of all possible program

executions– Add minimal synchronization to avoid

(over-approximation of) bad schedules

• Interplay between abstraction and synchronization– Finer abstraction may enable finer synchronization– Coarse synchronization may allow coarser abstraction

21Change the abstraction to match the program

A Standard Approach: Abstraction Refinement

ProgramP

SpecificationS

Abstractcounterexample

Abstraction

AbstractionRefinement


Verify

Valid

22

ProgramP


AbstractionRefinement

Change the program to match the abstraction

Verify

Abstraction-Guided Synthesis [VYY-POPL’10]

ProgramRestriction Implement P’


SpecificationS

Abstraction

23

1. Compute over-approximation of reachable states for the program using sound abstractions

2. Compute weakest constraints on abstract execution that guarantee all “bad abstract states” are avoided

3. Implement the constraints with fences

Our Approach Revisited: Recipe

<Formula>

24

Conservative Abstractions

25

Different ApproachesInfinite-State

Unounded-buffer

Automatic Inference of Memory Fences [FMCAD’10]

Dynamic Synthesis for Relaxed Memory Models [PLDI’12]

Partial-Coherence Abstractions for Relaxed Memory Models [PLDI’11]

Predicate Abstraction for Relaxed Memory Models [SAS’13]

Synthesis of Memory Fences via Refinement Propagation [SAS’14]

Effective Program Transformation forVerification under Relaxed Models[in progress]

26

Partial Coherence Abstractions [PLDI’11]

…P0

MainMemory

…P1

……

……

flag0flag1turn

flag0flag1turn

P0

MainMemory

P1

flag0

turn

flag0

flag1turn

Recent value

Bounded length k

Unordered elements

flag1

Allows precise fence semantics

Allows precise loads from buffer

Keeps the analysis precise for “well behaved” programs

Record what values appeared (withoutorder or number)

Sound abstractions of store buffers

27

Abstract Memory Models - Requirements

• Intra-process coherence– A process should see the most recent value it wrote

• Preserve fence semantics– The value written to main memory when flushed by a fence is

the most recent value stored before the fence• Preserve buffer emptiness

– Values do not appear out of nowhere• Partial inter-process coherence

– Preserve as much order information as feasible (bounded)

• Simple construction!

28

State Abstraction Techniques

• Predicate abstraction– Simple – Requires initial set of predicates

• Numerical domains– Octagon and Polyhedra abstractions– Automatically handle programs with (linear)

numerical invariants

29

Predicate Abstraction

• Successful for sequential program analysis– Graf and Saidi (CAV' 97)– Microsoft's SLAM (PLDI’01)– …

• Some work for SC concurrent programs– Symmetry-Aware Predicate Abstraction for Shared-Variable

Concurrent Programs. Kroening et al. (CAV' 11)– Threader: A constraint-based verifier for multi-threaded programs

Gupta et al. (CAV' 11)– …

How can we apply standard predicate abstraction to verification under relaxed memory models?

30

Classical predicate abstractionThread 0:1 X = Y+12 fence(X)

Thread 1:1 Y = X+12 fence(Y)

initial: X=Y=0

assert(XY)

B0: X=Y, B1: X=1, B2: Y=1, B3: X=0 ,B4: Y=0

P

V

/* Statement X = 0 */21: store B1 = false;/*update predicate - B1: (X = 1) */22: store B3 = true;/* update predicate - B3: (X = 0) */23: store B0 = false…/* Statement Y = X + 1 */54: store B0 = false; /* update predicate - B0: (X = Y) */55: store B2 = choose(t3(t0t4), t1t3(t0t2)(t0 t4)(t0t4)); /* B2: (Y = 1) */56: store B4 = choose(false, (t1)(t3)(t0t2)(t0t4)); /*B4: (Y = 0) */…

BP(P,V)

BP(P,V)SC S entails PSC S

31

Direct application is not sound

Thread 0:1 X = Y+12 fence(X)

Thread 1:1 Y = X+12 fence(Y)

initial: X=Y=0

assert(XY)

B0: X=Y, B1: X=1, B2: Y=1, B3: X=0 ,B4: Y=0

PPSO S but BP(P,V)PSO S

predicates with false value other than (x=y) have been omitted

Concrete T0 T1 Glob (X,Y) (X,Y) (X,Y) (0,0) (0,0) (0,0)

T0: X = Y+1 (1,0) (0,0) (0,0) T1: Y = X+1 (1,0) (0,1) (0,0) T0: flush(X) (1,0) (1,1) (1,0) T1: flush(Y) (1,1) (1,1) (1,1)

Predicate Abstraction T0 T1 Global

X=Y, X=0, Y=0 X=Y, X=0, Y=0 X=Y, X=0, Y=0 (X=Y),X=1, Y=0 X=Y, X=0, Y=0 X=Y, X=0, Y=0 (X=Y),X=1, Y=0 (X=Y), Y=1, X=0 X=Y, X=0, Y=0 (X=Y),X=1, Y=0 (X=Y), Y=1, X=0 (X=Y), X=1, Y=0 (X=Y),X=1, Y=0 (X=Y), Y=1, X=0 (X=Y), X=1, Y=1

32

How do we restore soundness?

• Option 0: restrict programs/properties • Option 1: BP(P,V)specialized

– Capture dependencies between updates– Invalidation of predicates – Synchronized updates of multiple predicates

• Option 2: BP(PM,V)SC

– Capture all relaxed memory model effects in the program itself

– Boolean program construction as usual – Verification as usual (using SC tools)

33

Encode memory model effects in the Program

PM S?

PMSC S?

The behavior of PM under sequential consistency is anover-approximation of the behavior of P running under model M

34

Encode RMM effects into the program

• Pick a bound k for store buffers (sound)• Encode store buffers as program variables • Shared variable X encoded as

– Xcnt – a counter for the buffer position

– X1, …, Xk – buffer contents

X1 X2 Xk… …X (PSO)

35

Encode Program: Example for k=1

load t = X if (Xcnt == 0) t = Xif (Xcnt == 1) t = X1

store X = t if (Xcnt == k) “overflow”Xcnt ++if (Xcnt == 1) X1 = t

36

Where do predicates come from?


ProgramP

Predicates

V

Boolean Program B

Model Checker

Verified

Counterexample

MemoryModel

M

Reduction

Program PM

?

37

Idea 2: Proof Extrapolation

• Leverage the similarity between behaviors in PM and those in PSC

• Verify program under SC using a given vocabulary V

• Extrapolate predicates VM for PM from the SC proof

38

Step 1: Verify program under SC

• Find a set of predicates V• Construct the Boolean program B(P,V)• Verify B(P,V)SC S

39

Step 2: Predicate Extrapolation

• Discover new predicates for RMM based on the predicates used in the SC proof

• Generic predicates– Buffer size, overflow

• SC-Based extrapolated predicates– from SC relationships as captured in V

40

Predicate Extrapolation Example

• xshared variables, 0 i k – (Xcnt == i) tracks buffer size

– (Xi==Xi-1), i 0 for flush actions

• pV where p is of the form “(X<Y)”, 0 i k – (Xi < Y)

– (X < Yi)

41

Dekker with extrapolated predicates

Thread 0:flag0 := truewhile flag1 = true { if turn ≠ 0 { flag0 := false while turn ≠ 0 { } flag0 := true }}// critical sectionturn := 1flag0 := false

Thread 1:flag1 := truewhile flag0 = true { if turn ≠ 1 { flag1 := false while turn ≠ 1 { } flag1 := true }}// critical sectionturn := 0flag1 := false


SC (t2 = 0),(t1 = 0), (f1 = 0), (f2 = 0), (flag0 = 0), (flag1 = 0), (turn = 0)

PSO (overflow = 0), (t2 = 0), (t1 = 0), (f1 = 0), (f2 = 0), (flag0 = 0), (flag1 = 0), (turn = 0),(turn_cnt_T0 = 0), (turn_cnt_T0 = 1), (turn_cnt_T1 = 0), (turn_cnt_T1 = 1),(turn_1_T0 = 0), (turn_1_T1 = 0)(flag0_cnt_T0 = 0), (flag0_cnt_T0 = 1), (flag0_1_T0 = 0)(flag1_cnt_T1 = 0), (flag1_cnt_T1 = 1), (flag1_1_T1 = 0)

TSO (overflow = 0), (t2 = 0), (t1 = 0), (f1 = 0), (f2 = 0), (flag0 = 0), (flag1 = 0), (turn = 0)(T0_cnt = 0), (T0_cnt = 1), (lhs_1_T0 = 0), (lhs_1_T0 = 1)(T1_cnt = 0), (T1_cnt = 1), (lhs_1_T1 = 0), (lhs_1_T1 = 2), (rhs_1_T0 = 0), (rhs_1_T1 = 0)

42

Our approach so far


ProgramP

Predicates

V

Boolean Program B

Model Checker

Verified

Counterexample

MemoryModel

M

Reduction Extrpolation

Program PM Predicates VM

43

Unfortunately…

• Building the Boolean program is exponential in the number of predicates

• Non-feasible for some benchmarks– For example: Bakery goes for more than 10 hours

|VSC| |VPSO| |VTSO|

Dekker 7 28 26

Szymanski 20 47 51

Bakery 15 38 36

Ticket 11 56 48

for k = 2

44

Core problem: abstract transformers

Literals qi = pi or qi = ¬pi, pi ∊ VM

Cubes(VM) = {qi1 ∧ … ∧ qij, j ≤ |VM|}

|Cubes(VM)| = 3|VM|

For st Statements for pi V f = wp(pi,st) for c Cubes(VM) if c f // SMT call add c to the transformer

45

Cube Extrapolation

• Reuse more information from the SC proof

• In addition to input predicates, extrapolate from the cubes used in the Boolean program

• Cube search space restricted only to extrapolated cubes

46

Cube Extrapolation Example

Cube in the SC Boolean Program B Potential Cubes for the RMM Boolean Program

(X 0 X < Y) (X1 0 X1 < Y)…(Xk 0 Xk < Y)(X 0 X < Y1) …(X 0 X < Yk)

47

Abstract transformers with extrapolated Cubes

Literals qi = pi or qi = ¬pi, pi ∊ VM

Cubes(VM) = {qi1 ∧ … ∧ qij, j ≤ |VM|}ExtCubes(B,VM) = CubeExtrapolation(B)

|ExtCubes(B,VM)|<< |Cubes(VM)|

For st Statements for pi V f = wp(pi,st) for c ExtCubes(B,VM) if c f // SMT call add c to the transformer

48

Complete Approach


ProgramP

Predicates

V

Boolean Program B

Model Checker

Verified

Counterexample

MemoryModel

M

Reduction Extrpolation

Program PM Predicates VM

Cube Extraction

Boolean Program

BSC

Cubes from BSC

49

Results: Predicate Extrapolation Build Boolean Program Model Check

algorithm memory # input # SMT time # cubes cube # states memory time

model preds calls (K) (sec) used size (K) (MB) (sec)

Dekker

SC 7 0.7 0.1 0

1

14 6 1

PSO 20 26 6 0 80 31 5

TSO 18 22 5 0 45 20 3

Peterson

SC 7 0.6 0.1 2

2

7 3 1

PSO 20 15 3 2 31 13 3

TSO 18 13 3 2 25 11 2

ABP

SC 8 2 0.5 5

2

0.6 1 0.6

PSO 15 20 4 5 2 3 1

TSO 17 23 5 5 2 3 1

Szymanski

SC 20 16 3.3 1

2

12 6 2

PSO 35 152 33 1 61 30 4

TSO 37 165 35 1 61 31 5

50

Results: Cube Extrapolation Build Boolean Program Model check

algorithm memory method # input # input # SMT time # cubes cube # states memory time

model preds cubes calls (K) (sec) used size (K) (MB) (sec)

Queue

SC Trad 7 - 20 5 50

4

1 2 1

PSO PE

15 - 5,747 1,475 412 1 4 1

CE 99 98 17 99 11 6 2

TSO PE

16 - 11,133 2,778 412 12 4 1

CE 99 163 31 99 12 7 2

Bakery

SC Trad 15 - 1,552 355 161

4

20 8 2

PSO PE

38 - - T/O - - - -

CE 422 9,018 1,773 381 979 375 104

TSO PE

36 - - T/O - - - -

CE 422 7,048 1,386 383 730 285 121

Ticket

SC Trad 11 - 218 51 134

4

2 2 1

PSO PE

56 - - T/O - - - -

CE 622 15,644 2,163 380 193 123 40

TSO PE

48 - - T/O - - - -

CE 622 6,941 1,518 582 71 67 545

51

Numerical analysis under SC

Thread 0:0:1:flag0 := true2: while flag1 = true {3: if turn ≠ 0 {4: flag0 := false5: while turn ≠ 0 { }6: flag0 := true7: }8:}9:// critical sectionA: turn := 1B: flag0 := false

Thread 1:0:1: flag1 := true2: while flag0 = true {3: if turn ≠ 1 {4: flag1 := false5: while turn ≠ 1 { }6: flag1 := true7: }8: }9: // critical sectionA: turn := 0B: flag1 := false


(0,0) {turn=0; flag1=0; flag0=0}(9,9) { }(2,2) {flag1-1=0; flag0-1=0; -turn+1>=0; turn>=0}(2,9) {flag1-1=0; flag0-1=0;}

//line number indicate state at the end of the line (i.e. after executing)

52

Use same encoding for PM

load t = X if (Xcnt == 0) t = Xif (Xcnt == 1) t = X1 if (Xcnt == 2) t = X2

store X = t if (Xcnt == 2) “overflow”Xcnt ++if (Xcnt == 1) X1 = tif (Xcnt == 2) X2 = t

(shown for k=2)

while random do if flag0_cnt_0 > 0 then flag0 = flag0_1_0; if flag0_cnt_0 > 1 then flag0_1_0 = flag0_2_0; flag0_cnt_0 = flag0_cnt_0 - 1; yield;

Flush operation

//At this point we can’t know if there was a flush or not, due to the non deterministic loop.

flush is a problem for convex domains

● The non deterministic flush captures two possible buffer states 1. value is flushed -- buffer content shifted one slot 2. value is not flushed – buffer does not change

● To avoid losing precision, have to track disjuctions in a convex numerical domain

1 33 3cnt_t1=1 cnt_t1=2

join

3cnt_t1=[1,2]

[1,3]

flushed non flushed

55

Refine the abstraction

• Leverage boolean-numerical domains– Add boolean flags– Similar to trace partitioning domain– Supported by our SC verifier – ConcurInterproc

1 33 3cnt_t1=1 cnt_t1=2

join

flushed: non flushed:

1 3

cnt_t1=2

3 3cnt_t1=1

,

¬𝑏

𝑏

Refined flush operation

b_f1_flag0_0_t0 = false;b_f1_flag0_1_t0 = false;yield;while random do if flag0_cnt_0 > 0 then flag0 = flag0_1_0; if flag0_cnt_0 > 1 then b_f1_flag0_1_t0 = true; flag0_1_0 = flag0_2_0; else b_f1_flag0_0_t0 = true; flag0_cnt_0 = flag0_cnt_0 - 1; yield;

57

Challenge: state explosion

• Using refined flush operations everywhere is not feasible – state explosion

• We would like to find a minimal refinement that enables verification with a minimal fence placement

• Search space exponential in number of fence placements and in number of refinement placements

58

Idea 3: Refinement propagation

propagation of: program correctness + abstraction refinements

f1,r1 f1,r1f2,r2 f2,r2

f3,r3f3,r3

program has been explored

means is a successful abstraction refinement used to verify program

program to be explored program need not be explored

is an attempt to verify with a combined abstraction refinement

59

Two dimensional search

• Start from full fence placement• A verification attempt produces new options to explore

– If verified – smaller placements should be explored• Either fewer fences, or coarser abstraction

– If failed – larger placements should be explored• Either additional fences, or finer abstraction

– If “unknown” • Try both directions

• We keep a worklist from which we choose the next placement to explore– Do not try subset of failed or superset of verified– A small verified placement or a large failed placement reduce the

search space substantially – so we guide the search

60

Benchmark

● 15 concurrent algorithms● 8 infinite state● Safety specifications: Either mutual exclusion

or reachability invariants involving labels of different threads

61

Example: PC1

• 9 possible fences • 27 possible locations for flush refinements • BFS

– explore various boolean placements for full fenced placement for 3:30 hours• DFS

– Verifies 5 fence placements in under 5 mins– State explosion leads to exploring failing placements for the rest of the time

• Propagation– finds that a single fence is needed

0:00:00 0:00:17 0:00:34 0:00:51 0:01:09 0:01:26 0:01:43 0:02:00 0:02:180

0.51

1.52

2.5ABP TSO

propbfsdfs

num

fenc

es

0:00:00 0:14:24 0:28:48 0:43:12 0:57:36 1:12:00 1:26:24 1:40:48 1:55:12 2:09:36 2:24:004.5

5

5.5

6

6.5Loop2_TLM TSO

propbfsdfs

num

fenc

esResults

0:00:00 0:02:52 0:05:45 0:08:38 0:11:31 0:14:24 0:17:16 0:20:09012345

WSQ-Chase TSOpropbfsdfs

#r lo

catio

ns

0:00:00 0:00:08 0:00:17 0:00:25 0:00:34 0:00:43 0:00:510

0.20.40.60.8

11.2

Queue TSOpropbfsdfs

#r lo

catio

ns

63

Summary

• Abstraction-guided synthesis– Compute over-approximation of all possible program executions– Add minimal synchronization to avoid

(over-approximation of) bad schedules• Proof extrapolation

– Use information from the SC proof to help proof under RMM– Extrapolate predicates and cubes

• Refinement propagation – Implied correctness/incorrectness in the space of

fence/refinement placements– Combining information from different fence/refinement

placements

64

Back to Boids

• With synchronization barriers• Numerical abstractions for tracking array

indices• Establishing conflict-freedom of array accesses

that may happen in parallel

• Computing forces can be done in parallel• Different Boids write to disjoint parts of the

array

65

Boids Simulation

…

Generate initial locationsSpawn N Boid TasksWhile (true) { Wait on display-barrier Read locations of all boids Render boids}

While (true) { Read locations of other boids Wait on message-barrier Compute new location Update my location Wait on display-barrier}

Main task Boid task

Shared Memory (Global State)

Locations of N Boids

66

http://practicalsynthesis.org/fender/

Abstractions for Relaxed Memory Models

Documents

Transcript of Abstractions for Relaxed Memory Models