Abstractions for Relaxed Memory Models
description
Transcript of Abstractions for Relaxed Memory Models
1
Abstractions for Relaxed Memory Models
Andrei Dan, Yuri Meshman, Martin Vechev, Eran Yahav
2
Verification under a relaxed memory model
P M S?
3
Really about
P M SP’ M SChange the program
P M S
…
Logozzo, Ball. Modular and Verified Automatic Program Repair, OOPSLA'12
Automatic Inference of Memory Fences, FMCAD’10
Abstraction-Guided Synthesis of Synchronization, POPL’10
4
Example: Correct and Efficient Synchronization with Barriers
• Boids simulation• Craig Reynolds ,"Flocks, herds and schools: A distributed
behavioral model.", SIGGRAPH '87• Sequential implementation by Paul Richmond (U. of Sheffield)
…
Global state contains shared arrayof Boid locations
N Boids
5
Boid Task
…Read locations of other boids
…Update my location
Specification: guarantee conflict-freedom
Conflict: two threads are enabled to access the same memory location and (at least) one of these accesses is a write
Boid task pid=2
pid=2
…Read locations of other boids
…Update my location
Boid task pid=3
6
Boids Simulation
…
Generate initial locationsSpawn N Boid TasksWhile (true) { Read locations of all boids Render boids}
While (true) { Read locations of other boids Compute new location Update my location}
Main task Boid task
Shared Memory (Global State)
Locations of N Boids
Where should I put synchronization barriers?
7
Textbook Example: Dekker’s Algorithm
Thread 0: flag0 := truewhile flag1 = true { if turn ≠ 0 { flag0 := false while turn ≠ 0 { } flag0 := true }}// critical sectionturn := 1flag0 := false
Thread 1:
flag1 := truewhile flag0 = true { if turn ≠ 1 { flag1 := false while turn ≠ 1 { } flag1 := true }}// critical sectionturn := 0flag1 := false
initial: flag0 = false, flag1 = false, turn = 0
spec: mutual exclusion over critical section
sequential consistency
relaxed model x86 TSO
Yes
No
8
Concrete PSO semantics using store buffers
…P0
MainMemory
…T0
……
……
flag0flag1turn
flag0flag1turn
store flush
load
fence
9
Where should I put fences?
On the one hand, memory barriers are expensive (100s of cycles, maybe more), and should be used only when necessary.
On the other, synchronization bugs can be very difficult to track down, so memory barriers should be used liberally, rather than relying on complex platform-specific guarantees about limits to memory instruction reordering.
– Herlihy and Shavit
10
May seem easy…
Thread 0: flag0 := truefencewhile flag1 = true { if turn ≠ 0 { flag0 := false while turn ≠ 0 { } flag0 := true fence }}// critical sectionturn := 1flag0 := false
Thread 1:
flag1 := truefencewhile flag0 = true { if turn ≠ 1 { flag1 := false while turn ≠ 1 { } flag1 := true fence }}// critical sectionturn := 0flag1 := false
initial: flag0 = false, flag1 = false, turn = 0
relaxed model x86 TSO Yes
spec: mutual exclusion over critical section
11
1 int take() {2 long b = bottom – 1;3 item_t * q = wsq;4 bottom = b
5 long t = top6 if (b < t) {7 bottom = t;8 return EMPTY;9 }10 task = q->ap[b % q->size];11 if (b > t)12 return task13 if (!CAS(&top, t, t+1))14 return EMPTY;15 bottom = t + 1;16 return task;17 }
1 void push(int task) {2 long b = bottom;3 long t = top;4 item_t * q = wsq;5 if (b – t ≥ q->size – 1) {6 wsq = expand();7 q = wsq;8 }9 q->ap[b % q->size] = task;
10 bottom = b + 1;11 }
1 int steal() {2 long t = top;
3 long b = bottom;
4 item_t * q = wsq;5 if (t >= b)6 return EMPTY;7 task = q->ap[t % q->size];
8 if (!CAS(&top, t, t+1))9 return ABORT;10 return task;11 }
fence
fence
fence
fence
fence
Chase-Lev Work-Stealing Queue
12
Goal
• Help the programmer place fences– Find optimal fence placement
• Principle– Restrict non-determinism s.t. program stays
within set of safe executions
13
Our Approach: Overview
• P’ satisfies the specification S under M
FENDER
ProgramP
Specification S
MemoryModel
M
Program P’with
Fences
14
1. Compute reachable states for the program
2. Compute weakest constraints on execution that guarantee all “bad states” are avoided
3. Implement the constraints with fences <Formula>
Our Approach: Recipe
15
• Compute reachable states of the program
• Compute constraints on execution that guarantee that all “bad states” are avoided
• Implement the constraints with fences
Bad News [Atig et al. POPL’10]
Even for programs that are finite-state under SC
Reachability undecidable for RMO
Non-primitive recursive complexity for TSO/PSO
Our Approach: Recipe
16
Challenges
• Automatic verification and fence inference that works for realistic programs
• Handle two sources of unboundedness– Unbounded SC state– Unbounded store buffers
17
Bound state and buffer (under-approx)[Automatic Inference of Memory Fences, FMCAD’10]
18
Demonic scheduling for dynamic exploration of executions[Dynamic Synthesis for Relaxed Memory Models , PLDI’12]
Can infer fences for large tricky programs such as a lock-free memory allocator
LLVM-GCC
LLVM Interpreter
Threading
DemonicScheduler
Memory Model Specification
Trace Analysis SAT Solver
Fence Enforcement
Concurrent C/C++ code Client
trace Order formula
.bc modified .bc
our extension
existing work
Fixed bytecode &Fence location report
DFENCE: support for concurrency and RMM
satisfyingassignment
Open-source, available at: http://practicalsynthesis.org/fender/
20
Idea 1: Abstraction-Guided Synthesis
• Synthesis of synchronization via abstract interpretation– Compute over-approximation of all possible program
executions– Add minimal synchronization to avoid
(over-approximation of) bad schedules
• Interplay between abstraction and synchronization– Finer abstraction may enable finer synchronization– Coarse synchronization may allow coarser abstraction
21Change the abstraction to match the program
A Standard Approach: Abstraction Refinement
ProgramP
SpecificationS
Abstractcounterexample
Abstraction
AbstractionRefinement
Abstractcounterexample
Verify
Valid
22
ProgramP
Abstractcounterexample
AbstractionRefinement
Change the program to match the abstraction
Verify
Abstraction-Guided Synthesis [VYY-POPL’10]
ProgramRestriction Implement P’
Abstractcounterexample
SpecificationS
Abstraction
23
1. Compute over-approximation of reachable states for the program using sound abstractions
2. Compute weakest constraints on abstract execution that guarantee all “bad abstract states” are avoided
3. Implement the constraints with fences
Our Approach Revisited: Recipe
<Formula>
24
Conservative Abstractions
25
Different ApproachesInfinite-State
Unounded-buffer
Automatic Inference of Memory Fences [FMCAD’10]
Dynamic Synthesis for Relaxed Memory Models [PLDI’12]
Partial-Coherence Abstractions for Relaxed Memory Models [PLDI’11]
Predicate Abstraction for Relaxed Memory Models [SAS’13]
Synthesis of Memory Fences via Refinement Propagation [SAS’14]
Effective Program Transformation forVerification under Relaxed Models[in progress]
26
Partial Coherence Abstractions [PLDI’11]
…P0
MainMemory
…P1
……
……
flag0flag1turn
flag0flag1turn
P0
MainMemory
P1
flag0
turn
flag0
flag1turn
Recent value
Bounded length k
Unordered elements
flag1
Allows precise fence semantics
Allows precise loads from buffer
Keeps the analysis precise for “well behaved” programs
Record what values appeared (withoutorder or number)
Sound abstractions of store buffers
27
Abstract Memory Models - Requirements
• Intra-process coherence– A process should see the most recent value it wrote
• Preserve fence semantics– The value written to main memory when flushed by a fence is
the most recent value stored before the fence• Preserve buffer emptiness
– Values do not appear out of nowhere• Partial inter-process coherence
– Preserve as much order information as feasible (bounded)
• Simple construction!
28
State Abstraction Techniques
• Predicate abstraction– Simple – Requires initial set of predicates
• Numerical domains– Octagon and Polyhedra abstractions– Automatically handle programs with (linear)
numerical invariants
29
Predicate Abstraction
• Successful for sequential program analysis– Graf and Saidi (CAV' 97)– Microsoft's SLAM (PLDI’01)– …
• Some work for SC concurrent programs– Symmetry-Aware Predicate Abstraction for Shared-Variable
Concurrent Programs. Kroening et al. (CAV' 11)– Threader: A constraint-based verifier for multi-threaded programs
Gupta et al. (CAV' 11)– …
How can we apply standard predicate abstraction to verification under relaxed memory models?
30
Classical predicate abstractionThread 0:1 X = Y+12 fence(X)
Thread 1:1 Y = X+12 fence(Y)
initial: X=Y=0
assert(XY)
B0: X=Y, B1: X=1, B2: Y=1, B3: X=0 ,B4: Y=0
P
V
/* Statement X = 0 */21: store B1 = false;/*update predicate - B1: (X = 1) */22: store B3 = true;/* update predicate - B3: (X = 0) */23: store B0 = false…/* Statement Y = X + 1 */54: store B0 = false; /* update predicate - B0: (X = Y) */55: store B2 = choose(t3(t0t4), t1t3(t0t2)(t0 t4)(t0t4)); /* B2: (Y = 1) */56: store B4 = choose(false, (t1)(t3)(t0t2)(t0t4)); /*B4: (Y = 0) */…
BP(P,V)
BP(P,V)SC S entails PSC S
31
Direct application is not sound
Thread 0:1 X = Y+12 fence(X)
Thread 1:1 Y = X+12 fence(Y)
initial: X=Y=0
assert(XY)
B0: X=Y, B1: X=1, B2: Y=1, B3: X=0 ,B4: Y=0
PPSO S but BP(P,V)PSO S
predicates with false value other than (x=y) have been omitted
Concrete T0 T1 Glob (X,Y) (X,Y) (X,Y) (0,0) (0,0) (0,0)
T0: X = Y+1 (1,0) (0,0) (0,0) T1: Y = X+1 (1,0) (0,1) (0,0) T0: flush(X) (1,0) (1,1) (1,0) T1: flush(Y) (1,1) (1,1) (1,1)
Predicate Abstraction T0 T1 Global
X=Y, X=0, Y=0 X=Y, X=0, Y=0 X=Y, X=0, Y=0 (X=Y),X=1, Y=0 X=Y, X=0, Y=0 X=Y, X=0, Y=0 (X=Y),X=1, Y=0 (X=Y), Y=1, X=0 X=Y, X=0, Y=0 (X=Y),X=1, Y=0 (X=Y), Y=1, X=0 (X=Y), X=1, Y=0 (X=Y),X=1, Y=0 (X=Y), Y=1, X=0 (X=Y), X=1, Y=1
32
How do we restore soundness?
• Option 0: restrict programs/properties • Option 1: BP(P,V)specialized
– Capture dependencies between updates– Invalidation of predicates – Synchronized updates of multiple predicates
• Option 2: BP(PM,V)SC
– Capture all relaxed memory model effects in the program itself
– Boolean program construction as usual – Verification as usual (using SC tools)
33
Encode memory model effects in the Program
PM S?
PMSC S?
The behavior of PM under sequential consistency is anover-approximation of the behavior of P running under model M
34
Encode RMM effects into the program
• Pick a bound k for store buffers (sound)• Encode store buffers as program variables • Shared variable X encoded as
– Xcnt – a counter for the buffer position
– X1, …, Xk – buffer contents
X1 X2 Xk… …X (PSO)
35
Encode Program: Example for k=1
load t = X if (Xcnt == 0) t = Xif (Xcnt == 1) t = X1
store X = t if (Xcnt == k) “overflow”Xcnt ++if (Xcnt == 1) X1 = t
36
Where do predicates come from?
Predicate Abstraction
ProgramP
Predicates
V
Boolean Program B
Model Checker
Verified
Counterexample
MemoryModel
M
Reduction
Program PM
?
37
Idea 2: Proof Extrapolation
• Leverage the similarity between behaviors in PM and those in PSC
• Verify program under SC using a given vocabulary V
• Extrapolate predicates VM for PM from the SC proof
38
Step 1: Verify program under SC
• Find a set of predicates V• Construct the Boolean program B(P,V)• Verify B(P,V)SC S
39
Step 2: Predicate Extrapolation
• Discover new predicates for RMM based on the predicates used in the SC proof
• Generic predicates– Buffer size, overflow
• SC-Based extrapolated predicates– from SC relationships as captured in V
40
Predicate Extrapolation Example
• xshared variables, 0 i k – (Xcnt == i) tracks buffer size
– (Xi==Xi-1), i 0 for flush actions
• pV where p is of the form “(X<Y)”, 0 i k – (Xi < Y)
– (X < Yi)
41
Dekker with extrapolated predicates
Thread 0:flag0 := truewhile flag1 = true { if turn ≠ 0 { flag0 := false while turn ≠ 0 { } flag0 := true }}// critical sectionturn := 1flag0 := false
Thread 1:flag1 := truewhile flag0 = true { if turn ≠ 1 { flag1 := false while turn ≠ 1 { } flag1 := true }}// critical sectionturn := 0flag1 := false
initial: flag0 = false, flag1 = false, turn = 0
SC (t2 = 0),(t1 = 0), (f1 = 0), (f2 = 0), (flag0 = 0), (flag1 = 0), (turn = 0)
PSO (overflow = 0), (t2 = 0), (t1 = 0), (f1 = 0), (f2 = 0), (flag0 = 0), (flag1 = 0), (turn = 0),(turn_cnt_T0 = 0), (turn_cnt_T0 = 1), (turn_cnt_T1 = 0), (turn_cnt_T1 = 1),(turn_1_T0 = 0), (turn_1_T1 = 0)(flag0_cnt_T0 = 0), (flag0_cnt_T0 = 1), (flag0_1_T0 = 0)(flag1_cnt_T1 = 0), (flag1_cnt_T1 = 1), (flag1_1_T1 = 0)
TSO (overflow = 0), (t2 = 0), (t1 = 0), (f1 = 0), (f2 = 0), (flag0 = 0), (flag1 = 0), (turn = 0)(T0_cnt = 0), (T0_cnt = 1), (lhs_1_T0 = 0), (lhs_1_T0 = 1)(T1_cnt = 0), (T1_cnt = 1), (lhs_1_T1 = 0), (lhs_1_T1 = 2), (rhs_1_T0 = 0), (rhs_1_T1 = 0)
42
Our approach so far
Predicate Abstraction
ProgramP
Predicates
V
Boolean Program B
Model Checker
Verified
Counterexample
MemoryModel
M
Reduction Extrpolation
Program PM Predicates VM
43
Unfortunately…
• Building the Boolean program is exponential in the number of predicates
• Non-feasible for some benchmarks– For example: Bakery goes for more than 10 hours
|VSC| |VPSO| |VTSO|
Dekker 7 28 26
Szymanski 20 47 51
Bakery 15 38 36
Ticket 11 56 48
for k = 2
44
Core problem: abstract transformers
Literals qi = pi or qi = ¬pi, pi ∊ VM
Cubes(VM) = {qi1 ∧ … ∧ qij, j ≤ |VM|}
|Cubes(VM)| = 3|VM|
For st Statements for pi V f = wp(pi,st) for c Cubes(VM) if c f // SMT call add c to the transformer
45
Cube Extrapolation
• Reuse more information from the SC proof
• In addition to input predicates, extrapolate from the cubes used in the Boolean program
• Cube search space restricted only to extrapolated cubes
46
Cube Extrapolation Example
Cube in the SC Boolean Program B Potential Cubes for the RMM Boolean Program
(X 0 X < Y) (X1 0 X1 < Y)…(Xk 0 Xk < Y)(X 0 X < Y1) …(X 0 X < Yk)
47
Abstract transformers with extrapolated Cubes
Literals qi = pi or qi = ¬pi, pi ∊ VM
Cubes(VM) = {qi1 ∧ … ∧ qij, j ≤ |VM|}ExtCubes(B,VM) = CubeExtrapolation(B)
|ExtCubes(B,VM)|<< |Cubes(VM)|
For st Statements for pi V f = wp(pi,st) for c ExtCubes(B,VM) if c f // SMT call add c to the transformer
48
Complete Approach
Predicate Abstraction
ProgramP
Predicates
V
Boolean Program B
Model Checker
Verified
Counterexample
MemoryModel
M
Reduction Extrpolation
Program PM Predicates VM
Cube Extraction
Boolean Program
BSC
Cubes from BSC
49
Results: Predicate Extrapolation Build Boolean Program Model Check
algorithm memory # input # SMT time # cubes cube # states memory time
model preds calls (K) (sec) used size (K) (MB) (sec)
Dekker
SC 7 0.7 0.1 0
1
14 6 1
PSO 20 26 6 0 80 31 5
TSO 18 22 5 0 45 20 3
Peterson
SC 7 0.6 0.1 2
2
7 3 1
PSO 20 15 3 2 31 13 3
TSO 18 13 3 2 25 11 2
ABP
SC 8 2 0.5 5
2
0.6 1 0.6
PSO 15 20 4 5 2 3 1
TSO 17 23 5 5 2 3 1
Szymanski
SC 20 16 3.3 1
2
12 6 2
PSO 35 152 33 1 61 30 4
TSO 37 165 35 1 61 31 5
50
Results: Cube Extrapolation Build Boolean Program Model check
algorithm memory method # input # input # SMT time # cubes cube # states memory time
model preds cubes calls (K) (sec) used size (K) (MB) (sec)
Queue
SC Trad 7 - 20 5 50
4
1 2 1
PSO PE
15 - 5,747 1,475 412 1 4 1
CE 99 98 17 99 11 6 2
TSO PE
16 - 11,133 2,778 412 12 4 1
CE 99 163 31 99 12 7 2
Bakery
SC Trad 15 - 1,552 355 161
4
20 8 2
PSO PE
38 - - T/O - - - -
CE 422 9,018 1,773 381 979 375 104
TSO PE
36 - - T/O - - - -
CE 422 7,048 1,386 383 730 285 121
Ticket
SC Trad 11 - 218 51 134
4
2 2 1
PSO PE
56 - - T/O - - - -
CE 622 15,644 2,163 380 193 123 40
TSO PE
48 - - T/O - - - -
CE 622 6,941 1,518 582 71 67 545
51
Numerical analysis under SC
Thread 0:0:1:flag0 := true2: while flag1 = true {3: if turn ≠ 0 {4: flag0 := false5: while turn ≠ 0 { }6: flag0 := true7: }8:}9:// critical sectionA: turn := 1B: flag0 := false
Thread 1:0:1: flag1 := true2: while flag0 = true {3: if turn ≠ 1 {4: flag1 := false5: while turn ≠ 1 { }6: flag1 := true7: }8: }9: // critical sectionA: turn := 0B: flag1 := false
initial: flag0 = false, flag1 = false, turn = 0
(0,0) {turn=0; flag1=0; flag0=0}(9,9) { }(2,2) {flag1-1=0; flag0-1=0; -turn+1>=0; turn>=0}(2,9) {flag1-1=0; flag0-1=0;}
//line number indicate state at the end of the line (i.e. after executing)
52
Use same encoding for PM
load t = X if (Xcnt == 0) t = Xif (Xcnt == 1) t = X1 if (Xcnt == 2) t = X2
store X = t if (Xcnt == 2) “overflow”Xcnt ++if (Xcnt == 1) X1 = tif (Xcnt == 2) X2 = t
(shown for k=2)
while random do if flag0_cnt_0 > 0 then flag0 = flag0_1_0; if flag0_cnt_0 > 1 then flag0_1_0 = flag0_2_0; flag0_cnt_0 = flag0_cnt_0 - 1; yield;
Flush operation
//At this point we can’t know if there was a flush or not, due to the non deterministic loop.
flush is a problem for convex domains
● The non deterministic flush captures two possible buffer states 1. value is flushed -- buffer content shifted one slot 2. value is not flushed – buffer does not change
● To avoid losing precision, have to track disjuctions in a convex numerical domain
1 33 3cnt_t1=1 cnt_t1=2
join
3cnt_t1=[1,2]
[1,3]
flushed non flushed
55
Refine the abstraction
• Leverage boolean-numerical domains– Add boolean flags– Similar to trace partitioning domain– Supported by our SC verifier – ConcurInterproc
1 33 3cnt_t1=1 cnt_t1=2
join
flushed: non flushed:
1 3
cnt_t1=2
3 3cnt_t1=1
,
¬𝑏
𝑏
Refined flush operation
b_f1_flag0_0_t0 = false;b_f1_flag0_1_t0 = false;yield;while random do if flag0_cnt_0 > 0 then flag0 = flag0_1_0; if flag0_cnt_0 > 1 then b_f1_flag0_1_t0 = true; flag0_1_0 = flag0_2_0; else b_f1_flag0_0_t0 = true; flag0_cnt_0 = flag0_cnt_0 - 1; yield;
57
Challenge: state explosion
• Using refined flush operations everywhere is not feasible – state explosion
• We would like to find a minimal refinement that enables verification with a minimal fence placement
• Search space exponential in number of fence placements and in number of refinement placements
58
Idea 3: Refinement propagation
propagation of: program correctness + abstraction refinements
f1,r1 f1,r1f2,r2 f2,r2
f3,r3f3,r3
program has been explored
means is a successful abstraction refinement used to verify program
program to be explored program need not be explored
is an attempt to verify with a combined abstraction refinement
59
Two dimensional search
• Start from full fence placement• A verification attempt produces new options to explore
– If verified – smaller placements should be explored• Either fewer fences, or coarser abstraction
– If failed – larger placements should be explored• Either additional fences, or finer abstraction
– If “unknown” • Try both directions
• We keep a worklist from which we choose the next placement to explore– Do not try subset of failed or superset of verified– A small verified placement or a large failed placement reduce the
search space substantially – so we guide the search
60
Benchmark
● 15 concurrent algorithms● 8 infinite state● Safety specifications: Either mutual exclusion
or reachability invariants involving labels of different threads
61
Example: PC1
• 9 possible fences • 27 possible locations for flush refinements • BFS
– explore various boolean placements for full fenced placement for 3:30 hours• DFS
– Verifies 5 fence placements in under 5 mins– State explosion leads to exploring failing placements for the rest of the time
• Propagation– finds that a single fence is needed
0:00:00 0:00:17 0:00:34 0:00:51 0:01:09 0:01:26 0:01:43 0:02:00 0:02:180
0.51
1.52
2.5ABP TSO
propbfsdfs
num
fenc
es
0:00:00 0:14:24 0:28:48 0:43:12 0:57:36 1:12:00 1:26:24 1:40:48 1:55:12 2:09:36 2:24:004.5
5
5.5
6
6.5Loop2_TLM TSO
propbfsdfs
num
fenc
esResults
0:00:00 0:02:52 0:05:45 0:08:38 0:11:31 0:14:24 0:17:16 0:20:09012345
WSQ-Chase TSOpropbfsdfs
#r lo
catio
ns
0:00:00 0:00:08 0:00:17 0:00:25 0:00:34 0:00:43 0:00:510
0.20.40.60.8
11.2
Queue TSOpropbfsdfs
#r lo
catio
ns
63
Summary
• Abstraction-guided synthesis– Compute over-approximation of all possible program executions– Add minimal synchronization to avoid
(over-approximation of) bad schedules• Proof extrapolation
– Use information from the SC proof to help proof under RMM– Extrapolate predicates and cubes
• Refinement propagation – Implied correctness/incorrectness in the space of
fence/refinement placements– Combining information from different fence/refinement
placements
64
Back to Boids
• With synchronization barriers• Numerical abstractions for tracking array
indices• Establishing conflict-freedom of array accesses
that may happen in parallel
• Computing forces can be done in parallel• Different Boids write to disjoint parts of the
array
65
Boids Simulation
…
Generate initial locationsSpawn N Boid TasksWhile (true) { Wait on display-barrier Read locations of all boids Render boids}
While (true) { Read locations of other boids Wait on message-barrier Compute new location Update my location Wait on display-barrier}
Main task Boid task
Shared Memory (Global State)
Locations of N Boids
66
http://practicalsynthesis.org/fender/