Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models Fang Yu...
-
Upload
godfrey-craig -
Category
Documents
-
view
218 -
download
2
Transcript of Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models Fang Yu...
Symbolic Program Consistency Checking of OpenMP Parallel
Programs with Relaxed Memory Models
Fang Yu
National Cheng Chi University
Shun-Ching Yang
Guan-Cheng Chen
Che-Chang Chan
National Taiwan University
Based on an LCTES 2012 paper.
Farn Wang
National Taiwan University
& Academia Sinica
Outline
• Introduction– Motivation– Parallel program correctness– Related work
• 2-step program consistency checking– Step 1: Static race constraint solution– Step 2: Guided simulation
• Extended finite-state machine (EFSM), relaxed memory models
• Implementation• Experiments• Conclusion
2
Motivation (1/4)
• Parallel Programming – Multi-cores, – General purpose computation on GPU (GPGPU)– Distributed computing, cloud computing
• Challenges: – Parallel loops, chunk sizes, # threads, schedules– Arrays, pointer aliases, – Relaxed memory models
3
Motivation (2/4)
A Running example of C & OpenMP
4
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,1) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }
Motivation (3/4)
Thread1: k+1, … , k+1+c-1, Thread2: k+1+c , … , k+1+2c-1Thread3: k+1+2c , … , k+1+3c-1Thread4: k+1+3c, … , k+1+4c-1
Thread1: k+1+4c, … , k+1+5c-1 ……. 5
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }
Motivation (4/4)
Many programming supports• forks & joins• P-threads• Open Multi-Processing (OpenMP)• Thread Building Blocks• Microsoft …
6
Parallel Program Correctness (1/4)
7
Program level, what users care about• Determinism: – For all input, all executions yield the same output.
• Consistency: – All executions yield the same output as the sequential
execution.
• Race-freedom: – Parallel executions do not yield different results.
All seemingly equivalent at program level. • unless sequential execution is not a parallel execution.
Parallel Program Correctness (2/4)
• Checking the correctness property of each parallel region (PR)
• Correctness at PRs correctness of the program
parallel for
parallel while
parallel for
parallel for
8
Parallel Program Correctness (3/4)
9
In practice • It may be unclear what the program result is. • Instead, properties for correctness at PR level
are usually checked. – determinism – consistency – race-freedom
• At RW schedule levels, values do not count.
– linearizability (transaction levels)
Parallel Program Correctness (4/4)
Linearizability (Transaction level) race-freedom (PR RW level)
determinism (PR level) = consistency (PR level)
race-freedom (program level) = determinism (program level) = consistency (program level)
program correctness 10
Related Work (1/4)• Thread analyzer of Sun Studio [Lin 2008]– Static race detection, no arrays
• Intel Thread Checker [Petersen & Shah 2003]– Dynamic approach
• Instrumentation approach on client-server for race detection [Kang et al. 2009]– Run-time monitoring in OpenMP programs
• OmpVerify [Basupalli et al. 2011]– Polyhedral analysis for Affine Control Loops
11
Related Work in PLDI 2012 (2/4)no simulation as the 2nd step
• Detect races via liquid effects [Kawaguchi, Rondon, Bakst, Jhala]– type inferencing for precise race detection. – no arrays.
• Speculative Linearizability [Guerraoui,Kuncak,Losa]
• Reasoning about Relaxed Programs [Carbin, Kim, Misailovic, Rinard]
• Parallelizing Top-Down Interprocedural Analysis [Albarghouthi, Kumar, Nori, Rajamani]
12
Related Work in PLDI 2012 (3/4) no simulation as the 2nd step
• Sound and Precise Analysis of Parallel Programs through Schedule-Specialization [Wu, Tang, Hu, et al]
• Race Detection for Web Applications [Petrov, Vechev, Sridharan, Dolby]
• Concurrent Data Representation Synthesis [Hawkins, Aiken, Fisher2, et al]
• Dynamic Synthesis for Relaxed Memory Models [Liu, Nedev, Prisadnikov, et al]
Related Work in PLDI 2012 (4/4)no simulation as the 2nd step
Tools:• Parcae [Raman, Zaks, Lee 3, et al] • Chimera [Lee, Chen, Flinn, Narayanasamy]• Janus [Tripp1, Manevich, Field, Sagiv]• Reagents [Turon]
Methodology (1/2)
Assumptions:• Arrays do not overlap.• No pointers other than arrays. • Fixed #threads, chunk size, scheduling policy.– We analyze consistency of program implementation.
• Focusing on OpenMP.– The techniques should be applicable to other
frameworks.
• Output result prescribed by users.15
Why OpenMP ?
• Complicate enough• Practical enough– Parallelizes programs automatically; – Is an industry standard of application
programming interface (API); – Is supported by Sun Studio, Intel Parallel Studio,
Visual C++, GNU Compiler Collection (GCC).
16
Methodology (2/2)
Potential race analysis at PR level
Guided simulation for program consistency violations
17
Program Consistency checking
end
2-step program consistency checking.
Potential race report
Step 1: Potential Races at PR level
18
Necessary constraints as Presburger formulas• A race constraint between each pair of
memory references to the same location by different threads.
• Solution of the pairwise constraints via Presburger formula solving.
Race-freedom
Step 1: Potential Race AnalysisC program with OpenMP
Pairwise Constraints Generator
Pairwise Race Constraints
Consraint Solver
Sat?No Yes
Potential races(Truth
Assignment)
19
Potential Race Constraint
A Potential Race Constraint = Thread Path Condition Λ Race Condition• Thread Path Condition– Necessary for a thread to access a memory location in a
statement– Obtained by symbolic postcondition analysis
• Race Condition– The necessary condition of an access by two threads in a
parallel region
20
Running example
Thread1: k+1, … , k+1+c-1, Thread2: k+1+c , … , k+1+2c-1Thread3: k+1+2c , … , k+1+3c-1Thread4: k+1+3c, … , k+1+4c-1
Thread1: k+1+4c, … , k+1+5c-1 ……. 21
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }
Thread Path Condition of L[i][k]
22
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k]*M[k][j]} } }
Thread 1: it1-(k+1)%4=0 Λ k+1≤ i t1< size
Thread Path Conditions of L[i-1][k]
23
Thread 2: it2-(k+1)-1 % 4 = 0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < size
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k] *M[k][j]} } }
Race Condition of L[i][k] & L[i-1][k]
24
for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j] = M[i][j] – L[i-1][k] *M[k][j]} } }
it1-(k+1) % 4 = 0 Λ k+1 ≤ it1 < sizeΛ it2-(k+1)-1 % 4=0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < size
Λ k = k Λ it1 = it2 -1
Potential Race Constraint Solving
Potential races (Omega lib.): . . i_1 = k+1+4alpha. . i_2 = k+2+4alpha. . i_2 = i_1+1. . i_1 < size. . i_2 < size. . k+1 <= i_1. . k+1 <= i_2. . k+1 <= j_2. . j_2 < sizei_1 – 0 [0,), not_tighti_2 – 0 [0,), not_tight 25
it1-(k+1) % 4 = 0 Λ k+1 ≤ it1 < sizeΛ it2-(k+1)-1 % 4=0 Λ k+1 ≤ it2 < sizeΛ k+1 ≤ jt2 < sizeΛ k = k Λ it1 = it2 -1
All Presburger
Step 2: Guided symbolic simulation
26
• Program models: – Extended finite-state machine (EFSM)– Relaxed memory model
• Simulator of EFSM– Stepwise, backtrack, fixed point
• Witness of program consistency violations– comparison with the sequential execution result.
Potential races
(from step 1)
Guided SimulationC program with OpenMP
Model Generator
Model (EFSM)
Simulation
Consistency ?
Yes
No
fixed point ?
Yes
No
27
Consistency violations Consistency
(w. benign races)
C Program Model Construction (1/2)
start
S
stop
(true)x=(t-1) *c +i; y=0;
(x-y+m*c j y=c-1)x=x-y+m*c; y=0;
(x-y+m*c>j y=c-1)
(x<jy<c-1)x++; y++;
(x> j)
y is an auxiliary local variable for chunk.t is the serial number of the thread.
28
Example: #pragma omp for schedule(Static, c) num_threads(m) for(x=i;x<=j;x++) S
To model races in a C statement: y = f(x1, x2, …, xn)
assume reads x1, x2,…, xn in order. – other orders can also be modeled.
Translate to the following n+1 EFSM transitions: a1=x1; a2=x2; …; an=xn; y=f(a1,…,an);
a1, a2, …, an are auxiliary variables in EFSM.
29
C Program Model Construction (2/2)
Relaxed Memory Models
• Out-of-order execution of accesses to the memory for hardware efficiency.– local caches, multiprocessors– for customized synchronizations, controlled races
• May lead to unexpected result.A classical example:
initially x=0 y = 0thread 1: x=1; thread 2: y = 1;
z = y; w = x; assert z=1w=1
30
Relaxed Memory Models
A classical example: initially x=0 y = 0
31
cache 1
cache 2
thread 1: x=1; z=y;
thread 2: y=1; w=x;
memoryloadstore
storeload
assert z=1w=1
x.c1=1y.c1=1load(w.c2,x)load(z.c1,y)store(x.c1)x=x.c1store(y.c2)y=y.c2
Relaxed Memory Models
Total store order (TSO) • From SPARC• Adapted to Intel 80x86 series• Description: – Local reads can use pending writes in the local
store.• Problem: Peer reads are not aware of the local pending
writes.
– Local stores must be FIFO.
32
Modeling TSO w. m threads (1/4)
• An array x[0..m] for each shared variable x – x[0] is the memory copy. – x[i] is the cache copy of x of thread i [1,m]– x now becomes an address variable instead of
the value variable for x.
33
Modeling TSO w. m threads (2/4)• An arrays ls[0..n] of objects for load-store (LS) buffer
of size n+1.– ls_st[k]: status of load-store buffer cell k
• 0: not used, 1:load, 2: store
– ls_th[k]: thread that use load-store buffer cell k.– ls_dst[k], ls_src[k]: destination and source addresses– ls_value: value to storePurely for convenience.Can be changed to m load-store buffers for each thread. Need know mappings from threads to cores
34
PW ? steps EFSM transitions
Pending Write (PW)
1 Thread J: !load@Q ls_src@(Q) = &x; ls_dst = &a;
LS Q: must be the largest PW LS object. ?load@J ls_th=J; ls_status = 1;
2 Thread J: ?load_finish
LS Q: !load_finish@(ls_th)l s_dst[0]=ls_value; ls_th=0; ls_status=0; compact LS array;
No pending Write
1 Thread J: !load@Q ls_src@(Q) = &x; ls_dst = &a;
LS Q: must be the smallest idle LS obj. ?load@J ls_th=J; ls_status = 1;
2 Thread J: ?load_finish
LS Q: !load_finish@(ls_th) ls_dst[0]=ls_src[0];ls_th=0; ls_status=0; compact LS array;
Modeling TSO w. m threads (3/4)Load a x by thread j, ‘a’ is private.
35
Modeling TSO w. m threads (4/4)Store a x by thread j, ‘a’ is private.
36
steps EFSM transitions1 Thread J: !store@Q ls_dst@(Q) = &x; ls_value = a;
LS Q: must be the smallest idle LS obj. ?store@J ls_th=J; ls_status = 2;
2 LS Q: ls_dst[0] = ls_value; ls_th=0; ls_status=0; compact LS array;
Guided Simulation
• For each pairwise race condition truth assignment, perform a simulation session.
• Use a stack to explore the simulation paths. • Explore all paths compatible with the truth
assignment. • Check consistency at the end of each path. • Mark benign races.
37
Implementation
Pathg – path generator • Pontential race condition solving– Presburger Omega library
• Model construction: – REDLIB for EFSM with synchronizations, arrays,
variable declarations, address arithmetics
• Guided EFSM simulation– REDLIB semi-symbolic simulator– step, backtrack, check fixpoint/consistency
38
ImplementationGuided Symbolic Simulation
Sequential execution(Golden model) Guided Multi-Threaded Simulation
ParallelTask 1
ParallelTask 1
ParallelTask 2
ParallelTask 2
ParallelTask 3
MasterThread
MasterThread
Read:L[2][1]Read:L[2][1]Write:L[2][1]Read:L[2][1]Write:L[2][1]
.
.
.
.
Read:L[2][1]Read:L[2][1]Write:L[2][1]Read:L[2][1]Write:L[2][1]
.
.
.
.
Memory Accessing Sequence
MemoryAccessing Sequence
Read:L[2][1]Write:L[2][1]Read:L[2][1]Read:L[2][1]Write:L[2][1]
.
.
.
.
Read:L[2][1]Write:L[2][1]Read:L[2][1]Read:L[2][1]Write:L[2][1]
.
.
.
.
MasterThread
ParallelTask 1
ParallelTask 1
ParallelTask 2
ParallelTask 2
ParallelTask 3
MasterThread
output output
39
ImplementationPotential Race Report
===tg:i_4,i_1=====tw:i_4Race::L[5][1]===tg:i_3,i_4=====tw:i_3Race::L[4][1]===tg:i_2,i_3=====tw:i_2Race:: L[3][1]===tg:i_1,i_2=====tw:i_1Race:: L[2][1]
tg indicates threads involved in the race.
tw indicates threads WRITE the Memory address.
Race is where the race condition is.
We enumerate variables to limit the solution
40
Experiments
• Environment– Ubuntu 9.10 64bit– i5-760 2.8GHz and 2GB RAM
• Benchmarks– OpenMP Source Code Repository (OmpSCR)– NAS Parallel Benchmarks (NPB)
41
Constraint Solving of OmpSCR Bug v1: Races manually introduced (between any two threads dealing with
the consecutive iterations) Bug v2: Rare races introduced (only between two specific threads on a
particular share memory) Fixed: A barrier statement manually inserted (remove the race in Bug v2)
BenchmarkOriginal Bug v1 Bug v2 Fixed
#Const. #Sat Time #Const. #Sat Time #Const. #Sat Time #Const. #Sat Time
c_lu.c 71 0 0.18s 629 29 1.810s 935 30 4.110s 935 0 5.15s
c_ja01.c 95 0 0.39s 95 8 0.42s 155 1 0.75s 95 0 0.77s
c_ja02.c 95 0 0.03s 95 8 0.35s 155 1 0.67s 95 0 1.03s
c_loopA.c 17 0 0.04s 47 4 0.07s 95 1 0.32s 17 0 0.84s
c_loopB.c 17 0 0.03s 29 4 0.08s 95 1 0.15s 17 0 1.13s
c_md.c 65 0 0.25s 77 4 0.30s 131 1 0.53s 65 0 1.25s
42
Symbolic Simulation of OmpSCR• Blindly simulation needs to explore (much) more traces to hit a
consistency violation! • Standard OpenMP tools fail to report races of these benchmarks.
Benchmarks
Guided simulation Random simulation Sun Studio Intel Thread Checker
#Traces Time #Trace Time race Race/total
c_lu_bug1 1 23.35s 25.3 52.11s N 4/10
c_lu_bug2 1 23.22s 178.9 110.58s N 1/10
c_ja01_bug1 1 6.65s 10.6 26.60s N 4/10
c_ja01_bug2 1 13.91s 42.1 58.16s N 3/10
c_ja_02_bug1 1 14.86s 25 28.83s N 2/10
c_ja_02_bug2 1 15.19s 41.3 52.25s N 2/10
c_loopA_bug1 1 10.76s 11.7 36.82s N 3/10
c_loopA_bug2 1 56.86s 27.6 98.40s N 2/10
c_loopB_bug1 1 14.54s 9.4 29.58s N 2/10
c_loopB_bug2 1 41.50s 38.6 66.48s N 2/10
c_md_bug1 1 12.19s 10.4 26.21s N 3/10
c_md_bug2 1 19.38s 44.3 83.52s N 2/10 43
NAS Parallel Benchmarks• Middle-size benchmarks (1200+~3500+ loc) • Efficient race constraint solving – e.g., 150000+ race constraints solved in 38 minutes by omega library
• Rare satisfiable constraints – 8/85067 constraints of nas_lu.c
Benchmark #loc #Access #Const. #Sat Time
nas_lu.c 3481 13736 85067 8 27m30.37s
bt.c 3616 15916 157047 0 37m33.32s
mg.c 1250 4636 2269 0 0m17.19s
sp.c 2983 13604 45209 0 4m0.32s44
nas_lu.c
• Slice the program to the segment of the paralleled region with satisfiable race conditions
• Construct the symbolic model of the sliced segment:– 35 Modes (EFSM)– Reaching the fixed point without consistency violation after 205
steps and 16.93secs
• Benign races– All of them are used as mutual exclusion semaphores– nas_lu.c is consistent
45
Conclusion
• Static analysis of program consistency– for real C/C++ program with OpenMP directives
• Highly automated solution– Constraint solving– Symbolic simulation
• High precision: relaxed memory models• High efficiency • Extension to TBB, other memory models ? • Partial order reduction ?
46