Con$nued(from(last$me:(( Race(Detec$on(in(Cilk(Computaons(
Transcript of Con$nued(from(last$me:(( Race(Detec$on(in(Cilk(Computaons(
Canonical Series-‐Parallel (SP) Parse Tree S
e1
S
P
F1 S
e2 P
F2 e3
S
e4
S
P
F4 S
e5 P
F5 e6
S
e7
S
P
F7 S
e8 P
F8 e9
e10 F: e1; spawn F1; e2; spawn F2; e3; sync; e4; spawn F4; e5; spawn F5; e6; sync; e7; spawn F4; e8; spawn F8; e9; sync; e10;
Spine
Lemma 1 [LCA in SP tree]: The least common ancestor (LCA) of two strands determines whether the strands are logically in series or in parallel: • if e ≺ e’ if LCA(e, e’) is an S node and e is to the leQ of e’
(e precedes e’) • if e ‖ e’ if LCA(e, e’) is an P node.
Canonical Series-‐Parallel (SP) Parse Tree
Lemma 1 [LCA in SP tree]: The LCA of two strands determines whether the strands are logically in series or in parallel: • if e ‖ e’ if and only if LCA(e, e’) is an P node.
(The other case, where e ≺ e’ is just a Corollary of this Lemma.)
Canonical Series-‐Parallel (SP) Parse Tree
Case 1 (⟹): Assume for the purpose of contradic$on, e ‖ e’, but their LCA is an S node.
G1 G2
e1
e2
: Sink of G1 and source of G2
Then there must be a path from e1 to e2. Contradic$on!
Since the LCA is an S node, the dag G1 containing e1 must be connected in series with the dag G2 containing e2 :
Canonical Series-‐Parallel (SP) Parse Tree
Case 2 (⟸): Assume for the purpose of contradic$on e ≺ e’, but their LCA is a P node. That means, G1 and G2 are connected w/ parallel composi$on. ˝
G1 e1
G2 e2
Since there is a path from e1 to e2, there is a path from the sink of G1 to source of G2, making the graph cyclic. Contradic$on!
Lemma 1 [LCA in SP tree]: The LCA of two strands determines whether the strands are logically in series or in parallel: • if e ‖ e’ if and only if LCA(e, e’) is an P node.
(The other case, where e ≺ e’ is just a Corollary of this Lemma.)
Lemma 1 [LCA in SP tree]: The least common ancestor (LCA) of two strands determines whether the strands are logically in series or in parallel: • if e ≺ e’ if LCA(e, e’) is an S node and e is to the leQ of e’
(e precedes e’) • if e ‖ e’ if LCA(e, e’) is an P node.
Canonical Series-‐Parallel (SP) Parse Tree
Overview of Nondeterminator • A serial tool -‐-‐-‐ it executes a Cilk computa$on serially, but analyzes the parallel constructs for a given input.
• The program is compiled so that every load and store in the user program is instrumented.
• As the program executes, the Nondeterminator maintains: • a shadow space that keeps track of the memory accesses
seen by the execu$on thus far; • an SP-‐bag data structure that keeps track of the series-‐
parallel rela$onship among strands (so implicitly it’s keep track of the shape of the SP parse tree).
• Race is reported when two logically parallel strands access the same memory loca$on in a conflic$ng way.
• Guarantee: reports a race if and only if the computa$on (program + input) contains a race.
The SP-‐Bags Data Structure For each ac$ve procedure on the call stack, the Nondeterminator maintains an S bag and a P bag: • S-‐Bag SF : Contains IDs of F’s completed descendants (including
F itself) that logically precede the currently execu$ng strand. • P-‐Bag PF : Contains IDs of F’s completed descendants that
operate logically in parallel with the currently execu$ng strand.
0 1 2 3 4 5 6 7 8 9 10 11 12: The slowdown of eight benchmark Cilk programs checked with
the Nondeterminator. The slowdown, shown as a dark bar, is the ratio ofthe Nondeterminator runtime to the original optimized runtime ( )of the benchmark. For comparison, the slowdown of an ordinary debuggingversion ( ) of each benchmark is shown as a light bar.
TimeAlgorithm Thread Per Space
creation & accesstermination
English-Hebrewlabeling [16] O p O pt O vt min np vtp
Taskrecycling [7] O t O t O vt t2Offset-spanlabeling [12] O p O p O v min np vpSP-bagsalgorithm O α v v O α v v O v
p maximum depth of nested parallelismt maximum number of logical concurrent threadsv number of shared locations being monitoredn number of threads in an execution
: Comparison of determinacy-racedetection algorithms. The func-tion α is the very slowly growing inverse of Ackermann’s function intro-duced by Tarjan in his analysis of an efficient disjoint-set data structure. Forall conceivably practical inputs, the value of this function is at most 4. Thetime for the SP-bags algorithm is an amortized bound.
we prove that the running time of the algorithm isO T α v v whenrun on a Cilk program that takes time T on one processor and uses vshared-memory locations, where α is Tarjan’s functional inverse ofAckermann’s function [21].The SP-bags algorithm is a serial algorithm. It uses the fact that
any Cilk program can be executed on one processor in a depth-first(C-like) fashion and conforms to the semantics of the C programthat results when all and keywords are removed. Asthe SP-bags algorithm executes, it employs several data structuresto determine which procedure instances have the potential to exe-cute “in parallel” with each other, and is thereby able to check fordeterminacy races.The SP-bags algorithm maintains two shadow spaces of shared
memory called writer and reader. For each location of sharedmemory, each shadow space has a corresponding location. Everyspawned procedure1 is given a unique ID at runtime. For each loca-tion l in shared memory, the ID of the procedure that wrote the lo-cation is stored in location l of the writer shadow space. Similarly,location l of the reader shadow space stores the ID of a procedurewhich previously read location l, although in this case, the ID is notnecessarily that of the most recent reader. The SP-bags algorithmupdates the shadow spaces as it executes.
1Technically, by “procedure” we mean “procedure instance,” that is, the runtimestate of the procedure.
F1
F4 F6
F6 F9
F14
F8
F13SF9 PF9
PF13SF13
F2 F3F1 F5
F7
F12
SF1 PF1
SF6 PF6F11
F9F10
F13
: A snapshot of the SP-bags data structures during the execu-tion of a Cilk program. The ovals in the figure represent procedures thatare currently on the runtime stack: F1 spawns F6, which spawns F9, whichspawns F13. Each procedure contains an S-bag and a P-bag. Each descen-dant of a completed child of a procedure F belongs either to F’s S-bag or toF’s P-bag. For example, F2, F3, F4, and F5 are descendants of F1 that com-plete before F1 spawns F6, and so these procedures belong to either F1’s S-bag or its P-bag. In addition, every procedure F belongs to its own S-bag.
The SP-bags algorithm uses the fast disjoint-set data structure [6,Chapter 22] analyzed by Tarjan [21]. The data structure maintains adynamic collection Σ of disjoint sets and provides three elementaryoperations:
x : Σ Σ x .
X Y : Σ Σ X Y X Y . The sets X andY are de-stroyed.
x : Returns the set X Σ such that x X .
Tarjan shows that any m of these operations on n sets take a total ofO mα m n time.During the execution of the SP-bags algorithm, two “bags” of
procedure ID’s are maintained for every Cilk procedure on the callstack, as illustrated in Figure 7. These bags have the following con-tents:
The S-bag SF of a procedure F contains the ID’s of those de-scendants of F’s completed children that logically “precede”the currently executing thread, as well as the ID for F itself.The P-bag PF of a procedure F contains the ID’s of those de-scendants of F’s completed children that operate logically “inparallel” with the currently executing thread.
The S-bags and P-bags are represented as sets using a disjoint-setdata structure.The SP-bags algorithm itself is given in Figure 8. As the Cilk pro-
gram executes in a serial, depth-first fashion, the SP-bags algorithmperforms additional operations whenever one of the five followingactions occurs: , , , , and . The cor-rectness of the SP-bags algorithm is presented in Section 4, but wegive an informal explanation of its operation here.As the SP-bags algorithm executes, it updates the contents of
the S-bags and P-bags whenever one of the actions , ,occurs. Whenever a procedureF is spawned, SF is initially
made to contain F, because F’s subsequent instructions are in se-ries with its earlier instructions. Whenever a subprocedure F re-turns to its parent F, the contents of SF are emptied into PF , sincethe procedures in SF can execute in parallel with any subproceduresthat F might spawn in the future before performing a . Whena occurs, PF is emptied into its SF , since all of F’s previously
The SP-‐Bags Data Structure For each ac$ve procedure on the call stack, the Nondeterminator maintains an S bag and a P bag: • S-‐Bag SF : Contains IDs of F’s completed descendants (including
F itself) that logically precede the currently execu$ng strand. • P-‐Bag PF : Contains IDs of F’s completed descendants that
operate logically in parallel with the currently execu$ng strand.
F1
F2
F3
........
F4
F5
F6
........
....
....
sync block Fi= parse tree of any spawned procedure
e1
e2
When e1 executes, SF = {F1, F2, F3}. When e2 executes, SF = {F1, F2, F3} and PF = {F4, F5}.
The SP-‐Bags Data Structure The Nondeterminator uses a disjoint-‐set data structure to maintain the S and P bags of procedures on the call stack.
Defini?on [Disjoint-‐Set Data Structure (Union-‐Find)]: Union-‐Find maintains a collec$on ∑ of disjoint sets. For two sets X and Y, X & Y ∊ ∑ ⟹ X ∩ Y = ∅ . For each set X ∊ ∑ typically has a designated "leader" element x ∊ X which is used to "name" the set. The data structure maintains the collec$on ∑ and answers the following queries:
• Make-‐set(e): ∑ ⟵ ∑ ∪ { {e} } Adds a new set {e} into the collec$on ∑ .
• Union(X, Y): ∑ ⟵ ∑ -‐ {X,Y} ∪ {X U Y} Removes individual sets X and Y and replaces them with the union of X and Y.
• Find-‐set(e): Returns X ∊ ∑ such that e ∊ X . Note that sets in ∑ is named by their leaders, so this returns the leader represen$ng the set.
The SP-‐Bags Data Structure The Nondeterminator uses a disjoint-‐set data structure to maintain the S and P bags of procedures on the call stack.
Defini?on [Disjoint-‐Set Data Structure (Union-‐Find)]: Union-‐Find maintains a collec$on ∑ of disjoint sets. For two sets X and Y, X & Y ∊ ∑ ⟹ X ∩ Y = ∅ . For each set X ∊ ∑ typically has a designated "leader" element x ∊ X which is used to "name" the set. The data structure maintains the collec$on ∑ and answers the following queries:
• Make-‐set(e) • Union(X, Y) • Find-‐set(e)
Theorem [Opera?ons on Disjoint-‐set Data structure] (Tarjan 1975): Any sequence of m opera$ons on n sets can be performed in O(m A(m, n)), where A is the inverse Ackermann's func$on (a really really slow growing func$on).
The SP-‐Bags Algorithm The SP-‐Bags algorithm is the algorithm used by the Nondeterminator and it performs two types of opera$ons.
• spawn procedure F: SF ⟵ Make-‐set(F); (F is the leader) PF ⟵ ∅
• sync in procedure F: SF ⟵ Union(SF , PF); PF ⟵ ∅
• return from F' to F (F' is spawned): PF ⟵ Union(PF , SF'); (Note that PF' must be empty at this point.)
The first type updates the S and P bags for all procedures on the call stack, which is triggered during the DFS traversal of the SP parse tree.
The SP-‐Bags Algorithm The SP-‐Bags algorithm is the algorithm used by the Nondeterminator and it performs two types of opear$ons.
Shadow memory: • writer[v]: the ID of the last
procedure that wrote to v. • reader[v]: the ID of the a
procedure that read v (not necessarily the last one).
The second type uses the SP-‐bags data structure to detect determinacy races when the user program accesses a memory loca$on.
• write loca@on v by procedure F: if (Find-‐set(reader[v]) is a P-‐bag Or Find-‐set(writer[v]) is a P-‐bag) then report race; writer[v] ⟵ F;
• read loca@on v by procedure F: if (Find-‐set(writer[v]) is a P-‐bag then report race; if (Find-‐set(reader[v]) is an S-‐bag then reader[v] ⟵ F; (Replace reader only when it's in an S-‐bag)
Jus$fica$on of the SP-‐Bags Algorithm The SP-‐Bags algorithm is the algorithm used by the Nondeterminator and it performs two types of opear$ons.
• spawn procedure F: SF ⟵ Make-‐set(F); (F is the leader) PF ⟵ ∅
Recall: • S-‐Bag SF : Contains IDs of F’s completed descendants (including F itself)
that logically precede the currently execu$ng strand. • P-‐Bag PF : Contains IDs of F’s completed descendants that operate
logically in parallel with the currently execu$ng strand.
This opera$on is valid since the S-‐bag of F by defini$on contains itself, and F has no valid child yet.
Jus$fica$on of the SP-‐Bags Algorithm The SP-‐Bags algorithm is the algorithm used by the Nondeterminator and it performs two types of opear$ons.
• sync in procedure F: SF ⟵ Union(SF , PF); PF ⟵ ∅
Recall: • S-‐Bag SF : Contains IDs of F’s completed descendants (including F itself)
that logically precede the currently execu$ng strand. • P-‐Bag PF : Contains IDs of F’s completed descendants that operate
logically in parallel with the currently execu$ng strand.
AQer a sync, we switch to a strand e right aQer sync from some strand e' right before sync. Originally PF contains IDs of F’s completed descendants that operate logically in parallel with e'. These procedures now must operate in series with e (and anything else that F will spawn). Thus, it's valid to move the IDs in PF into SF.
Jus$fica$on of the SP-‐Bags Algorithm The SP-‐Bags algorithm is the algorithm used by the Nondeterminator and it performs two types of opear$ons.
• return from F' to F (F' is spawned): PF ⟵ Union(PF , SF'); (Note that PF' must be empty at this point.)
Before a func$on F' returns, PF' is empty, since there is always an implicit sync. Also, SF' contains all the logical descendants of F', which are also logical descendants of F, and can now execute in parallel with any procedures that F might spawn in the future (before the next sync).
Recall: • S-‐Bag SF : Contains IDs of F’s completed descendants (including F itself)
that logically precede the currently execu$ng strand. • P-‐Bag PF : Contains IDs of F’s completed descendants that operate
logically in parallel with the currently execu$ng strand.
Q: What if F' is called but not spawned?
Jus$fica$on of the SP-‐Bags Algorithm To understand the second type of opera$ons, we need some lemmas first.
Recall Lemma 1 [LCA in SP tree]: • if e ‖ e’ if and only if LCA(e, e’) is an P node.
Lemma 2: Let strands e1, e2, and e3 execute serially in order. If e1 ≺ e2 and e1 ‖ e3, then e2 ‖ e3.
Proof: Suppose for the sake of contradic$on that e2 ≺ e3 . Then, by transi$vity, we'd have e1 ≺ e3 . Contradic$on.
Note that the parallel rela@on ‖ , unlike precedes ≺ , is not transi@ve.
P
S
e1 e3
e2 In this tree, e1 ‖ e3 and e1 ‖ e3 but e1 ‖ e3 .
Jus$fica$on of the SP-‐Bags Algorithm To understand the second type of opera$ons, we need some lemmas first.
Lemma 3 [Pseudotransi?vity of ‖] : Let strands e1, e2, and e3 execute serially in order. If e1 ‖ e2 and e2 ‖ e3, then e1 ‖ e3.
Proof: Since we do a depth-‐first traversal of the tree, the only possible op$ons for the tree that have them in the right serial order are:
a1
a2
e1 e2
e3
In both cases, we know that both LCA(e1, e2) and LCA(e2, e3) are P nodes. So the LCA(e1, e3), which is a1 , must also be a P node.
a1
e1
e2 e3
a2
Recall Lemma 1 [LCA in SP tree]: • if e ‖ e’ if and only if LCA(e, e’) is an P node.
Jus$fica$on of the SP-‐Bags Algorithm Define h(a) to be the procedure that immediately enclose strand a.
Lemma 4 [SP-‐Bags maintenance] : Let e1 be executed before e2, and let a = LCA(e1, e2) in the SP parse tree. • if e1 ≺ e2 ⟹ h(e1) is in an S-‐bag(h(a)) when e2 executes. • if e1 ‖ e2 ⟹ h(e1) is in a P-‐bag(h(a)) when e2 executes.
Proof sketch: Case 1: Since a is an S-‐node, a must belongs to either the spine or within a sync block. S
e
S
P
F1 S
e P
F2 e
a
e
S
P
F4 S
e1 P
F5 e
S
e2
S
P
F7 S
e P
F8 e
e If a belongs to the spine, then e1 belongs to a's leQ subtree and e2 to a's right subtree. Either h(e1) = h(a) or h(e1) is a descendant of h(a). If h(e1) is h(a), h(e1) is already in h(a)'s S bag.
Jus$fica$on of the SP-‐Bags Algorithm Define h(a) to be the procedure that immediately enclose strand a.
Lemma 4 [SP-‐Bags maintenance] : Let e1 be executed before e2, and let a = LCA(e1, e2) in the SP parse tree. • if e1 ≺ e2 ⟹ h(e1) is in an S-‐bag(h(a)) when e2 executes. • if e1 ‖ e2 ⟹ h(e1) is in a P-‐bag(h(a)) when e2 executes.
Proof sketch: Case 1: Since a is an S-‐node, a must belongs to either the spine or within a sync block. S
e
S
P
F1 S
e P
F2 e
a
e
S
P
F4 S
e P
F5 e
S
e2
S
P
F7 S
e P
F8 e
e If h(e1) is not h(a), h(e1) moves up into some bags in the call stack when its ancestor returns. Once the sync corresponds to a's leQ subtree executes, h(e1) moves into h(a)'s S bag (and stays there).
Jus$fica$on of the SP-‐Bags Algorithm Define h(a) to be the procedure that immediately enclose strand a.
Lemma 4 [SP-‐Bags maintenance] : Let e1 be executed before e2, and let a = LCA(e1, e2) in the SP parse tree. • if e1 ≺ e2 ⟹ h(e1) is in an S-‐bag(h(a)) when e2 executes. • if e1 ‖ e2 ⟹ h(e1) is in a P-‐bag(h(a)) when e2 executes.
Proof sketch: Case 2: Since a is an P-‐node.
S
e
S
P
F1 S
e P
F2 e
S
e
S
a
F4 S
e P
F5
S
e
S
P
F7 S
e P
F8 e
e
In this case, a must be within a sync block and e1 belongs to the leQ subtree and e2 to the right. At this point, the leQ child of a P node is always a spawned procedure F' that gets placed into h(a)'s P bag when the F' returns. Since no sync has occurred yet, F' must s$ll be in a P bag when e2 executes.
e2
Proof of the SP-‐Bags Race Detec$on Theorem [SP-‐Bags correctneess] : The SP-‐Bags algorithm reports a race in a Cilk computa$on if and only if a determinacy race exists.
Proof sketch: The (⟹) case is straight-‐forward. If the SP-‐Bags reports a race, that means it detected two strands logically in parallel that accesses the same memory loca$on in a conflic$ng way. Thus, if it reports a race, a determinacy race exists.
Proof of the SP-‐Bags Race Detec$on Theorem [SP-‐Bags correctneess] : The SP-‐Bags algorithm reports a race in a Cilk computa$on if and only if a determinacy race exists.
Proof sketch: The (⟸) case is trickier. We want to show that if a det. race exists, the SP-‐Bags algorithm reports it. Let e1 ‖ e2 and have a race on v. Assume e1 executes before e2 . If there are several races, choose e2 to be the race whose strand executes earliest in the serial order.
Case 1: Say e1 writes to v and e2 reads it. Suppose when e2 executes, writer[v] = h(e) for some e. If e = e1 then we are done, since we know that h(e1) is in a P bag of LCA(e1, e2) (by Lemma SP-‐bags maintenance). If e is not e1, then e must have been executed aQer e1 and before e2. Then either e1 ≺ e, then e ‖ e2 by Lemma 2* shown earlier. Or e1 ‖ e, then there is already a race between e1 and e, which contradicts our assump$on about e2 being the earliest race.
* Lemma 2: Let strands e1, e2, and e3 execute serially in order. If e1 ≺ e2 and e1 ‖ e3, then e2 ‖ e3.
Proof of the SP-‐Bags Race Detec$on Theorem [SP-‐Bags correctneess] : The SP-‐Bags algorithm reports a race in a Cilk computa$on if and only if a determinacy race exists.
Proof sketch: The (⟸) case is trickier. We want to show that if a det. race exists, the SP-‐Bags algorithm reports it. Let e1 ‖ e2 and have a race on v. Assume e1 executes before e2 . If there are several races, choose e2 to be the race whose strand executes earliest in the serial order.
Case 2: Say e1 writes to v and e2 writes it. This is similar to case 1.
Proof of the SP-‐Bags Race Detec$on Theorem [SP-‐Bags correctneess] : The SP-‐Bags algorithm reports a race in a Cilk computa$on iff a determinacy race exists.
Proof sketch: The (⟸) case is trickier. We want to show that if a det. race exists, the SP-‐Bags algorithm reports it. Let e1 ‖ e2 and have a race on v. Assume e1 executes before e2 . If there are several races, choose e2 to be the race whose strand executes earliest in the serial order.
Case 3: Say e1 reads to v and e2 writes it. Again, suppose reader[v] = e. If e = e1, then then we are done, since we know that h(e1) is in a P bag of LCA(e1, e2) (by Lemma SP-‐bags maintenance). So, we can assume e ≠ e1 . There are two possibili$es. Case 3.1: reader[v] was e1 at some point, but eventually got overwriwen by e (there can be some e' in reader[v] in between e1 and e). This can occur only if e1 ≺ e . Since e1 ≺ e and e1 ‖ e2 , e ‖ e2 , (again by Lemma 2*) so a race is reported.
* Lemma 2: Let strands e1, e2, and e3 execute serially in order. If e1 ≺ e2 and e1 ‖ e3, then e2 ‖ e3.
Proof of the SP-‐Bags Race Detec$on Theorem [SP-‐Bags correctneess] : The SP-‐Bags algorithm reports a race in a Cilk computa$on iff a determinacy race exists.
Proof sketch: The (⟸) case is trickier. We want to show that if a det. race exists, the SP-‐Bags algorithm reports it. Let e1 ‖ e2 and have a race on v. Assume e1 executes before e2 . If there are several races, choose e2 to be the race whose strand executes earliest in the serial order.
Case 3: Say e1 reads to v and e2 writes it. Again, suppose reader[v] = e. If e = e1, then then we are done, since we know that h(e1) is in a P bag of LCA(e1, e2) (by Lemma SP-‐bags maintenance). So, we can assume e ≠ e1 . There are two possibili$es. Case 3.2: reader[v] was never updated to be e1 . Let's assume when e1 executes, reader[v] = e'. Then it must be that e ‖ e1 or we'd have updated reader[v]. Then by Lemma Pseudotransivity of ‖ , e' ‖ e2 , and a race is reported. *Pseudotransi?vity of ‖ : e1, e2, and e3 execute serially in order. If e1 ‖ e2 and e2 ‖ e3 , then e1 ‖ e3.
What We Need in a Det. Race Detector
• SP-‐Bags data structure: maintaining the series-‐parallel ordering of strands.
• Shadow space that contains: – The last writer to a loca$on v; and – The last serial reader to a loca$on v. (But we are totally dropping the parallel readers.)
Ques?on: Can we extend the SP-‐Bags algorithm to race detect a Cilk computa?on execu?ng in parallel?
Where Things Break
• The SP-‐Bags data structure maintenance is inherently serial: it keeps track of procedure IDs that are in series / parallel with respect to the "currently execu$ng strand."
• The shadow memory only keeps track of the last serial reader (that the execu$on encounters), which is insufficient.
On-‐the-‐Fly Maintenance of Series-‐Parallel Rela$onships
The English-‐Hebrew orderings: Figure 1: A dag representing a multithreaded computation. The edgesrepresent threads, labeled u0, u1, . . . u8. The diamonds represent forks,and the squares indicate joins.
Figure 2: The parse tree for the computation dag shown in Figure 1. Theleaves are the threads in the dag. The S-nodes indicate series relationships,and the P-nodes indicate parallel relationships.
present unless the corresponding left subtree has been fully elabo-rated. Both subtrees of a P-node, however, can be partially elabo-rated. In a language like Cilk, a serial execution unfolds the parsetree in the manner of a left-to-right walk. For example, in Figure 2,a serial execution executes the threads in the order of their indices.
A typical serial, on-the-fly data-race detector simulates the exe-cution of the program as a left-to-right walk of the parse tree whilemaintaining various data structures for determining the existenceof races. The core data structure maintains the series-parallel re-lationships between the currently executing thread and previouslyexecuted threads. Specifically, the race detector must determinewhether the current thread is operating logically in series or inparallel with certain previously executed threads. We call a dy-namic data structure that maintains the series-parallel relationshipbetween threads an SP-maintenance data structure. The data struc-ture supports insertion, deletion, and SP queries: queries as towhether two nodes are logically in series or in parallel.
The Nondeterminator [13,20] race detectors use a variant of Tar-jan’s [30] least-common-ancestor algorithm, as the basis of theirSP-maintenance data structure. To determine whether a thread ui
logically precedes a thread uj , denoted ui ≺ uj , their SP-bags al-gorithm can be viewed intuitively as inspecting their least commonancestor lca(ui, uj) in the parse tree to see whether it is an S-nodewith ui in its left subtree. Similarly, to determine whether a threadui operates logically in parallel with a thread uj , denoted ui ‖ uj ,the SP-bags algorithm checks whether lca(ui, uj) is a P-node. Ob-serve that an SP relationship exists between any two nodes in theparse tree, not just between threads (leaves).
For example, in Figure 2, we have u1 ≺ u4, because S1 =lca(u1, u4) is an S-node and u1 appears in S1’s left subtree. Wealso have u1 ‖ u6, because P1 = lca(u1, u6) is a P-node. The(serially executing) Nondeterminator race detectors perform SP-maintenance operations whenever the program being tested forks,joins, or accesses a shared-memory location. The amortized costfor each of these operations is O(α(v, v)), where α is Tarjan’sfunctional inverse of Ackermann’s function and v is the number
Algorithm Space Time perper node Thread Query
creation
English-Hebrew [27] Θ(f) Θ(1) Θ(f)Offset-Span [26] Θ(d) Θ(1) Θ(d)
SP-Bags [20] Θ(1) Θ(α(v, v)) Θ(α(v, v))SP-Order Θ(1) Θ(1) Θ(1)
f = number of forks in the programd = maximum depth of nested parallelismv = number of shared locations being monitored
Figure 3: Comparison of serial, SP-maintenance algorithms. The run-ning times of the English-Hebrew and offset-span algorithms are worst-casebounds, and the SP-bags and SP-order algorithms are amortized. The func-tion α is Tarjan’s functional inverse of Ackermann’s function.
of shared-memory locations used by the program. As a conse-quence, the asymptotic running time of the Nondeterminator isO(T1α(v, v)), where T1 is the running time of the original pro-gram on 1 processor.
The SP-bags data structure has two shortcomings. The first isthat it slows the asymptotic running time by a factor of α(v, v).This factor is nonconstant in theory but is nevertheless close enoughto constant in practice that this deficiency is minor. The second,more important shortcoming is that the SP-bags algorithm reliesheavily on the serial nature of its execution, and hence it appearsdifficult to parallelize.
Some early SP-maintenance algorithms use labeling schemeswithout centralized data structures. These labeling schemes areeasy to parallelize but unfortunately are much less efficient thanthe SP-bags algorithm. Examples of such labeling schemes includethe English-Hebrew scheme [27] and the offset-span scheme [26].These algorithms generate labels for each thread on the fly, but oncegenerated, the labels remain static. By comparing labels, these SP-maintenance algorithms can determine whether two threads operatelogically in series or in parallel. One of the reasons for the ineffi-ciency of these algorithms is that label lengths increase linearlywith the number of forks (English-Hebrew) or with the depth offork nesting (offset-span).
Results
In this paper we introduce a new SP-maintenance algorithm, calledthe SP-order algorithm, which is more efficient than the SP-bagsalgorithm. This algorithm is inspired by the English-Hebrewscheme, but rather than using static labels, the labels are maintainedby an order-maintenance data structure [10, 15, 17, 33]. Figure 3compares the serial space and running times of SP-order with theother algorithms. As can be seen from the table, SP-order attainsasymptotic optimality.
We also present a parallel SP-maintenance algorithm which isdesigned to run with a Cilk-like work-stealing scheduler [12, 21].Our SP-hybrid algorithm consists of two tiers: a global tier basedon our SP-order algorithm, and a local tier based on the Nondeter-minator’s SP-bags algorithm. Suppose that a fork-join program hasn threads, T1 work, and a critical-path length of T∞. Whereas theCilk scheduler executes a computation with work T1 and critical-path length T∞ in asymptotically optimal TP = O(T1/P + T∞)expected time on P processors, SP-hybrid executes the computa-tion in O((T1/P + PT∞) lg n) time on P processors while main-taining SP relationships. Thus, whereas the underlying computa-tion achieves linear speedup when P = O(T1/T∞), SP-hybridachieves linear speed-up when P = O(
√
T1/T∞), but the work isincreased by a factor of O(lg n).
The nodes in the leQ subtree of an S-‐node always precede those in the right subtree.
On-‐the-‐Fly Maintenance of Series-‐Parallel Rela$onships
The English-‐Hebrew orderings: Figure 1: A dag representing a multithreaded computation. The edgesrepresent threads, labeled u0, u1, . . . u8. The diamonds represent forks,and the squares indicate joins.
Figure 2: The parse tree for the computation dag shown in Figure 1. Theleaves are the threads in the dag. The S-nodes indicate series relationships,and the P-nodes indicate parallel relationships.
present unless the corresponding left subtree has been fully elabo-rated. Both subtrees of a P-node, however, can be partially elabo-rated. In a language like Cilk, a serial execution unfolds the parsetree in the manner of a left-to-right walk. For example, in Figure 2,a serial execution executes the threads in the order of their indices.
A typical serial, on-the-fly data-race detector simulates the exe-cution of the program as a left-to-right walk of the parse tree whilemaintaining various data structures for determining the existenceof races. The core data structure maintains the series-parallel re-lationships between the currently executing thread and previouslyexecuted threads. Specifically, the race detector must determinewhether the current thread is operating logically in series or inparallel with certain previously executed threads. We call a dy-namic data structure that maintains the series-parallel relationshipbetween threads an SP-maintenance data structure. The data struc-ture supports insertion, deletion, and SP queries: queries as towhether two nodes are logically in series or in parallel.
The Nondeterminator [13,20] race detectors use a variant of Tar-jan’s [30] least-common-ancestor algorithm, as the basis of theirSP-maintenance data structure. To determine whether a thread ui
logically precedes a thread uj , denoted ui ≺ uj , their SP-bags al-gorithm can be viewed intuitively as inspecting their least commonancestor lca(ui, uj) in the parse tree to see whether it is an S-nodewith ui in its left subtree. Similarly, to determine whether a threadui operates logically in parallel with a thread uj , denoted ui ‖ uj ,the SP-bags algorithm checks whether lca(ui, uj) is a P-node. Ob-serve that an SP relationship exists between any two nodes in theparse tree, not just between threads (leaves).
For example, in Figure 2, we have u1 ≺ u4, because S1 =lca(u1, u4) is an S-node and u1 appears in S1’s left subtree. Wealso have u1 ‖ u6, because P1 = lca(u1, u6) is a P-node. The(serially executing) Nondeterminator race detectors perform SP-maintenance operations whenever the program being tested forks,joins, or accesses a shared-memory location. The amortized costfor each of these operations is O(α(v, v)), where α is Tarjan’sfunctional inverse of Ackermann’s function and v is the number
Algorithm Space Time perper node Thread Query
creation
English-Hebrew [27] Θ(f) Θ(1) Θ(f)Offset-Span [26] Θ(d) Θ(1) Θ(d)
SP-Bags [20] Θ(1) Θ(α(v, v)) Θ(α(v, v))SP-Order Θ(1) Θ(1) Θ(1)
f = number of forks in the programd = maximum depth of nested parallelismv = number of shared locations being monitored
Figure 3: Comparison of serial, SP-maintenance algorithms. The run-ning times of the English-Hebrew and offset-span algorithms are worst-casebounds, and the SP-bags and SP-order algorithms are amortized. The func-tion α is Tarjan’s functional inverse of Ackermann’s function.
of shared-memory locations used by the program. As a conse-quence, the asymptotic running time of the Nondeterminator isO(T1α(v, v)), where T1 is the running time of the original pro-gram on 1 processor.
The SP-bags data structure has two shortcomings. The first isthat it slows the asymptotic running time by a factor of α(v, v).This factor is nonconstant in theory but is nevertheless close enoughto constant in practice that this deficiency is minor. The second,more important shortcoming is that the SP-bags algorithm reliesheavily on the serial nature of its execution, and hence it appearsdifficult to parallelize.
Some early SP-maintenance algorithms use labeling schemeswithout centralized data structures. These labeling schemes areeasy to parallelize but unfortunately are much less efficient thanthe SP-bags algorithm. Examples of such labeling schemes includethe English-Hebrew scheme [27] and the offset-span scheme [26].These algorithms generate labels for each thread on the fly, but oncegenerated, the labels remain static. By comparing labels, these SP-maintenance algorithms can determine whether two threads operatelogically in series or in parallel. One of the reasons for the ineffi-ciency of these algorithms is that label lengths increase linearlywith the number of forks (English-Hebrew) or with the depth offork nesting (offset-span).
Results
In this paper we introduce a new SP-maintenance algorithm, calledthe SP-order algorithm, which is more efficient than the SP-bagsalgorithm. This algorithm is inspired by the English-Hebrewscheme, but rather than using static labels, the labels are maintainedby an order-maintenance data structure [10, 15, 17, 33]. Figure 3compares the serial space and running times of SP-order with theother algorithms. As can be seen from the table, SP-order attainsasymptotic optimality.
We also present a parallel SP-maintenance algorithm which isdesigned to run with a Cilk-like work-stealing scheduler [12, 21].Our SP-hybrid algorithm consists of two tiers: a global tier basedon our SP-order algorithm, and a local tier based on the Nondeter-minator’s SP-bags algorithm. Suppose that a fork-join program hasn threads, T1 work, and a critical-path length of T∞. Whereas theCilk scheduler executes a computation with work T1 and critical-path length T∞ in asymptotically optimal TP = O(T1/P + T∞)expected time on P processors, SP-hybrid executes the computa-tion in O((T1/P + PT∞) lg n) time on P processors while main-taining SP relationships. Thus, whereas the underlying computa-tion achieves linear speedup when P = O(T1/T∞), SP-hybridachieves linear speed-up when P = O(
√
T1/T∞), but the work isincreased by a factor of O(lg n).
The nodes in the leQ subtree of an S-‐node always precede those in the right subtree. English order: the nodes in the leQ subtree of a P-‐node precede those in the right subtree.
1
2 3
4
5
6 7
8
On-‐the-‐Fly Maintenance of Series-‐Parallel Rela$onships
The English-‐Hebrew orderings: Figure 1: A dag representing a multithreaded computation. The edgesrepresent threads, labeled u0, u1, . . . u8. The diamonds represent forks,and the squares indicate joins.
Figure 2: The parse tree for the computation dag shown in Figure 1. Theleaves are the threads in the dag. The S-nodes indicate series relationships,and the P-nodes indicate parallel relationships.
present unless the corresponding left subtree has been fully elabo-rated. Both subtrees of a P-node, however, can be partially elabo-rated. In a language like Cilk, a serial execution unfolds the parsetree in the manner of a left-to-right walk. For example, in Figure 2,a serial execution executes the threads in the order of their indices.
A typical serial, on-the-fly data-race detector simulates the exe-cution of the program as a left-to-right walk of the parse tree whilemaintaining various data structures for determining the existenceof races. The core data structure maintains the series-parallel re-lationships between the currently executing thread and previouslyexecuted threads. Specifically, the race detector must determinewhether the current thread is operating logically in series or inparallel with certain previously executed threads. We call a dy-namic data structure that maintains the series-parallel relationshipbetween threads an SP-maintenance data structure. The data struc-ture supports insertion, deletion, and SP queries: queries as towhether two nodes are logically in series or in parallel.
The Nondeterminator [13,20] race detectors use a variant of Tar-jan’s [30] least-common-ancestor algorithm, as the basis of theirSP-maintenance data structure. To determine whether a thread ui
logically precedes a thread uj , denoted ui ≺ uj , their SP-bags al-gorithm can be viewed intuitively as inspecting their least commonancestor lca(ui, uj) in the parse tree to see whether it is an S-nodewith ui in its left subtree. Similarly, to determine whether a threadui operates logically in parallel with a thread uj , denoted ui ‖ uj ,the SP-bags algorithm checks whether lca(ui, uj) is a P-node. Ob-serve that an SP relationship exists between any two nodes in theparse tree, not just between threads (leaves).
For example, in Figure 2, we have u1 ≺ u4, because S1 =lca(u1, u4) is an S-node and u1 appears in S1’s left subtree. Wealso have u1 ‖ u6, because P1 = lca(u1, u6) is a P-node. The(serially executing) Nondeterminator race detectors perform SP-maintenance operations whenever the program being tested forks,joins, or accesses a shared-memory location. The amortized costfor each of these operations is O(α(v, v)), where α is Tarjan’sfunctional inverse of Ackermann’s function and v is the number
Algorithm Space Time perper node Thread Query
creation
English-Hebrew [27] Θ(f) Θ(1) Θ(f)Offset-Span [26] Θ(d) Θ(1) Θ(d)
SP-Bags [20] Θ(1) Θ(α(v, v)) Θ(α(v, v))SP-Order Θ(1) Θ(1) Θ(1)
f = number of forks in the programd = maximum depth of nested parallelismv = number of shared locations being monitored
Figure 3: Comparison of serial, SP-maintenance algorithms. The run-ning times of the English-Hebrew and offset-span algorithms are worst-casebounds, and the SP-bags and SP-order algorithms are amortized. The func-tion α is Tarjan’s functional inverse of Ackermann’s function.
of shared-memory locations used by the program. As a conse-quence, the asymptotic running time of the Nondeterminator isO(T1α(v, v)), where T1 is the running time of the original pro-gram on 1 processor.
The SP-bags data structure has two shortcomings. The first isthat it slows the asymptotic running time by a factor of α(v, v).This factor is nonconstant in theory but is nevertheless close enoughto constant in practice that this deficiency is minor. The second,more important shortcoming is that the SP-bags algorithm reliesheavily on the serial nature of its execution, and hence it appearsdifficult to parallelize.
Some early SP-maintenance algorithms use labeling schemeswithout centralized data structures. These labeling schemes areeasy to parallelize but unfortunately are much less efficient thanthe SP-bags algorithm. Examples of such labeling schemes includethe English-Hebrew scheme [27] and the offset-span scheme [26].These algorithms generate labels for each thread on the fly, but oncegenerated, the labels remain static. By comparing labels, these SP-maintenance algorithms can determine whether two threads operatelogically in series or in parallel. One of the reasons for the ineffi-ciency of these algorithms is that label lengths increase linearlywith the number of forks (English-Hebrew) or with the depth offork nesting (offset-span).
Results
In this paper we introduce a new SP-maintenance algorithm, calledthe SP-order algorithm, which is more efficient than the SP-bagsalgorithm. This algorithm is inspired by the English-Hebrewscheme, but rather than using static labels, the labels are maintainedby an order-maintenance data structure [10, 15, 17, 33]. Figure 3compares the serial space and running times of SP-order with theother algorithms. As can be seen from the table, SP-order attainsasymptotic optimality.
We also present a parallel SP-maintenance algorithm which isdesigned to run with a Cilk-like work-stealing scheduler [12, 21].Our SP-hybrid algorithm consists of two tiers: a global tier basedon our SP-order algorithm, and a local tier based on the Nondeter-minator’s SP-bags algorithm. Suppose that a fork-join program hasn threads, T1 work, and a critical-path length of T∞. Whereas theCilk scheduler executes a computation with work T1 and critical-path length T∞ in asymptotically optimal TP = O(T1/P + T∞)expected time on P processors, SP-hybrid executes the computa-tion in O((T1/P + PT∞) lg n) time on P processors while main-taining SP relationships. Thus, whereas the underlying computa-tion achieves linear speedup when P = O(T1/T∞), SP-hybridachieves linear speed-up when P = O(
√
T1/T∞), but the work isincreased by a factor of O(lg n).
The nodes in the leQ subtree of an S-‐node always precede those in the right subtree. English order: the nodes in the leQ subtree of a P-‐node precede those in the right subtree. Hebrew order: the nodes in the right subtree of a P-‐node precede those in the leQ.
1
2 3
4
5
6 7
8
( , 1)
( , 2)
( , 4)
( , 3)
( , 5)
( , 6) ( , 7)
( , 8)
Under a S-‐node: E[uleQ] < E[uright] H[uleQ] < H[uright]
Under a P-‐node: E[uleQ] < E[uright] H[uleQ] > H[uright]
Key observa@on:
On-‐the-‐Fly Maintenance of Series-‐Parallel Rela$onships
The English-‐Hebrew orderings: Figure 1: A dag representing a multithreaded computation. The edgesrepresent threads, labeled u0, u1, . . . u8. The diamonds represent forks,and the squares indicate joins.
Figure 2: The parse tree for the computation dag shown in Figure 1. Theleaves are the threads in the dag. The S-nodes indicate series relationships,and the P-nodes indicate parallel relationships.
present unless the corresponding left subtree has been fully elabo-rated. Both subtrees of a P-node, however, can be partially elabo-rated. In a language like Cilk, a serial execution unfolds the parsetree in the manner of a left-to-right walk. For example, in Figure 2,a serial execution executes the threads in the order of their indices.
A typical serial, on-the-fly data-race detector simulates the exe-cution of the program as a left-to-right walk of the parse tree whilemaintaining various data structures for determining the existenceof races. The core data structure maintains the series-parallel re-lationships between the currently executing thread and previouslyexecuted threads. Specifically, the race detector must determinewhether the current thread is operating logically in series or inparallel with certain previously executed threads. We call a dy-namic data structure that maintains the series-parallel relationshipbetween threads an SP-maintenance data structure. The data struc-ture supports insertion, deletion, and SP queries: queries as towhether two nodes are logically in series or in parallel.
The Nondeterminator [13,20] race detectors use a variant of Tar-jan’s [30] least-common-ancestor algorithm, as the basis of theirSP-maintenance data structure. To determine whether a thread ui
logically precedes a thread uj , denoted ui ≺ uj , their SP-bags al-gorithm can be viewed intuitively as inspecting their least commonancestor lca(ui, uj) in the parse tree to see whether it is an S-nodewith ui in its left subtree. Similarly, to determine whether a threadui operates logically in parallel with a thread uj , denoted ui ‖ uj ,the SP-bags algorithm checks whether lca(ui, uj) is a P-node. Ob-serve that an SP relationship exists between any two nodes in theparse tree, not just between threads (leaves).
For example, in Figure 2, we have u1 ≺ u4, because S1 =lca(u1, u4) is an S-node and u1 appears in S1’s left subtree. Wealso have u1 ‖ u6, because P1 = lca(u1, u6) is a P-node. The(serially executing) Nondeterminator race detectors perform SP-maintenance operations whenever the program being tested forks,joins, or accesses a shared-memory location. The amortized costfor each of these operations is O(α(v, v)), where α is Tarjan’sfunctional inverse of Ackermann’s function and v is the number
Algorithm Space Time perper node Thread Query
creation
English-Hebrew [27] Θ(f) Θ(1) Θ(f)Offset-Span [26] Θ(d) Θ(1) Θ(d)
SP-Bags [20] Θ(1) Θ(α(v, v)) Θ(α(v, v))SP-Order Θ(1) Θ(1) Θ(1)
f = number of forks in the programd = maximum depth of nested parallelismv = number of shared locations being monitored
Figure 3: Comparison of serial, SP-maintenance algorithms. The run-ning times of the English-Hebrew and offset-span algorithms are worst-casebounds, and the SP-bags and SP-order algorithms are amortized. The func-tion α is Tarjan’s functional inverse of Ackermann’s function.
of shared-memory locations used by the program. As a conse-quence, the asymptotic running time of the Nondeterminator isO(T1α(v, v)), where T1 is the running time of the original pro-gram on 1 processor.
The SP-bags data structure has two shortcomings. The first isthat it slows the asymptotic running time by a factor of α(v, v).This factor is nonconstant in theory but is nevertheless close enoughto constant in practice that this deficiency is minor. The second,more important shortcoming is that the SP-bags algorithm reliesheavily on the serial nature of its execution, and hence it appearsdifficult to parallelize.
Some early SP-maintenance algorithms use labeling schemeswithout centralized data structures. These labeling schemes areeasy to parallelize but unfortunately are much less efficient thanthe SP-bags algorithm. Examples of such labeling schemes includethe English-Hebrew scheme [27] and the offset-span scheme [26].These algorithms generate labels for each thread on the fly, but oncegenerated, the labels remain static. By comparing labels, these SP-maintenance algorithms can determine whether two threads operatelogically in series or in parallel. One of the reasons for the ineffi-ciency of these algorithms is that label lengths increase linearlywith the number of forks (English-Hebrew) or with the depth offork nesting (offset-span).
Results
In this paper we introduce a new SP-maintenance algorithm, calledthe SP-order algorithm, which is more efficient than the SP-bagsalgorithm. This algorithm is inspired by the English-Hebrewscheme, but rather than using static labels, the labels are maintainedby an order-maintenance data structure [10, 15, 17, 33]. Figure 3compares the serial space and running times of SP-order with theother algorithms. As can be seen from the table, SP-order attainsasymptotic optimality.
We also present a parallel SP-maintenance algorithm which isdesigned to run with a Cilk-like work-stealing scheduler [12, 21].Our SP-hybrid algorithm consists of two tiers: a global tier basedon our SP-order algorithm, and a local tier based on the Nondeter-minator’s SP-bags algorithm. Suppose that a fork-join program hasn threads, T1 work, and a critical-path length of T∞. Whereas theCilk scheduler executes a computation with work T1 and critical-path length T∞ in asymptotically optimal TP = O(T1/P + T∞)expected time on P processors, SP-hybrid executes the computa-tion in O((T1/P + PT∞) lg n) time on P processors while main-taining SP relationships. Thus, whereas the underlying computa-tion achieves linear speedup when P = O(T1/T∞), SP-hybridachieves linear speed-up when P = O(
√
T1/T∞), but the work isincreased by a factor of O(lg n).
1
2 3
4
5
6 7
8
( , 1)
( , 2)
( , 4)
( , 3)
( , 5)
( , 6) ( , 7)
( , 8)
Under a S-‐node: E[uleQ] < E[uright] H[uleQ] < H[uright]
Under a P-‐node: E[uleQ] < E[uright] H[uleQ] > H[uright]
Observa@on #1:
Ques?on: Can we maintain the two labeling on the fly as the computa?on executes?
Observa@on #2: One doesn't need to assign specific labels for each strand; a rela$ve-‐ordering suffices.
The SP-‐Order Algorithm SP-‐Order(X): // X is a node is the SP tree if IsLeaf(X)
execute strand X return
// otherwie X is an internal node OM-‐Insert(Eng, X, leQ[X], right[X]) if IsSNode(X)
OM-‐Insert(Heb, X, leQ[X], right[X]) else
OM-‐Insert(Heb, X, right[X], leQ[X]) SP-‐Order(leQ[X]) SP-‐Order(right[X])
To detect race between two strands, check if they are in the same rela$ve order to each other in both Eng and Heb.
OM-‐Insert(L, X, Y1, Y2): In the ordering L, insert new element Y1 and Y2 immediately aQer X.
The SP-‐Order Algorithm SP-‐Order(X): // X is a node is the SP tree if IsLeaf(X)
execute strand X return
// otherwie X is an internal node OM-‐Insert(Eng, X, leQ[X], right[X]) if IsSNode(X)
OM-‐Insert(Heb, X, leQ[X], right[X]) else
OM-‐Insert(Heb, X, right[X], leQ[X]) SP-‐Order(leQ[X]) SP-‐Order(right[X])
To detect race between two strands, check if they are in the same rela$ve order to each other in both Eng and Heb.
Before/aQer execu$ng an S node:
English:
Hebrew:
Figure 6: An illustration of how SP-order operates at an S-node. (a) Asimple parse tree with an S-node S and two children L and R. (b) Theorder structures before traversing to S. The clouds represent the rest of theorder structure, which does not change when traversing to S. (c) The resultof the inserts after traversing to S. The left child L and then the right childR are inserted after S in both lists.
Figure 7: An illustration of how SP-order operates at a P-node. (a) Asimple parse tree with an P-node P and two children L and R. (b) Theorder structures before traversing to P . The clouds are the rest of the orderstructure, which does not change when traversing to P . (c) The result ofthe inserts after traversing to P . The left child L then the right child R areinserted after P in the English order, and R then L are inserted after P inthe Hebrew order.
Corollary 6. Consider a fork-join multithreaded program withrunning time T1 on a single processor. Then, a determinacy-racedetector using SP-order runs in O(T1) time.
To conclude this section, we observe that SP-order can be madeto work on the fly no matter how the input SP parse tree unfolds.Not only can lines 8–9 of Figure 5 be executed in either order, thebasic recursive call could be executed on nodes in any order thatrespects the parent-child and SP relationships. For example, onecould unfold the parse tree in essentially breadth-first fashion at P-nodes as long as the left subtree of an S-node is fully expandedbefore its right subtree is processed. An examination of the proofof Lemma 3 shows why we have this flexibility. The invariant inthe proof considers only a node and its children. If we expand anysingle node, its children are inserted into the order-maintenancedata structures in the proper place independent of what other nodeshave been expanded.
3 The SP-hybrid algorithm
This section describes the structure of the SP-hybrid parallel SP-maintenance algorithm. We begin by discussing how an SP parsetree is provided as input to SP-hybrid and explaining some of theproperties of Cilk that SP-hybrid exploits. We then describe thetwo-tier structure of the algorithm, which combines elements of
SP-order from Section 2 and SP-bags from [20]. We investigate thesynchronization issues that must be faced in order to parallelize SP-order and why a naive parallelization does not yield good bounds.We then overview SP-hybrid itself and present pseudocode for itsimplementation.
SP-hybrid’s input and Cilk
Like the SP-order algorithm, the SP-hybrid algorithm accepts asinput a fork-join multithreaded program expressed as an SP parsetree. The algorithm SP-hybrid provides weaker query semanticsthan the serial SP-order algorithm; these semantics are exactly whatis required for on-the-fly determinacy-race detection. Whereas SP-order allows queries of any two threads that have been unfoldedin the parse tree, SP-hybrid requires that one of the threads be acurrently executing thread. For a fork-join program with n threads,T1 work, and a critical path of length T∞, the parallel SP-hybridalgorithm can be made to run (in Cilk) in O((T1/P + PT∞) lg n)expected time.
Although SP-hybrid provides these performance bounds for anyfork-join program, it can only operate “on the fly” for programswhose parse trees unfold in a Cilk-like manner. Specifically, SP-hybrid is described and analyzed as a Cilk program, and as such,it takes advantage of two properties of the Cilk scheduler to ensureefficient execution. First, any single processor unfolds the parsetree left-to-right. Second, it exploits the properties of Cilk’s “work-stealing” scheduler, both for correctness and efficiency. AlthoughSP-hybrid operates correctly and efficiently on the a posteriori SPparse tree for any fork-join program, it only operates “on-the-fly”when the parse tree unfolds similar to a Cilk computation.
Cilk employs a “work-stealing” scheduler, which executes anymultithreaded computation having work T1 and critical-path lengthT∞ in O(T1/P + T∞) expected time on P processors, which isasymptotically optimal. The idea behind work stealing is that whena processor runs out of its own work to do, it “steals” work fromanother processor. Thus, the steals that occur during a Cilk com-putation break the computation, and hence the computation’s SPparse tree, into a set of “traces,” where each trace consists of a setof threads all executed by the same processor. These traces have ad-ditional structure imposed by Cilk’s scheduler. Specifically, when-ever a thief processor steals work from a victim processor, the workstolen corresponds to the right subtree of the P-node that is highestin the SP-parse tree walked by the victim. Cilk’s scheduler providesan upper bound of O(PT∞) steals with high probability.
A naive parallelization of SP-order
A straightforward way to parallelize the SP-order algorithm is toshare the SP-order data structure among the processors that are ex-ecuting the input fork-join program. The problem that arises, how-ever, is that processors may interfere with each other as they modifythe data structure, and thus some method of synchronization mustbe employed to provide mutual exclusion.
A common way to handle mutual exclusion is through the use oflocks. For example, suppose that each processor obtains a globallock prior to every OM-INSERT or OM-PRECEDES operation onthe shared SP-order data structure, releasing the lock when the op-eration is complete. Although this parallel version of SP-order iscorrect, the locking can introduce significant performance penal-ties.
Consider a parallel execution of this naive parallel SP-order al-gorithm on P processors. During a single operation by a processoron the shared SP-order data structure, all P − 1 other processorsmay stall while waiting for the lock required to perform their ownoperations. Let us assume, as is reasonable, that no processor waitson a lock unless another processor owns the lock. Thus, if we at-tribute the cost of waiting for a lock to the processor that owns the
Figure 6: An illustration of how SP-order operates at an S-node. (a) Asimple parse tree with an S-node S and two children L and R. (b) Theorder structures before traversing to S. The clouds represent the rest of theorder structure, which does not change when traversing to S. (c) The resultof the inserts after traversing to S. The left child L and then the right childR are inserted after S in both lists.
Figure 7: An illustration of how SP-order operates at a P-node. (a) Asimple parse tree with an P-node P and two children L and R. (b) Theorder structures before traversing to P . The clouds are the rest of the orderstructure, which does not change when traversing to P . (c) The result ofthe inserts after traversing to P . The left child L then the right child R areinserted after P in the English order, and R then L are inserted after P inthe Hebrew order.
Corollary 6. Consider a fork-join multithreaded program withrunning time T1 on a single processor. Then, a determinacy-racedetector using SP-order runs in O(T1) time.
To conclude this section, we observe that SP-order can be madeto work on the fly no matter how the input SP parse tree unfolds.Not only can lines 8–9 of Figure 5 be executed in either order, thebasic recursive call could be executed on nodes in any order thatrespects the parent-child and SP relationships. For example, onecould unfold the parse tree in essentially breadth-first fashion at P-nodes as long as the left subtree of an S-node is fully expandedbefore its right subtree is processed. An examination of the proofof Lemma 3 shows why we have this flexibility. The invariant inthe proof considers only a node and its children. If we expand anysingle node, its children are inserted into the order-maintenancedata structures in the proper place independent of what other nodeshave been expanded.
3 The SP-hybrid algorithm
This section describes the structure of the SP-hybrid parallel SP-maintenance algorithm. We begin by discussing how an SP parsetree is provided as input to SP-hybrid and explaining some of theproperties of Cilk that SP-hybrid exploits. We then describe thetwo-tier structure of the algorithm, which combines elements of
SP-order from Section 2 and SP-bags from [20]. We investigate thesynchronization issues that must be faced in order to parallelize SP-order and why a naive parallelization does not yield good bounds.We then overview SP-hybrid itself and present pseudocode for itsimplementation.
SP-hybrid’s input and Cilk
Like the SP-order algorithm, the SP-hybrid algorithm accepts asinput a fork-join multithreaded program expressed as an SP parsetree. The algorithm SP-hybrid provides weaker query semanticsthan the serial SP-order algorithm; these semantics are exactly whatis required for on-the-fly determinacy-race detection. Whereas SP-order allows queries of any two threads that have been unfoldedin the parse tree, SP-hybrid requires that one of the threads be acurrently executing thread. For a fork-join program with n threads,T1 work, and a critical path of length T∞, the parallel SP-hybridalgorithm can be made to run (in Cilk) in O((T1/P + PT∞) lg n)expected time.
Although SP-hybrid provides these performance bounds for anyfork-join program, it can only operate “on the fly” for programswhose parse trees unfold in a Cilk-like manner. Specifically, SP-hybrid is described and analyzed as a Cilk program, and as such,it takes advantage of two properties of the Cilk scheduler to ensureefficient execution. First, any single processor unfolds the parsetree left-to-right. Second, it exploits the properties of Cilk’s “work-stealing” scheduler, both for correctness and efficiency. AlthoughSP-hybrid operates correctly and efficiently on the a posteriori SPparse tree for any fork-join program, it only operates “on-the-fly”when the parse tree unfolds similar to a Cilk computation.
Cilk employs a “work-stealing” scheduler, which executes anymultithreaded computation having work T1 and critical-path lengthT∞ in O(T1/P + T∞) expected time on P processors, which isasymptotically optimal. The idea behind work stealing is that whena processor runs out of its own work to do, it “steals” work fromanother processor. Thus, the steals that occur during a Cilk com-putation break the computation, and hence the computation’s SPparse tree, into a set of “traces,” where each trace consists of a setof threads all executed by the same processor. These traces have ad-ditional structure imposed by Cilk’s scheduler. Specifically, when-ever a thief processor steals work from a victim processor, the workstolen corresponds to the right subtree of the P-node that is highestin the SP-parse tree walked by the victim. Cilk’s scheduler providesan upper bound of O(PT∞) steals with high probability.
A naive parallelization of SP-order
A straightforward way to parallelize the SP-order algorithm is toshare the SP-order data structure among the processors that are ex-ecuting the input fork-join program. The problem that arises, how-ever, is that processors may interfere with each other as they modifythe data structure, and thus some method of synchronization mustbe employed to provide mutual exclusion.
A common way to handle mutual exclusion is through the use oflocks. For example, suppose that each processor obtains a globallock prior to every OM-INSERT or OM-PRECEDES operation onthe shared SP-order data structure, releasing the lock when the op-eration is complete. Although this parallel version of SP-order iscorrect, the locking can introduce significant performance penal-ties.
Consider a parallel execution of this naive parallel SP-order al-gorithm on P processors. During a single operation by a processoron the shared SP-order data structure, all P − 1 other processorsmay stall while waiting for the lock required to perform their ownoperations. Let us assume, as is reasonable, that no processor waitson a lock unless another processor owns the lock. Thus, if we at-tribute the cost of waiting for a lock to the processor that owns the
English:
Hebrew: Figure 6: An illustration of how SP-order operates at an S-node. (a) Asimple parse tree with an S-node S and two children L and R. (b) Theorder structures before traversing to S. The clouds represent the rest of theorder structure, which does not change when traversing to S. (c) The resultof the inserts after traversing to S. The left child L and then the right childR are inserted after S in both lists.
Figure 7: An illustration of how SP-order operates at a P-node. (a) Asimple parse tree with an P-node P and two children L and R. (b) Theorder structures before traversing to P . The clouds are the rest of the orderstructure, which does not change when traversing to P . (c) The result ofthe inserts after traversing to P . The left child L then the right child R areinserted after P in the English order, and R then L are inserted after P inthe Hebrew order.
Corollary 6. Consider a fork-join multithreaded program withrunning time T1 on a single processor. Then, a determinacy-racedetector using SP-order runs in O(T1) time.
To conclude this section, we observe that SP-order can be madeto work on the fly no matter how the input SP parse tree unfolds.Not only can lines 8–9 of Figure 5 be executed in either order, thebasic recursive call could be executed on nodes in any order thatrespects the parent-child and SP relationships. For example, onecould unfold the parse tree in essentially breadth-first fashion at P-nodes as long as the left subtree of an S-node is fully expandedbefore its right subtree is processed. An examination of the proofof Lemma 3 shows why we have this flexibility. The invariant inthe proof considers only a node and its children. If we expand anysingle node, its children are inserted into the order-maintenancedata structures in the proper place independent of what other nodeshave been expanded.
3 The SP-hybrid algorithm
This section describes the structure of the SP-hybrid parallel SP-maintenance algorithm. We begin by discussing how an SP parsetree is provided as input to SP-hybrid and explaining some of theproperties of Cilk that SP-hybrid exploits. We then describe thetwo-tier structure of the algorithm, which combines elements of
SP-order from Section 2 and SP-bags from [20]. We investigate thesynchronization issues that must be faced in order to parallelize SP-order and why a naive parallelization does not yield good bounds.We then overview SP-hybrid itself and present pseudocode for itsimplementation.
SP-hybrid’s input and Cilk
Like the SP-order algorithm, the SP-hybrid algorithm accepts asinput a fork-join multithreaded program expressed as an SP parsetree. The algorithm SP-hybrid provides weaker query semanticsthan the serial SP-order algorithm; these semantics are exactly whatis required for on-the-fly determinacy-race detection. Whereas SP-order allows queries of any two threads that have been unfoldedin the parse tree, SP-hybrid requires that one of the threads be acurrently executing thread. For a fork-join program with n threads,T1 work, and a critical path of length T∞, the parallel SP-hybridalgorithm can be made to run (in Cilk) in O((T1/P + PT∞) lg n)expected time.
Although SP-hybrid provides these performance bounds for anyfork-join program, it can only operate “on the fly” for programswhose parse trees unfold in a Cilk-like manner. Specifically, SP-hybrid is described and analyzed as a Cilk program, and as such,it takes advantage of two properties of the Cilk scheduler to ensureefficient execution. First, any single processor unfolds the parsetree left-to-right. Second, it exploits the properties of Cilk’s “work-stealing” scheduler, both for correctness and efficiency. AlthoughSP-hybrid operates correctly and efficiently on the a posteriori SPparse tree for any fork-join program, it only operates “on-the-fly”when the parse tree unfolds similar to a Cilk computation.
Cilk employs a “work-stealing” scheduler, which executes anymultithreaded computation having work T1 and critical-path lengthT∞ in O(T1/P + T∞) expected time on P processors, which isasymptotically optimal. The idea behind work stealing is that whena processor runs out of its own work to do, it “steals” work fromanother processor. Thus, the steals that occur during a Cilk com-putation break the computation, and hence the computation’s SPparse tree, into a set of “traces,” where each trace consists of a setof threads all executed by the same processor. These traces have ad-ditional structure imposed by Cilk’s scheduler. Specifically, when-ever a thief processor steals work from a victim processor, the workstolen corresponds to the right subtree of the P-node that is highestin the SP-parse tree walked by the victim. Cilk’s scheduler providesan upper bound of O(PT∞) steals with high probability.
A naive parallelization of SP-order
A straightforward way to parallelize the SP-order algorithm is toshare the SP-order data structure among the processors that are ex-ecuting the input fork-join program. The problem that arises, how-ever, is that processors may interfere with each other as they modifythe data structure, and thus some method of synchronization mustbe employed to provide mutual exclusion.
A common way to handle mutual exclusion is through the use oflocks. For example, suppose that each processor obtains a globallock prior to every OM-INSERT or OM-PRECEDES operation onthe shared SP-order data structure, releasing the lock when the op-eration is complete. Although this parallel version of SP-order iscorrect, the locking can introduce significant performance penal-ties.
Consider a parallel execution of this naive parallel SP-order al-gorithm on P processors. During a single operation by a processoron the shared SP-order data structure, all P − 1 other processorsmay stall while waiting for the lock required to perform their ownoperations. Let us assume, as is reasonable, that no processor waitson a lock unless another processor owns the lock. Thus, if we at-tribute the cost of waiting for a lock to the processor that owns the
English:
Hebrew:
Figure 6: An illustration of how SP-order operates at an S-node. (a) Asimple parse tree with an S-node S and two children L and R. (b) Theorder structures before traversing to S. The clouds represent the rest of theorder structure, which does not change when traversing to S. (c) The resultof the inserts after traversing to S. The left child L and then the right childR are inserted after S in both lists.
Figure 7: An illustration of how SP-order operates at a P-node. (a) Asimple parse tree with an P-node P and two children L and R. (b) Theorder structures before traversing to P . The clouds are the rest of the orderstructure, which does not change when traversing to P . (c) The result ofthe inserts after traversing to P . The left child L then the right child R areinserted after P in the English order, and R then L are inserted after P inthe Hebrew order.
Corollary 6. Consider a fork-join multithreaded program withrunning time T1 on a single processor. Then, a determinacy-racedetector using SP-order runs in O(T1) time.
To conclude this section, we observe that SP-order can be madeto work on the fly no matter how the input SP parse tree unfolds.Not only can lines 8–9 of Figure 5 be executed in either order, thebasic recursive call could be executed on nodes in any order thatrespects the parent-child and SP relationships. For example, onecould unfold the parse tree in essentially breadth-first fashion at P-nodes as long as the left subtree of an S-node is fully expandedbefore its right subtree is processed. An examination of the proofof Lemma 3 shows why we have this flexibility. The invariant inthe proof considers only a node and its children. If we expand anysingle node, its children are inserted into the order-maintenancedata structures in the proper place independent of what other nodeshave been expanded.
3 The SP-hybrid algorithm
This section describes the structure of the SP-hybrid parallel SP-maintenance algorithm. We begin by discussing how an SP parsetree is provided as input to SP-hybrid and explaining some of theproperties of Cilk that SP-hybrid exploits. We then describe thetwo-tier structure of the algorithm, which combines elements of
SP-order from Section 2 and SP-bags from [20]. We investigate thesynchronization issues that must be faced in order to parallelize SP-order and why a naive parallelization does not yield good bounds.We then overview SP-hybrid itself and present pseudocode for itsimplementation.
SP-hybrid’s input and Cilk
Like the SP-order algorithm, the SP-hybrid algorithm accepts asinput a fork-join multithreaded program expressed as an SP parsetree. The algorithm SP-hybrid provides weaker query semanticsthan the serial SP-order algorithm; these semantics are exactly whatis required for on-the-fly determinacy-race detection. Whereas SP-order allows queries of any two threads that have been unfoldedin the parse tree, SP-hybrid requires that one of the threads be acurrently executing thread. For a fork-join program with n threads,T1 work, and a critical path of length T∞, the parallel SP-hybridalgorithm can be made to run (in Cilk) in O((T1/P + PT∞) lg n)expected time.
Although SP-hybrid provides these performance bounds for anyfork-join program, it can only operate “on the fly” for programswhose parse trees unfold in a Cilk-like manner. Specifically, SP-hybrid is described and analyzed as a Cilk program, and as such,it takes advantage of two properties of the Cilk scheduler to ensureefficient execution. First, any single processor unfolds the parsetree left-to-right. Second, it exploits the properties of Cilk’s “work-stealing” scheduler, both for correctness and efficiency. AlthoughSP-hybrid operates correctly and efficiently on the a posteriori SPparse tree for any fork-join program, it only operates “on-the-fly”when the parse tree unfolds similar to a Cilk computation.
Cilk employs a “work-stealing” scheduler, which executes anymultithreaded computation having work T1 and critical-path lengthT∞ in O(T1/P + T∞) expected time on P processors, which isasymptotically optimal. The idea behind work stealing is that whena processor runs out of its own work to do, it “steals” work fromanother processor. Thus, the steals that occur during a Cilk com-putation break the computation, and hence the computation’s SPparse tree, into a set of “traces,” where each trace consists of a setof threads all executed by the same processor. These traces have ad-ditional structure imposed by Cilk’s scheduler. Specifically, when-ever a thief processor steals work from a victim processor, the workstolen corresponds to the right subtree of the P-node that is highestin the SP-parse tree walked by the victim. Cilk’s scheduler providesan upper bound of O(PT∞) steals with high probability.
A naive parallelization of SP-order
A straightforward way to parallelize the SP-order algorithm is toshare the SP-order data structure among the processors that are ex-ecuting the input fork-join program. The problem that arises, how-ever, is that processors may interfere with each other as they modifythe data structure, and thus some method of synchronization mustbe employed to provide mutual exclusion.
A common way to handle mutual exclusion is through the use oflocks. For example, suppose that each processor obtains a globallock prior to every OM-INSERT or OM-PRECEDES operation onthe shared SP-order data structure, releasing the lock when the op-eration is complete. Although this parallel version of SP-order iscorrect, the locking can introduce significant performance penal-ties.
Consider a parallel execution of this naive parallel SP-order al-gorithm on P processors. During a single operation by a processoron the shared SP-order data structure, all P − 1 other processorsmay stall while waiting for the lock required to perform their ownoperations. Let us assume, as is reasonable, that no processor waitson a lock unless another processor owns the lock. Thus, if we at-tribute the cost of waiting for a lock to the processor that owns the
Before/aQer execu$ng a P node:
English:
Hebrew:
L and R are the leQ and right children in the SP parse tree.
Naïve paralleliza@on: the Order Maintenance data structure becomes a scalability boZleneck.
SP-‐Hybrid • Recall: between successful steals, each worker's behavior
mirrors the serial execu$on. trace: the execu$on done by a worker between steals.
• A two-‐$er scheme: global $er: use a global Order Maintenance data structure to maintain the ordering between traces. -‐ a clever design of a concurrent data structure allows one to query the data structure without locking. local $er: within a trace, query SP rela$onships using the SP-‐bags data structure.
Challenge: traces are defined dynamically as steals occur, so how do we keep track of that?
SP-‐Hybrid • Recall: between successful steals, each worker's behavior
mirrors the serial execu$on. trace: the execu$on done by a worker between steals.
• A two-‐$er scheme: global $er: use a global Order Maintenance data structure to maintain the ordering between traces. -‐ a clever design of a concurrent data structure allows one to query the data structure without locking. local $er: within a trace, query SP rela$onships using the SP-‐bags data structure.
Challenge: traces are defined dynamically as steals occur, so how do we keep track of that?
Spli|ng Traces On-‐the-‐Fly
Figure 11: The split of a trace U around a P-node X in terms of a canon-ical Cilk parse tree. The tree walk of U is executing in left [X] when thesubtree rooted at right [X] is stolen by a thief processor. The shaded re-gions contain the nodes belonging to each of the subtraces produced by thesplit. The two circles not enclosing any text indicate portions of the parsetree that have not yet been visited by the tree walk of U .
Figure 12: An ordering of the new traces resulting from a steal as shown inFigure 11. Each circle represents a trace.
single processor working on a trace. The FIND-TRACE operation,however, may be executed by any processor, and thus the imple-mentation must operate correctly in the face of multiple FIND-TRACE operations.
The implementation of SP-bags proposed in [20] uses the classi-cal disjoint-set data structure with “union by rank” and “path com-pression” heuristics [14, 29, 31]. On a single processor, this datastructure allows all local-tier operations to be supported in amor-tized O(α(m,n)) time, where α is Tarjan’s functional inverse ofAckermann’s function, m is the number of local-tier operations,and n is the number of threads. Moreover, the worst-case time forany operation is O(lg n).
The classical disjoint-set data structure does not work “out ofthe box” when multiple FIND-TRACE operations execute concur-rently. The reason is that although these operations are queries, thepath-compression heuristic modifies the data structure, potentiallycausing concurrent operations to interfere.8 Consequently, our im-plementation of the local tier uses the disjoint-set data structurewith union by rank only, which supports each operation in O(lg n)worst-case time.
The SP-bags implementation used by SP-hybrid follows that of[20], except that we must additionally support the SPLIT operation.At the time of a split, the subtraces U (1), U (2), and U (3) may all
8In fact, concurrent path compression does not affect the correctness ofthe algorithm, assuming that reads and writes execute atomically. The per-formance analysis become more complicated. We conjecture that a betterrunning time can be obtained using the classical data structure.
contain many threads. Thus, splitting them off from the trace Umay take substantial work. Fortunately, SP-bags overcomes thisdifficulty by allowing a split to be performed in O(1) time.
Consider the S- and P-bags at the time a thread in the procedureF is stolen and the five subtraces U (1), U (2), U (3), U (4), and U (5)
are created. The S-bag of F contains exactly the threads in the sub-trace U (1). Similarly, the P-bag of F contains exactly the threadsin the subtrace U (2). The SP-bags data structure is such that mov-ing these two bags to the appropriate subtraces requires only O(1)pointer updates. The subtrace U (3) owns all the other S- and P-bagsthat belonged to the original trace U , and thus nothing more needbe done, since U (3) directly inherits U ’s threads. The subtracesU (4) and U (5) are created with empty S- and P-bags. Thus, thesplit can be performed in O(1) time, since only O(1) bookkeepingneeds to be done including updating pointers.
6 Correctness of SP-hybrid
This section proves the correctness of the SP-hybrid algorithm. Webegin by showing that the traces maintained by SP-hybrid are con-sistent with the subtrace properties defined in Section 5. We thenprove that the traces are ordered correctly to determine SP relation-ships. Finally, we conclude that SP-hybrid works.
Due to the way the splits work, we can no longer prove a theo-rem as general as Lemma 1. That is to say, we can only accuratelyderive the relationship between two threads if one of them is a cur-rently executing thread.9 Although this result is weaker than for theserial algorithm, we do not need anything stronger for a race detec-tor. Furthermore, these are exactly the semantics provided by thelower-tier SP-bags algorithm.
The following lemma shows that when a split occurs, the sub-traces are consistent with the subtraces properties given in Sec-tion 5.
Lemma 7. Let Ui be a trace that is split around a P-node X . Then,the subtrace properties of Ui are maintained as invariants by SP-HYBRID.
Proof. The subtrace properties of Ui hold at the time of the splitaround the P-node X , when the subtraces were created, by defini-tion. If a subtrace is destroyed by splitting, the property holds forthat subtrace vacuously.
Consider any thread u at the time it is inserted into some trace U .Either U is a subtrace of Ui or not. If not, then the properties holdfor the subtrace Ui vacuously. Otherwise, we have five cases.
Case 1: U = Ui(1). This case cannot occur. Since Ui
(1) ismentioned only in lines 19–27 of Figure 8, it follows that Ui
(1) isnever passed to any call of SP-HYBRID. Thus, no threads are everinserted into Ui
(1).Case 2: U = Ui
(2). Like Case 1, this case cannot occur.Case 3: U = Ui
(3). We must show that Ui(3) = {u : u ∈ de-
scendants(left [X])}. The difficulty in this case is that when thetrace Ui is split, we have Ui = Ui
(3), that is, Ui and Ui(3) are
aliases for the same set. Thus, we must show that the invariantholds for all the already spawned instances of SP-HYBRID thattook Ui as a parameter, as well as those new instances that takeUi
(3) as a parameter. As it turns out, however, no new instancestake Ui
(3) as a parameter, because (like Cases 1 and 2) Ui(3) is
neither passed to SP-HYBRID nor returned.Thus, we are left to consider the already spawned instances
of SP-HYBRID that took Ui as a parameter. One such instanceis the outstanding SP-HYBRID(left [X], Ui) in line 13. If u ∈descendants(left [X]), then we are done, and thus, we only need
9Specifically, we cannot determine the relationship between threads inU(1) and U(2), but we can determine the relationship between any othertwo traces.
X: A thief steals here
U(1): the strands that precedes X U(2): the strands that is in parallel w/ X U(3): the strands in X's leQ subtree
(that is currently being executed by the vicim)
U(4): the strands in X's right subtree (ini$ally empty and will be populated by the thief)
U(5): the strands that follows X (ini$ally empty)
Each U(i) is a trace containing a set of strands.
Figure 11: The split of a trace U around a P-node X in terms of a canon-ical Cilk parse tree. The tree walk of U is executing in left [X] when thesubtree rooted at right [X] is stolen by a thief processor. The shaded re-gions contain the nodes belonging to each of the subtraces produced by thesplit. The two circles not enclosing any text indicate portions of the parsetree that have not yet been visited by the tree walk of U .
Figure 12: An ordering of the new traces resulting from a steal as shown inFigure 11. Each circle represents a trace.
single processor working on a trace. The FIND-TRACE operation,however, may be executed by any processor, and thus the imple-mentation must operate correctly in the face of multiple FIND-TRACE operations.
The implementation of SP-bags proposed in [20] uses the classi-cal disjoint-set data structure with “union by rank” and “path com-pression” heuristics [14, 29, 31]. On a single processor, this datastructure allows all local-tier operations to be supported in amor-tized O(α(m,n)) time, where α is Tarjan’s functional inverse ofAckermann’s function, m is the number of local-tier operations,and n is the number of threads. Moreover, the worst-case time forany operation is O(lg n).
The classical disjoint-set data structure does not work “out ofthe box” when multiple FIND-TRACE operations execute concur-rently. The reason is that although these operations are queries, thepath-compression heuristic modifies the data structure, potentiallycausing concurrent operations to interfere.8 Consequently, our im-plementation of the local tier uses the disjoint-set data structurewith union by rank only, which supports each operation in O(lg n)worst-case time.
The SP-bags implementation used by SP-hybrid follows that of[20], except that we must additionally support the SPLIT operation.At the time of a split, the subtraces U (1), U (2), and U (3) may all
8In fact, concurrent path compression does not affect the correctness ofthe algorithm, assuming that reads and writes execute atomically. The per-formance analysis become more complicated. We conjecture that a betterrunning time can be obtained using the classical data structure.
contain many threads. Thus, splitting them off from the trace Umay take substantial work. Fortunately, SP-bags overcomes thisdifficulty by allowing a split to be performed in O(1) time.
Consider the S- and P-bags at the time a thread in the procedureF is stolen and the five subtraces U (1), U (2), U (3), U (4), and U (5)
are created. The S-bag of F contains exactly the threads in the sub-trace U (1). Similarly, the P-bag of F contains exactly the threadsin the subtrace U (2). The SP-bags data structure is such that mov-ing these two bags to the appropriate subtraces requires only O(1)pointer updates. The subtrace U (3) owns all the other S- and P-bagsthat belonged to the original trace U , and thus nothing more needbe done, since U (3) directly inherits U ’s threads. The subtracesU (4) and U (5) are created with empty S- and P-bags. Thus, thesplit can be performed in O(1) time, since only O(1) bookkeepingneeds to be done including updating pointers.
6 Correctness of SP-hybrid
This section proves the correctness of the SP-hybrid algorithm. Webegin by showing that the traces maintained by SP-hybrid are con-sistent with the subtrace properties defined in Section 5. We thenprove that the traces are ordered correctly to determine SP relation-ships. Finally, we conclude that SP-hybrid works.
Due to the way the splits work, we can no longer prove a theo-rem as general as Lemma 1. That is to say, we can only accuratelyderive the relationship between two threads if one of them is a cur-rently executing thread.9 Although this result is weaker than for theserial algorithm, we do not need anything stronger for a race detec-tor. Furthermore, these are exactly the semantics provided by thelower-tier SP-bags algorithm.
The following lemma shows that when a split occurs, the sub-traces are consistent with the subtraces properties given in Sec-tion 5.
Lemma 7. Let Ui be a trace that is split around a P-node X . Then,the subtrace properties of Ui are maintained as invariants by SP-HYBRID.
Proof. The subtrace properties of Ui hold at the time of the splitaround the P-node X , when the subtraces were created, by defini-tion. If a subtrace is destroyed by splitting, the property holds forthat subtrace vacuously.
Consider any thread u at the time it is inserted into some trace U .Either U is a subtrace of Ui or not. If not, then the properties holdfor the subtrace Ui vacuously. Otherwise, we have five cases.
Case 1: U = Ui(1). This case cannot occur. Since Ui
(1) ismentioned only in lines 19–27 of Figure 8, it follows that Ui
(1) isnever passed to any call of SP-HYBRID. Thus, no threads are everinserted into Ui
(1).Case 2: U = Ui
(2). Like Case 1, this case cannot occur.Case 3: U = Ui
(3). We must show that Ui(3) = {u : u ∈ de-
scendants(left [X])}. The difficulty in this case is that when thetrace Ui is split, we have Ui = Ui
(3), that is, Ui and Ui(3) are
aliases for the same set. Thus, we must show that the invariantholds for all the already spawned instances of SP-HYBRID thattook Ui as a parameter, as well as those new instances that takeUi
(3) as a parameter. As it turns out, however, no new instancestake Ui
(3) as a parameter, because (like Cases 1 and 2) Ui(3) is
neither passed to SP-HYBRID nor returned.Thus, we are left to consider the already spawned instances
of SP-HYBRID that took Ui as a parameter. One such instanceis the outstanding SP-HYBRID(left [X], Ui) in line 13. If u ∈descendants(left [X]), then we are done, and thus, we only need
9Specifically, we cannot determine the relationship between threads inU(1) and U(2), but we can determine the relationship between any othertwo traces.
English:
Hebrew:
Upon a steal, insert into the global $er:
What We Need in a Det. Race Detector
• SP-‐Hybrid: global $er: an Order-‐Maintenance data structure maintains the series-‐parallel ordering of traces. local $er: within a single trace, query the SP-‐Bags data structure.
• Shadow space that contains: – The last writer to a loca$on v; and – The last serial reader to a loca$on v. (But we are totally dropping the parallel readers.)
Where Things Break
• The shadow memory only keeps track of the last serial reader (that the execu$on encounters), which is insufficient.
Lemma 2: Let strands e1, e2, and e3 execute serially in order. If e1 ≺ e2 and e1 ‖ e3, then e2 ‖ e3. Lemma 3 [Pseudotransi?vity of ‖] : Let strands e1, e2, and e3 execute serially in order. If e1 ‖ e2 and e2 ‖ e3, then e1 ‖ e3.
Recall the lemmas we need to show that SP-‐Bags algorithm works correctly:
Ques?on: When execu?ng in parallel, what do we need to maintain in the shadow space?
Ex: Keeping One Reader Is Not Enough
Recall how to update shadow memory: • write loca@on v by procedure F:
writer[v] ⟵ F; (Always update writer)
• read loca@on v by procedure F: if (Find-‐set(reader[v]) is an S-‐bag then reader[v] ⟵ F;
(Replace only serial reader)
e1:r e2:r
e3:w
Say e1, e2, and e3 executed in that order in parallel execu$on. Say e1 and e2 read v and e3 wrote to v.
When e3 executes, reader[v] contains e1 , since e1 ‖ e2 . We miss a race!
What if we always update the reader[v] with the last reader?
Ex: Keeping One Reader Is Not Enough
Recall how to update shadow memory: • write loca@on v by procedure F:
writer[v] ⟵ F; (Always update writer)
• read loca@on v by procedure F: if (Find-‐set(reader[v]) is an S-‐bag then reader[v] ⟵ F;
(Replace only serial reader)
e1:r e2:r e3:w
Say e1, e2, and e3 executed in that order in parallel execu$on. Say e1 and e2 read v and e3 wrote to v.
When e3 executes, reader[v] contains e1 , since e1 ‖ e2 . We miss a race!
What if we always update the reader[v] with the last reader?
Then when e3 executes, reader[v] contains e2 and we can s$ll miss a race!
Keeping two readers
It turns out that, it's sufficient to keep two readers -‐-‐-‐ we just need to keep track of the "leQ-‐most" Rl and "right-‐most" reader Rr for each memory loca$on v.
When e reads v: if e comes before Rl[v] in serial order, or e ≺ Rl[v]
Rl[v] = e if e comes aQer Rr[v] in serial order, or Rr[v] ≺ e
RR[v] = e
References [1] Nondeterminator and SP-‐Bags algorithms: Efficient Detec.on of Determinacy Races in Cilk Programs by Mingdong Feng and Charles E. Leiserson. IN Proceedings of the Ninth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), 1997 [2] On parallel race detec@on in Cilk computa@ons: On-‐the-‐Fly Maintenance of Series-‐Parallel Rela.onships in Fork-‐Join Mul.threaded Programs by Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson. In Proceedings of the Sixteenth ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2004.
[3] Race detec@on using English-‐Hebrew labeling scheme: Tools for the Efficient Development of Efficient Parallel Programs by Itzhak Nudler and Larry Rudolph. In Proceedings of the First Israeli Conference on Computer Systems Engineering, 1986. [4] Keeping two readers for parallel race detec@on in fork-‐join mul@threaded programs: On-‐the-‐Fly Detec.on of Data Races for Programs with Nested Fork-‐Join Parallelism by John Mellor-‐Crummey. In Proceedings of Supercompu$ng, 1991.