Robust memory-aware mappings for parallel multifrontal...
Transcript of Robust memory-aware mappings for parallel multifrontal...
![Page 1: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/1.jpg)
Robust memory-aware mappings forparallel multifrontal factorizationsJoint work with P. R. Amestoy, E. Agullo, A. Buttari, A. Guermouche, J.-Y. L’Excellent
François-Henry Rouet, Lawrence Berkeley National Laboratory
MUMPS Users Group Meeting, May 29-30, 2013
![Page 2: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/2.jpg)
Introduction
Motivation
• Memory is often a bottleneck for direct solvers.• Physical memory per core tends to decrease.• Estimating memory consumption in a dynamic scheduling contextis difficult (cf. the infamous error code “-9” in MUMPS).
Context
• We want to design mapping algorithms that enforce some memoryconstraints and provide better memory estimates.
• We focus on multifrontal methods (e.g., MUMPS).
2/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 3: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/3.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
FactorsActivefrontalmatrix
Stack ofcontributionblocks
Active storage
5
54
145
34
23
Elimination tree
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 4: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/4.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 5: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/5.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 6: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/6.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 7: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/7.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 8: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/8.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 9: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/9.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 10: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/10.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 11: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/11.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 12: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/12.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 13: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/13.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 14: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/14.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 15: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/15.jpg)
The multifrontal method [Duff & Reid ’83]
1 2 3 4 512345
1 2 3 4 512345
A= L+U-I=
Storage is divided into two parts:• Factors• Active memory
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 16: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/16.jpg)
The multifrontal method [Duff & Reid ’83]
• Factors are incompressible and usuallyscale fairly; they can optionally bewritten on disk.
• In sequential, the traversal thatminimizes active memory is known[Liu’86].
• In parallel, active memory becomesdominant.Example: share of active storage onthe AUDI matrix using MUMPS 4.10.01 processor: 11%256 processors: 59%
5
54
145
34
23
Active memory
3/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 17: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/17.jpg)
The problem
Metric: memory efficiency
e(p) = Sseqp × Smax (p)
We would like e(p) ' 1, i.e. Sseq/p on each processor.
Common mappings/schedulings provide a poor memory efficiency:• Proportional mapping: lim
p→∞e(p) = 0 on regular problems.
• MUMPS relies primarily on a relaxed proportional mapping; typicalefficiency: between 0.10 and 0.40.Memory estimates are unreliable.
4/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 18: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/18.jpg)
Proportional mapping
Proportional mapping: top-down traversal of the tree, where processors as-signed to a node are distributed among its children proportionally to the weightof their respective subtrees.
5/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 19: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/19.jpg)
Proportional mapping
Proportional mapping: top-down traversal of the tree, where processors as-signed to a node are distributed among its children proportionally to the weightof their respective subtrees.
64
5/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 20: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/20.jpg)
Proportional mapping
Proportional mapping: top-down traversal of the tree, where processors as-signed to a node are distributed among its children proportionally to the weightof their respective subtrees.
64
3262640% 10% 50%
5/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 21: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/21.jpg)
Proportional mapping
Proportional mapping: top-down traversal of the tree, where processors as-signed to a node are distributed among its children proportionally to the weightof their respective subtrees.
64
32626
26
13 13
2 2 221 11
11 10
• Targets run time but poor memory efficiency.• Usually a relaxed version is used: more memory-friendly but unreliable
estimates.5/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 22: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/22.jpg)
Proportional mapping
Proportional mapping: assuming that the sequential peak is 5 GB,
Smax (p) > max{4GB26 ,
1GB6 ,
5GB32
}= 0.16 GB⇒ e(p) 6 5
64× 0.16 6 0.5
64
326264GB 1GB 5GB
5GB
6/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 23: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/23.jpg)
A more memory-friendly strategy. . .
All-to-all mapping: postorder traversal of the tree, where all the processorswork at every node:
64
646464
64
64 64
64 64
6464
646464
7/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 24: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/24.jpg)
A more memory-friendly strategy. . .
All-to-all mapping: postorder traversal of the tree, where all the processorswork at every node:
64
646464
64
64 64
64 64
6464
646464
Optimal memory scalability (Smax (p) = Sseq/p) but no tree parallelism andprohibitive amounts of communications.
7/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 25: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/25.jpg)
A class of “memory-aware” mappings
“Memory-aware” mapping [Agullo ’08]: aims at enforcing a given memory con-straint (M0, maximum memory per processor):
64
326264GB 1GB 5GB
5GB
1. Try to apply proportional mapping.
2. Enough memory for each subtree?
If not, serialize them, and update M0:processors stack equal shares of the CBs from previous nodes.
8/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 26: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/26.jpg)
A class of “memory-aware” mappings
“Memory-aware” mapping [Agullo ’08]: aims at enforcing a given memory con-straint (M0, maximum memory per processor):
64
326264GB 1GB 5GB
5GB
!
1. Try to apply proportional mapping.2. Enough memory for each subtree?
If not, serialize them, and update M0:processors stack equal shares of the CBs from previous nodes.
8/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 27: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/27.jpg)
A class of “memory-aware” mappings
“Memory-aware” mapping [Agullo ’08]: aims at enforcing a given memory con-straint (M0, maximum memory per processor):
64
64 6464
1. Try to apply proportional mapping.2. Enough memory for each subtree? If not, serialize them, and update M0:
processors stack equal shares of the CBs from previous nodes.8/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 28: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/28.jpg)
A class of “memory-aware” mappings
“Memory-aware” mapping [Agullo ’08]: aims at enforcing a given memory con-straint (M0, maximum memory per processor):
64
64 6464
64 6464
32 32
22 21 21
32 32
• Ensures the given memory constraint and provides reliable estimates.• Tends to assign many processors on nodes at the top of the tree
=⇒ performance issues on parallel nodes.8/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 29: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/29.jpg)
A class of “memory-aware” mappings
A finer “memory-aware” mapping? Serializing all the children at once is veryconstraining: more tree parallelism can be found.
64
51 641364
Find groups on which proportional mapping works, and serialize these groups.Heuristic: follow a given order (e.g. the serial postorder) and form groups aslarge as possible.
9/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 30: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/30.jpg)
Memory-aware with groups
The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:
(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.
while not all children collected doi= current nodeif i cannot be alone then
Reset previous children to anall-to-all mapping (this shouldimprove balance)
end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones
end while
pf
p2
p1
10/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 31: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/31.jpg)
Memory-aware with groups
The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:
(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.
while not all children collected doi= current nodeif i cannot be alone then
Reset previous children to anall-to-all mapping (this shouldimprove balance)
end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones
end while
pf
p2
p1
10/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 32: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/32.jpg)
Memory-aware with groups
The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:
(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.
while not all children collected doi= current nodeif i cannot be alone then
Reset previous children to anall-to-all mapping (this shouldimprove balance)
end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones
end while
pf
pf
pf
10/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 33: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/33.jpg)
Memory-aware with groups
The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:
(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.
while not all children collected doi= current nodeif i cannot be alone then
Reset previous children to anall-to-all mapping (this shouldimprove balance)
end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones
end while
pf
pf
pf
10/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 34: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/34.jpg)
Memory-aware with groups
The algorithm maps each child i on pi processors so that, on everyprocessor working on i , two conditions are enforced:
(Cstk) there is enough memory to hold Pseq(i)/pi .(Casm) the assembly of the CB of i into its parent is feasible.
while not all children collected doi= current nodeif i cannot be alone then
Reset previous children to anall-to-all mapping (this shouldimprove balance)
end ifStarting from i : collect as manynodes as possible as long as (Cstk)and (Casm) are ensuredUse proportional mapping on thegroup, serialize with previous ones
end while
pf
pf
pf
p3
p4
p5
10/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 35: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/35.jpg)
Experiments
• Matrix: finite-difference model of acoustic wave propagation,27-point stencil, 192× 192× 192 grid; Seiscope consortium.N = 7 M, nnz = 189 M, factors=152 GB (METIS).Sequential peak of active memory: 46 GB.
• Machine: 64 nodes with two quad-core Xeon X5560 per node.We use 256 MPI processes.
• Perfect memory scalability: 46 GB/256=180MB.
11/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 36: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/36.jpg)
Experiments
Prop Memory-aware mappingMap M0 = 225 MB M0 = 380 MB
w/o groups w/ groupsMax stack peak (MB) 1932 227 330 381Avg stack peak (MB) 626 122 177 228Time (s) 1323 2690 2079 1779
The root node has 3 children that cannot bemapped using a proportional mapping:
46 GB+ 32 GB+ 30 GB > 380 MB× 256
• The “basic” memory-aware mapping serializesthe three children.
• The variant with groups puts two childrentogether and one alone.
46GB
45GB 32GB 30GB
12/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 37: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/37.jpg)
Experiments
We assess how the different strategies map nodes, compared to theproportional mapping.
Dept
h
0
2
1
34
00
0.5
1.5
2
2.5
3
3.5
4
2 4 6 8 10 12 14
Average #procs/#procs PM per node, for each layer
Depth
Ratio
#pr
ocs/
#pr
ocs P
M
MA, 225MBMA, 380MB, w/o groupsMA, 380MB, w/ groups
1
• Relaxing M0 enhances tree parallelism.• The variant with groups enhances tree parallelism.
13/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 38: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/38.jpg)
Performance issues
The new mapping tends to assign many processes to nodes at the topof the tree. One of the communication patterns became a bottleneck.
M
S1S2S3
(unsymmetric case)
Non-blocking broadcast of a block of factorsfrom the master process.A kind of ibcast.
Baseline implementation: a loop of non-blocking sends from themaster to the processors. Communication speed is constant.Hyperion (CICT): 1.7 GB/s; Hopper (NERSC): 4.5 GB/s.Tree-based implementation: communication speed is almostproportional to the number of processors.Hyperion, 128 cores: 28 GB/s; Hopper, 768 cores: 122 GB/s.
14/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 39: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/39.jpg)
Performance issues
The new mapping tends to assign many processes to nodes at the topof the tree. One of the communication patterns became a bottleneck.
M
S1S2S3
(unsymmetric case)
M
S1
Sp
... p
Baseline implementation: a loop of non-blocking sends from themaster to the processors. Communication speed is constant.Hyperion (CICT): 1.7 GB/s; Hopper (NERSC): 4.5 GB/s.
Tree-based implementation: communication speed is almostproportional to the number of processors.Hyperion, 128 cores: 28 GB/s; Hopper, 768 cores: 122 GB/s.
14/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 40: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/40.jpg)
Performance issues
The new mapping tends to assign many processes to nodes at the topof the tree. One of the communication patterns became a bottleneck.
M
S1S2S3
(unsymmetric case)
M
S...
S...... w
S...
S...
S...
S...
... w... w
Baseline implementation: a loop of non-blocking sends from themaster to the processors. Communication speed is constant.Hyperion (CICT): 1.7 GB/s; Hopper (NERSC): 4.5 GB/s.Tree-based implementation: communication speed is almostproportional to the number of processors.Hyperion, 128 cores: 28 GB/s; Hopper, 768 cores: 122 GB/s.
14/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 41: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/41.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 42: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/42.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
BLAS
Task 1
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 43: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/43.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
BLASSEND
Task 1
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 44: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/44.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
BLASSEND
Task 1
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 45: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/45.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
BLAS
BLASSEND
Task 1 Task 2
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 46: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/46.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
BLASSEND
BLASSEND
Task 1 Task 2
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 47: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/47.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
BLAS
BLASSEND
BLASSEND
Task 1 Task 2 Task 3
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 48: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/48.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
FREE
BLASSEND
BLASSEND
BLASSEND
Task 1 Task 2 Task 3
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 49: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/49.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
FREE
BLASSEND
FREE
BLASSEND
BLASSEND
Task 1 Task 2
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 50: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/50.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
FREE
BLASSEND
FREE
BLASSEND
FREE
BLASSEND
Task 1
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 51: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/51.jpg)
Non-blocking broadcast – implementation
MUMPS follows an (almost) fully dynamic scheduling scheme where aprocess can be asked on the fly to work on virtually any task.
Main loop:while . . . do
Probe for messageif message received then
Read tag, execute associated task(e.g., Facto panel)
end ifend while
Facto panel:Do BLAS stuff on panelTry to send panel to some processes:if Send buffer is full then
Call Main loopend ifFree stuff
FREE
BLASSEND
FREE
BLASSEND
FREE
BLASSEND
15/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 52: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/52.jpg)
Non-blocking broadcast – implementation
Difficulties:• Messages representing different blocks of factors can overtake.• Deadlocks. . .
Example of messages overtaking (broadcast tree is a chain):
P0
P1
P2
P0P1P2
recursive call
tta tb tc
P2 receives the second panel before the first one!
16/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 53: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/53.jpg)
Non-blocking broadcast – experiments
Experiments on a single node of size 64000 with 64 processors:
Strategy Time (s)1-way threaded BLAS 4-way threaded BLAS
Baseline 596 468Tree-based 401 167
Better overlapping between communications and computations⇒ more potential for exploiting multithreaded BLAS / faster cores.
Quite critical in a memory-aware context since the algorithm tends toincrease the number of processes per subtree.
17/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 54: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/54.jpg)
Memory-aware mapping – more results
Back to experiments with the memory-aware mapping; matrices:
Matrix name Order Entries Factors Sseq Description; origin(millions) (millions) (GB) (GB)
cage13 0.4 7.5 30.7 17.9 Directed weighted graph; Utrecht U.as-Skitter 1.7 23.9 17.7 6.2 Internet topology graph; SNAPHV15R 2.0 283.1 366.4 87.8 CFD, 3D engine fan; FLUOREMMORANSYS1 2.7 81.3 63.5 19.1 Model Order Reduction; CADFEMmeca_raff6 3.3 130.2 63.5 8.8 Thermo-mechanical coupling; EDF
18/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 55: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/55.jpg)
Memory-aware mapping – more results
Memory-aware mapping vs. default strategy (MUMPS 4.10.0):
Matrix Mapping Smax (MB) Savg(MB) Time(s)
cage13 Default 1069 727 285MA, M0 = 380MB 305 302 498
as-Skitter Default 1484 584 578MA, M0 = 130MB 135 70 285
HV15R Default 9063∗ 8773∗ N/AMA, M0 = 2600MB 2225 1803 7169
MORANSYS1 Default 1246 834 373MA, M0 = 380MB 379 305 651
meca_raff6 Default 1078 796 324MA, M0 = 320MB 329 209 367
19/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 56: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/56.jpg)
Conclusion – on-going work
Conclusion• The memory-aware mapping ensures a given constraint. . .• . . . and its variant with groups tries to add more parallelism.• Increasing the number of processes per node led to performanceproblems =⇒ new (dense!) algorithms.
Future work• Memory-aware mapping with groups: assess some heuristics.• We have performance models that provide the optimal number ofprocessors to use on a node, and we are working on how to injectthese models into the mapping.
• More dynamic scheduling.• Compressing CBs with low-rank approximation techniques (cf workby Clément Weisbecker and the MUMPS team).
20/21 MUMPS Users Group Meeting, May 29-30, 2013
![Page 57: Robust memory-aware mappings for parallel multifrontal ...mumps.enseeiht.fr/doc/ud_2013/slides_rouet.pdf · Robust memory-aware mappings for parallel multifrontal factorizations ...](https://reader031.fdocuments.in/reader031/viewer/2022022704/5bccd5b609d3f2426c8c81e5/html5/thumbnails/57.jpg)
End
Thank you for your attention!
Any questions?
Acknowledgements
• Matrices: Stéphane Operto (SEISCOPE), Olivier Boiteau (EDF).• Computational resources: Nicolas Renon, CICT, Toulouse.
21/21 MUMPS Users Group Meeting, May 29-30, 2013