Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor...
Transcript of Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor...
![Page 1: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/1.jpg)
Introduction into parallel computations
Miroslav Tuma
Institute of Computer Science
Academy of Sciences of the Czech Republic
and Technical University in Liberec
Presentation supported by the project
“Information Society” of the Academy of Sciences of the Czech Republic
under No. 1ET400300415
MFF UK, February, 2006
![Page 2: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/2.jpg)
M. Tuma 2
Pre-introduction
Preliminaries
General knowledge of involved basic algorithms of NLA Simple ideas from direct and iterative solvers for solving large sparse
linear systems Complexities of algorithms
Not covered
Vectorization of basic linear algebra algorithms Parallelization of combinatorial algorithms FFT, parallel FFT, vectorized FFT Multigrid, multilevel algorithms Tools like PETSC etc. Eigenvalue problems
![Page 3: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/3.jpg)
M. Tuma 3
Outline
Part I. A basic sketch on parallel processing1. Why to use parallel computers2. Classification (a very brief sketch)3. Some terminology; basic relations4. Parallelism for us5. Uniprocessor model6. Vector processor model7. Multiprocessor model
Part II. Parallel processing and numerical computations 8. Basic parallel operations 9. Parallel solvers of linear algebraic systems. 10. Approximate inverse preconditioners 11. Polynomial preconditioners 12. Element-by-element preconditioners 13. Vector / parallel preconditioners 14. Solving nonlinear systems
![Page 4: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/4.jpg)
M. Tuma 4
1. Why to use parallel computers?
It might seem that
![Page 5: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/5.jpg)
M. Tuma 4
1. Why to use parallel computers?
It might seem that
always better technologies
![Page 6: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/6.jpg)
M. Tuma 4
1. Why to use parallel computers?
It might seem that
always better technologies
computers are still faster: Moore’s law
The number of transistors per square inch on integrated circuits doublesevery year since the integrated circuit was inventedThe observation made in 1965 by Gordon Moore, co-founder of Intel.(G.E. Moore, Electronics, April 1965).
![Page 7: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/7.jpg)
M. Tuma 4
1. Why to use parallel computers?
It might seem that
always better technologies
computers are still faster: Moore’s law
The number of transistors per square inch on integrated circuits doublesevery year since the integrated circuit was inventedThe observation made in 1965 by Gordon Moore, co-founder of Intel.(G.E. Moore, Electronics, April 1965).
really:
1971: chip 4004 : 2.3k transistors
1978: chip 8086 : 31k transistors (2 micron technology)
1982: chip 80286: 110k transistors (HMOS technology)
1985: chip 80386: 280k transistors (0.8 micron CMOS)
![Page 8: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/8.jpg)
M. Tuma 5
1. Why to use parallel computers? II.
Further on
1989: chip 80486: 1.2M transistors
1993: Pentium: 3.1M transistors (0.8 micron biCMOS)
1995: Pentium Pro: 5.5M (0.6 micron)
1997: Pentium II: 7.5M transistors
1999: Pentium III: 24M transistors
2000: Pentium 4: 42M transistors
2002: Itanium: 220M transistors
2003: Itanium 2: 410M transistors
![Page 9: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/9.jpg)
M. Tuma 6
1. Why to use parallel computers? III.
But: Physical limitations
finite signal speed (speed of light; 300000 km s−1)
implies: cycle time (clock rate): MHz or ns:
100 MHz < −−−− > 10 ns
cycle time: 1 ns⇒ 30 cm per cycle time
Cray-1 (1976): 80 MHz
in any case: size of atoms and quantum effects seem to be ultimatelimits
![Page 10: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/10.jpg)
M. Tuma 7
1. Why to use parallel computers? IV.
Further motivation: important and very time-consumingproblems to be solved
reentry into the terrestrial atmosphere⇒Boltzmann equations
combustion⇒ large ODE systems
deformations, crash-tests⇒large systems of nonlinear equations
turbulent flows⇒ large systems of PDEs in 3D
⇓accelerations of computations still needed
⇓parallel processing
![Page 11: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/11.jpg)
M. Tuma 8
1. Why to use parallel computers? V.
High-speed computing seems to be cost efficient
“The power of computer systems increases as the square of their cost”(Grosch’s law; H.A. Grosch. High speed arithmetic: The digital computeras a research tool. J. Opt. Soc. Amer. 43 (1953); H.A. Grosch. Grosch’s
law revisited. Computerworld 8 (1975), p.24)
![Page 12: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/12.jpg)
M. Tuma 9
2. Classification: a very brief sketch
a) How deep can we go: levels of parallelism
running jobs in parallel for reliabilityIBM AN/FSQ-31 (1958) – purely duplex machine
(time for operations 2.5µ s – 63.5 µ s; computer connected with thehistory of the word byte)
running parts of jobs on independent specialized unitsUNIVAC LARC (1960) – first I/O processor
running jobs in parallel for speedBurroughs D-825 (1962) – more modules, job scheduler
running parts of programs in parallelBendix G-21 (1963), CDC 6600 (1964)
– nonsymmetric multiprocessor
![Page 13: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/13.jpg)
M. Tuma 10
2. Classification: a very brief sketch II.
a) How deep can we go: levels of parallelism (continued)
running matrix-intensive stuff separatelydevelopment of IBM 704x/709x (1963), ASC TI (1965)
parallelizing instructionsIBM 709 (1957), IBM 7094 (1963)
data synchronizer units DSU→ channels) – enables simultaneouslyread/write/compute
overlap computational instructions / loads and stores IBR (instruction backup registers) instruction pipeline
![Page 14: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/14.jpg)
M. Tuma 11
2. Classification: a very brief sketch III.
a) How deep can we go: levels of parallelism (continued, 3rdpart)
parallelizing arithmetics (bit level): less clocks per instructionsuperscalar in RISCs (CDC6600), static superscalar (VLIW)
Check dependencies Schedule operations
![Page 15: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/15.jpg)
M. Tuma 12
2. Classification: a very brief sketch III.
b) Macro view based on Flynn classification
MISDSISD SIMD MIMD
Simple processor
processor Vector Array
processorShared memory Distributed memory
Cache coherent Non cache coherent
Processor/memory organization
SISD: single instruction – single data stream
MIMD: multiple instruction – multiple data streams
![Page 16: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/16.jpg)
M. Tuma 13
2. Classification: a very brief sketch IV.
b) Macro view based on Flynn classification – MIMDmessage passing examples
Caltech Cosmic Cube (1980s)(maximum 64 processors; hypercubeorganization)
picture of Caltech Cosmic Cube
commercial microprocessors + MPP support examples: transputers, ncube-1, ncube-2
picture of transputer A100
standard microprocessors + network support examples: Intel Paragon (i860), Meiko CS-2 (Sun SPARC), TMC
CM-5 (Sun SPARC), IBM SP2-4 (RS6000)
some vector supercomputers: Fujitsu VPP machines
loosely coupled cluster systems
![Page 17: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/17.jpg)
M. Tuma 14
2. Classification: a very brief sketch IV.
b) Macro view based on Flynn classificationshared memory machines examples
no hardware cache coherence (hardware maintaining synchronizationbetween cache and other memory) examples: BBN Butterfly (end of 70s), Cray T3D (1993) /T3E (1996),
vector superprocessors; Cray X-MP (1983), Cray Y-MP (1988), CrayC-90 (1990)
hardware cache coherence examples: SGI Origin (1996),Sun Fire (2001)
![Page 18: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/18.jpg)
M. Tuma 15
2. Classification: a very brief sketch V.
of course, there are other possible classifications
by memory access (local/global caches, shared memory cases (UMA,NUMA, cache only memory), distributed memory, distributed sharedmemory)
MIMD by topology (master/slave, pipe, ring, array, torus, tree,hypercube, ...)
features at various levels
![Page 19: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/19.jpg)
M. Tuma 16
2. Classification: a very brief sketch VI.
c) Miscellaneous: features making the execution faster: FPU and ALU work in parallel
mixing index evaluations and floating points is natural now it was not always like that: Cray-1 had rather weak integer arithmetics
multiple functional units (for different operations, or for the sameoperations) first for CDC 6600 (1964) – 10 independent units
pipeline for instructions IBM 7094 (1969) – IBR (instruction backup registers)
1 2 3 4 5
generic example of adding check exponents possibly swap operands shift one of mantissas by the number of bits determined by
differences in exponents compute the new mantissa normalize the result
![Page 20: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/20.jpg)
M. Tuma 17
2. Classification: a very brief sketch VII.
c) Miscellaneous: features making the execution faster:(continued)
pipeline for operations example see later CDC 7600 (1969) – first vector processor
overlapping operations generalizes pipelining:
– possible dependencies between evaluations
– possible different number of stages
– time per stages may differ
processor arrays ILLIAC IV (1972) – 64 elementary processors
memory interleaving first for CDC 6600 (1964) – 64 memory banks Cray-2 efficiency relies on that
![Page 21: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/21.jpg)
M. Tuma 18
3. Some terminology; basic relations
Definitions describing “new” features of the computers
time model
speedup
—- how fast we are
efficiency
—- how fast we are with respect to our resources
granularity (of algorithm, implementation)
—- how large blocks of the code we will consider
![Page 22: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/22.jpg)
M. Tuma 19
3. Some terminology; basic relations: II.
Simplified time models
sequential timetseq = tsequential_startup_latency + toperation_time
vector pipeline timetvec = tvector_startup_latency + n ∗ toperation_time
communication timettransfer_n_words = tstartup_latency + n ∗ ttransfer_word
startup latency: delay time to start the transfer
more complicated relations among data and computer from ourstandpoint invisible, but we should be aware of them
![Page 23: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/23.jpg)
M. Tuma 20
3. Some terminology; basic relations: III.
Speedup S
Ratio Ts/Tp where Ts is time for a non-enhanced run and Tp is time for theenhanced run. Typically:
– Ts: sequential time
– Tp: time for parallel or vectorized run
for multiprocesor run with p processors: 0 < S ≤ p
vector pipeline: next slide
![Page 24: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/24.jpg)
M. Tuma 21
3. Some terminology; basic relations: IV.
Speedup S (continued)
time op− 1 op− 2 op− 3 op− 4 op− 5
1 a1
2 a2 a1
3 a3 a2 a1
4 a4 a3 a2 a1
5 a5 a4 a3 a2
. . . . . . . . . . . . . . .
for processing p entries needed: length ∗ p/(length + p) clock cycles (herelength = 5)
Speedup:S = length ∗ p/(length + p) ≤ p
speedup (better):S = length ∗ p ∗ tseq/(tvec_latency + (length + p) ∗ tvec_op)
![Page 25: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/25.jpg)
M. Tuma 22
3. Some terminology; basic relations: V.
Efficiency E
Ratio S/p where S is speedup and p characterizes the enhancement. if p is number of processors: 0 < E ≤ 1
if p is pipeline length: 0 < E ≤ 1
Relative speedup and efficiency for multiprocessors Sp and Ep
Sp = T1/Tp,
where T1 is time for running the parallel code on one processor. typically Tp ≥ Ts
other similar definitions of E and S (e.g., taking into account relationparallel code × best sequential code
memory hierarchy effects (e.g., SGI2000 2-processor effect; largememory on parallel machines)
![Page 26: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/26.jpg)
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
![Page 27: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/27.jpg)
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time
![Page 28: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/28.jpg)
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time
f
1−f
sequential
parallel
![Page 29: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/29.jpg)
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time
Then : S =f ∗ t + (1− f)t
f ∗ t + (1− f) ∗ (t/p)≤ 1
f
f
1−f
sequential
parallel
![Page 30: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/30.jpg)
M. Tuma 23
3. Some terminology; basic relations: VI.
Amdahl’s law expresses natural surprise by the following fact:
if a process performs part of the work quickly and part of the workslowly then the overall (speedup, efficiency) is strongly limited by thepart performed slowly
Notation: f : fraction of the slow (sequential) part (1− f): the rest (parallelized, vectorized) t: overall time
Then : S =f ∗ t + (1− f)t
f ∗ t + (1− f) ∗ (t/p)≤ 1
f
E.g.: f = 1/10⇒ S ≤ 10
f
1−f
sequential
parallel
![Page 31: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/31.jpg)
M. Tuma 24
3. Some terminology; basic relations: VII.
Amdahl’s law (continued)Described in: (Gene Amdahl: Interpretation of AMDAHL’s theorem,advertisement of IBM, 1967)
Gene Myron Amdahl (1922 —) worked on IBM 704/709, IBM/360 Series, Amdahl V470 (1975)
Amdahl’s law relevancy
Only a simple approximation of computer processing: dependence f(n)not considered: fully applies when there are absolute constraints for solution time
(weather prediction, financial transactions) Algorithm is effectively parallel if f → 0 for n→∞.
Speedup / efficiency anomalies: More processors may have more memory/cache Increasing chances to find a lucky solution in parallel combinatorial
algorithms
![Page 32: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/32.jpg)
M. Tuma 25
3. Some terminology; basic relations: VIII.
Scalability program is scalable if:
larger efficiency comes with larger number of processors or longerpipeline
multiprocessors: linear, sublinear, superlinear S/E
different specialized definitions for growing number of processors / pipeline length growing time
Isoefficiency Overhead function: To(size, p) = pTp(size, p)− Ts
Efficiency: E = 1/(1 + To(size, p)/size)
Isoefficiency function: size = KTo(size, p) such that E is constant,K = E/(1− E)
Adding n numbers on p processors: size = Θ(p log p).
![Page 33: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/33.jpg)
M. Tuma 26
3. Some terminology; basic relations: IX.
Load balancing
Techniques to minimize Tp on multiprocessors by approximate equalizingtasks for individual processors.
static load balancing array distribution schemes (block, cyclic, block-cyclic, randomized
block)
graph partitioning
hierarchical mappings
dynamic load balancing centralized schemes
distributed schemesWill be discussed later
![Page 34: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/34.jpg)
M. Tuma 27
3. Some terminology; basic relations: IX.
Semaphores
Signals operated by individual processes and not by central control
Shared memory computers’ feature
Introduced By Dijkstra.
Message passing
Mechanism to transfer data from one process to another.
Distributed memory computers’ feature
Blocking × non-blocking communication
![Page 35: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/35.jpg)
M. Tuma 28
4. Parallelism for us
Mathematician’s point of view
We need to: convert algorithms into state-of-the-art codes
algorithms→ codes→ computers
Algorithm
Idealized computer
Computer
Implementation, Code
What is the idealized computer?
![Page 36: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/36.jpg)
M. Tuma 29
4. Parallelism for us
Idealized computer
idealized vector processor
idealized uniprocessor
idealized computers with more processors
![Page 37: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/37.jpg)
M. Tuma 30
5. Uniprocessor model
CPU
Memory
I/O
![Page 38: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/38.jpg)
M. Tuma 31
5. Uniprocessor model: II.
Example: model and reality
Even simple Pentium III has on-chip
pipeline (at least 11 stages for each instruction) data parallelism (SIMD type) like MMX (64bit) and SSE (128bit) instruction level parallelism (up to 3 instructions) more threads at system level based on bus communication
![Page 39: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/39.jpg)
M. Tuma 32
5. Uniprocessor model: III.
How to ...?: pipelined superscalar CPU: not for us
( pipelines; ability to issue more instructions at the same time)
detecting true data dependencies: dependencies in processing order
detecting resource dependencies: competition of data for computationalresources
— reordering instructions; most microprocessors enable out-of-orderscheduling
solving branch dependencies
— speculative scheduling across; typically every 5th-6th instruction is abranch
VLIW – compile time scheduling
![Page 40: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/40.jpg)
M. Tuma 33
5. Uniprocessor model: IV.
How to ...?: memory and its connection to CPU
(should be considered by us)
1. memory latency—- delay between memory request and data retrieval
2. memory bandwidth—- rate at which data can be transferred from/to memory
![Page 41: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/41.jpg)
M. Tuma 34
5. Uniprocessor model: V.
Memory latency and performance
Example: processor 2GHz, DRAM with latency 0.1µs; two FMA unit on theprocessor and 4-way superscalar (4 instructions in a cycle, e.g., two addsand two multiplies)
cycle time: 0.5ns
maximum processor rate: 8 GFLOPs
for every memory request: 0.1 µs waiting
it is: 200 cycles wasted for each operation
dot product: two data fetches for each multiply-add (2 ops)
consequently: one op for one fetch
resulting rate: 10 MFLOPs
![Page 42: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/42.jpg)
M. Tuma 35
5. Uniprocessor model: VI.
Hiding / improving memory latency (I.)
a) Using cache
The same example cache of size 64kB
it can store matrices A,B and and C of dimension 50
matrix multiplication A ∗B = C
matrix fetch: 5000 bytes: 500 µs
ops: 2n3 time for ops: 2 ∗ 643 ∗ 0.5 ns = 262 µs
total: 762 µs
resulting rate: 688 MFLOPs
![Page 43: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/43.jpg)
M. Tuma 36
5. Uniprocessor model: VII.
Hiding / improving memory latency (II.)
b) Using multithreading(Thread: A sequence of instructions in a program which runs a certain
procedure.)
dot products of rows of A with b
do i=1,nr(i)=A(i,:)’*b
end do
![Page 44: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/44.jpg)
M. Tuma 37
5. Uniprocessor model: VIII.
Hiding / improving memory latency (III.) (continued)
multithreaded version of the dot product
do i=1,nr(i)=new_thread(dot_product, double, A(i,:), b)
end do
processing more threads: able to hide memory latency
important condition: fast switches of threads
HEP or Tera can switch in each cycle
![Page 45: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/45.jpg)
M. Tuma 38
5. Uniprocessor model: IX.
Hiding / improving memory latency (III.)
c) Prefetching
advancing data loads
as some other techniques, it can induce the rate for our example: anoperation per clock cycle
![Page 46: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/46.jpg)
M. Tuma 39
5. Uniprocessor model: X.
Memory bandwith
data transfer rate / peak versus average
improvement of memory bandwidth: increase size of communicatedmemory blocks
sending consecutive words from memory
requires spatial locality of data
column versus row major data access: the physical access should becompatible with the logical accass from programming language
![Page 47: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/47.jpg)
M. Tuma 40
5. Uniprocessor model: XI.
Memory bandwith (continued)
summing columns of A
do i=1,nsum(i)=0.0d0for j=1,nsum(i)=sum(i) + A(i,j)
end doend do
matrix is stored columnwise: good spatial locality
matrix is stored rowwise: bad spatial locality
of course, code can be rewritten for row major data access
C, Pascal (rowwise), Fortran (columnwise)
![Page 48: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/48.jpg)
M. Tuma 41
5. Uniprocessor model: XII.
Memory bandwith and latency: conclusions
The other side of memory hiding memory latency: increase in memorybandwidth
memory bandwidth improvements if the vectors are long: breaking theiteration space into blocks: tiling
exploit any possible spatial and temporal locality to amortize memorylatency and increase effective memory bandwidth
the ratio q: ops / number of memory accesses: good indicator oftolerance to memory bandwidth
memory layout, organization of computation are a significant challengefor users
![Page 49: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/49.jpg)
M. Tuma 42
5. Uniprocessor model: XIII.
How to improve the ratio q: ops / number of memoryaccesses? How to standardize the improvement?
⇓more levels of Basic Linear Algebra Subroutines (BLAS)
basic linear algebraic operations with vectors
basic linear algebraic operations with matrices
closer to “matlab elegance”
in fact, first matlab started with LINPACK (1979) kernels with cleverimplementation of vector and matrix operations
![Page 50: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/50.jpg)
M. Tuma 43
5. Uniprocessor model: XIV.
BLAS
operation ops comms q = ops/comms
αx + y 2 ∗ n 3 ∗ n + 1 ≈ 2/3
αAx + y 2 ∗ n2 + n n2 + 3 ∗ n + 1 ≈ 2
αAB + C 2 ∗ n3 + n2 4 ∗ n2 + 1 ≈ n/2
BLAS1 (1979): SAXPY (αx + y), dot_product (xT y), vector_norm, planerots, ...
BLAS2 (1988): matvecs (αAx + βy), rank-1 updates, rank-2updates,triang eqs, ...
BLAS3 (1990): matmats et al.: SGEMM (C = AB)
![Page 51: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/51.jpg)
M. Tuma 44
5. Uniprocessor model: XIV.
BLAS pros and cons
BLAS (pros): for most of available computers
– increase effective memory bandwidth
– portability
– modularity
– clarity
– much simpler software maintenance
BLAS (cons): time-consuming interface for simple ops
– further possible improvements based on the problem knowledge(distinguishing cases with specific treatment like loop unrolling)
![Page 52: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/52.jpg)
M. Tuma 45
5. Uniprocessor model: XV.
Standardization at the higher level: LAPACK
covers solvers for dense and banded
– systems of linear equations
– eigenvalue problems
– least-squares solutions of overdetermined systems
associated factorizations: LU, Cholesky, QR, SVD, Schur, generalizedSchur)
additional routines: estimates of condition numbers, factorizationreorderings by pivoting
based on LINPACK (1979) and EISPACK (1976) projects
![Page 53: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/53.jpg)
M. Tuma 46
6. Vector processor model
Founding father
chief constructor of latest model of CDC computers with some parallelfeatures)
Cray computers: one of most successful chapters in the history ofdevelopment of parallel computers
first CRAYs: vector computers
![Page 54: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/54.jpg)
M. Tuma 47
6. Vector processor model: II.
Vector processing principles
1. Vector computers’ basics
pipelined instructions
pipelined data: vector registers
typically different vector processing units for different operations
V1
V2
S1
*
+
![Page 55: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/55.jpg)
M. Tuma 48
6. Vector processor model: III.
Vector processing principles
1. Vector computers’ basics (continued)
important breakthrough: efficient vectorizing sparse data⇒ enormousinfluence on scientific computing instructions: compress, expand, scatter, gather
scatter b
do i=1,na(index(i)) = b(i)
end do
x
mask1 1 1 1
x1 x3 x6 x10
Cyber-205 (late seventies): efficient software (in microcode) since Cray X-MP: performed by hardware
![Page 56: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/56.jpg)
M. Tuma 49
6. Vector processor model: IV.
Vector processing principles
2. Chaining overlapping of vector instructions: introduced in Cray-1 (1976)
results in c + length clock cycles for a small c to process a vectoroperation with vectors of length length
the longer the vector chain the better speedup
the effect called supervector performance
V1
V2
S1
*
+
![Page 57: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/57.jpg)
M. Tuma 50
6. Vector processor model: V.
Vector processing principles
Stripmining splitting long vectors still saw-like curve of speedup relative to vector length
S
length
Stride: distance between vector elements Fortran matrices: column major
C, Pascal matrices: row major
![Page 58: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/58.jpg)
M. Tuma 51
6. Vector processor model: VII.
Vector processing and us
Prepare data to be easily vectorized: II.
![Page 59: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/59.jpg)
M. Tuma 51
6. Vector processor model: VII.
Vector processing and us
Prepare data to be easily vectorized: II. loop unrolling: prepare new possibilities for vectorization by a more
detailed description in some cases: predictable sizes of blocks: efficient processing of
loops of fixed size
![Page 60: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/60.jpg)
M. Tuma 51
6. Vector processor model: VII.
Vector processing and us
Prepare data to be easily vectorized: II. loop unrolling: prepare new possibilities for vectorization by a more
detailed description in some cases: predictable sizes of blocks: efficient processing of
loops of fixed size
subroutine dscal(n,da,dx,incx)do 50 i = mp1,n,5dx(i) = da*dx(i)dx(i + 1) = da*dx(i + 1)dx(i + 2) = da*dx(i + 2)dx(i + 3) = da*dx(i + 3)dx(i + 4) = da*dx(i + 4)50 continue
![Page 61: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/61.jpg)
M. Tuma 52
6. Vector processor model: VIII.
Vector processing and us
Prepare data to be easily vectorized: III.
![Page 62: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/62.jpg)
M. Tuma 52
6. Vector processor model: VIII.
Vector processing and us
Prepare data to be easily vectorized: III. loop interchanges: 1. recursive doubling for polynomial evaluation
![Page 63: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/63.jpg)
M. Tuma 52
6. Vector processor model: VIII.
Vector processing and us
Prepare data to be easily vectorized: III. loop interchanges: 1. recursive doubling for polynomial evaluation
Horner’s rule: p(k) = an−k + p(k−1)x for getting p(n).strictly recursive and non-vectorizable
![Page 64: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/64.jpg)
M. Tuma 52
6. Vector processor model: VIII.
Vector processing and us
Prepare data to be easily vectorized: III. loop interchanges: 1. recursive doubling for polynomial evaluation
Horner’s rule: p(k) = an−k + p(k−1)x for getting p(n).strictly recursive and non-vectorizable
[v1
v2
]←[
x
x2
]
[v3
v4
]← v2
[v1
v2
]
and so on
![Page 65: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/65.jpg)
M. Tuma 53
6. Vector processor model: IX.
Vector processing and us
Prepare data to be easily vectorized: IV.
![Page 66: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/66.jpg)
M. Tuma 53
6. Vector processor model: IX.
Vector processing and us
Prepare data to be easily vectorized: IV. loop interchanges: 2. cyclic reduction demonstrated for solving tridiagonal systems: other “parallel” TD
solvers: later (twisted factorization)
![Page 67: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/67.jpg)
M. Tuma 53
6. Vector processor model: IX.
Vector processing and us
Prepare data to be easily vectorized: IV. loop interchanges: 2. cyclic reduction demonstrated for solving tridiagonal systems: other “parallel” TD
solvers: later (twisted factorization)
even-odd rearrangement of rows
![Page 68: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/68.jpg)
M. Tuma 53
6. Vector processor model: IX.
Vector processing and us
Prepare data to be easily vectorized: IV. loop interchanges: 2. cyclic reduction demonstrated for solving tridiagonal systems: other “parallel” TD
solvers: later (twisted factorization)
even-odd rearrangement of rows
d0 f0
e1 d1 f1
e2 d2 f2
e3 d3 f3
e4 d4 f4
e5 d5 f5
e6 d1
![Page 69: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/69.jpg)
M. Tuma 53
6. Vector processor model: IX.
Vector processing and us
Prepare data to be easily vectorized: IV. loop interchanges: 2. cyclic reduction demonstrated for solving tridiagonal systems: other “parallel” TD
solvers: later (twisted factorization)
even-odd rearrangement of rows
d0 f0
d2 e2 f2
d4 e4 f4
d6 e6
e1 f1 d1
e3 f3 d3
e5 f5 d5
![Page 70: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/70.jpg)
M. Tuma 53
6. Vector processor model: IX.
Vector processing and us
Prepare data to be easily vectorized: IV. loop interchanges: 2. cyclic reduction demonstrated for solving tridiagonal systems: other “parallel” TD
solvers: later (twisted factorization)
even-odd rearrangement of rows
d0 f0
d2 e2 f2
d4 e4 f4
d6 e6
e1 f1 d1
e3 f3 d3
e5 f5 d5
more vectorizable than GE, more ops, worse cache treatment
![Page 71: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/71.jpg)
M. Tuma 54
7. Multiprocessor model
Basic items (some of them emphasized once more)
communication
– in addition to memory latency and memory bandwidth we considerlatencies and bandwidths connected to mutual communication
granularity
– how large should be independent computational tasks
load balancing
– balancing work in the whole system
resulting measure: parallel efficiency / scalability
![Page 72: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/72.jpg)
M. Tuma 55
7. Multiprocessor model: II.
Communication
Additional communication (with respect to uniprocessor P-M ): P-P
store and forward routing via l links between two processors
– tcomm = ts + l(mtw + th)
– ts: transfer startup time (includes startups for both nodes)
– m: message size
– th: node latency (header latency)
– tw: time to transfer a word
– simplification: tcomm = ts + lmtw
typically: poor efficiency of communication
![Page 73: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/73.jpg)
M. Tuma 56
7. Multiprocessor model: III.
Communication (continued)
Single message
Message broken into two parts
![Page 74: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/74.jpg)
M. Tuma 57
7. Multiprocessor model: IV.
Communication (continued 2)
packet routing: routing r packets via l links between two processors
subsequent sends after a part of the message (packet) received
– tcomm = ts + thl + tw1m + m/rtw2(r + s)
– ts: transfer startup time (includes startups for both nodes)
– tw1: time for packetizing the message, tw2: time to transfer a word, s:size of info on packetizing
– finally: tcomm = ts + thl + mtw
– stores overlapped by transfer cut through routing: message broken into flow control digits (fixed size
units)
– tcomm = ts + thl + mtw
supported by most current parallel machines and local networks
![Page 75: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/75.jpg)
M. Tuma 58
7. Multiprocessor model: V.
Communication (shared memory issues)
avoid cache thrashing (degradation of performance due to insufficientcaches); much more important on multiprocessor architectures⇒typical deterioration of performance when a code is transferred to aparallel computer
more difficult to model prefetching
difficult to get and model spatial locality because of cache issues
cache sharing (sharing data for different processors in the same cachelines)
remote access latencies (data for a processor updated in a cache ofanother processor)
![Page 76: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/76.jpg)
M. Tuma 59
7. Multiprocessor model: VI.
Optimizing communication
minimize amount of transferred data: better algorithms
message aggregation, communication granularity, communicationregularity: implementation
minimize distance of data transfer: efficient routing, physical platformorganizations (not treated here)(but tacitly used in some very generaland realistic assumptions)
![Page 77: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/77.jpg)
M. Tuma 60
7. Multiprocessor model: VII.
Granularity of algorithms, implementation, computation
Rough classification of size of program sections executed withoutadditional communication
fine grain
medium grain
coarse grain
![Page 78: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/78.jpg)
M. Tuma 61
7. Multiprocessor model: VIII.
Fine grain example 1: pointwise Jacobi iteration
x+ = (I −D−1A)x + D−1b
A =
B −I
−I B −I
. . . . . .
−I B
B =
4 −1
−1 4 −1
. . . . . .
−1 4
D =
4
4
. . .
4
![Page 79: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/79.jpg)
M. Tuma 62
7. Multiprocessor model: IX.
Fine grain example 1: pointwise Jacobi iteration (continued)
x+ij = xij + (bij + xi−1,j + xi,j−1 + xi+1,j + xi,j+1 − 4 ∗ xij)/4
i
j
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
![Page 80: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/80.jpg)
M. Tuma 63
7. Multiprocessor model: X.
Fine grain example 1: pointwise Gauss-Seidel iteration
x+ = (I − (D + L)−1A)x + (D + L)−1b
x+ij = xij + (bij + x+
i−1,j + x+i,j−1 + xi+1,j + xi,j+1 − 4 ∗ xij)/4
i
j
1
2
2
3
3
3
4
4
4
4
5
5
5
5
5
6
6
6
6
7
7
7
8
8
9
![Page 81: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/81.jpg)
M. Tuma 64
7. Multiprocessor model: XI.
Granularity
Concept of granularity:can be generalized to: Decomposition of the computation
Problem decomposition (I/IV)
recursive decomposition: divide and conquer strategy
– example: sorting algorithm quicksort
—- select an entry in the sorted sequence
—- partitions the sequence into two subsequences
![Page 82: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/82.jpg)
M. Tuma 65
7. Multiprocessor model: XII.
Problem decomposition (II/IV)
One step of quicksort – basic scheme
3 1 7 2 5 8 6 4 3
1 2 3 7 5 8 6 4 3
![Page 83: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/83.jpg)
M. Tuma 66
7. Multiprocessor model: XIII.
Problem decomposition (III/IV)
data decomposition: split the problem data
– example: matrix multiplication
(A11 A12
A21 A22
)(B11 B12
B21 B22
)→(
C11 C12
C21 C22
)
![Page 84: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/84.jpg)
M. Tuma 67
7. Multiprocessor model: XIV.
Problem decomposition (IV/IV)
exploratory decomposition: split the search space
– used, e.g., in approximate solving NP-hard combinatorial optimizationproblems
speculative; random decompositions
– example: evaluating branch instructions before a branch condition isevaluated
hybrid decomposition: first recursive decomposition into large chunks,later data decomposition
![Page 85: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/85.jpg)
M. Tuma 68
7. Multiprocessor model: XV.
Load balancing
static mappings
– 1. data block distribution schemes
– example: matrix multiplication
– n: matrix dimension; p: number of processors
– 1D block distribution: processors own row matrix blocks: each one hasn/p of rows
– 2D block distribution: processors own blocks of size n/√
p× n/√
ppartitioned by both rows and columns:
– input, intermediate, output block data distributions
![Page 86: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/86.jpg)
M. Tuma 69
7. Multiprocessor model: XVI.
Load balancing: 1D versus 2D matrix distribution for a matmat(matrix-matrix multiplication)
1D partitioning
2D partitioning
shared data: 1D: n2/p + n2, 2D output: O(n2/√
p
![Page 87: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/87.jpg)
M. Tuma 70
7. Multiprocessor model: XVII.
Load balancing: Data block distribution schemes for matrixalgorithms with nonuniform work with respect to ordering of
indices
example: LU decomposition
cyclic and block-cyclic distributions
1D and 2D block cyclic distribution
![Page 88: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/88.jpg)
M. Tuma 71
7. Multiprocessor model: XVIII.
Load balancing: other static mappings
randomized block distributions
– useful, e.g. for sparse or banded matrices
graph partitioning
– an application based input block data distribution
hierarchical static mappings
task-based partitionings
![Page 89: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/89.jpg)
M. Tuma 72
7. Multiprocessor model: XIX.
Load balancing: dynamic mappings
centralized schemes
– master: a special process managing pool of available tasks
– slave: processors performing tasks from the pool
—- self-scheduling (choosing tasks in independent demands)
—- controlled-scheduling (master involved in providing tasks)
—- chunk-scheduling (slaves take a block of tasks)
distributed schemes
– more freedom, more duties
– synchronization between sender and receiver
– initiation of tasks
![Page 90: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/90.jpg)
M. Tuma 73
7. Multiprocessor model: XX.
User point of view: tools
the most widespread message passing model: MPI paradigm
– supports execution of different programs on each of processors
– enables easy description using SPMD approach: a way to having jobof program writing efficient
– simple parallelization with calls to a library
other message passing model: PVM
– some enhancements but less efficient
Posix Thread API
Shared-memory OpenMP API
![Page 91: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/91.jpg)
M. Tuma 74
7. Multiprocessor model: XXI.
Example: basic MPI routines
MPI_init(ierr)
MPI_finalize(ierr)
MPI_comm_rank(comm,rank,ierr)
MPI_comm_size(comm,size,ierr)
MPI_send(buf,n,type,dest,tag,comm,ierr)
MPI_recv(buf,n,type,srce,tag,comm,status,ierr)
MPI_bcast(buf,n,type,srce,tag,comm,status,ierr)
MPI_REDUCE(sndbuf,rcvbuf,1,type,op,0,comm,ierr)
![Page 92: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/92.jpg)
M. Tuma 75
7. Multiprocessor model: XXII.
c******************************************************************c pi.f - compute pi by integrating f(x) = 4/(1 + x**2)
c (rewritten from the example program from MPICH, ANL)
c
c Each node:
c 1) receives the number of rectangles used in the approximation.
c 2) calculates the areas of it’s rectangles.
c 3) Synchronizes for a global summation.
c Node 0 prints the result.
c
program main
include ’mpif.h’
double precision PI25DT
parameter (PI25DT = 3.141592653589793238462643d0)
double precision mypi, pi, h, sum, x, f, a
integer n, myid, numprocs, i, rc
![Page 93: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/93.jpg)
M. Tuma 76
7. Multiprocessor model: XXIII.
c function
f(a) = 4.d0 / (1.d0 + a*a)
c init
call MPI_INIT( ierr )
c who am I?
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
c how many of us?
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
print *, "Process ", myid, " of ", numprocs, " is alive"
c
10 if ( myid .eq. 0 ) then
write(*,*) ’Enter the number of intervals: (0 quits)’
read(*,*) n
endif
![Page 94: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/94.jpg)
M. Tuma 77
7. Multiprocessor model: XXIV.
c distribute dimension
call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
c calculate the interval size
h = 1.0d0/n
c
sum = 0.0d0
do i = myid+1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
end do
mypi = h * sum
c collect all the partial sums
call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,
$ MPI_COMM_WORLD,ierr)
![Page 95: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/95.jpg)
M. Tuma 78
7. Multiprocessor model: XXV.
c node 0 prints the answer.
if (myid .eq. 0) then
write(6, 97) pi, abs(pi - PI25DT)
97 format(’ pi is approximately: ’, F18.16,
+ ’ Error is: ’, F18.16)
endif
30 call MPI_FINALIZE(rc)
stop
end
![Page 96: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/96.jpg)
M. Tuma 79
7. Multiprocessor model: XXVI.
Linear algebra standardization and multiprocessor model
BLACS: Basic linear algebra communication subroutines (low level ofconcurrent programming)
PBLAS: Parallel BLAS: “parallel” info is transferred via a descriptor array
ScaLAPACK: library of high-performance linear algebra for messagepassing architectures
All of these based on the message-passing primitives
![Page 97: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/97.jpg)
M. Tuma 80
7. Multiprocessor model: XXVII.
Dependency tree for linear algebra high-performancesoftware
ScaLAPACK
PBLAS
BLACS
MPI, PVM
BLAS
LAPACK
![Page 98: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/98.jpg)
M. Tuma 81
8. Basic parallel operations
Dense matrix-vector multiplication
Algorithm 1 sequential matrix-vector multiplication y = Axfor i = 1, . . . n
yi = 0for j = 1, . . . n
yi = yi + aijxj
end jend i
a) rowwise 1-D partitioning
b) 2-D partitioning
![Page 99: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/99.jpg)
M. Tuma 82
8. Basic parallel operations: II.
Dense matrix-vector multiplication: rowwise 1-D partitioning
P0
P1
P2
P3
P4
P5
P0
P1
P2
P3
P4
P5
x0
x1
x2
x3
x4
x5
x0
x1x2
x3
x4
x5
Communication: all-to-all communication among n processors(P0 −−Pn−1); Θ(n) for a piece of communication
Multiplication: Θ(n)
Altogether: Θ(n) parallel time, Θ(n2) process time: cost optimal:(asymptotically same number of operations when sequentialized)
![Page 100: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/100.jpg)
M. Tuma 83
8. Basic parallel operations: III.
Dense matrix-vector multiplication: block-rowwise 1-Dpartitioning
Blocks of the size n/p, matrix block-rowwise stripped, vectors x and ysplit into subvectors of length n/p.
Communication: all-to-all communication among p processors(P0 −−Pp−1): time ts log(p) + tw(n/p)(p− 1) ≈ ts log(p) + twn (usingrather general assumption on implementation of collectivecommunications).
Multiplication: n2/p
Altogether: n2/p + ts log(p) + twn parallel time; cost optimal forp = O(n) (asymptotically the same number of operations as in thesequential case).
![Page 101: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/101.jpg)
M. Tuma 84
8. Basic parallel operations: IV.
Dense matrix-vector multiplication: 2-D partitioning
x0
x1
x2
x3
x4
x5
x0
x1x2
x3
x4
x5
P0 P1 P2 P5
P6
P25
... ...
...
...
...
Communication I.: Align vector x
Communication II.: one-to-all broadcast among n processors of eachcolumn: Θ(log(n))
Communication III.: all-to-one reduction in rows: Θ(log n)
Multiplication: 1
Altogether: Θ(n) parallel time; process time Θ(n2 log n). Algorithm is notcost optimal
![Page 102: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/102.jpg)
M. Tuma 85
8. Basic parallel operations: V.
Dense matrix-vector multiplication: block 2-D partitioning
x0
x1
x2
x3
x4
x5
x0
x1x2
x3
x4
x5
P0 P1 P2 P5
P6
P25
... ...
...
...
...
Multiplication: n2/p
Aligning vector: ts + twn/√
p
Columnwise one-to-all broadcast: (ts + twn/√
p) log(√
p)
All-to-one reduction: (ts + twn/√
p) log(√
p)
Altogether: n2/p + ts log p + twn/√
p log p parallel time
Algorithm is cost optimal for p = O(n).
![Page 103: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/103.jpg)
M. Tuma 86
8. Basic parallel operations: VI.
Dense matrix-matrix multiplication: 2-D partitioning
P0 P1 P2 P5
P6
P25
... ...
...
...
...
Communication: Two all-to-all broadcast steps
Each with√
p concurrent broadcasts among groups of√
p processes
Total communication time 2ts log(√
p) + twn2/p√
p
Multiplications of matrices of dimensions n/√
p,√
p-times.
Altogether: n3/p + tsp log p + 2twn2/√
p parallel time. Algorithm is costoptimal for p = O(n2)
Large memory consumption: each process has√
p blocks od sizeΘ(n2/p).
![Page 104: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/104.jpg)
M. Tuma 87
8. Basic parallel operations: VII.
Dense matrix-matrix multiplication: Cannon’s algorithm
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B00 B01 B02 B03
B10 B11 B12B13
B20 B21 B22 B23
B30 B31 B32 B33
![Page 105: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/105.jpg)
M. Tuma 87
8. Basic parallel operations: VII.
Dense matrix-matrix multiplication: Cannon’s algorithm
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B00 B01 B02 B03
B10 B11 B12B13
B20 B21 B22 B23
B30 B31 B32 B33
A00 A01 A02 A03
A12 A13 A10A11
A22 A23 A20 A21
A33 A30 A31 A32
B00 B11 B22 B33
B21 B32 B03B10
B20 B31 B02 B13
B30 B01 B12 B23
![Page 106: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/106.jpg)
M. Tuma 87
8. Basic parallel operations: VII.
Dense matrix-matrix multiplication: Cannon’s algorithm
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
B00 B01 B02 B03
B10 B11 B12B13
B20 B21 B22 B23
B30 B31 B32 B33
A00 A01 A02 A03
A12 A13 A10A11
A22 A23 A20 A21
A33 A30 A31 A32
B00 B11 B22 B33
B21 B32 B03B10
B20 B31 B02 B13
B30 B01 B12 B23
![Page 107: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/107.jpg)
M. Tuma 87
8. Basic parallel operations: VII.
Dense matrix-matrix multiplication: Cannon’s algorithm
A00 A01 A02 A03
A12 A13 A10A11
A22 A23 A20 A21
A33 A30 A31 A32
B00 B11 B22 B33
B21 B32 B03B10
B20 B31 B02 B13
B30 B01 B12 B23
Memory-efficient: version of matrix-matrix multiplication
Parallel time and cost-optimality asymptotically the same.
Possible to use n3/ log n processes to get Θ(log n) parallel time (Dekel,Nassimi, Sahni) (not cost optimal)
There exists also a fast cost-optimal variant
![Page 108: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/108.jpg)
M. Tuma 88
8. Basic parallel operations: VIII.
Gaussian elimination (here the kij case of LU factorization)
active part
(k,k)
(i,k)
(k,j)
(i,j)
a(k,j)=a(k,j)/a(k,k)
a(i,j)=aa(i,j)−a(i,k)*a(k,j)
![Page 109: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/109.jpg)
M. Tuma 88
8. Basic parallel operations: VIII.
Gaussian elimination (here the kij case of LU factorization)
active part
(k,k)
(i,k)
(k,j)
(i,j)
a(k,j)=a(k,j)/a(k,k)
a(i,j)=aa(i,j)−a(i,k)*a(k,j)
sequential time complexity:2/3n3 + O(n2)
![Page 110: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/110.jpg)
M. Tuma 89
8. Basic parallel operations: IX.
Standard Gaussian elimination: 1-D partitioning
1
10
0
0
0
0
0
0
0
0
0
0
0
0
![Page 111: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/111.jpg)
M. Tuma 89
8. Basic parallel operations: IX.
Standard Gaussian elimination: 1-D partitioning
1
10
0
0
0
0
0
0
0
0
0
0
0
0
1
10
0
0
0
0
0
0
0
0
0
0
0
0
1
![Page 112: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/112.jpg)
M. Tuma 89
8. Basic parallel operations: IX.
Standard Gaussian elimination: 1-D partitioning
1
10
0
0
0
0
0
0
0
0
0
0
0
0
1
10
0
0
0
0
0
0
0
0
0
0
0
0
1
1
10
0
0
0
0
0
0
0
0
0
0
0
0
1
![Page 113: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/113.jpg)
M. Tuma 89
8. Basic parallel operations: IX.
Standard Gaussian elimination: 1-D partitioning
1
10
0
0
0
0
0
0
0
0
0
0
0
0
1
10
0
0
0
0
0
0
0
0
0
0
0
0
1
1
10
0
0
0
0
0
0
0
0
0
0
0
0
1
Computation: 3∑n−1
k=0(n− k − 1) = 3n(n− 1)/2
Parallel time: 3n(n− 1)/2 + tsn log n + 1/2twn(n− 1) log n
This is not cost-optimal, since the total time is Θ(n3 log n).
![Page 114: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/114.jpg)
M. Tuma 90
8. Basic parallel operations: X.
Pipelined Gaussian elimination: 1-D partitioning
1 1 1
1 1 1
1
1
1
. . .
![Page 115: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/115.jpg)
M. Tuma 91
8. Basic parallel operations: XI.
Pipelined Gaussian elimination: 1-D partitioning (continued)
Total number of steps: Θ(n)
Operations:, each of O(n) time complexity Communication O(n) entries
Division O(n) entries by a scalar
Elimination step on O(n) entries
Parallel time: O(n2); Total time: O(n3).
Not the same constant in the asymptotic complexity as in thesequential case: some processors are idle.
![Page 116: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/116.jpg)
M. Tuma 92
8. Basic parallel operations: XII.
Gaussian elimination: further issues
2-D partitioning: Θ(n3) total time for n2 processes. block 2-D partitioning: Θ(n3/p) total time for p processes. 2-D partitionings: generally more scalable (allow efficient use of more
processors) Pivoting: changing layout of the elimination
Partial pivoting: No problem in 1-D rowwise partitioning: O(n) search in each row It might seem that it is better with 1-D columnwise partitioning:
O(log p) search. But this strongly limits pipelining. Strong restrictions to pipelining.
Weaker variants of pivoting (e.g., pairwise pivoting: may result instrong degradation of the numerical quality of the algorithm.
![Page 117: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/117.jpg)
M. Tuma 93
8. Basic parallel operations: XIII.
Solving triangular system: back-substitution
sequential back-substitution for U*x=y(one possible order of operations)
do k=n,1,-1 ! (backwards)x(i)=y(i)do i=k-1,1,-1 ! (backwards)y(i)=y(i)-x(k)*U(i,k)
end doend do
Sequential complexity: n2/2 + O(n)
Rowwise block 1-D partitioning: Constant communication, O(n/p) forcomputation, Θ(n) steps: total time Θ(n2/p).
Block 2-D partitioning: Θ(n2√p) total time.
![Page 118: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/118.jpg)
M. Tuma 94
8. Basic parallel operations: XIV.
Solving linear recurrences: a case of parallel prefix operation
Parallel prefix operation:
Get y0 = x0, y1 = x0♥x1, . . . , yi = x0♥x1♥ . . .♥xi for associative ♥.
![Page 119: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/119.jpg)
M. Tuma 94
8. Basic parallel operations: XIV.
Solving linear recurrences: a case of parallel prefix operation
Parallel prefix operation:
Get y0 = x0, y1 = x0♥x1, . . . , yi = x0♥x1♥ . . .♥xi for associative ♥.
0:7
0 1 2 3 4 5 6 7
0:1 2:3 4:5 6:7
0:3 4:7
0:5
0:60:40:2
![Page 120: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/120.jpg)
M. Tuma 95
8. Basic parallel operations: XIV.
Parallel prefix operation (continued)
Application to zi+1 = aizi + bi
Get pi = a0 . . . ai using the parallel prefix operation Compute βi = bi/pi in parallel. Compute si = β0 + . . . + βi−1 using parallel prefix operation. Compute zi = sipi−1 in parallel.
![Page 121: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/121.jpg)
M. Tuma 96
8. Basic parallel operations: XV.
Conclusion for basic parallel operations
Still far from contemporary scientific computing There are large dense matrices in practical problems, but: A lot can be performed by ready-made scientific software like
SCALAPACK Problems are:
Sparse: O(n) sequential steps may be too many But: contemporary sparse matrix software strongly relies on using
dense blocks connected by a general sparse structure Very often unstructured: operations with general graphs and
specialized combinatorial routines should be efficiently implementedwhouch would be generally efficient on a wide spectrum of computerarchitectures.
Not homogenous in the sense that completely different parallelizationtechniques should be used in implementations.
![Page 122: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/122.jpg)
M. Tuma 97
9. Parallel solvers of linear algebraic equations
Basic classification of (sequential) solvers
Ax = b
Our case of interest
A is large
A is, fortunately, most often, sparse
Different classes of methods for solving the system with variousadvantages and disadvantages.
Gaussian elimination→ direct methods CG method→ Krylov space iterative methods (+) multilevel information transfer
![Page 123: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/123.jpg)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
![Page 124: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/124.jpg)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity)
![Page 125: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/125.jpg)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity)
![Page 126: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/126.jpg)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
polynomial (O(log2 n)).
kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)
![Page 127: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/127.jpg)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
polynomial (O(log2 n)).
kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)
=
![Page 128: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/128.jpg)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
polynomial (O(log2 n)).
kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)
=
−1
![Page 129: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/129.jpg)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
polynomial (O(log2 n)).
kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)
=
−1
Compute the inverse using Cayley-Hamilton theorem (O(log2 n))
![Page 130: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/130.jpg)
M. Tuma 98
9. Parallel solvers of linear algebraic equations: II.
Hunt for extreme parallelism: Algorithm by Csanky
Compute powers of A: A2, A3, . . . An−1 (O(log2 n) complexity) Compute traces sk = tr(Ak) of the powers (O(log n) complexity) Solve Newton identities for the coefficients ai of the characteristic
polynomial (O(log2 n)).
kpk = sk − p1sk−1 − . . .− pk−1s1, (k = 1, . . . , n)
=
−1
Compute the inverse using Cayley-Hamilton theorem (O(log2 n)) Horribly unstable.
![Page 131: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/131.jpg)
M. Tuma 99
9. Parallel solvers of linear algebraic equations: III.
Typical key operations in the skeleton of Krylov subspacemethods
1. Matrix-vector multiplication (one right-hand side) with a sparse matrix.
2. Matrix-matrix multiplications (more right-hand sides), first matrix issparse.
3. Sparse matrix-matrix multiplications.
4. Preconditioning operation (we will explaain preconditioning later).
5. Orthogonalization in some algorithms (GMRES).
6. Some standard dense stuff (saxpys, dot products, norm computations.
7. Overlapping communication and computation. It sometimes changesnumerical properties of the implementation.
![Page 132: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/132.jpg)
M. Tuma 100
9. Parallel solvers of linear algebraic equations: IV.
System matrix goes to sparse: more possible data structures
![Page 133: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/133.jpg)
M. Tuma 100
9. Parallel solvers of linear algebraic equations: IV.
System matrix goes to sparse: more possible data structures* ** *
**
**
**
*
**
**
**
*
**
**
*
**
**
**
**
*
**
* ***
*
Band 6
* ** *
**
**
**
*
**
**
**
*
**
**
*
**
**
**
**
*
**
* ***
*
Profile 6
* ** *
**
**
**
*
**
**
**
*
**
**
*
**
**
**
**
*
**
* ***
*
Frontal method - dynamic band
Movingwindow -
![Page 134: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/134.jpg)
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
![Page 135: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/135.jpg)
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.
![Page 136: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/136.jpg)
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.
Machines often support gather-scatter useful with indirect addressingconnected to sparse matrices.
![Page 137: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/137.jpg)
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.
Machines often support gather-scatter useful with indirect addressingconnected to sparse matrices.
Generally sparse data structures typically preferred.
![Page 138: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/138.jpg)
M. Tuma 101
9. Parallel solvers of linear algebraic equations: V.
General sparsity structure can be reasonably treated.
Banded and envelope paradigms often lead to slower algorithms e.g.,when matrices has to be decomposed.
Machines often support gather-scatter useful with indirect addressingconnected to sparse matrices.
Generally sparse data structures typically preferred.
∗ ∗ ∗ ∗∗ ∗ ∗ ∗
∗ ∗ ∗∗ ∗ ∗ ∗
∗ ∗ ∗ ∗∗ ∗ ∗ ∗
∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗∗ ∗ f ∗ ∗
∗ ∗ ∗∗ ∗ ∗ ∗
∗ f ∗ ∗ f ∗∗ ∗ f ∗ f ∗
∗ f ∗ ∗∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
![Page 139: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/139.jpg)
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
![Page 140: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/140.jpg)
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
![Page 141: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/141.jpg)
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
2. Graph partitioning
![Page 142: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/142.jpg)
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
2. Graph partitioning
3. Reordering matrix for matvecs for 1-D / 2-D partitioning
![Page 143: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/143.jpg)
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
2. Graph partitioning
3. Reordering matrix for matvecs for 1-D / 2-D partitioning
4. Sparse matrix-matrix multiplication
![Page 144: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/144.jpg)
M. Tuma 102
9. Parallel solvers of linear algebraic equations: VI.
General sparsity structure can be reasonably treated.
Scheduling for parallel computation is not straightforward.
(Some) issues useful for linear algebraic solvers:
1. Sparse fill-in minimizing reorderings
2. Graph partitioning
3. Reordering matrix for matvecs for 1-D / 2-D partitioning
4. Sparse matrix-matrix multiplication
5. Some ideas from preconditioning
![Page 145: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/145.jpg)
M. Tuma 103
9. Parallel solvers of linear algebraic equations: VII.
Sparse fill-in minimizing reorderings.
![Page 146: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/146.jpg)
M. Tuma 103
9. Parallel solvers of linear algebraic equations: VII.
Sparse fill-in minimizing reorderings.
static differs them from dynamic reordering strategies (pivoting)
two basic types
![Page 147: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/147.jpg)
M. Tuma 103
9. Parallel solvers of linear algebraic equations: VII.
Sparse fill-in minimizing reorderings.
static differs them from dynamic reordering strategies (pivoting)
two basic types
local reorderings: based on local greedy criterion
global reorderings: taking into account the whole graph / matrix
![Page 148: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/148.jpg)
M. Tuma 104
9. Parallel solvers of linear algebraic equations: VIII.
Local fill-in minimizing reorderings: MD: the basic algorithm.
G = G(A)
for i = 1 to n do
find v such that degG(v) = minv∈V degG(v)
G = Gv
end i
The order of found vertices induces their new renumbering
deg(v) = |Adj(v)|; graph G as a superscript determines the currentgraph
![Page 149: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/149.jpg)
M. Tuma 105
9. Parallel solvers of linear algebraic equations: IX.
MD: the basic algorithm: example.
v v
G G_v
![Page 150: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/150.jpg)
M. Tuma 106
9. Parallel solvers of linear algebraic equations: X.
global reorderings: ND algorithm (George, 1973)
Find separator
Reorder the matrix numbering nodes in the separator last
Do it recursively
Vertex separator
C_1 C_2
S
![Page 151: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/151.jpg)
M. Tuma 107
9. Parallel solvers of linear algebraic equations: XI.
ND algorithm after one level of recursion
C_1
C_2
S
SC_2C_1
![Page 152: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/152.jpg)
M. Tuma 108
9. Parallel solvers of linear algebraic equations: XII.
ND algorithm after more levels of recursion
1 7 4 43 22 28 25
3 8 6 44 24 29 27
2 9 5 45 23 30 36
19 20 21 46 40 41 42
10 16 13 47 31 37 34
1712 15 48 33 38 36
11 18 14 49 32 39 35
![Page 153: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/153.jpg)
M. Tuma 109
9. Parallel solvers of linear algebraic equations: XIII.
static reorderings: summary
![Page 154: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/154.jpg)
M. Tuma 109
9. Parallel solvers of linear algebraic equations: XIII.
static reorderings: summary
the most useful strategy: combining local and global reorderings
![Page 155: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/155.jpg)
M. Tuma 109
9. Parallel solvers of linear algebraic equations: XIII.
static reorderings: summary
the most useful strategy: combining local and global reorderings
modern nested dissections are based on graph partitioners: partition agraph such that components have very similar sizes
separator is small
can be correctly formulated and solved for a general graph
theoretical estimates for fill-in and number of operations
![Page 156: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/156.jpg)
M. Tuma 109
9. Parallel solvers of linear algebraic equations: XIII.
static reorderings: summary
the most useful strategy: combining local and global reorderings
modern nested dissections are based on graph partitioners: partition agraph such that components have very similar sizes
separator is small
can be correctly formulated and solved for a general graph
theoretical estimates for fill-in and number of operations
modern local reorderings: used after a few steps of an incompletenested dissection
![Page 157: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/157.jpg)
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
![Page 158: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/158.jpg)
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2
such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√
3|V |.
![Page 159: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/159.jpg)
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2
such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√
3|V |.
Many different strategies for general cases
![Page 160: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/160.jpg)
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2
such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√
3|V |.
Many different strategies for general cases
Recursive bisecections or k-sections
![Page 161: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/161.jpg)
M. Tuma 110
9. Parallel solvers of linear algebraic equations: XIV
Graph partitioning
The goal: separate a given graph into pieces of similar sizes havingsmall separators.
TH: Let G = (V, E) be a planar graph. Then we can find a vertexseparator S = (VS , ES) which divides V into two disjoint sets V1 and V2
such that max(|V1|, |V2|) ≤ 2/3|V | and VS | ≤ 2 ∗√
3|V |.
Many different strategies for general cases
Recursive bisecections or k-sections
Sometimes for weighted graphs
![Page 162: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/162.jpg)
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
![Page 163: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/163.jpg)
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
![Page 164: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/164.jpg)
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
2. Level-structure partitioning
![Page 165: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/165.jpg)
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
2. Level-structure partitioning
3. Inertial partitioning
![Page 166: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/166.jpg)
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
2. Level-structure partitioning
3. Inertial partitioning
4. Spectral partitioning
![Page 167: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/167.jpg)
M. Tuma 111
9. Parallel solvers of linear algebraic equations: XV
Graph partitioning: classification of a few basic approaches
1. Kerninghan-Lin algorithm
2. Level-structure partitioning
3. Inertial partitioning
4. Spectral partitioning
5. Multilevel partitioning
![Page 168: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/168.jpg)
M. Tuma 112
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Kerninghan-Lin (1970)
![Page 169: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/169.jpg)
M. Tuma 112
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Kerninghan-Lin (1970)
Partitioning by local searches. Often used for improving partitions provided by other algorithms. More efficient implementation by Fiduccia and Mattheyses, 1982.
![Page 170: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/170.jpg)
M. Tuma 112
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Kerninghan-Lin (1970)
Partitioning by local searches. Often used for improving partitions provided by other algorithms. More efficient implementation by Fiduccia and Mattheyses, 1982.
The intention
Start with a graph G = (V, E) with edge weights w : E → IR+ andpartitioning V = VA ∪ VB .
![Page 171: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/171.jpg)
M. Tuma 112
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Kerninghan-Lin (1970)
Partitioning by local searches. Often used for improving partitions provided by other algorithms. More efficient implementation by Fiduccia and Mattheyses, 1982.
The intention
Start with a graph G = (V, E) with edge weights w : E → IR+ andpartitioning V = VA ∪ VB .
Find X ⊂ VA and Y ⊂ VB such that the new partitionV = (VA ∪ Y \X)∪ (B ∪X \ Y ) reduces total cost of edges between VA
and VB given by
COST =∑
a∈VA,b∈VB
w(a, b).
![Page 172: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/172.jpg)
M. Tuma 113
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Kerninghan-Lin: II.
Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by
E(a)− I(a) + E(b)− I(b)− 2w(a, b),
where E(x), I(x) denotes external or internal cost of x ∈ V , respectively
![Page 173: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/173.jpg)
M. Tuma 113
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Kerninghan-Lin: II.
Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by
E(a)− I(a) + E(b)− I(b)− 2w(a, b),
where E(x), I(x) denotes external or internal cost of x ∈ V , respectively
V_A V_B
a b
![Page 174: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/174.jpg)
M. Tuma 113
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Kerninghan-Lin: II.
Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by
E(a)− I(a) + E(b)− I(b)− 2w(a, b),
where E(x), I(x) denotes external or internal cost of x ∈ V , respectively
V_A V_B
a b
I(a)
![Page 175: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/175.jpg)
M. Tuma 113
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Kerninghan-Lin: II.
Monitoring gains in COST when exchanging a pair of vertices gain(a, b) for a ∈ VA and b ∈ VB is given by
E(a)− I(a) + E(b)− I(b)− 2w(a, b),
where E(x), I(x) denotes external or internal cost of x ∈ V , respectively
V_A V_B
a b
E(a)
![Page 176: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/176.jpg)
M. Tuma 114
9. Parallel solvers of linear algebraic equations: XVIII.
Graph partitioning: Kerninghan-Lin: III.The algorithm
Algorithm 2 Kernighan-Lincompute COST of the initial partitionuntil GAIN ≤ 0
for all nodes x compute E(x) + I(x)unmark all nodeswhile there are unmarked nodes do
find a suitable pair a, b of vertices from different partitionsmaximizing gain(a, b)mark a, b
end whilefind GAIN maximizing partial ssums of gains computed in the loopif GAIN > 0 then update the partition
end until
![Page 177: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/177.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 178: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/178.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 179: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/179.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 180: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/180.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 181: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/181.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 182: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/182.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 183: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/183.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 184: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/184.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 185: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/185.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 186: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/186.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 187: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/187.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 188: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/188.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 189: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/189.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 190: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/190.jpg)
M. Tuma 115
9. Parallel solvers of linear algebraic equations: XIX.
Graph partitioning: Level structure algorithms
based on the breadth-first search simple, but often not very good can be improved, for example, by the KL algorithm
![Page 191: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/191.jpg)
M. Tuma 116
9. Parallel solvers of linear algebraic equations: XVI.
Graph partitioning: Inertial algorithm
deals with graphs and their coordinates divides the set of graph nodes by a line (2D) or a plane (3D)
The strategy in 2D
Choose a line a(x− x0) + b(y − y0) = 0, a2 + b2 = 1. It has the slope −a/b, goes through (x0, y0). Compute distances ci of the projections of the nodes (xi, yi) from the
nodes. Compute distances di = a(yi − y0)− b(xi − x0) of the projections of the
nodes (xi, yi) from (x0, y0)
Find median d of these distances. Divide the nodes according to this median into two groups. How to choose the line?
![Page 192: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/192.jpg)
M. Tuma 117
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Inertial algorithm: II.
![Page 193: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/193.jpg)
M. Tuma 117
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Inertial algorithm: II.
![Page 194: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/194.jpg)
M. Tuma 117
9. Parallel solvers of linear algebraic equations: XVII.
Graph partitioning: Inertial algorithm: II.
![Page 195: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/195.jpg)
M. Tuma 118
9. Parallel solvers of linear algebraic equations: XVIII.
Graph partitioning: Inertial algorithm: III.
Some more explanation for the 2D
Finding a line such that the sum of squares of the projections to it isminimized.
This is a total least squares problem. Considering the nodes as mass units, the line taken as the axis should
minimize the moment of inertia among all possible lines. Mathematically:
n∑
i=1
c2i = (1)
=n∑
i=1
((xi − x0)2 + (yi − y0)
2 − (a(yi − y0)− b(xi − x0))2) = (a, b)M
(a
b
)
(2)
That is, a small eigenvalue problem
![Page 196: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/196.jpg)
M. Tuma 119
9. Parallel solvers of linear algebraic equations: XIX.
Spectral partitioning
DF: The Laplacian matrix of an undirected unweighted graph G = (V, E)is given by
L(G) = AT A
where A is its incidence (edge by vertex) matrix. Namely,
L(G)ij =
degree of node i for i = j
−1 for (i, j) ∈ E, i 6= j
0 otherwise
Then
xT Lx = xT AT Ax =∑
i,j∈EG
(xi − xj)2. (3)
L positive semidefinite
![Page 197: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/197.jpg)
M. Tuma 120
9. Parallel solvers of linear algebraic equations: XX.
Spectral partitioning: examples of Laplacians
1 2
3 4
5
−1 −1
1 −1
−1 1 −1
1 1 −1
1 1
2 −1 −1
−1 2 −1
−1 3 −1 −1
−1 −1 3 −1
−1 −1 2
![Page 198: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/198.jpg)
M. Tuma 121
9. Parallel solvers of linear algebraic equations: XX.
Spectral partitioning
Laplacian corresponding to the graph of a connected mesh haseigenvalue 0.
The eigenvector corresponding to this eigenvalue is (1, . . . , 1)T /√
n. Denote by µ, the second smallest eigenvalue of L(G). Then from the
Courant-Fischer theorem:
µ = minxT Lx | x ∈ IRn ∧ xT x = 1 ∧ xT (1, . . . , 1)T = 0. (4)
Let V is partitioned into V + and V −. Let v be a vector where v(x) = 1for x ∈ V + and v(x) = −1 otherwise. TH: Then number of edgesconnecting V + and V − is 1/4xT L(G)x.
xT L(G)x =∑
(i,j)∈E
(xi − xj)2 =
∑
(i,j)∈E,i∈V +,j∈V −
(xi − xj)2 =
= 4 ∗ number of edges between V + and V − (5)
![Page 199: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/199.jpg)
M. Tuma 122
9. Parallel solvers of linear algebraic equations: XXI.
Spectral partitioning
Find the second eigenvector of the Laplacian Dissect by its values This is an approximation to the discrete optimization problem
Multilevel partitioning: acceleration of basic procedures
Multilevel nested dissection Multilevel spectral partitioning
Approximate the initial graph G by a (simpler, smaller, cheaper) G Partition G Refine the partition from G to G Perform these steps recursively.
![Page 200: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/200.jpg)
M. Tuma 123
9. Parallel solvers of linear algebraic equations: XXII.
Graph partitioning: problems with our model
Edge cuts are not proportional to the total communication volume
Latencies of messages typically more important than the volume
In mane cases, minmax problem should be considered (minimizingmaximum communication cost)
nonsymmetric partitions might be considered (bipartite graph model;hypergraph model)
general rectangular problem should be considered
partitioning in parallel (there are papers and codes)
![Page 201: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/201.jpg)
M. Tuma 124
9. Parallel solvers of linear algebraic equations:
Iterative methods
![Page 202: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/202.jpg)
M. Tuma 124
9. Parallel solvers of linear algebraic equations:
Iterative methods
Stationary iterative methods Used in some previous examples Typically not methods of choice Useful as auxiliary methods
![Page 203: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/203.jpg)
M. Tuma 124
9. Parallel solvers of linear algebraic equations:
Iterative methods
Stationary iterative methods Used in some previous examples Typically not methods of choice Useful as auxiliary methods
Krylov space methods see the course by Zdenek Strakos
![Page 204: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/204.jpg)
M. Tuma 124
9. Parallel solvers of linear algebraic equations:
Iterative methods
Stationary iterative methods Used in some previous examples Typically not methods of choice Useful as auxiliary methods
Krylov space methods see the course by Zdenek Strakos
Simple iterative schemes driven by data decomposition Schwarz methods
![Page 205: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/205.jpg)
M. Tuma 124
9. Parallel solvers of linear algebraic equations:
Iterative methods
Stationary iterative methods Used in some previous examples Typically not methods of choice Useful as auxiliary methods
Krylov space methods see the course by Zdenek Strakos
Simple iterative schemes driven by data decomposition Schwarz methods
Added hierarchical principle not treated here
![Page 206: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/206.jpg)
M. Tuma 125
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods
![Page 207: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/207.jpg)
M. Tuma 125
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods
Ω =⋃
Ωi, d domains
Ωi ∩Omegaj 6= ∅ : overlap
![Page 208: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/208.jpg)
M. Tuma 125
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods
Ω =⋃
Ωi, d domains
Ωi ∩Omegaj 6= ∅ : overlap
![Page 209: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/209.jpg)
M. Tuma 126
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: II.
z(0), z(1), . . .
![Page 210: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/210.jpg)
M. Tuma 126
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: II.
z(0), z(1), . . .
A|Ωj= Aj = RT
j ARj
(Rj extracts columns from I corresponding to nodes in Ωj)
![Page 211: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/211.jpg)
M. Tuma 126
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: II.
z(0), z(1), . . .
A|Ωj= Aj = RT
j ARj
(Rj extracts columns from I corresponding to nodes in Ωj)
rj ≡ (b−Az(k))|Ωj= RT
j (b−Az(k))
![Page 212: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/212.jpg)
M. Tuma 126
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: II.
z(0), z(1), . . .
A|Ωj= Aj = RT
j ARj
(Rj extracts columns from I corresponding to nodes in Ωj)
rj ≡ (b−Az(k))|Ωj= RT
j (b−Az(k))
z(k+i/d) = z(k+(i−1)/d)+RjA−1j RT
j (b−Az(k+(i−1)/d)) ≡ z(k+(i−1)/d)+Bjr(k+(i−1)/d)
![Page 213: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/213.jpg)
M. Tuma 126
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: II.
z(0), z(1), . . .
A|Ωj= Aj = RT
j ARj
(Rj extracts columns from I corresponding to nodes in Ωj)
rj ≡ (b−Az(k))|Ωj= RT
j (b−Az(k))
z(k+i/d) = z(k+(i−1)/d)+RjA−1j RT
j (b−Az(k+(i−1)/d)) ≡ z(k+(i−1)/d)+Bjr(k+(i−1)/d)
This is the Multiplicative Schwarz procedureLess parallel, more powerful
![Page 214: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/214.jpg)
M. Tuma 127
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: III.
z(0), z(1), . . .
![Page 215: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/215.jpg)
M. Tuma 127
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: III.
z(0), z(1), . . .
g groups of domains that do not overlap
![Page 216: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/216.jpg)
M. Tuma 127
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: III.
z(0), z(1), . . .
g groups of domains that do not overlap
x(k+1/g) = x(k) +∑
j∈group1
Bjr(k)
x(k+2/g) = x(k+1/g) +∑
j∈group2
Bjr(k+1/g) . . .
x(k+1) = x(k+(g−1)/g) +∑
j∈groupg
Bjr(k+(g−1)/g)
![Page 217: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/217.jpg)
M. Tuma 127
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: III.
z(0), z(1), . . .
g groups of domains that do not overlap
x(k+1/g) = x(k) +∑
j∈group1
Bjr(k)
x(k+2/g) = x(k+1/g) +∑
j∈group2
Bjr(k+1/g) . . .
x(k+1) = x(k+(g−1)/g) +∑
j∈groupg
Bjr(k+(g−1)/g)
z(k+1) = z(k) +∑
j
Bjr(k)
![Page 218: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/218.jpg)
M. Tuma 127
9. Parallel solvers of linear algebraic equations:
Iterative Schwarz methods: III.
z(0), z(1), . . .
g groups of domains that do not overlap
x(k+1/g) = x(k) +∑
j∈group1
Bjr(k)
x(k+2/g) = x(k+1/g) +∑
j∈group2
Bjr(k+1/g) . . .
x(k+1) = x(k+(g−1)/g) +∑
j∈groupg
Bjr(k+(g−1)/g)
z(k+1) = z(k) +∑
j
Bjr(k)
This is the Additive Schwarz procedureMore parallel, more powerful
![Page 219: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/219.jpg)
M. Tuma 128
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting
![Page 220: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/220.jpg)
M. Tuma 128
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting
non-overlapping domain decomposition scheme
numerically scalable for a wide class of PDE problems (e.g., some 2ndorder elasticity, plate and shell problems)
successful parallel implementations
problem: domain matrices do not need to be regular
here, an example for two subdomains
![Page 221: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/221.jpg)
M. Tuma 129
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: II.
![Page 222: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/222.jpg)
M. Tuma 129
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: II.
Omega^(2)
Omega^(1)
Gamma_I
![Page 223: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/223.jpg)
M. Tuma 129
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: II.
Omega^(2)
Omega^(1)
Gamma_I
K(1)u(1) = f (1) + B(1)T λ
K(2)u(2) = f (2) + B(2)T λ
B(1)u(1) = B(2)u(2)
![Page 224: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/224.jpg)
M. Tuma 130
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: III.
K(1)u(1) = f (1) + B(1)T λ
K(2)u(2) = f (2) + B(2)T λ
B(1)u(1) = B(2)u(2)
![Page 225: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/225.jpg)
M. Tuma 130
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: III.
K(1)u(1) = f (1) + B(1)T λ
K(2)u(2) = f (2) + B(2)T λ
B(1)u(1) = B(2)u(2)
If we can substitute we get
u(1) = K(1)−1(f (1) + B(1)T λ)
u(2) = K(2)−1(f (2) + B(2)T λ)
(B(1)K(1)−1B(1)T +B(2)K(2)−1
B(2)T )λ = B(1)K(1)−1f (1)+B(2)K(2)−1
f (2)
![Page 226: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/226.jpg)
M. Tuma 130
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: III.
In general we have
u(1) = K(1)+(f (1) + B(1)T λ) + R(1)α
u(2) = K(2)+(f (2) + B(2)T λ) + R(2)α
![Page 227: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/227.jpg)
M. Tuma 130
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: III.
In general we have
u(1) = K(1)+(f (1) + B(1)T λ) + R(1)α
u(2) = K(2)+(f (2) + B(2)T λ) + R(2)α
(B(1)K(1)+B(1)T + B(2)K(2)+B(2)T −B(2)R(2)
−R(2)TB(2)T 0
)(λ
α
)
=
(B(1)K(1)+f (1) + B(2)K(2)+f (2)
−R(2)Tf (2)
)
interface system
![Page 228: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/228.jpg)
M. Tuma 131
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: IV.(
FI −GI
−GTI 0
)(λ
α
)=
(d
−e
)
![Page 229: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/229.jpg)
M. Tuma 131
9. Parallel solvers of linear algebraic equations:
FETI (Finite Element Tearing and Interconnecting: IV.(
FI −GI
−GTI 0
)(λ
α
)=
(d
−e
)
Solution: general augmented system (constrained minimization) conjugate gradients projected to the null-space of GT
I . initial λ satisfying the constraint as GT
I (GTI GI)
−1e
explicit projector P = I −GI(GTI GI)
−1GTI )
reorthogonalizations closely related method: balanced DD (balancing residuals by adding a
coarse problem from equilibrium conditions for possibly singularproblems; Mandel, 1993)
![Page 230: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/230.jpg)
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 1.
Matrix, of course, sparse
![Page 231: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/231.jpg)
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 1.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths.
![Page 232: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/232.jpg)
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 1.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do
![Page 233: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/233.jpg)
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 1.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge
![Page 234: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/234.jpg)
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 1.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows
![Page 235: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/235.jpg)
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 1.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows / / / at least, at the beginning: static load balancing
![Page 236: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/236.jpg)
M. Tuma 132
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 1.
Matrix, of course, sparse Matrix should be distributed, based, e.g., on distributed read of row
lengths. first step: what are my rows? do i=1,n determine start of the row i find start of the row i find end of the row i compute length of the row end do parallel gather / parallel sort / parallel merge finally, processes know what are their rows / / / at least, at the beginning: static load balancing for example: cyclic distribution matrix rows to groups of approximately
nnz(A)/p nonzeros.
![Page 237: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/237.jpg)
M. Tuma 133
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 2.
Natural assumption: matrix processed only as distributed.
![Page 238: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/238.jpg)
M. Tuma 133
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 2.
Natural assumption: matrix processed only as distributed. second step: distributed read
![Page 239: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/239.jpg)
M. Tuma 133
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 2.
Natural assumption: matrix processed only as distributed. second step: distributed read in MPI: all procesors check for their rows concurrently do i=1,n
if this is my row then
determine start of the row i
find start of the row i
find end of the row i
read / process the row: if (myid.eq.xxx)then read
end if
end do
![Page 240: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/240.jpg)
M. Tuma 134
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 3.
How efficiently merge sets of sparse vectors?
![Page 241: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/241.jpg)
M. Tuma 134
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 3.
How efficiently merge sets of sparse vectors? entries stored with local indices
![Page 242: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/242.jpg)
M. Tuma 134
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 3.
How efficiently merge sets of sparse vectors? entries stored with local indices
1 3 11 13
1 4 9 11
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 GLOBAL
GLOBAL
1
1
2
3 4
5
5
6
LOCAL
![Page 243: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/243.jpg)
M. Tuma 135
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 4.
1 3 11 13
1 4 9 11
GLOBAL
1
1
2
3 4
5
5
6
LOCAL
![Page 244: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/244.jpg)
M. Tuma 135
9. Parallel solvers of linear algebraic equations: XXII.
”Universal” matrix operation – parallel aspects: 4.
1 3 11 13
1 4 9 11
GLOBAL
1
1
2
3 4
5
5
6
LOCAL
local to global mapping: direct indexing
global to local mapping: e.g., hash tables
![Page 245: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/245.jpg)
M. Tuma 136
9. Parallel solvers of linear algebraic equations: XXIII.
Sparse matrix-matrix multiplications
A natural routine when dealing with more blocks Useful even for forming Schur complements in the sequential case
case 1: C = AB, all stored by rows case 2: C −AB, A stored by columns, B stored by rows
cc -- clear wn01; set linksc
do i=1,max(p,n)wn01(i)=0link(i)=0head(i)=0first(i)=ia(i)
end do
![Page 246: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/246.jpg)
M. Tuma 137
9. Parallel solvers of linear algebraic equations: XXIV
Sparse matrix-matrix multiplications: II.
cc -- initialize pointers firstc
do i=1,pj=first(i)if(j.lt.ia(i+1)) then
k=ja(j)-shiftif(head(k).eq.0) thenlink(i)=0
elseif(head(k).ne.0) thenlink(i)=head(k)
end ifhead(k)=i
end ifend doindc=1ic(1)=indc
![Page 247: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/247.jpg)
M. Tuma 138
9. Parallel solvers of linear algebraic equations: XXV
Sparse matrix-matrix multiplications: III.
cc -- loop of rows of ac
do i=1,mnewj=head(i)ind2=0
200 continuej=newjif(j.eq.0) go to 400newj=link(j)jfirst=first(j)first(j)=jfirst+1
![Page 248: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/248.jpg)
M. Tuma 139
9. Parallel solvers of linear algebraic equations: XXVI.
Sparse matrix-matrix multiplications: IV.
cc -- if indices of j-th column are not processedc
if(jfirst+1.lt.ia(j+1)) thenl=ja(jfirst+1)-shiftif(head(l).eq.0) thenlink(j)=0
elseif(head(l).ne.0) thenlink(j)=head(l)
end ifhead(l)=j
end if
![Page 249: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/249.jpg)
M. Tuma 140
9. Parallel solvers of linear algebraic equations: XXVII.
Sparse matrix-matrix multiplications: V.
cc -- coded loop search through the row of bc
temp=aa(jfirst)kstrt=ib(j)kstop=ib(j+1)-1
cc -- search the row of bc
do k=kstrt,kstopk1=jb(k)if(wn01(k1).eq.0) then
ind2=ind2+1wn02(ind2)=k1wr02(ind2)=temp*ab(k)wn01(k1)=ind2
elsewr02(wn01(k1))=wr02(wn01(k1))+temp*ab(k)
end if
![Page 250: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/250.jpg)
M. Tuma 141
9. Parallel solvers of linear algebraic equations: XXVIII.
Sparse matrix-matrix multiplications: VI.
cc -- end of coded loop in jc
go to 200400 continue
cc -- rewrite indices and elements to ic/jc/acc
do j=1,ind2k=wn02(j)jc(indc)=kwn01(k)=0ac(indc)=wr02(j)indc=indc+1
end doic(i+1)=indc
end do
![Page 251: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/251.jpg)
M. Tuma 142
9. Parallel solvers of linear algebraic equations: XXIX.
Preconditioners: approximations M to A: M ≈ A
![Page 252: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/252.jpg)
M. Tuma 142
9. Parallel solvers of linear algebraic equations: XXIX.
Preconditioners: approximations M to A: M ≈ A
Within a stationary (linear, consistent) method: need to solve a systemwith M
x+ = x−M−1(Ax− b) (6)
![Page 253: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/253.jpg)
M. Tuma 142
9. Parallel solvers of linear algebraic equations: XXIX.
Preconditioners: approximations M to A: M ≈ A
Within a stationary (linear, consistent) method: need to solve a systemwith M
x+ = x−M−1(Ax− b) (6)
Desired properties of M : good approximation to A
in the sense of a norm of (M −A) in the sense of a norm of (I −M−1A) if factorized, then with stable factors
systems with M should be easy to solve
applicable to a wide spectrum of computer architectures
![Page 254: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/254.jpg)
M. Tuma 143
9. Parallel solvers of linear algebraic equations: XXX.
Having M as a preconditioner in (1) is equivalent to transform the linearsystem to
M−1Ax = M−1b (preconditioner applied from left)
![Page 255: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/255.jpg)
M. Tuma 143
9. Parallel solvers of linear algebraic equations: XXX.
Having M as a preconditioner in (1) is equivalent to transform the linearsystem to
M−1Ax = M−1b (preconditioner applied from left)
Other transformations are possible obtained by change of variables and/ or supporting matrix symmetry.
AM−1y = b (preconditioner applied from right)M−1
1 AM−12 x = M−1
1 b (split preconditioner)
![Page 256: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/256.jpg)
M. Tuma 143
9. Parallel solvers of linear algebraic equations: XXX.
Having M as a preconditioner in (1) is equivalent to transform the linearsystem to
M−1Ax = M−1b (preconditioner applied from left)
Other transformations are possible obtained by change of variables and/ or supporting matrix symmetry.
AM−1y = b (preconditioner applied from right)M−1
1 AM−12 x = M−1
1 b (split preconditioner)
In all these cases, corresponding stationary iterations can be put down
Two basic approaches how to plug-in preconditioning: Write directly recursions for the transformed system. Mostly in case
of stationary iterative methods Use it only inside a procedure to get M−1z (or similar operations) for
a given z. This is more flexible and useful also for non-stationaryiterative methods
![Page 257: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/257.jpg)
M. Tuma 144
10. Approximate inverse preconditioners: I.
M ≈ A−1
![Page 258: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/258.jpg)
M. Tuma 144
10. Approximate inverse preconditioners: I.
M ≈ A−1
Example grid to show the local character of fill-in
Vertex separator
C_1 C_2
S
![Page 259: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/259.jpg)
M. Tuma 145
10. Approximate inverse preconditioners: II.
Some properties of approximate inverses
![Page 260: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/260.jpg)
M. Tuma 145
10. Approximate inverse preconditioners: II.
Some properties of approximate inverses
Fill-in not only local
![Page 261: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/261.jpg)
M. Tuma 145
10. Approximate inverse preconditioners: II.
Some properties of approximate inverses
Fill-in not only local
Even more stress to stay sparse
![Page 262: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/262.jpg)
M. Tuma 145
10. Approximate inverse preconditioners: II.
Some properties of approximate inverses
Fill-in not only local
Even more stress to stay sparse
Provide reasonably precise info on the exact matrix inverse
![Page 263: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/263.jpg)
M. Tuma 145
10. Approximate inverse preconditioners: II.
Some properties of approximate inverses
Fill-in not only local
Even more stress to stay sparse
Provide reasonably precise info on the exact matrix inverse
Explicit – potential for parallelism
![Page 264: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/264.jpg)
M. Tuma 145
10. Approximate inverse preconditioners: II.
Some properties of approximate inverses
Fill-in not only local
Even more stress to stay sparse
Provide reasonably precise info on the exact matrix inverse
Explicit – potential for parallelism
Why the fill-in may be non-local
![Page 265: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/265.jpg)
M. Tuma 145
10. Approximate inverse preconditioners: II.
Some properties of approximate inverses
Fill-in not only local
Even more stress to stay sparse
Provide reasonably precise info on the exact matrix inverse
Explicit – potential for parallelism
Why the fill-in may be non-local A −→ nonzeros determined by nonzeros in adjacency graph G(A)
![Page 266: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/266.jpg)
M. Tuma 145
10. Approximate inverse preconditioners: II.
Some properties of approximate inverses
Fill-in not only local
Even more stress to stay sparse
Provide reasonably precise info on the exact matrix inverse
Explicit – potential for parallelism
Why the fill-in may be non-local A −→ nonzeros determined by nonzeros in adjacency graph G(A)
A−1 −→ nonzeros determined by nonzeros in transitive closure of G(A)(paths in G(A)↔ edges in the transitive closure)
![Page 267: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/267.jpg)
M. Tuma 146
10. Approximate inverse preconditioners: III.
Summarizing motivation
![Page 268: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/268.jpg)
M. Tuma 146
10. Approximate inverse preconditioners: III.
Summarizing motivation Approximate inverses have specific features not shared with other
preconditioners.
![Page 269: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/269.jpg)
M. Tuma 146
10. Approximate inverse preconditioners: III.
Summarizing motivation Approximate inverses have specific features not shared with other
preconditioners.
AI are sometimes pretty efficient as preconditioners. Can help to solvesome hard problems.
![Page 270: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/270.jpg)
M. Tuma 146
10. Approximate inverse preconditioners: III.
Summarizing motivation Approximate inverses have specific features not shared with other
preconditioners.
AI are sometimes pretty efficient as preconditioners. Can help to solvesome hard problems.
AI can lead to development of some other algorithms.
![Page 271: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/271.jpg)
M. Tuma 146
10. Approximate inverse preconditioners: III.
Summarizing motivation Approximate inverses have specific features not shared with other
preconditioners.
AI are sometimes pretty efficient as preconditioners. Can help to solvesome hard problems.
AI can lead to development of some other algorithms.
Especially helpful on parallel computer architectures.
![Page 272: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/272.jpg)
M. Tuma 146
10. Approximate inverse preconditioners: III.
Summarizing motivation Approximate inverses have specific features not shared with other
preconditioners.
AI are sometimes pretty efficient as preconditioners. Can help to solvesome hard problems.
AI can lead to development of some other algorithms.
Especially helpful on parallel computer architectures.
A lot of features that still have to be developed.
![Page 273: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/273.jpg)
M. Tuma 146
10. Approximate inverse preconditioners: III.
Summarizing motivation Approximate inverses have specific features not shared with other
preconditioners.
AI are sometimes pretty efficient as preconditioners. Can help to solvesome hard problems.
AI can lead to development of some other algorithms.
Especially helpful on parallel computer architectures.
A lot of features that still have to be developed.
In short (PDE terms): Hope to capture by approximate inverses also somebasic non-local features of discrete Green functions.
![Page 274: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/274.jpg)
M. Tuma 147
10. Approximate inverse preconditioners: IV.
Some basic techniques
![Page 275: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/275.jpg)
M. Tuma 147
10. Approximate inverse preconditioners: IV.
Some basic techniques
Frobenius norm minimization (Benson, 1973)
minimize FW (X, A) = ‖I −XA‖2W = tr[(I −XA)W (I −XA)T
]
![Page 276: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/276.jpg)
M. Tuma 147
10. Approximate inverse preconditioners: IV.
Some basic techniques
Frobenius norm minimization (Benson, 1973)
minimize FW (X, A) = ‖I −XA‖2W = tr[(I −XA)W (I −XA)T
]
Global matrix iterations (Schulz, 1933)
Iterate Gi+1 = Gi(2I −AGi)
![Page 277: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/277.jpg)
M. Tuma 147
10. Approximate inverse preconditioners: IV.
Some basic techniques
Frobenius norm minimization (Benson, 1973)
minimize FW (X, A) = ‖I −XA‖2W = tr[(I −XA)W (I −XA)T
]
Global matrix iterations (Schulz, 1933)
Iterate Gi+1 = Gi(2I −AGi)
A-orthogonalization (Benzi, T., 1996)
Get W, Z, D from ZT AW = D ≡ A−1 = WD−1ZT
![Page 278: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/278.jpg)
M. Tuma 147
10. Approximate inverse preconditioners: IV.
Some basic techniques
Frobenius norm minimization (Benson, 1973)
minimize FW (X, A) = ‖I −XA‖2W = tr[(I −XA)W (I −XA)T
]
Global matrix iterations (Schulz, 1933)
Iterate Gi+1 = Gi(2I −AGi)
A-orthogonalization (Benzi, T., 1996)
Get W, Z, D from ZT AW = D ≡ A−1 = WD−1ZT
Approximate inverses as auxiliary procedures, e.g. in block algorithms(Axelsson, Brinkkemper, Il’in, 1984; Concus, Golub, Meurant, 1985)
![Page 279: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/279.jpg)
M. Tuma 148
10. Approximate inverse preconditioners: V.
Other approaches
![Page 280: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/280.jpg)
M. Tuma 148
10. Approximate inverse preconditioners: V.
Other approaches
Approximate inverse smoothers in geometric and algebraic multigrids:Chow (2000); Tang, Wan, 2000; Bröker, Grote, Mayer, Reusken, 2002;Bröker, Grote, 2002.
![Page 281: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/281.jpg)
M. Tuma 148
10. Approximate inverse preconditioners: V.
Other approaches
Approximate inverse smoothers in geometric and algebraic multigrids:Chow (2000); Tang, Wan, 2000; Bröker, Grote, Mayer, Reusken, 2002;Bröker, Grote, 2002.
Inverted direct incomplete decompositions, Alvarado, Dag, 1992
![Page 282: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/282.jpg)
M. Tuma 148
10. Approximate inverse preconditioners: V.
Other approaches
Approximate inverse smoothers in geometric and algebraic multigrids:Chow (2000); Tang, Wan, 2000; Bröker, Grote, Mayer, Reusken, 2002;Bröker, Grote, 2002.
Inverted direct incomplete decompositions, Alvarado, Dag, 1992
Approximate inverses by bordering, Saad, 1996(ZT
−yT 1
)(A v
vT α
)(Z −y
1
)=
(D
δ
)
![Page 283: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/283.jpg)
M. Tuma 148
10. Approximate inverse preconditioners: V.
Other approaches
Approximate inverse smoothers in geometric and algebraic multigrids:Chow (2000); Tang, Wan, 2000; Bröker, Grote, Mayer, Reusken, 2002;Bröker, Grote, 2002.
Inverted direct incomplete decompositions, Alvarado, Dag, 1992
Approximate inverses by bordering, Saad, 1996(ZT
−yT 1
)(A v
vT α
)(Z −y
1
)=
(D
δ
)
Sherman-Morrison formula based preconditioners (Bru, Cerdán, Marín,Mas, 2002)
![Page 284: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/284.jpg)
M. Tuma 149
10. Approximate inverse preconditioners: VI.
Frobenius norm minimization: special cases I.
![Page 285: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/285.jpg)
M. Tuma 149
10. Approximate inverse preconditioners: VI.
Frobenius norm minimization: special cases I.
Least-squares approximate inverse (AI): W = I (Benson, 1973)
Minimize FI(X, A) = ‖I −XA‖F =n∑
i=1
‖eTi − xiA‖22,
xi: rows of XIt leads to n simple least-squares problems
![Page 286: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/286.jpg)
M. Tuma 149
10. Approximate inverse preconditioners: VI.
Frobenius norm minimization: special cases I.
Least-squares approximate inverse (AI): W = I (Benson, 1973)
Minimize FI(X, A) = ‖I −XA‖F =n∑
i=1
‖eTi − xiA‖22,
xi: rows of XIt leads to n simple least-squares problems
Direct block method (DB): W = A−1 (Benson, 1973):
Solve [GA]ij = δij for (i, j) ∈ S,
where S is the sparsity pattern for the inverse
![Page 287: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/287.jpg)
M. Tuma 149
10. Approximate inverse preconditioners: VI.
Frobenius norm minimization: special cases I.
Least-squares approximate inverse (AI): W = I (Benson, 1973)
Minimize FI(X, A) = ‖I −XA‖F =n∑
i=1
‖eTi − xiA‖22,
xi: rows of XIt leads to n simple least-squares problems
Direct block method (DB): W = A−1 (Benson, 1973):
Solve [GA]ij = δij for (i, j) ∈ S,
where S is the sparsity pattern for the inverse
In both LS and DB: sparsity pattern assumption.
![Page 288: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/288.jpg)
M. Tuma 150
10. Approximate inverse preconditioners: VII.
Frobenius norm minimization: special cases II.
![Page 289: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/289.jpg)
M. Tuma 150
10. Approximate inverse preconditioners: VII.
Frobenius norm minimization: special cases II.
Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997)
![Page 290: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/290.jpg)
M. Tuma 150
10. Approximate inverse preconditioners: VII.
Frobenius norm minimization: special cases II.
Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997) Evaluating new pattern by estimating norms of possible new residuals More exact evaluations of residuals (Gould, Scott, 1995) Procedurally parallel, but data paralelism difficult Need to have high-quality pattern predictions (Huckle, 1999, 2001;
Chow, 2000)
![Page 291: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/291.jpg)
M. Tuma 150
10. Approximate inverse preconditioners: VII.
Frobenius norm minimization: special cases II.
Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997) Evaluating new pattern by estimating norms of possible new residuals More exact evaluations of residuals (Gould, Scott, 1995) Procedurally parallel, but data paralelism difficult Need to have high-quality pattern predictions (Huckle, 1999, 2001;
Chow, 2000)
Simple stationary iterative method for individual columns ci by solving
Aci = ei
(Chow, Saad, 1994)
![Page 292: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/292.jpg)
M. Tuma 150
10. Approximate inverse preconditioners: VII.
Frobenius norm minimization: special cases II.
Changing sparsity patterns in outer iterations: SPAI (Cosgrove, Díaz,Griewank, 1992; Grote, Huckle, 1997) Evaluating new pattern by estimating norms of possible new residuals More exact evaluations of residuals (Gould, Scott, 1995) Procedurally parallel, but data paralelism difficult Need to have high-quality pattern predictions (Huckle, 1999, 2001;
Chow, 2000)
Simple stationary iterative method for individual columns ci by solving
Aci = ei
(Chow, Saad, 1994) Simple, but not very efficient “Gauss-Seidel” variant: sometimes much better, sometimes much
worse
![Page 293: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/293.jpg)
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
![Page 294: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/294.jpg)
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)
Z = arg minX∈S
FI(XT , L) = arg min
X∈S‖I −XT L‖2F , where A = LLT .
![Page 295: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/295.jpg)
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)
Z = arg minX∈S
FI(XT , L) = arg min
X∈S‖I −XT L‖2F , where A = LLT .
The procedure: first get Z from problem
‖I −XT L‖2F =n∑
i=1
‖eTi − xT
i L‖22, Set D = (diag(Z))−1, Z = ZD1/2
Then A−1 ≈ ZZT
![Page 296: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/296.jpg)
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)
Z = arg minX∈S
FI(XT , L) = arg min
X∈S‖I −XT L‖2F , where A = LLT .
The procedure: first get Z from problem
‖I −XT L‖2F =n∑
i=1
‖eTi − xT
i L‖22, Set D = (diag(Z))−1, Z = ZD1/2
Then A−1 ≈ ZZT
Extended to nonsymmetric case
![Page 297: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/297.jpg)
M. Tuma 151
10. Approximate inverse preconditioners: VIII.
Frobenius norm minimization: special cases III.
Factorized inverse preconditioners based on Frobenius normapproximate minimization for SPD matrices (Kolotilina, Yeremin, 1973)
Z = arg minX∈S
FI(XT , L) = arg min
X∈S‖I −XT L‖2F , where A = LLT .
The procedure: first get Z from problem
‖I −XT L‖2F =n∑
i=1
‖eTi − xT
i L‖22, Set D = (diag(Z))−1, Z = ZD1/2
Then A−1 ≈ ZZT
Extended to nonsymmetric case
Rather robust, often underestimated
![Page 298: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/298.jpg)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
![Page 299: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/299.jpg)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
![Page 300: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/300.jpg)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
![Page 301: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/301.jpg)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s
![Page 302: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/302.jpg)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s
A more detailed treatment of theA-orthogonalization: in the first Wilkinson paper (with Fox and Huskey,1948)
![Page 303: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/303.jpg)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s
A more detailed treatment of theA-orthogonalization: in the first Wilkinson paper (with Fox and Huskey,1948)
Extended to nonsymmetric case
![Page 304: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/304.jpg)
M. Tuma 152
10. Approximate inverse preconditioners: IX.
A-orthogonalization: AINV
For an SPD matrix A: find an upper triangular Z and diagonal matrix Dsuch that
ZT AZ = D → A−1 = ZD−1ZT (7)
The algorithm: Conjugate Gram-Schmidt: GS with a different innerproduct: (x, y)A
Origins of the A-orthogonalization for solving linear systems in morepapers in 40’s
A more detailed treatment of theA-orthogonalization: in the first Wilkinson paper (with Fox and Huskey,1948)
Extended to nonsymmetric case Breakdown-free modification for SPD A (Benzi, Cullum, T., 2001)
![Page 305: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/305.jpg)
M. Tuma 153
10. Approximate inverse preconditioners: X.
A-orthogonalization: AINV
![Page 306: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/306.jpg)
M. Tuma 153
10. Approximate inverse preconditioners: X.
A-orthogonalization: AINVAlgorithm H-S I.
zi = ei −i−1∑
k=1
zkeTi Azk
zTk Azk
, i = 1, . . . , n; Z = [z1, . . . , zn]
![Page 307: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/307.jpg)
M. Tuma 153
10. Approximate inverse preconditioners: X.
A-orthogonalization: AINVAlgorithm H-S I.
zi = ei −i−1∑
k=1
zkeTi Azk
zTk Azk
, i = 1, . . . , n; Z = [z1, . . . , zn]
left-looking
stabilized diagonal entries (in exact arithmetic eTi Azk ≡ zT
i Azk, i ≤ k)
![Page 308: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/308.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns
![Page 309: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/309.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
![Page 310: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/310.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
![Page 311: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/311.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
![Page 312: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/312.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
![Page 313: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/313.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
![Page 314: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/314.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)
![Page 315: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/315.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)
Vector method by Purcell, 1952: basically H-S II.
![Page 316: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/316.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)
Vector method by Purcell, 1952: basically H-S II.
Approximate inverse by bordering (Saad, 1996) is equivalent to H-S I.(Benzi, T. (2002))
![Page 317: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/317.jpg)
M. Tuma 154
10. Approximate inverse preconditioners: XI.
Possibility of breakdowns AINV: modified incomplete A-orthogonalization; exists for M-matrices
and H-matrices (Benzi, Meyer, T., 1996)
Possibility of breakdown in A-orthogonalization for non-H matrices
Possibly poor approximate inverses for these matrices
A-orthogonalization in historical perspective
Fox, Huskey, Wilkinson, 1948: H-S I.
Escalator method by Morris, 1946: a variation of H-S I. (non-stabilizedcomputation of D)
Vector method by Purcell, 1952: basically H-S II.
Approximate inverse by bordering (Saad, 1996) is equivalent to H-S I.(Benzi, T. (2002))
Bridson, Tang, 1998 – (nonsymmetric) algorithms equivalent to H-S I.
![Page 318: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/318.jpg)
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
![Page 319: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/319.jpg)
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
Pivoting
![Page 320: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/320.jpg)
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
Pivoting
Look-ahead
![Page 321: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/321.jpg)
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
Pivoting
Look-ahead
DCR
![Page 322: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/322.jpg)
M. Tuma 155
10. Approximate inverse preconditioners: XII.
Other possible stabilization attempts:
Pivoting
Look-ahead
DCR
Block algorithms
![Page 323: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/323.jpg)
M. Tuma 156
11. Polynomial preconditioners: I.
The problem
![Page 324: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/324.jpg)
M. Tuma 156
11. Polynomial preconditioners: I.
The problem
Find a preconditioner M such that M−1 is a polynomial in A of a givendegree k, that is
M−1 = Pk(A) =
k∑
j=0
αjAj .
First proposed by Cesari, 1937 (for Richardson iteration)
![Page 325: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/325.jpg)
M. Tuma 156
11. Polynomial preconditioners: I.
The problem
Find a preconditioner M such that M−1 is a polynomial in A of a givendegree k, that is
M−1 = Pk(A) =
k∑
j=0
αjAj .
First proposed by Cesari, 1937 (for Richardson iteration) Naturally motivated since by the Cayley-Hamilton theorem we have
Qk(A) ≡k∑
j=0
βjAj = 0
for the characteristic polynomial of A, k ≤ n.
![Page 326: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/326.jpg)
M. Tuma 156
11. Polynomial preconditioners: I.
The problem
Find a preconditioner M such that M−1 is a polynomial in A of a givendegree k, that is
M−1 = Pk(A) =
k∑
j=0
αjAj .
First proposed by Cesari, 1937 (for Richardson iteration) Naturally motivated since by the Cayley-Hamilton theorem we have
Qk(A) ≡k∑
j=0
βjAj = 0
for the characteristic polynomial of A, k ≤ n. Therefore, we have
A−1 = − 1
β0
k∑
j=1
βjAj−1
![Page 327: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/327.jpg)
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
![Page 328: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/328.jpg)
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
![Page 329: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/329.jpg)
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1.
![Page 330: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/330.jpg)
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?
![Page 331: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/331.jpg)
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?
Number of CG iterations still can be decreased
![Page 332: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/332.jpg)
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?
Number of CG iterations still can be decreased Can be useful when bottleneck is in scalar products, message
passing, memory hierarchy. Can strongly enhance vector processing.
![Page 333: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/333.jpg)
M. Tuma 157
11. Polynomial preconditioners: II.
Polynomial preconditioners and Krylov space methods
Remind: CG forms an approximation to the solution vector for k ≥ 1
xk+1 = x0 + Pk(A)r0
The polynomial Pk(A) is optimal in minimizing
(x− x∗)T A(x− x∗)
among all polynomials of degree at most k with Pk(0) = 1. Therefore, why polynomial preconditioners?
Number of CG iterations still can be decreased Can be useful when bottleneck is in scalar products, message
passing, memory hierarchy. Can strongly enhance vector processing. simplicity, matrix-free computations
![Page 334: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/334.jpg)
M. Tuma 158
11. Polynomial preconditioners: III.
Basic classes of polynomial preconditioners:I.
![Page 335: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/335.jpg)
M. Tuma 158
11. Polynomial preconditioners: III.
Basic classes of polynomial preconditioners:I.
Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979)
![Page 336: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/336.jpg)
M. Tuma 158
11. Polynomial preconditioners: III.
Basic classes of polynomial preconditioners:I.
Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979) Let A = M1 −N such that M is nonsingular and G = M−1
1 N satisfiesρ(G) < 1. Then
A−1 = (I −G)−1M−11 =
+∞∑
j=1
Gj
M−1
1
![Page 337: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/337.jpg)
M. Tuma 158
11. Polynomial preconditioners: III.
Basic classes of polynomial preconditioners:I.
Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979) Let A = M1 −N such that M is nonsingular and G = M−1
1 N satisfiesρ(G) < 1. Then
A−1 = (I −G)−1M−11 =
+∞∑
j=1
Gj
M−1
1
The preconditioner: truncating the series
M−1 =
k∑
j=1
Gj
M−1
1 , k > 0
![Page 338: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/338.jpg)
M. Tuma 158
11. Polynomial preconditioners: III.
Basic classes of polynomial preconditioners:I.
Neumann series preconditioners for SPD systems (Dubois,Greenbaum, Rodrigue, 1979) Let A = M1 −N such that M is nonsingular and G = M−1
1 N satisfiesρ(G) < 1. Then
A−1 = (I −G)−1M−11 =
+∞∑
j=1
Gj
M−1
1
The preconditioner: truncating the series
M−1 =
k∑
j=1
Gj
M−1
1 , k > 0
Preconditioners Pk of odd degrees sufficient (not less efficient thanPk+1.
![Page 339: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/339.jpg)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
![Page 340: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/340.jpg)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983)
![Page 341: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/341.jpg)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
![Page 342: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/342.jpg)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
![Page 343: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/343.jpg)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
Let
Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS
![Page 344: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/344.jpg)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
Let
Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS
Find Rk ∈ Qk such that it has min-max value on this inclusion setminmax polynomial.
![Page 345: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/345.jpg)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
Let
Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS
Find Rk ∈ Qk such that it has min-max value on this inclusion setminmax polynomial.
Apply to residual polynomials 1−Qk(λ) = 1− λPk(λ) for thepolynomial preconditioner Pk.
![Page 346: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/346.jpg)
M. Tuma 159
11. Polynomial preconditioners: IV.
Basic classes of polynomial preconditioners:II.
Generalized Neumann series preconditioners for SPD systems(Johnson, Micchelli, Paul, 1983) Parametrizing the approximate inverse (I −G)−1
I + γ1G + γ2G2 + . . . γkGk
the added degrees of freedom may be used to optimize theapproximation to A−1
Let
Rk = Rk|Rk(0) = 0; Rk(λ) > 0 ∀ λ from an inclusion set IS
Find Rk ∈ Qk such that it has min-max value on this inclusion setminmax polynomial.
Apply to residual polynomials 1−Qk(λ) = 1− λPk(λ) for thepolynomial preconditioner Pk.
This polynomial can be expressed in terms of Chebyshevpolynomials of the first kind
![Page 347: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/347.jpg)
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
![Page 348: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/348.jpg)
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983)
![Page 349: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/349.jpg)
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large
eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.
![Page 350: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/350.jpg)
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large
eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.
This approach: minimize a quadratic norm of the residual polynomial:∫
IS
(1−Q(λ))2w(λ)dλ
![Page 351: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/351.jpg)
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large
eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.
This approach: minimize a quadratic norm of the residual polynomial:∫
IS
(1−Q(λ))2w(λ)dλ
Jacobi weights (w(λ) = (b− λ)α(λ− a)α, α, β > −1 for IS = 〈a, b〉 orLegendre weights (w ≡ 1) for simple integration
![Page 352: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/352.jpg)
M. Tuma 160
11. Polynomial preconditioners: V.
Basic classes of polynomial preconditioners:III.
Least-squares preconditioners for SPD systems (Johnson, Micchelli,Paul, 1983) Min-max polynomial may map small eigenvalues of A to large
eigenvalues of M−1A. This seems to degrade the convergence rate.Its quality seems to depend strongly on inclusion set estimate.
This approach: minimize a quadratic norm of the residual polynomial:∫
IS
(1−Q(λ))2w(λ)dλ
Jacobi weights (w(λ) = (b− λ)α(λ− a)α, α, β > −1 for IS = 〈a, b〉 orLegendre weights (w ≡ 1) for simple integration
Computing the polynomials from three-term recurrences (Stiefel,1958), or by kernel polynomials (Stiefel, 1958), or from normalequations (Saad, 1983)
![Page 353: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/353.jpg)
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
![Page 354: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/354.jpg)
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
![Page 355: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/355.jpg)
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.
![Page 356: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/356.jpg)
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.
Grcar polynomials solve a slightly modified minmax approximationproblem. Formulated for residual polynomials. But: more oscillatorybehavior.
![Page 357: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/357.jpg)
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.
Grcar polynomials solve a slightly modified minmax approximationproblem. Formulated for residual polynomials. But: more oscillatorybehavior.
Both mentioned possibilities: positive definite preconditioned matrix, notexplicitly computable
![Page 358: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/358.jpg)
M. Tuma 161
11. Polynomial preconditioners: VI.
Preconditioning symmetric indefinite systems
DeBoor and Rice polynomials solve the minmax problem for generalinclusion sets composed from two parts: IS = 〈a, b〉 ∪ 〈c, d〉, b < 0 < c(DeBoor, Rice, 1982)
For equal lengths (b− a = d− c): can be expressed in terms ofChebyshev polynomials of the first kind. Best behavior in this case.
Grcar polynomials solve a slightly modified minmax approximationproblem. Formulated for residual polynomials. But: more oscillatorybehavior.
Both mentioned possibilities: positive definite preconditioned matrix, notexplicitly computable
Clustering eigenvalues around µ < 0 and 1 (Freund, 1991; bilevelpolynomial of Ashby, 1991) Best behavior for nonequal intervals andb ≈ −c.
![Page 359: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/359.jpg)
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
![Page 360: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/360.jpg)
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).
![Page 361: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/361.jpg)
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).
Adapting polynomials based on the information from the CG method(Ashby, 1987, 1990; Ashby, Manteuffel, Saylor, 1989; see also Fischer,Freund, 1994; O’Leary, 1991).
![Page 362: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/362.jpg)
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).
Adapting polynomials based on the information from the CG method(Ashby, 1987, 1990; Ashby, Manteuffel, Saylor, 1989; see also Fischer,Freund, 1994; O’Leary, 1991).
Double use of minmax polynomial can bring some improvement (Perlot,1995)
![Page 363: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/363.jpg)
M. Tuma 162
11. Polynomial preconditioners: VII.
Further achievements
Different weights for least-squares polynomials for solving symmetricindefinite systems (Saad, 1983).
Adapting polynomials based on the information from the CG method(Ashby, 1987, 1990; Ashby, Manteuffel, Saylor, 1989; see also Fischer,Freund, 1994; O’Leary, 1991).
Double use of minmax polynomial can bring some improvement (Perlot,1995)
Polynomial preconditioners for solving nonsymmetric systems arepossible, but, typically, not a method of choice (Manteuffel, 1977, 1978;Saad, 1986; Smolarski, Saylor, 1988).
![Page 364: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/364.jpg)
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
![Page 365: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/365.jpg)
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
Assume that A is given as
A =∑
e
Ae
![Page 366: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/366.jpg)
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
Assume that A is given as
A =∑
e
Ae
Consider
Me = (DA)e + (Ae −De),
where (DA)e is a part of DA corresponding to Ae.
![Page 367: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/367.jpg)
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
Assume that A is given as
A =∑
e
Ae
Consider
Me = (DA)e + (Ae −De),
where (DA)e is a part of DA corresponding to Ae. Set
M =
ne∏
e=1
Me
![Page 368: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/368.jpg)
M. Tuma 163
12. Element-by-element preconditioners: I.
Basic notation
Assume that A is given as
A =∑
e
Ae
Consider
Me = (DA)e + (Ae −De),
where (DA)e is a part of DA corresponding to Ae. Set
M =
ne∏
e=1
Me
Introduced by Hughes, Levit, Winget, 1983 (and formulated forJacobi-scaled A)
![Page 369: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/369.jpg)
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
![Page 370: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/370.jpg)
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)
![Page 371: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/371.jpg)
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)
For solving SPD systems Me matrices can be decomposed as
Me = LeLTe
![Page 372: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/372.jpg)
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)
For solving SPD systems Me matrices can be decomposed as
Me = LeLTe
Another approach (Gustafsson, Linskog, 1986)
M =
ne∑
e=1
Le,
where Le can be modified to be positive definite (individual Ae do notneed to be regular)
![Page 373: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/373.jpg)
M. Tuma 164
12. Element-by-element preconditioners: II.
Other possibilities
Simple application to solving nonsymmetric systems Mz = y (as theproduct of easily invertible matrices)
For solving SPD systems Me matrices can be decomposed as
Me = LeLTe
Another approach (Gustafsson, Linskog, 1986)
M =
ne∑
e=1
Le,
where Le can be modified to be positive definite (individual Ae do notneed to be regular)
Parallel implementations (van Gijzen, 1994; Daydé, L’Excellent, Gould,1997)
![Page 374: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/374.jpg)
M. Tuma 165
13. Vector / Parallel preconditioners: I.
Decoupling parts of triangular factors
Forced aposteriori annihilation in triangular factors (Seager, 1986)
![Page 375: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/375.jpg)
M. Tuma 165
13. Vector / Parallel preconditioners: I.
Decoupling parts of triangular factors
Forced aposteriori annihilation in triangular factors (Seager, 1986)
∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗
![Page 376: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/376.jpg)
M. Tuma 165
13. Vector / Parallel preconditioners: I.
Decoupling parts of triangular factors
Forced aposteriori annihilation in triangular factors (Seager, 1986)
∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗∗ ∗
![Page 377: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/377.jpg)
M. Tuma 165
13. Vector / Parallel preconditioners: I.
Decoupling parts of triangular factors
Forced aposteriori annihilation in triangular factors (Seager, 1986)
∗∗ ∗∗ ∗
∗∗ ∗∗ ∗∗ ∗
∗∗ ∗∗ ∗
![Page 378: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/378.jpg)
M. Tuma 165
13. Vector / Parallel preconditioners: I.
Decoupling parts of triangular factors
Forced aposteriori annihilation in triangular factors (Seager, 1986)
∗∗ ∗∗ ∗
∗∗ ∗∗ ∗∗ ∗
∗∗ ∗∗ ∗
Can lead to slow convergence
![Page 379: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/379.jpg)
M. Tuma 166
13. Vector / Parallel preconditioners: II.
Partial vectorization
Exploiting vector potential of a special structure of the matrix Example factor from 5-point stencil:
![Page 380: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/380.jpg)
M. Tuma 166
13. Vector / Parallel preconditioners: II.
Partial vectorization
Exploiting vector potential of a special structure of the matrix Example factor from 5-point stencil:
**
**
***
**
**
**
***
**
**
**
***
**
**
**
**
**
**
**
**
*
**
**
***
**
![Page 381: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/381.jpg)
M. Tuma 166
13. Vector / Parallel preconditioners: II.
Partial vectorization
Exploiting vector potential of a special structure of the matrix Example factor from 5-point stencil:
**
**
***
**
**
**
***
**
**
**
***
**
**
**
**
**
**
**
**
*
**
**
***
**
So nice only for regular grids
![Page 382: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/382.jpg)
M. Tuma 167
13. Vector / Parallel preconditioners: III.
Generalized partial vectorization: jagged diagonal formats,modified jagged diagonal formats, stripes (Melhem, 1988;
Anderson, 1988; Paolini, Di Brozolo, 1989)
Storing matrix as a small number of long diagonals
![Page 383: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/383.jpg)
M. Tuma 167
13. Vector / Parallel preconditioners: III.
Generalized partial vectorization: jagged diagonal formats,modified jagged diagonal formats, stripes (Melhem, 1988;
Anderson, 1988; Paolini, Di Brozolo, 1989)
Storing matrix as a small number of long diagonals
a
b c
d e
f g h
Construction 1) row compression 2) sorting the rows 3) considering the matrix as a set of columns
![Page 384: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/384.jpg)
M. Tuma 167
13. Vector / Parallel preconditioners: III.
Generalized partial vectorization: jagged diagonal formats,modified jagged diagonal formats, stripes (Melhem, 1988;
Anderson, 1988; Paolini, Di Brozolo, 1989)
Storing matrix as a small number of long diagonals
a
b c
d e
f g h
Construction 1) row compression 2) sorting the rows 3) considering the matrix as a set of columns
![Page 385: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/385.jpg)
M. Tuma 167
13. Vector / Parallel preconditioners: III.
Generalized partial vectorization: jagged diagonal formats,modified jagged diagonal formats, stripes (Melhem, 1988;
Anderson, 1988; Paolini, Di Brozolo, 1989)
Storing matrix as a small number of long diagonals
f g h
b c
d e
a
Construction 1) row compression 2) sorting the rows 3) considering the matrix as a set of columns
![Page 386: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/386.jpg)
M. Tuma 167
13. Vector / Parallel preconditioners: III.
Generalized partial vectorization: jagged diagonal formats,modified jagged diagonal formats, stripes (Melhem, 1988;
Anderson, 1988; Paolini, Di Brozolo, 1989)
Storing matrix as a small number of long diagonals
f g h
b c
d e
a
Construction 1) row compression 2) sorting the rows 3) considering the matrix as a set of columns
![Page 387: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/387.jpg)
M. Tuma 167
13. Vector / Parallel preconditioners: III.
Generalized partial vectorization: jagged diagonal formats,modified jagged diagonal formats, stripes (Melhem, 1988;
Anderson, 1988; Paolini, Di Brozolo, 1989)
Storing matrix as a small number of long diagonals
f g h
b c
d e
a
Construction 1) row compression 2) sorting the rows 3) considering the matrix as a set of columns
Other sophisticated variations: cf. Heroux, Vu, Yang, 1991.
![Page 388: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/388.jpg)
M. Tuma 168
13. Vector / Parallel preconditioners: IV.
Wavefront processing for 5-point stencil in 2D
![Page 389: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/389.jpg)
M. Tuma 168
13. Vector / Parallel preconditioners: IV.
Wavefront processing for 5-point stencil in 2D
![Page 390: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/390.jpg)
M. Tuma 168
13. Vector / Parallel preconditioners: IV.
Wavefront processing for 5-point stencil in 2D
generalization to 7-point pencils in 3D: hyperplane approach block chequer-board distribution of processors
![Page 391: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/391.jpg)
M. Tuma 169
13. Vector / Parallel preconditioners: V.
Generalized wavefront / hyperplane processing: levelscheduling
Structure of L or U can be described by a directed acyclic graph The level scheduling is an aposteriori reordering (applied to graphs
triangular factors of A) (Anderson, 1988)
![Page 392: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/392.jpg)
M. Tuma 169
13. Vector / Parallel preconditioners: V.
Generalized wavefront / hyperplane processing: levelscheduling
Structure of L or U can be described by a directed acyclic graph The level scheduling is an aposteriori reordering (applied to graphs
triangular factors of A) (Anderson, 1988)
2 5 7 10
49
13
6
8
![Page 393: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/393.jpg)
M. Tuma 169
13. Vector / Parallel preconditioners: V.
Generalized wavefront / hyperplane processing: levelscheduling
Structure of L or U can be described by a directed acyclic graph The level scheduling is an aposteriori reordering (applied to graphs
triangular factors of A) (Anderson, 1988)
∗
∗
∗ ∗
∗ ∗ ∗
∗ ∗
∗ ∗ ∗
∗ ∗
∗ ∗
∗ ∗ ∗
∗ ∗ ∗
2 5 7 10
49
13
6
8
![Page 394: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/394.jpg)
M. Tuma 169
13. Vector / Parallel preconditioners: V.
Generalized wavefront / hyperplane processing: levelscheduling
Structure of L or U can be described by a directed acyclic graph The level scheduling is an aposteriori reordering (applied to graphs
triangular factors of A) (Anderson, 1988)
∗
∗
∗ ∗
∗ ∗ ∗
∗ ∗
∗ ∗ ∗
∗ ∗
∗ ∗
∗ ∗ ∗
∗ ∗ ∗
2 5 7 10
49
13
6
8
Suitable for unstructured matrices
![Page 395: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/395.jpg)
M. Tuma 170
13. Vector / Parallel preconditioners: VI.
Twisted factorization
Concurrent factorization from both ends of domain (Babuçka, 1972;Meurant 1984; van der Vorst, 1987)
![Page 396: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/396.jpg)
M. Tuma 170
13. Vector / Parallel preconditioners: VI.
Twisted factorization
Concurrent factorization from both ends of domain (Babuçka, 1972;Meurant 1984; van der Vorst, 1987)
∗∗ ∗∗ ∗∗ ∗∗ ∗ ∗
∗ ∗∗ ∗∗ ∗∗
∗ ∗∗ ∗∗ ∗∗ ∗∗∗ ∗∗ ∗∗ ∗∗ ∗
![Page 397: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/397.jpg)
M. Tuma 170
13. Vector / Parallel preconditioners: VI.
Twisted factorization
Concurrent factorization from both ends of domain (Babuçka, 1972;Meurant 1984; van der Vorst, 1987)
∗∗ ∗∗ ∗∗ ∗∗ ∗ ∗
∗ ∗∗ ∗∗ ∗∗
∗ ∗∗ ∗∗ ∗∗ ∗∗∗ ∗∗ ∗∗ ∗∗ ∗
Only two-way parallelism Can be performed in a nested way (van der Vorst, 1987)
![Page 398: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/398.jpg)
M. Tuma 171
13. Vector / Parallel preconditioners: VII.
Ordering from corners for regular grids in 2D
![Page 399: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/399.jpg)
M. Tuma 171
13. Vector / Parallel preconditioners: VII.
Ordering from corners for regular grids in 2D
![Page 400: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/400.jpg)
M. Tuma 171
13. Vector / Parallel preconditioners: VII.
Ordering from corners for regular grids in 2D
![Page 401: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/401.jpg)
M. Tuma 171
13. Vector / Parallel preconditioners: VII.
Ordering from corners for regular grids in 2D
![Page 402: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/402.jpg)
M. Tuma 171
13. Vector / Parallel preconditioners: VII.
Ordering from corners for regular grids in 2D
![Page 403: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/403.jpg)
M. Tuma 171
13. Vector / Parallel preconditioners: VII.
Ordering from corners for regular grids in 2D
![Page 404: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/404.jpg)
M. Tuma 171
13. Vector / Parallel preconditioners: VII.
Ordering from corners for regular grids in 2D
Can be generalized to 3D
![Page 405: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/405.jpg)
M. Tuma 172
13. Vector / Parallel preconditioners: VIII.
Generalized ordering from corners: reorderings based ondomains
![Page 406: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/406.jpg)
M. Tuma 172
13. Vector / Parallel preconditioners: VIII.
Generalized ordering from corners: reorderings based ondomains
![Page 407: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/407.jpg)
M. Tuma 172
13. Vector / Parallel preconditioners: VIII.
Generalized ordering from corners: reorderings based ondomains
![Page 408: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/408.jpg)
M. Tuma 172
13. Vector / Parallel preconditioners: VIII.
Generalized ordering from corners: reorderings based ondomains
Useful for general domains (matrices) Sophisticated graph partitioning algorithms
![Page 409: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/409.jpg)
M. Tuma 173
13. Vector / Parallel preconditioners: IX.
Generalized ordering from corners: reorderings based ondomains: additional ideas
![Page 410: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/410.jpg)
M. Tuma 173
13. Vector / Parallel preconditioners: IX.
Generalized ordering from corners: reorderings based ondomains: additional ideas
ILU with overlapped diagonal blocks (Radicati, Robert, 1987) Chan, Govaerts,1990: ILU for domains can provide faster even
sequential iterative methods Tang (1992); Tan (1995): Enhanced interface conditions for better
coupling Karypis, Kumar, 1996: But, convergence rate can be strongly
deteriorated Benzi, Marín, T., 1997: Parallel approximate inverse preconditioners +
parallelization by domains can solve some hard problems
![Page 411: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/411.jpg)
M. Tuma 174
13. Vector / Parallel preconditioners: X.
Parallel preconditioning: distributed parallelism
![Page 412: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/412.jpg)
M. Tuma 174
13. Vector / Parallel preconditioners: X.
Parallel preconditioning: distributed parallelism
subdomain boundary
P0
P1
P2
P0 P1 P2
![Page 413: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/413.jpg)
M. Tuma 174
13. Vector / Parallel preconditioners: X.
Parallel preconditioning: distributed parallelism
subdomain boundary
P0
P1
P2
P0 P1 P2 Matrix-vector product: overlapping communication and computation
1) Initialize sends and receives of boundary nodes 2) Perform local matvecs 3) Complete receives of boundary data 4) Finish the computation
![Page 414: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/414.jpg)
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
![Page 415: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/415.jpg)
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
![Page 416: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/416.jpg)
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
![Page 417: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/417.jpg)
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
![Page 418: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/418.jpg)
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
![Page 419: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/419.jpg)
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
![Page 420: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/420.jpg)
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
![Page 421: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/421.jpg)
M. Tuma 175
13. Vector / Parallel preconditioners: XI.
Multicolorings
Faster convergence; parallel; need to balance these effects Doi, 1991; Doi, Lichnewsky, 1991; Doi, Hoshi, 1992; Wang, Hwang,
1995 Nodes with the same colour: mutually as far as possible
![Page 422: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/422.jpg)
M. Tuma 176
14. Solving nonlinear systems: I.
Newton-Krylov paradigm
F (x) = 0
⇓Sequences of linear systems of the form
J(xk)∆x = −F (xk), J(xk) ≈ F ′(xk)
solved until for some k, k = 1, 2, . . .
‖F (xk)‖ < tol
J(xk) may change at points influenced by nonlinearities
![Page 423: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/423.jpg)
M. Tuma 177
14. Solving nonlinear systems: II.
Much easier if matrix approximations are readily available
But: matrices are often given only implicitly.
For example: linear solvers in Newton-Krylov framework (see, e.g., Knoll,Keyes, 2004)
J(xk)∆x = −F (xk), J(xk) ≈ F ′(xk)
Only matvecs F ′(xk)v for a given vector v are typically performed. Finite differences can be used to get such products:
F (xk + ǫv)− F (xk)
ǫ≈ F ′(xk)v
matrices are always present in more or less implicit form: a tradeoff:implicitness × fast execution appears in many algorithms
For strong algebraic preconditioners we need matrix approximations
![Page 424: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/424.jpg)
M. Tuma 178
14. Solving nonlinear systems: III.
To summarize
Jacobian J often provided only implicitly
Parallel functional evaluations
Efficient preconditioning of the linearized system
Efficient evaluation of the products Jx knowing the structure of J
![Page 425: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/425.jpg)
M. Tuma 179
14. Solving nonlinear systems: IV.
Efficient preconditioning of the linearized system
Can strongly simplify the problem to be parallelized: Approximate inverse Jacobians Jacobian of related discretizations (convection-diffusion preconditioned
by diffusion, Brown, Saad, 1980) Operator split Jacobians:
J = (αI + S + R)−1 ≈ (αI + R)−1(I + α−1S)−1
Jacobians formed from only “strong” entries Jacobians of low-order discretizations Jacobians with freezed values for expensive terms Jacobians with freezed and updated values
![Page 426: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/426.jpg)
M. Tuma 180
14. Solving nonlinear systems: V.
Getting a matrix approximation stored implicitly: cases
Get the matrix Ai+k by n matvecs Aej , j = 1, . . . , n (Inefficient) A sparse Ai+k can be often obtained via a significantly less matvecs
than n by grouping computed columns if we know its pattern. pattern (stencil) is often known (e.g., given by the problem grid in
PDE problems) often used in practice
but for approximating Ai+k we do not need so much it might be enough to use an approximate pattern of a different but
structurally similar matrix
![Page 427: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/427.jpg)
M. Tuma 181
14. Solving nonlinear systems: VI.
How to approximate a matrix by small number of matvecs if we knowmatrix pattern:
Example 1: Efficient estimation of a banded matrix0BBBBBBBBBBBBBBBBB
♠ ∗
♠ ∗ ∗
∗ ∗ ♠
∗ ♠ ∗
♠ ∗ ∗
∗ ∗ ♠
∗ ♠ ∗
♠ ∗
1CCCCCCCCCCCCCCCCCA
Columns with “red spades” can be computed at the same time in onematvec since sparsity patterns of their rows do not overlap. Namely,
A(e1 + e4 + e7) computes entries in the columns 1, 4 and 7.
![Page 428: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/428.jpg)
M. Tuma 182
14. Solving nonlinear systems: VII.
How to approximate a matrix by small number of matvecs if we knowmatrix pattern:
Example 2: Efficient estimation of a general matrix
∗ ∗ ∗∗ ∗ ∗∗ ∗ ∗
∗ ∗ ∗∗ ∗ ∗
∗ ∗ ∗
Again, By one matvec can be computed the columns for which sparsitypatterns of their rows do not overlap.
![Page 429: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/429.jpg)
M. Tuma 182
14. Solving nonlinear systems: VII.
How to approximate a matrix by small number of matvecs if we knowmatrix pattern:
Example 2: Efficient estimation of a general matrix
♠ ∗ ∗♠ ∗ ∗∗ ♠ ∗
♠ ∗ ∗∗ ∗ ♠
∗ ∗ ♠
Again, By one matvec can be computed the columns for which sparsitypatterns of their rows do not overlap.
For example, A(e1 + e3 + e6) computes entries in the columns 1, 3 and 6.
![Page 430: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/430.jpg)
M. Tuma 182
14. Solving nonlinear systems: VII.
How to approximate a matrix by small number of matvecs if we knowmatrix pattern:
Example 2: Efficient estimation of a general matrix
♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠
Entries in A can be computed by four matvecs.In each matvec we need to have structurally orthogonal columns.
![Page 431: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/431.jpg)
M. Tuma 183
14. Solving nonlinear systems: VIII.
Efficient matrix estimation: well established field
Structurally orthogonal columns can be grouped
Finding the minimum number of groups: combinatorially difficultproblem (NP-hard)
Classical field: a (very restricted) selection of references: Curtis, Powell;Reid,1974; Coleman, Moré, 1983; Coleman, Moré, 1984; Coleman,Verma, 1998; Gebremedhin, Manne, Pothen, 2003. extensions to SPD (Hessian) approximations extensions to use both A and AT in automatic differentiation not only direct determination of resulting entries (substitution
methods)
![Page 432: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/432.jpg)
M. Tuma 184
14. Solving nonlinear systems: IX.
Efficient matrix estimation: graph coloring problem
♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠
1
2
3
4
5
6
In the other words, columns which form an independent set in the graphof AT A (called intersection graph) can be grouped⇒ a graph coloringproblem for the graph of AT A.
Problem: Find a coloring of vertices of the graph of AT A (G(AT A)) withminimum number of colors such that edges connect only vertices of
different colors
![Page 433: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/433.jpg)
M. Tuma 185
14. Solving nonlinear systems: X.
Our matrix is defined only implicitly.⇓
♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠
Consider a new pattern: e.g.,if the entries denoted by ♣ are small, number of groups can be decreased:
![Page 434: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/434.jpg)
M. Tuma 185
14. Solving nonlinear systems: X.
Our matrix is defined only implicitly.⇓
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
♠ → ♠♠ → ♠
![Page 435: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/435.jpg)
M. Tuma 186
14. Solving nonlinear systems: XI.
Our matrix is defined only implicitly.
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
But: the computation of entries from matvecs is inexact
![Page 436: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/436.jpg)
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
![Page 437: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/437.jpg)
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
→
♠ ♠♠ ♠
♠ ♠♠ ♠
♠ ♠♠ ♠
![Page 438: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/438.jpg)
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
→
♠ ♠♠ ♠
♠ ♠♠ ♠
♠ ♠♠ ♠
![Page 439: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/439.jpg)
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
Step 2: Graph coloring problem for the graph G(patternT pattern) to getgroups.
![Page 440: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/440.jpg)
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 1: Compute pattern of Ai or M i. E.g., for Ai as sparsification of Ai:
Step 2: Graph coloring problem for the graph G(patternT pattern) to getgroups.
♠ ♠♠ ♠
♠ ♠♠ ♠
♠ ♠♠ ♠
1
2
3
4
5
6
![Page 441: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/441.jpg)
M. Tuma 187
14. Solving nonlinear systems: XII.
Computational procedure I.
Step 3: Using matvecs to get Ai+k for more indices k ≥ 0 as if theentries outside the pattern are not present
Notes:
getting the entries from the matvecs spoiled by errors
an approximation error for any estimated entry ai,j in A:∑
k∈(i,k)∈A\P
|aik|
A\P: entries outside the given pattern The error distribution can be strongly influenced by column grouping balancing the error
![Page 442: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/442.jpg)
M. Tuma 188
14. Solving nonlinear systems: XIII.
Computational procedure II.Preconditioner based on exact estimation of off-diagonals in of Ai
(diagonal partial coloring problem)
♠ ♠ ♠♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠♠ ♠ ♠
♠ ♠ ♠
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
Consider a new pattern: e.g.,if the entries denoted by ♣ are small, number of groups can be decreased:
![Page 443: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/443.jpg)
M. Tuma 188
14. Solving nonlinear systems: XIII.
Computational procedure II.Preconditioner based on exact estimation of off-diagonals in of Ai
(diagonal partial coloring problem)
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
♠ → ♠Since all off-diagonals in columns 4 and 5 are computed precisely
![Page 444: Introduction into parallel computationstuma/ps/parallel.pdf · SISD SIMD MISD MIMD Simple processor processor Vector Array processor Shared memory Distributed memory Cache coherent](https://reader033.fdocuments.in/reader033/viewer/2022042001/5e6dbe6589d339515d0faa8f/html5/thumbnails/444.jpg)
M. Tuma 188
14. Solving nonlinear systems: XIII.
Computational procedure II.Preconditioner based on exact estimation of off-diagonals in of Ai
(diagonal partial coloring problem)
♠ ♠ ♣♠ ♣ ♠♣ ♠ ♠
♠ ♠ ♣♣ ♠ ♠
♠ ♣ ♠
♠ not → ♠Because of row 1