Decomposition, Locality and (finishing synchronization lecture) Barriers
description
Transcript of Decomposition, Locality and (finishing synchronization lecture) Barriers
![Page 1: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/1.jpg)
04/22/23 CS194 Lecture 1
Decomposition, Locality and (finishing synchronization
lecture) Barriers
Kathy [email protected]
www.cs.berkeley.edu/~yelick/cs194f07
![Page 2: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/2.jpg)
04/22/23 CS194 Lecture 2
Lecture Schedule for Next Two Weeks• Fri 9/14 11-12:00 • Mon 9/17 10:30-11:30 Discussion (11:30-12 as needed)• Wed 9/19 10:30-12:00 Decomposition and Locality• Fri 9/21 10:30-12:00 NVIDIA Lecture• Mon 9/24 10:30-12:00 NVIDIA lecture (David Kirk)• Tue 9/25 3-:4:30 NVIDIA research talk in Woz (optional)• Wed 9/26 10:30-11:30 Discussion• Fri 9/27 11-12 Lecture topic TBD
![Page 3: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/3.jpg)
04/22/23 CS194 Lecture 3
Outline
• Task Decomposition• Review parallel decomposition and task graphs• Styles of decomposition• Extending task graphs with interaction information
• Example: Sharks and Fish (Wator)• Parallel vs. sequential versions
• Data decomposition• Partitioning rectangular grids (like matrix multiply)• Ghost regions (unlike matrix multiply)
• Bulk-synchronous programming• Barrier synchronization
![Page 4: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/4.jpg)
04/22/23 CS194 Lecture 4
Designing Parallel Algorithms
• Parallel software design starts with decomposition• Decomposition Techniques
• Recursive Decomposition • Data Decomposition: Input, Output, or Intermediate Data• And others
• Characteristics of Tasks and Interactions • Task Generation, Granularity, and Context • Characteristics of Task Interactions.
![Page 5: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/5.jpg)
04/22/23 CS194 Lecture 5
Recursive Decomposition: Example • Consider parallel Quicksort.
Once the array is partitioned, each subarry can be processed i parallel
Source: Ananth Grama
![Page 6: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/6.jpg)
04/22/23 CS194 Lecture 6
Data Decomposition
• Identify the data on which computations are performed.
• Partition this data across various tasks. • This partitioning induces a decomposition of the
of the computation, often using the following rule:
• Owner Computes Rule: the thread assigned to a particular data item is responsible for all computation associated with it.
• The owner computes rule is especially common on output data. Why?
Source: Ananth Grama
![Page 7: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/7.jpg)
04/22/23 CS194 Lecture 7
Input Data Decomposition• Recall: applying a function square to the
elements of array A, then computing its sum:
• Conceptualize a decomposition as a task dependency graph:
• A directed graph with • Nodes corresponding to tasks • Edges indicating dependencies, that the result of
one task is required for processing the next.
sqr (A[0])sqr(A[1])sqr(A[2]) sqr(A[n])
sum
…
![Page 8: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/8.jpg)
04/22/23 CS194 Lecture 8
Output Data Decomposition: Example
Consider matrix multiply:
Task 1:
Task 2:
Task 3:
Task 4:
Source: Ananth Grama
![Page 9: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/9.jpg)
04/22/23 CS194 Lecture 9
A Model Problem: Sharks and Fish
• Illustration of parallel programming• Original version (discrete event only) proposed by
Geoffrey Fox• Called WATOR
• Basic idea: sharks and fish living in an ocean• Rules for breeding, eating, and death• Could also add: forces in the ocean, between
creatures, etc.• Ocean is toroidal (2D donut)
![Page 10: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/10.jpg)
04/22/23 CS194 Lecture 10
Serial vs. Parallel• Updating left-to-right row-wise order, we get a serial
algorithm• Cannot be parallelized, because of dependencies, so
instead we use a “red-black” orderforall black points grid(i,j) update …forall red points grid(i,j) update …
° For 3D or general graph, use graph coloring ° Can use repeated Maximal Independent Sets to color° Graph(T) is bipartite => 2 colorable (red and black)° Nodes for each color can be updated simultaneously
° Can also use two copies (old and new), but in both cases Note we have changed behavior of original algorithm !
![Page 11: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/11.jpg)
04/22/23 CS194 Lecture 11
Parallelism in Wator• The simulation is synchronous
• use two copies of the grid (old and new).• the value of each new grid cell depends only on 9 cells (itself plus 8
neighbors) in old grid.• simulation proceeds in timesteps-- each cell is updated at every step.
• Easy to parallelize by dividing physical domain: Domain Decomposition
• Locality is achieved by using large patches of the ocean• Only boundary values from neighboring patches are needed.
• How to pick shapes of domains?
P4
P1 P2 P3
P5 P6
P7 P8 P9
Repeat compute locally to update local system barrier() exchange state info with neighborsuntil done simulating
![Page 12: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/12.jpg)
04/22/23 CS194 Lecture 12
Regular Meshes (eg Game of Life)
• Suppose graph is nxn mesh with connection NSEW neighbors• Which has less communication (less potential false sharing)?
n*(p-1)edge crossings
2*n*(p1/2 –1)edge crossings
• Minimizing communication on mesh minimizing “surface to volume ratio” of partition
![Page 13: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/13.jpg)
04/22/23 CS194 Lecture 13
Ghost Nodes• Overview of (possible!) Memory Hierarchy Optimization
• Normally done with 1-wide ghost• Can you imagine a reason for wider?
To compute green
Copy yellow
Compute blue
![Page 14: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/14.jpg)
04/22/23 CS194 Lecture 14
Bulk Synchronous Coputation• With this decomposition, computation is mostly
independent• Need to synchronize between phases
• Picking up from last time: barrier synchronization
![Page 15: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/15.jpg)
15CS194 Lecture04/22/23
Barriers
• Software algorithms implemented using locks, flags, counters
• Hardware barriers• Wired-AND line separate from address/data bus
• Set input high when arrive, wait for output to be high to leave
• In practice, multiple wires to allow reuse• Useful when barriers are global and very frequent• Difficult to support arbitrary subset of processors
• even harder with multiple processes per processor• Difficult to dynamically change number and identity of
participants• e.g. latter due to process migration
• Not common today on bus-based machines
![Page 16: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/16.jpg)
16CS194 Lecture04/22/23
struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name;
BARRIER (bar_name, p) {LOCK(bar_name.lock);if (bar_name.counter == 0) /* reset if first */
bar_name.flag = 0; /* to reach */mycount = ++bar_name.counter; /* mycount is private */UNLOCK(bar_name.lock);if (mycount == p) { /* last to arrive */
bar_name.counter = 0; /* reset for next barrier */
bar_name.flag = 1; /* release waiters */}else while (bar_name.flag == 0) {}; /* busy wait */
}
A Simple Centralized Barrier• Shared counter maintains number of processes that
have arrived: increment when arrive, check until =procs
What is the problem if barriers are done back-to-back?
![Page 17: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/17.jpg)
17CS194 Lecture04/22/23
A Working Centralized Barrier• Consecutively entering the same barrier doesn’t work
• Must prevent process from entering until all have left previous instance• Could use another counter, but increases latency and contention
• Sense reversal: wait for flag to take different value consecutive times
• Toggle this value only when all processes reach
BARRIER (bar_name, p) {local_sense = !(local_sense); /* toggle private sense variable */
LOCK(bar_name.lock);mycount = ++bar_name.counter; /* mycount is private */if (bar_name.counter == p)
UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/
else { UNLOCK(bar_name.lock);
while (bar_name.flag != local_sense) {}; }}
![Page 18: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/18.jpg)
18CS194 Lecture04/22/23
Centralized Barrier Performance• Latency
• Centralized has critical path length at least proportional to p• Traffic
• About 3p bus transactions• Storage Cost
• Very low: centralized counter and flag• Fairness
• Same processor should not always be last to exit barrier• No such bias in centralized
• Key problems for centralized barrier are latency and traffic
• Especially with distributed memory, traffic goes to same node
![Page 19: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/19.jpg)
19CS194 Lecture04/22/23
Improved Barrier Algorithms for a Bus
• Separate arrival and exit trees, and use sense reversal• Valuable in distributed network: communicate along different paths• On bus, all traffic goes on same bus, and no less total traffic• Higher latency (log p steps of work, and O(p) serialized bus xactions)• Advantage on bus is use of ordinary reads/writes instead of locks
Software combining tree•Only k processors access the same location, where k is degree of tree
Flat Tree structured
Contention Little contention
![Page 20: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/20.jpg)
04/22/23 CS194 Lecture 20
Combining Tree Barrier
![Page 21: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/21.jpg)
04/22/23 CS194 Lecture 21
Combining Tree Barrier
![Page 22: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/22.jpg)
04/22/23 CS194 Lecture 22
Tree-Based Barrier Summary
![Page 23: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/23.jpg)
04/22/23 CS194 Lecture 23
Tree-Based Barrier Cost
Latency
Traffic
Memory
![Page 24: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/24.jpg)
04/22/23 CS194 Lecture 24
Dissemination Based Barrier
![Page 25: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/25.jpg)
04/22/23 CS194 Lecture 25
Dissemination Barrier
![Page 26: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/26.jpg)
04/22/23 CS194 Lecture 26
Latency
TrafficMemory
![Page 27: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/27.jpg)
27CS194 Lecture04/22/23
Barrier Performance on SGI Challenge
• Centralized does quite well• Will discuss fancier barrier algorithms for distributed machines
• Helpful hardware support: piggybacking of reads misses on bus• Also for spinning on highly contended locks
Number of processors
Tim
e (
s)
123456780
5
10
15
20
25
30
35 Centralized Combining tree Tournament Dissemination
![Page 28: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/28.jpg)
28CS194 Lecture04/22/23
Synchronization Summary
• Rich interaction of hardware-software tradeoffs• Must evaluate hardware primitives and software algorithms
together• primitives determine which algorithms perform well
• Evaluation methodology is challenging• Use of delays, microbenchmarks• Should use both microbenchmarks and real workloads
• Simple software algorithms with common hardware primitives do well on bus
• Will see more sophisticated techniques for distributed machines• Hardware support still subject of debate
• Theoretical research argues for swap or compare&swap, not fetch&op
• Algorithms that ensure constant-time access, but complex
![Page 29: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/29.jpg)
04/22/23 CS194 Lecture 29
References• John Mellor-Crummey’s 422 Lecture Notes
![Page 30: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/30.jpg)
04/22/23 CS194 Lecture 30
Tree-Based Computation
• The broadcast and reduction operations in MPI are a good example of tree-based algorithms
• For reductions: take n inputs and produce 1 output• For broadcast: take 1 input and produce n outputs• What can we say about such computations in general?
![Page 31: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/31.jpg)
04/22/23 CS194 Lecture 31
A log n lower bound to compute any function of n variables
• Assume we can only use binary operations, one per time unit
• After 1 time unit, an output can only depend on two inputs
• Use induction to show that after k time units, an output can only depend on 2k inputs
• After log2 n time units, output depends on at most n inputs
• A binary tree performs such a computation
![Page 32: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/32.jpg)
04/22/23 CS194 Lecture 32
Broadcasts and Reductions on Trees
![Page 33: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/33.jpg)
04/22/23 CS194 Lecture 33
Parallel Prefix, or Scan• If “+” is an associative operator, and x[0],…,x[p-1] are input data
then parallel prefix operation computes
• Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final valuey[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1
![Page 34: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/34.jpg)
04/22/23 CS194 Lecture 34
Mapping Parallel Prefix onto a Tree - Details• Up-the-tree phase (from leaves to root)
1) Get values L and R from left and right children2) Save L in a local register Lsave3) Pass sum L+R to parent
By induction, Lsave = sum of all leaves in left subtree• Down the tree phase (from root to leaves)
1) Get value S from parent (the root gets 0)2) Send S to the left child3) Send S + Lsave to the right child
• By induction, S = sum of all leaves to left of subtree rooted at the parent
![Page 35: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/35.jpg)
04/22/23 CS194 Lecture 35
E.g., Fibonacci via Matrix Multiply Prefix
Fn+1 = Fn + Fn-1
1-n
n
n
1n
FF
0111
FF
Can compute all Fn by matmul_prefix on
[ , , , , , , , , ]then select the upper left entry
0111
0111
0111
0111
0111
0111
0111
0111
0111
Derived from: Alan Edelman, MIT
• Consider computing of the Fibbonacci numbers:
• Each step can be viewed as a matrix multiplication:
![Page 36: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/36.jpg)
04/22/23 CS194 Lecture 36
Adding two n-bit integers in O(log n) time• Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit
binary numbers• We want their sum s = a+b = s[n]s[n-1]…s[0]
• Challenge: compute all c[i] in O(log n) time via parallel prefix
• Used in all computers to implement addition - Carry look-ahead
c[-1] = 0 … rightmost carry bitfor i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1]
for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit
c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix
![Page 37: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/37.jpg)
04/22/23 CS194 Lecture 37
Other applications of scans
• There are several applications of scans, some more obvious than others
• add multi-precision numbers (represented as array of numbers)• evaluate recurrences, expressions • solve tridiagonal systems (numerically unstable!)• implement bucket sort and radix sort• to dynamically allocate processors• to search for regular expression (e.g., grep)
• Names: +\ (APL), cumsum (Matlab), MPI_SCAN• Note: 2n operations used when only n-1 needed
![Page 38: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/38.jpg)
04/22/23 CS194 Lecture 38
Evaluating arbitrary expressions
• Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately
• Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labeled by +, -, * and /
• Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity
• Sketch of (modern) proof: evaluate expression tree E greedily by
• collapsing all leaves into their parents at each time step• evaluating all “chains” in E with parallel prefix
![Page 39: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/39.jpg)
04/22/23 CS194 Lecture 39
Multiplying n-by-n matrices in O(log n) time• For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j)
• cost = 1 time unit, using n^3 processors
• For all (1 <= I,j <= n) C(i,j) = P(i,j,k)• cost = O(log n) time, using a tree with n^3 / 2 processors
k =1
n
![Page 40: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/40.jpg)
04/22/23 CS194 Lecture 40
Evaluating recurrences• Let xi = fi(xi-1), fi a rational function, x0 given• How fast can we compute xn?• Theorem (Kung): Suppose degree(fi) = d for all i
• If d=1, xn can be evaluated in O(log n) using parallel prefix• If d>1, evaluating xn takes (n) time, i.e. no speedup is possible
• Sketch of proof when d=1
• Sketch of proof when d>1• degree(xi) as a function of x0 is di
• After k parallel steps, degree(anything) <= 2k
• Computing xi take (i) steps
xi = fi(xi-1) = (ai * xi-1 + bi )/( ci * xi-1 + di ) can be written as
xi = numi / deni = (ai * numi-1 + bi * deni-1)/(ci * numi-1 + di * deni-1) or
numi = ai bi * numi-1 = Mi * numi-1 = Mi * Mi-1 * … * M1* num0
demi ci di deni-1 deni-1 den0
Can use parallel prefix with 2-by-2 matrix multiplication
![Page 41: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/41.jpg)
04/22/23 CS194 Lecture 41
Summary• Message passing programming
• Maps well to large-scale parallel hardware (clusters)• Most popular programming model for these machines• A few primitives are enough to get started
• send/receive or broadcast/reduce plus initialization• More subtle semantics to manage message buffers to avoid
copying and speed up communication
• Tree-based algorithms• Elegant model that is a key piece of data-parallel programming• Most common are broadcast/reduce• Parallel prefix (aka scan) has produces partial answers and can
be used for many surprising applications• Some of these or more theoretical than practical interest
![Page 42: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/42.jpg)
04/22/23 CS194 Lecture 42
Extra Slides
![Page 43: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/43.jpg)
04/22/23 CS194 Lecture 44
Inverting Dense n-by-n matrices in O(log n) time
• Lemma 1: Cayley-Hamilton Theorem• expression for A-1 via characteristic polynomial in A
• Lemma 2: Newton’s Identities• Triangular system of equations for coefficients of characteristic
polynomial, matrix entries = sk
• Lemma 3: sk = trace(Ak) = Ak [i,i] = [i (A)]k
• Csanky’s Algorithm (1976)
• Completely numerically unstable
2
i=1
n
i=1
n
1) Compute the powers A2, A3, …,An-1 by parallel prefix cost = O(log2 n)2) Compute the traces sk = trace(Ak) cost = O(log n)3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log2 n)4) Evaluate A-1 using Cayley-Hamilton Theorem cost = O(log n)
![Page 44: Decomposition, Locality and (finishing synchronization lecture) Barriers](https://reader031.fdocuments.in/reader031/viewer/2022020111/56815fcc550346895dcec735/html5/thumbnails/44.jpg)
04/22/23 CS194 Lecture 45
Summary of tree algorithms• Lots of problems can be done quickly - in theory - using
trees• Some algorithms are widely used
• broadcasts, reductions, parallel prefix• carry look ahead addition
• Some are of theoretical interest only• Csanky’s method for matrix inversion• Solving general tridiagonals (without pivoting)• Both numerically unstable• Csanky needs too many processors
• Embedded in various systems• CM-5 hardware control network• MPI, Split-C, Titanium, NESL, other languages