Matrix multiplication
-
Upload
sunawar-khan-ahsan -
Category
Education
-
view
381 -
download
2
Transcript of Matrix multiplication
MATRIX MULTIPLICATION
PARALLEL PROCESSING
There are many application in day to day life that demandreal time solution to problem. For example weatherforecasting has to done in a timely fashion etc.. If an expertsystem is used to aid a physician in surgical procedures,decisions have to be made within seconds. And so on.Programs written for such applications have to perform anenormous amount of computation. Even the fastest single-processor machine may not be able to come up with solutionswithin tolerable time limits. Parallel Random AccessMachines (PRAM) offer the potential of decreasing the
solution times enormously.
Introduction
Say there are 100 numbers to be added and there are two
persons A and B. Person A can add the first 50 numbers.
At the same time B can add the next 50 numbers. When
they are done, one of them can add two individual sums
to get the final answer. So two people can add the 100
numbers in almost half the time required by one.
Example
Computing the Convex Hull
• Take the set of points and divide the set into two halves
• Assume that recursive call computes the convex hull of the two halves
• Conquer stage: take the two convex hulls and merge it to obtain the convex hull for the entire set
Another Example
In the RAM (Random Access Machine) we assume that any of thefollowing operation can be done in one unit of time: addition,subtraction, multiplication, division, comparison, memory access,assignment, and so on. This model widely accepted as a validsequential model.
An important feature of parallel computing that is absent insequential computing is need for interprocessor communication.For example given any problem, the processor have to communicateamong themselves and agree on the subproblems each will workon. Also they need to communicate to see whether every one hasfinished its task so on.
Computational model
Standard Random Access Machine
Each Operation load, store, jump, add, etc takes one unit of time.
Simply general one model.
Basic model for sequential algorithm
RAM Model for Single Processor Machine
We begin this discussion with an ideal parallel machine called Parallel Random Access Machine, or PRAM
Physical organization of Parallel Platforms
A natural extension of the Random Access Machine(RAM), serial architecture is the Parallel Random Access Machine, or PRAM
Pram Consist of p Processor and a global memory of unbounded size that is uniformly accessible to all processors
Processors share a common clock but may execute different instruction in each cycle
PRAM and Basic Algorithm
PRAM Architecture
Conceptual view of a parallel random-access
machine (PRAM).
Processor i can do
the following in three
phases of one cycle:
1. Fetch a value from
address si in
shared memory
2. Perform
computations on
data held in local
registers
3. Store a value into
address di in
shared memory
CLASSIFICATION OF PRAM
EREW Least “powerful”, most “realistic”
CREW Default
ERCW Not useful
CRCW Most “powerful”,
further subdivided
Reads from same location W
rite
s t
o s
am
e lo
ca
tio
n
Exclusive C
oncurr
ent
Concurrent E
xclu
siv
e
Submodels of the PRAM model.
CRCW SUBMODEL
CRCW
Un-Defined
Detecting CommonRando
mPriority Max/Min
Reduction
CRCW-U CRCW-D CRCW-C CRCW-R CRCW-P CRCW-M
AND XOR SUM
• Exclusive Read (ER): p processors can simultaneously read the content of p distinct memory locations.
• Concurrent Read (CR): p processors can simultaneously read the content of p’ memory locations, where p’ < p.
• Exclusive Write (EW): p processors can simultaneously write the content of p distinct memory locations.
• Concurrent Write (CW): p processors can simultaneously write the content of p’ memory locations, where p’ < p.
Memory Access in PRAM
Processors and memories are connected via switches.
Since these switches must operate in O(1) time at the level of words, for a system of p Processors and m words, the switch is O(mp).
Clearly, for meaningful values of p and m, a true PRAM is not realizable
Physical Complexity
Implications of the CRCW Hierarchy of Submodels
Our most powerful PRAM CRCW submodel can be emulated by the least
powerful submodel with logarithmic slowdown
Efficient parallel algorithms have polylogarithmic running times
Running time still polylogarithmic after slowdown due to emulation
A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-
processor EREW PRAM with slowdown factor Q(log p).
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
We need not be too concerned with the CRCW submodel used
Simply use whichever submodel is most natural or convenient
Let T(n) time and P(n) processors be used ona parallel machine on a problem with n inputs
zCost: C(n) = P(n)T(n) is the time × processorproduct, or work, for problem on n inputs.
zAn equivalent serial algorithm will run in timeO(C(n)).
If p≤P(n) processors available, we canimplement the algorithm in time O(P(n)T(n)/p)or O(C(n)/p) time.
Measuring Performance of Parallel Programs
Matrix Multiplication
Sequential matrix multiplication
for i = 0 to m – 1 do
for j = 0 to m – 1 do
t := 0
for k = 0 to m – 1 do
t := t + aikbkj
endfor
cij := t
endfor
endfor
=i
j
ij
A B C
cij :=Sk=0 to m–1 aikbkj
PRAM solution with m3
processors:
each processor does
one multiplication
(not very efficient)
m m
matrices
Consider n*n matric multiplication with n3 processors
Each cij=∑ k=1..n aik bkj be computed on the CREW PRAM in parallel using n processors n O(log n) time.
On the EREW PRAM exclusive read of aij and bij values can be satisfied by making n copies of a and b, which takes O(log n) time with n Processors
Total time is still O(log n)
Memory requirement is ofcourse much higher for the EREW PRAM
MATRIX MULTIPLICATION
Complexity: Θ(n3)
Better Algorithm that improve slightly
Multiplication by block
Takes advantage of the cache
Matrix Multiplication
PRAM Matrix Multiplication with m2 Processors
PRAM matrix multiplication; p = m2 processors.
=i
j
ij
PRAM matrix multiplication using m2 processorsProc (i, j), 0 i, j < m, dobegin
t := 0for k = 0 to m – 1 do
t := t + aikbkj
endforcij := t
end
Q(m) steps: Time-optimal
CREW model is implicit
Processors are numbered (i, j),
instead of 0 to m2 – 1
A B C
PRAM Matrix Multiplication with m Processors
PRAM matrix multiplication using m processorsfor j = 0 to m – 1 Proc i, 0 i < m, do
t := 0for k = 0 to m – 1 do
t := t + aikbkj
endforcij := t
endfor
=i
j
ij
Q(m2) steps: Time-optimal
CREW model is implicit
Because the order of multiplications is
immaterial, accesses to B can be
skewed to allow the EREW model
A B C
- m processors read A at once (no concurrent)
- All m processors read same column of B at same time (concurrent read should be allowed) - if not then, Brent’s theorem states – we can convert CREW -> EREW by skewing memory access
PRAM Matrix Multiplication with Fewer Processors
Algorithm is similar, except that each processor is in charge of computing m /p rows of C
Q(m3/p) steps: Time-optimal
EREW model can be used
A drawback of all algorithms thus far is that only two arithmetic
operations (one multiplication and one addition) are performed for each
memory access.
This is particularly costly for NUMA shared-memory machines.
=i
j
ij m / p rows
B CA
More Efficient Matrix Multiplication (for NUMA)
Partitioning the matrices for block
matrix multiplication .
A B
C D
AE+BG AF+BH E F
G H CE+DG CF+DH
=
Block matrix multiplication
follows the same algorithm as
simple matrix multiplication.
=
i
j
ijBlockBlock-
band
Block-band
1 2 ¦p
1
2
¦p
One processor
computes these
element s of C
t hat it holds in
local memory
q
q=m/¦p
p
p
q=m/p
Partition the matrices
into p square blocks
Details of Block Matrix Multiplication
How Processor (i, j) operates on an element of A and one block-
row of B to update one block-row of C.
A multiply-add
computation
on q q blocks
needs
2q 2 = 2m 2/p
memory
accesses and
2q 3 arithmetic
operations
So, q arithmetic
operations are
done per memory
access
iq + q - 1
iq + a
iq + 1
iq
jq jq + b jq + q - 1
kq + c
kq + c
iq + q - 1
iq + a
iq + 1
iq
jq jq + 1 jq + b jq + q - 1
Multiply
Add Elements of
block (i, j)
in matrix C
Elements of
block (k, j)
in matrix B
Element of
block (i, k)
in matrix A
jq + 1
Time Complexity Of Matrix Multiplication
Adding n numbers in parallel
Parallel Processing, Extreme Models
A simple parallel algorithm
Example for n numbers addition:1. We start with 4 processors and each of
them adds 2 items in the first step.2. The number of items is halved at every
subsequent step. Hence logn steps are required for adding n numbers.
3. The processor requirement is O(n).
A parallel algorithms is analyzed mainly in terms of its time, processor and work complexities. Time complexity T(n): How many time steps are needed?
Processor complexity P(n) : How many processors are used?
Work complexity W(n): What is the total work done by all the processors? Hence,
For add example: T(n) = O(log n)
P(n) = O(n)
W(n) = O(nlog n)
Parallel Processing, Extreme Models
How do we analyze a parallel algorithm?
• Let P(n) = O(n2)
Read n2 processors Aij all cells at once in = O(1) Read n2 processors Bij all cells in = O(1) Each processor multiply Aij* Bij in = O(1) Parallel Sum to get Cij = O(logn) Store Value Cij = O(1)
T(n) = O(log n) P(n) = O(n2) W(n) = O(n2log n) = total # of all operations
CREW Cost
• Let P(n) = O(n2)
Read n2 processors Aij all cells at once in = O(1) Cannot read n2 processors Bij all cells in = O(1)
Concurrent read is not allowed Skew the memory – replicate – or - Parallel processor read O(logn)
Each processor multiply Aij* Bij in = O(1) Parallel Sum to get Cij = O(logn) Store Value Cij = O(1)
T(n) = O(log n) P(n) = O(n2) W(n) = O(n2log n) = total # of all operations
EREW Cost
PRAM removes algorithmic details concerning synchronization and communicating, allowing the algorithm designer to focus on problem properties
A PRAM algorithm includes an explicit understanding of the operation performed at each time unit and an explicit allocation of processors to jobs at each time unit.
PRAM designer paradigm have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models
Advantages of PRAM Model
Model Inaccuracies
unbounded local memory(register)
All operation takes unit time
processors run in lock steps
Unaccounted costs
Non-local memory access
Latency
Bandwidth
Memory Access Contention
PRAM Weaknesses
PRAM algorithm is the source of most fundamental ideas
It’s a source of inspiration for algorithms
PRAM is simple and easy to understand.
The improved locality of block matrix multiplication canalso improve the running time on a uniprocessor, ordistributed shared-memory multiprocessor with caches
Conclusion
Reason: Higher Cache Hit Rates