Matrix multiplication

MATRIX MULTIPLICATION

PARALLEL PROCESSING

There are many application in day to day life that demandreal time solution to problem. For example weatherforecasting has to done in a timely fashion etc.. If an expertsystem is used to aid a physician in surgical procedures,decisions have to be made within seconds. And so on.Programs written for such applications have to perform anenormous amount of computation. Even the fastest single-processor machine may not be able to come up with solutionswithin tolerable time limits. Parallel Random AccessMachines (PRAM) offer the potential of decreasing the

solution times enormously.

Introduction

Say there are 100 numbers to be added and there are two

persons A and B. Person A can add the first 50 numbers.

At the same time B can add the next 50 numbers. When

they are done, one of them can add two individual sums

to get the final answer. So two people can add the 100

numbers in almost half the time required by one.

Example

Computing the Convex Hull

• Take the set of points and divide the set into two halves

• Assume that recursive call computes the convex hull of the two halves

• Conquer stage: take the two convex hulls and merge it to obtain the convex hull for the entire set

Another Example

In the RAM (Random Access Machine) we assume that any of thefollowing operation can be done in one unit of time: addition,subtraction, multiplication, division, comparison, memory access,assignment, and so on. This model widely accepted as a validsequential model.

An important feature of parallel computing that is absent insequential computing is need for interprocessor communication.For example given any problem, the processor have to communicateamong themselves and agree on the subproblems each will workon. Also they need to communicate to see whether every one hasfinished its task so on.

Computational model

Standard Random Access Machine

Each Operation load, store, jump, add, etc takes one unit of time.

Simply general one model.

Basic model for sequential algorithm

RAM Model for Single Processor Machine

We begin this discussion with an ideal parallel machine called Parallel Random Access Machine, or PRAM

Physical organization of Parallel Platforms

A natural extension of the Random Access Machine(RAM), serial architecture is the Parallel Random Access Machine, or PRAM

Pram Consist of p Processor and a global memory of unbounded size that is uniformly accessible to all processors

Processors share a common clock but may execute different instruction in each cycle

PRAM and Basic Algorithm

PRAM Architecture

Conceptual view of a parallel random-access

machine (PRAM).

Processor i can do

the following in three

phases of one cycle:

1. Fetch a value from

address si in

shared memory

2. Perform

computations on

data held in local

registers

3. Store a value into

address di in

shared memory

CLASSIFICATION OF PRAM

EREW Least “powerful”, most “realistic”

CREW Default

ERCW Not useful

CRCW Most “powerful”,

further subdivided

Reads from same location W

rite

s t

o s

am

e lo

ca

tio

n

Exclusive C

oncurr

ent

Concurrent E

xclu

siv

e

Submodels of the PRAM model.

CRCW SUBMODEL

CRCW

Un-Defined

Detecting CommonRando

mPriority Max/Min

Reduction

CRCW-U CRCW-D CRCW-C CRCW-R CRCW-P CRCW-M

AND XOR SUM

• Exclusive Read (ER): p processors can simultaneously read the content of p distinct memory locations.

• Concurrent Read (CR): p processors can simultaneously read the content of p’ memory locations, where p’ < p.

• Exclusive Write (EW): p processors can simultaneously write the content of p distinct memory locations.

• Concurrent Write (CW): p processors can simultaneously write the content of p’ memory locations, where p’ < p.

Memory Access in PRAM

Processors and memories are connected via switches.

Since these switches must operate in O(1) time at the level of words, for a system of p Processors and m words, the switch is O(mp).

Clearly, for meaningful values of p and m, a true PRAM is not realizable

Physical Complexity

Implications of the CRCW Hierarchy of Submodels

Our most powerful PRAM CRCW submodel can be emulated by the least

powerful submodel with logarithmic slowdown

Efficient parallel algorithms have polylogarithmic running times

Running time still polylogarithmic after slowdown due to emulation

A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-

processor EREW PRAM with slowdown factor Q(log p).

EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

We need not be too concerned with the CRCW submodel used

Simply use whichever submodel is most natural or convenient

Let T(n) time and P(n) processors be used ona parallel machine on a problem with n inputs

zCost: C(n) = P(n)T(n) is the time × processorproduct, or work, for problem on n inputs.

zAn equivalent serial algorithm will run in timeO(C(n)).

If p≤P(n) processors available, we canimplement the algorithm in time O(P(n)T(n)/p)or O(C(n)/p) time.

Measuring Performance of Parallel Programs

Matrix Multiplication

Sequential matrix multiplication

for i = 0 to m – 1 do

for j = 0 to m – 1 do

t := 0

for k = 0 to m – 1 do

t := t + aikbkj

endfor

cij := t

endfor

endfor

=i

j

ij

A B C

cij :=Sk=0 to m–1 aikbkj

PRAM solution with m3

processors:

each processor does

one multiplication

(not very efficient)

m m

matrices

Consider n*n matric multiplication with n3 processors

Each cij=∑ k=1..n aik bkj be computed on the CREW PRAM in parallel using n processors n O(log n) time.

On the EREW PRAM exclusive read of aij and bij values can be satisfied by making n copies of a and b, which takes O(log n) time with n Processors

Total time is still O(log n)

Memory requirement is ofcourse much higher for the EREW PRAM

MATRIX MULTIPLICATION

Complexity: Θ(n3)

Better Algorithm that improve slightly

Multiplication by block

Takes advantage of the cache

Matrix Multiplication

PRAM Matrix Multiplication with m2 Processors

PRAM matrix multiplication; p = m2 processors.

=i

j

ij

PRAM matrix multiplication using m2 processorsProc (i, j), 0 i, j < m, dobegin

t := 0for k = 0 to m – 1 do

t := t + aikbkj

endforcij := t

end

Q(m) steps: Time-optimal

CREW model is implicit

Processors are numbered (i, j),

instead of 0 to m2 – 1

A B C

PRAM Matrix Multiplication with m Processors

PRAM matrix multiplication using m processorsfor j = 0 to m – 1 Proc i, 0 i < m, do

t := 0for k = 0 to m – 1 do

t := t + aikbkj

endforcij := t

endfor

=i

j

ij

Q(m2) steps: Time-optimal

CREW model is implicit

Because the order of multiplications is

immaterial, accesses to B can be

skewed to allow the EREW model

A B C

- m processors read A at once (no concurrent)

- All m processors read same column of B at same time (concurrent read should be allowed) - if not then, Brent’s theorem states – we can convert CREW -> EREW by skewing memory access

PRAM Matrix Multiplication with Fewer Processors

Algorithm is similar, except that each processor is in charge of computing m /p rows of C

Q(m3/p) steps: Time-optimal

EREW model can be used

A drawback of all algorithms thus far is that only two arithmetic

operations (one multiplication and one addition) are performed for each

memory access.

This is particularly costly for NUMA shared-memory machines.

=i

j

ij m / p rows

B CA

More Efficient Matrix Multiplication (for NUMA)

Partitioning the matrices for block

matrix multiplication .

A B

C D

AE+BG AF+BH E F

G H CE+DG CF+DH

=

Block matrix multiplication

follows the same algorithm as

simple matrix multiplication.

=

i

j

ijBlockBlock-

band

Block-band

1 2 ¦p

1

2

¦p

One processor

computes these

element s of C

t hat it holds in

local memory

q

q=m/¦p

p

p

q=m/p

Partition the matrices

into p square blocks

Details of Block Matrix Multiplication

How Processor (i, j) operates on an element of A and one block-

row of B to update one block-row of C.

A multiply-add

computation

on q q blocks

needs

2q 2 = 2m 2/p

memory

accesses and

2q 3 arithmetic

operations

So, q arithmetic

operations are

done per memory

access

iq + q - 1

iq + a

iq + 1

iq

jq jq + b jq + q - 1

kq + c

kq + c

iq + q - 1

iq + a

iq + 1

iq

jq jq + 1 jq + b jq + q - 1

Multiply

Add Elements of

block (i, j)

in matrix C

Elements of

block (k, j)

in matrix B

Element of

block (i, k)

in matrix A

jq + 1

Time Complexity Of Matrix Multiplication

Adding n numbers in parallel

Parallel Processing, Extreme Models

A simple parallel algorithm

Example for n numbers addition:1. We start with 4 processors and each of

them adds 2 items in the first step.2. The number of items is halved at every

subsequent step. Hence logn steps are required for adding n numbers.

3. The processor requirement is O(n).

A parallel algorithms is analyzed mainly in terms of its time, processor and work complexities. Time complexity T(n): How many time steps are needed?

Processor complexity P(n) : How many processors are used?

Work complexity W(n): What is the total work done by all the processors? Hence,

For add example: T(n) = O(log n)

P(n) = O(n)

W(n) = O(nlog n)

Parallel Processing, Extreme Models

How do we analyze a parallel algorithm?

• Let P(n) = O(n2)

Read n2 processors Aij all cells at once in = O(1) Read n2 processors Bij all cells in = O(1) Each processor multiply Aij* Bij in = O(1) Parallel Sum to get Cij = O(logn) Store Value Cij = O(1)

T(n) = O(log n) P(n) = O(n2) W(n) = O(n2log n) = total # of all operations

CREW Cost

• Let P(n) = O(n2)

Read n2 processors Aij all cells at once in = O(1) Cannot read n2 processors Bij all cells in = O(1)

Concurrent read is not allowed Skew the memory – replicate – or - Parallel processor read O(logn)

Each processor multiply Aij* Bij in = O(1) Parallel Sum to get Cij = O(logn) Store Value Cij = O(1)

T(n) = O(log n) P(n) = O(n2) W(n) = O(n2log n) = total # of all operations

EREW Cost

PRAM removes algorithmic details concerning synchronization and communicating, allowing the algorithm designer to focus on problem properties

A PRAM algorithm includes an explicit understanding of the operation performed at each time unit and an explicit allocation of processors to jobs at each time unit.

PRAM designer paradigm have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models

Advantages of PRAM Model

Model Inaccuracies

unbounded local memory(register)

All operation takes unit time

processors run in lock steps

Unaccounted costs

Non-local memory access

Latency

Bandwidth

Memory Access Contention

PRAM Weaknesses

PRAM algorithm is the source of most fundamental ideas

It’s a source of inspiration for algorithms

PRAM is simple and easy to understand.

The improved locality of block matrix multiplication canalso improve the running time on a uniprocessor, ordistributed shared-memory multiprocessor with caches

Conclusion

Reason: Higher Cache Hit Rates

Matrix multiplication

Education

Transcript of Matrix multiplication