1 Parallelizing Compiler Technology Dr. Stephen Tse [email protected] Lesson 7.

1

Parallelizing Compiler Technology Parallelizing Compiler Technology

Dr. Stephen TseDr. Stephen Tse

[email protected]

Lesson 7Lesson 7

2

Vectorize byProper restructuring the innermost loop

• Presence of a cycle in the data dependence graph usually denies vectorization, but if one of the edges forming a cycle is counter-dependent or output-dependent, the compiler may perform the vectorization by inserting the substitution statement into the temporary arrangement.

do I=1,N do I=I,N C(I)=A(I)+B(I) TEMP(I)=C(I+1) TEMP(1:N)=C(2:N+1) D(I)=C(I)+C(I+1) C(I)=A(I)+B(I) C(1:N)=A(1:N)+B(1:N)end do D(I)=C(I)+TEMP(I) D(1:N)=C(1:N)+TEMP(1:N) end do

(a) This do loop having a cycle consisting of the flow dependence and counter-dependence

(b) This do loop can be rewritten by introducing the temporary variable of TEMP(I)

(c) Finally, it can be vectorized

Nodal Division restructuring method

3

Scalar Expansion Method

• These two methods are intended to vectorize through proper restructuring the innermost loop. These loops cannot be vectorize because from the data dependence analysis result, the loops are counter-dependent or output-dependent. The compiler may perform the vectorization by inserting the substitution statement into the temporary arrangement.

do I=1,N do I=I,N T=A(I)*B(I) TEMP(I)=A(I)*B(I) TEMP(1:N)=A(1:N)*B(1:N) C(I)=T+B(I+1) C(I)=TEMP(I)+B(I) C(I)=TEMP(1:N)+B(1:N)end do end do T=TEMP(N)

T=TEMP(N)

(a) This Method is to cancel out the counter-dependence of the variable T in the loop

(b) Expanding T to the temporary arrangement of TEMP(I)

(c) Finally, it can be vectorized the do loop

Nodal Division restructuring method

4

Other Restructuring Method

• Many other restructuring methods are proposed for better processing efficiency. – by increasing the vector length– by setting plural pipelines in parallel operation– by effectively use of the vector register

5

A Loop Interchange Method

• Because the vector length is short, by loop change, we can increased the vector length and has better processing efficiency

do I=1,100 do J=2,20

do J=2,20 do I=1,100

B(I,J)=A(I,J)+B(I,J-1) B(I,J)=A(I,J )+B(I,J-1)

end do end do

end do end do

(a) The primary recurrence of this dual loop is inside, and the vector length is short.

(b) Increasing the vector length by exchanging the outer and inner loops for better processing efficiency.

6

Loop Decay Method

• Change dual loop for single loop is intended to increase the vector length

real A(6,6), B(6,6) real A(36), B (36)

do I=1,6 do I J=1,36

do J=1,6

A(I,J)=B(I,J)+C(I,J) A(I,J)=B(I,J)+C(I,J)

end do end do

end do

(a) The frequency of the iteration inside of the dual loop is low.

(b) Exchange the dual loop for a single loop is called a loop decay.

7

Strip Mining Method

• The strip mining method is when a single loop has a large N value. Contrary to the loop decay method, we increase in to multiple loops.

• This method is used mainly for the vector register control in the vectorization process.

do I=1,N do I J=1,N,64

C(I)=A(I)+B(I) do J=I,min(I+63,N)

end do C(I)=A(I)+B(I)

end do

end do

(b) To effectively use of those registers, we can convert the loop to a dual loop for an effective of all the registers.

(a) If machine has only 64 vector registers, when N is large, this loop processing will be time consuming.

8

Parallelizing Compilers for Multiprocessor System

• For further improvement of the performance of pipeline super computers in the hardware area, the methods considered including:– increasing the number of pipe stages– raising the speed of process at each stage (increasing the element

speed)– increasing the number of pipes which can operate in parallel

• In reality, there are difficulties in stepping up the performance by means of increasing the number of stages and pipes.

• If rely on the increase in the element speed for boosting the effective performance; even with the current fastest silicon bipolar logic IC (as high as 70 ps, about 20,000 gate concentration), it is still difficult to attain the several-digit speed increase in the future.

• Therefore, the multiprocessor format that has multiple connection of the conventional pipeline processors are important.

• The second generation super computers are all incorporate with multiprocessor format; i.e. CRAY 2, CRAY X-MP, Nichiden SX-3 are all made up with four processors connected with shared memory.

9

Automatic Parallelizing Compiler

with Multiprocessor

The multiprocessor system is a combination of plural processors connected by the interconnection network or the shared memory. Each processor can process a different lane of instructions while one processor gives data to, and receives data from, the other processor.

Shared Memory

Interconnection Network

(bus, Cross-bus switch, multistage switching network, etc)

Local Memory

Processor 1

Local Memory

Processor 2

Local Memory

Processor 3

Multiprocessor System

10

SoftwareParallel Programming Language

Observation 1: A regular sequential programming language (C or Fortran or C++ etc) plus four communication statements (send, receive, myid, numnodes) are necessary and sufficient to form a parallel computing language.

1. Send: One processor sends a message to the network. Note this processor

does not have to know to which processor it is sending this message, but it does give “name” for the message.

2. Receive: One processor receives a message from the network. Note this

processor does not have to know which processor sends this message, but it retrieves the message by name.

3. myid: Integer between 0 and P-1 identifying a processor. myid is always

unique within one partition. 4. numnodes: Integer showing the total number of nodes in the system.

11

Send and Receive

Figure 1: Basic Message PassingSender: The circle on the left represents the "Sender" whose responsibility is to send a

message to the "Network Buffer” without knowing who is the receiver. Receiver: The “Receiver” on the right has to issue a message to the “buffer” to retrieve a

message that is labeled for it. Note:1. This is the so-called single-sided message passing, which is popular in most distributed-memory

supercomputer. 2. The Network Buffer as labeled, in fact, does not exist as an independent entity and is only a

temporary storage. It created either in the sender’s RAM or in the receiver’s RAM and depended on the readiness of the message routing information. For example, if a message’s destination is known but the exact location is not known at the destination, the message will be copied to the receiver’s RAM for easier transmission.

Sender

Network Buffer

Receiver

Figure 1: The basic concept of message passing.

12

THREE WAYS TO COMMUNICATE

1. Synchronous: The sender will not proceed to the next task until the

receiver retrieves the message from the network (hand deliver: slow!)

2. Asynchronous: The sender will proceed to the next task whether the

receiver retrieves the message from the network or not (mailing a letter: will not tie up sender!) No protection for the message in the buffer.

One example for Asynchronous message passing3. Interrupt: The receiver interrupts the sender's current activity for

pulling messages from the sender (ordering a package: interrupt sender!)

13

Synchronous Communication

Figure 2: Synchronous Message Passing

The circle on the left sends a message-1 to the imaginary Network “Buffer”, which then requests the destination to stop its current activities and ready to receive a message from the sender. In synchronous mode, the Receiver will immediately halt its current processing stream by issuing an acknowledgement to the Sender saying “OK” to send the message. After receiving this message, the Sender will immediately dump the original intended message to the Receiver at the exact location.

Send (msg 1)

Receivemsg 1

Figure 2: Synchronous Message Passing

msg 1Yes ?

OK ?

14

Asynchronous Communication

Figure 3: Asynchronous message passing: The Sender issues a message with the appropriate addressing header (envelope information) and regardless of the arrival of the message at the Receiver end or not, the Sender continues its execution without waiting for any confirmation from the Receiver. The Receiver, on the other hand, will also continue its own execution stream until the “receive” statement is met.

Note: The advantage with the asynchronous message passing is its speed. There is no need for either party to wait. The risk lies in the misuse of the correct message.

Send (msg 1)

Receivemsg 1

Figure 3: Asynchronous Message Passing

msg 1Yes ?

OK ?

15

Asynchronous MP ExampleAsynchronous Message Passing Example

SRC Processor (Sender) DST Processor (Receiver)

doing_something_useful……msg_sid=isend() /* send msg */……doing_sth_without_messing_msg

msg_rid=irecv() /* no need of msgs */…doing_sth_without_needing_msgs…msgwait(msg_rid); /*not return until msg arrives*/doing_sth_using_msgs

Choice II: msg_doneif (msgdone(msg_rid))doing_sth_with_it:else doing_other_stuff;

Choice III: msg_ignoremsgignor(msg_rid); /* oops, wrong number */

Choice IV: msgmergeMid=msgmerge(mid1,mid2); /* to grp msgs for a purpose */

Figure 4: Asynchronous Message passing example. The Sender issues a message and then continues on its execution regardless of the Receiver’s

response in receiving the message. While the Receiver can have several options with regard to the message issued already by the Sender; this message now stays somewhere called Buffer:

1.The first option for the Receiver is to wait until the message has arrived and then make use of it. 2.The second is to check if the message has indeed arrived. If YES, do something with it; otherwise, stay with its

own-thing. 3.The third option is to ignore the message; telling the buffer that this message was not for me. 4.The fourth option is to merge this message to another existing message in the Buffer; etc.

16

Interrupt Communication

Figure 5: Interrupt message passing: 1. The “Sender” first issues a short message to interrupt the “Receiver” current execution stream so

that the “Receiver” is ready to receive a long message from the Sender. 2. After appropriate delay (for the interrupt to return the operation pointer to the messaging process),

the Sender pushes through the message to the right location for the Receiver without any delay.

Receiver

Sender Send a short message to Interrupt the receiver

Send the message

Figure 5: Interrupt Message Passing

17

NINE COMMUNICATION PATTERNS

1. 1 to 1 2. 1 to Partial 3. 1 to All 4. Partial to 1 5. Partial to Partial6. Partial to All7. All to 18. All to Partial 9. All to All

18

Communication Patterns

SENDER

Figure 6: Nine Communication Patterns: (A) A single processor can send one message (same) to one processor, to a sub-group of M processors, or to the entire system. (B) A subgroup of M processors or all processors can send M different messages or all different messages to one processor.(C) A sub-group of K processors (how the messages are partitioned is a separate issue) can send messages to the entire system.

Finally, the entire system of P processors can send P different messages to one processor, a sub-group of N processors, or to the entire system.

Note: 1. In the obvious case of one message to 1, K, or P processors (same case in reverse), messages are partitioned naturally. 2. But in the case of M messages sent to K processors, the matter is a different problem and we will discuss that later.

RECEIVER

1 All

1 1

1 M

M 1 All 1

K P

All(P) All(P)

(A)

(B)

(C )

19

PARALLEL PROGRAMMING TOOLS

1. Parallel computing languages (parallel FORTRAN, C, C++ etc)

1.1 Message-passing assistant,

1.2 Portability helper: PVM, MPI......

2. Debuggers

3. Performance analyzer

4. Queuing system (same as in sequential)

20

Parallel Performance Measurement 1. Speed Up

Let T(1, N) be the time required for the best serial algorithm to solve problem of size N on 1 processor and T(P, N) be the time for a given parallel algorithm to solve the same problem of the same size N on P processors. Speedup is defined as

S(P, N) = T(1,N)/T(P, N)

Remarks: 1. Normally, S(P,N) < P$; Ideally, S(P,N) = P; Rarely,

S(P,N) > P --- super speedup. 2. Linear speedup: S(P,N) = c*P where c is a constant

independent of N and P.3. Algorithms with S(P,N) = c P are called scalable

algorithm.

21

Parallel Efficiency 2. Parallel Efficiency

Let T(1, N) be the time required for the best serial algorithm to solve problem of size N on 1 processor and T(P, N) be the time for a given parallel algorithm to solve the same problem of the same size N on P processors. Parallel efficiency is defined as

E(P,N)= T(1, N)/[T(P, N)P] = S(P,N)/P

Remarks: 1. Normally, E(P,N) < 1; Ideally, S(P,N) = 1; Rarely,

S(P,N) > 1; E(P,N) ~.6 acceptable. Of course, it is problem-dependent.

2. Linear speedup: E(P,N) = c where c is a constant independent of N and P.

3. Algorithms with E(P,N) = c are called scalable algorithms.

22

3. Load Imbalance Ratio I(P,N)• Processor i spends ti doing useful work and tmax = max{ti} is the

maximum time spent by one or more processors and tavg = (i=0P-1

ti)/P= average time. The total time spent on useful task for computation and communication is i=0

P-1 ti while the time that the system is occupied (either computation or communication or idle) is P tmax. Thus, we define a parameter called load imbalance ratio:

I(P,N) = [Ptmax - i=0P-1 ti]/ i=0

P-1 ti = tmax / tavg – 1

Remarks:1. I(P,N) is the average time wasted by each processor due to load

imbalance.2. If tmax = t, then ti = t, then, I(P,N) = 0 complete load balance. 3. One slow processor (tmax) can mess up the entire team. This

observation shows that slave-master scheme is usually very inefficient because of the load imbalance issue due to slow master processor.

23

Load Balance:ti on P Nodes Within Synchronization

1 Parallelizing Compiler Technology Dr. Stephen Tse [email protected] Lesson 7.

Documents

Transcript of 1 Parallelizing Compiler Technology Dr. Stephen Tse [email protected] Lesson 7.