Perfromnce of Program

5
Performance Oriented Programming: OpenMP provides a set of important pragmas and runtime functions that enable thread synchronization and related actions to facilitate correct parallel programming. Using these pragmas and runtime functions effectively with minimum overhead and thread waiting time is extremely important for achieving optimal performance from your applications. Performance Issues in OPENMP: Using Barrier and nowait Interleaving Single Thread & Multiple Thread Execution Data Copy-in & Copy-out Protecting Updates of Shared Variables. Using Barrier and nowait: Barriers are a form of synchronization method that OpenMP employs to synchronize threads. Threads will wait at a barrier until all the threads in the parallel region have reached the same point. At the end of the parallel, for,sections, and single constructs, an implicit barrier is generated by the compiler or invoked in the runtime library. The barrier causes execution to wait for all threads to finish the work of the loop, sections, or region before any go on to execute additional work. This barrier can be removed with the nowait clause, as shown in the following code: #pragma omp parallel { #pragma omp for nowait for ( k = 0; k < m; k++ ) { fn10(k); fn20(k); } #pragma omp sections private(y, z) { #pragma omp section { y = sectionD(); fn70(y); } #pragma omp section

Transcript of Perfromnce of Program

Page 1: Perfromnce of Program

Performance Oriented Programming:

OpenMP provides a set of important pragmas and runtime functions that enable thread synchronization and related actions to facilitate correct parallel programming. Using these pragmas and runtime functions effectively with minimum overhead and thread waiting time is extremely important for achieving optimal performance from your applications.

Performance Issues in OPENMP:

Using Barrier and nowait Interleaving Single Thread & Multiple Thread Execution Data Copy-in & Copy-out Protecting Updates of Shared Variables.

Using Barrier and nowait: Barriers are a form of synchronization method that OpenMP employs to synchronize threads.

Threads will wait at a barrier until all the threads in the parallel region have reached the same point.

At the end of the parallel, for,sections, and single constructs, an implicit barrier is generated by the compiler or invoked in the runtime library. The barrier causes execution to wait for all threads to finish the work of the loop, sections, or region before any go on to execute additional work.

This barrier can be removed with the nowait clause, as shown in the following code:

#pragma omp parallel{#pragma omp for nowaitfor ( k = 0; k < m; k++ ) {fn10(k); fn20(k);}#pragma omp sections private(y, z){#pragma omp section{y = sectionD(); fn70(y);}#pragma omp section{z = sectionC(); fn80(z);}}}

The nowait clause can also be used with the work-sharing sections construct and single construct to remove its implicit barrier at the end of the code block.

Adding an explicit barrier is also supported by OpenMP as shown in the following example through the barrier pragma,it is especially useful when all threads need to finish a task before any more work can be computed.

#pragma omp parallel shared(x, y, z) num_threads(2)

Page 2: Perfromnce of Program

{int tid = omp_get_thread_num();if (tid == 0) {y = fn70(tid);} else {z = fn80(tid);}#pragma omp barrier#pragma omp forfor ( k = 0; k < 100; k++ ) {x[k] = y + z + fn10(k) + fn20(k);}}

Interleaving Single & Multiple Thread for execution:

In large real-world applications, a program may consist of both serial and parallel code segments due to various reasons such as data dependence constraints and I/O operations.

A need to execute something only once by only one thread will certainly be required within a parallel region,the master and Single construct can be used for achieving this.

Example:#pragma omp parallel{ // every thread calls this functionint tid = omp_get_thread_num();// this loop is divided among the threads#pragma omp for nowaitfor ( k = 0; k < 100; k++ ) x[k] = fn1(tid);// no implicit barrier at the end of the above loop causes// all threads to synchronize here#pragma omp mastery = fn_input_only();// only the master thread calls this// adding an explicit barrier to synchronize all threads// to make sure x[0-99] and y is ready for use#pragma omp barrier// again, this loop is divided among the threads#pragma omp for nowaitfor ( k = 0; k < 100; k++ ) x[k] = y + fn2(x[k]);// The above loop does not have an implicit barrier, so// threads will not wait for each other.// One thread presumbly the first one done with above --// will continue and execute the following code.#pragma omp singlefn_single_print(y); // only one of threads calls this// The above single construct has an implicit barrier,// so all threads synchronize here before printing x[].#pragma omp masterfn_print_array(x); // only one of threads prints x[]}

Page 3: Perfromnce of Program

using single and master pragmas along with the barrier pragma and nowait clause in a clever way, you should be able to maximize the scope of a parallel region and the overlap of computations to reduce threading overhead effectively, while obeying all data dependences and I/O constraints in your programs.

Data Copy-in & Copy-out: When you parallelize a program, you would normally have to deal with how to copy in

the initial value of a private variable to initialize its private copy for each thread in the team.We would also copy out the value of the private variable computed in the last iteration/section to its original variable for the master thread at the end of parallel region.

OpenMP standard provides four clauses firstprivate lastprivate copyin and copyprivate

For you to accomplish the data copy-in and copy-out operations whenever necessary based on your program and parallelization scheme.

The semantics of the Scheme is described below,

Firstprivate:provides a way to initialize the value of a private variable for each thread with the value of variable from the master thread. Normally, temporary private variables have an undefinedinitial value saving the performance overhead of the copy.

lastprivate provides a way to copy out the value of the private variable computed in the last iteration/section to the copy of the variable in the master thread. Variables can be declared bothfirstprivate and lastprivate at the same time.

copyin provides a way to copy the master thread's threadprivate variable to the threadprivate variable of each other member of the team executing the parallel region.

copyprivate provides a way to use a private variable to broadcast a value from one member of threads to other members of the team executing the parallel region. The copyprivate clause is allowed to associate with the single construct; the broadcast action is completed before any of threads in the team left the barrier at the end of construct.

code converts a color image to black and white.

for ( row = 0; row < height; row++ ) {for ( col = 0; col < width; col++ ) {pGray[col] = (BYTE)( pRGB[row].red * 0.299 +

Page 4: Perfromnce of Program

pRGB[row].green * 0.587 +pRGB[row].blue * 0.114 );}pGray += GrayStride;pRGB += RGBStride;}

The address computation for each pixel can be done with the following code:pDestLoc = pGray + col + row * GrayStride;pSrcLoc = pRGB + col + row * RGBStride;

Improved Version:#pragma omp parallel for private (row, col) \firstprivate(doInit, pGray, pRGB)for ( row = 0; row < height; row++ ) {// Need this init test to be able to start at an// arbitrary point within the image after threading.if (doInit == TRUE) {doInit = FALSE;pRGB += ( row * RGBStride );pGray += ( row * GrayStride );} for ( col =0; col <width; col++ ){pGray[col] = (BYTE) ( pRGB[row].red * 0.299 +pRGB[row].green * 0.587 +pRGB[row].blue * 0.114 );} pGray += GrayStride;pRGB += RGBStride;}