Challenges InThreading a Loop_doc1

8/12/2019 Challenges InThreading a Loop_doc1

1/6

1.Challenges inThreading a Loop:

Threading a loop is to convert independent loop iterations to threads and run these threads in parallel. Insome sense, this is a re-ordering transformation in which the original order of loop iterations can be

converted to into an undetermined order.

Loop carry dependence Data race conditions Managing shared and private data Loop scheduling & partitioning Effective use of reduction

Loop-carried Dependence:The theory of data dependence imposes two requirements that must be met for a statement S 2 and to bedata dependent on statement S1.There must exist a possible execution path such that statement S1

and S2both reference the same memory location L.The execution of S1 that references L occurs before the execution

of S2 that references L.

In order for S2 to depend upon S1, it is necessary for some execution of S1 to write to a memory location Lthat is later rea d by an execution of S2.This is also called flow dependence. Other dependencies existwhen twostatements write the same memory location L, called an output dependence, or a read occurs

before a write, called an anti-dependence.

This pattern can occur in one of two ways:S1 can reference the memory location L on one iteration of a loop;

on a subsequent iteration S2 can reference the same memorylocation L.

S1 and S2 can reference the same memory location L on the sameloop iteration, but with S1preceding S2 during execution of theloop iteration.

The first case is an example of loop-ca rried dependence, since the dependence exists when the loop isiterated. The second case is an example of loop-independent dependence; the dependence exists because

of the position of the code within the loops.


2/6

Example:

It will fail due to loop-carried// dependencies.x[0] = 0;y[0] = 1;#pragma omp parallel for private(k)for ( k = 1; k < 100; k++ ) {x[k] = y[k-1] + 1; // S1y[k] = x[k-1] + 2; // S2}

The only way to fix this kind of problem is to rewrite the loop or to pick a different algorithm that doesnot contain the loop-carried dependence. With this example, you can first predetermine the initial value of

x[49] and y[49]; then, you can apply the loop strip mining technique to create a loop-carrieddependence-free loop m.By applying this transformation, the original loop can be executed by two

threads on a dual-core processor system.

// Effective threading of the loop using strip-mining// transformation.x[0] = 0;y[0] = 1;x[49] = 74; //derived from the equation x(k)=x(k-2)+3y[49] = 74; //derived from the equation y(k)=y(k-2)+3#pragma omp parallel for private(m, k)for (m=0, m


3/6

// Effective threading of a loop using parallel sections#pragma omp parallel sections private(k){ { x[0] = 0; y[0] = 1;for ( k = 1; k < 49; k++ ) {x[k] = y[k-1] + 1; // S1

y[k] = x[k-1] + 2; // S2}}#pragma omp section{ x[49] = 74; y[49] = 74;for ( k = 50; k < 100; k++ ) {x[k] = y[k-1] + 1; // S3y[k] = x[k-1] + 2; // S4}}}

Data Race Conditions:

OpenMP pragmas or directives while encountering them during compilation phase, however, the

compiler does not perform or ignores the detection of data-race conditions. Thus, a loop similar to thefollowing example, in which multiple threads are updating the variable x will lead to undesirable results.In such a situation, the code needs to be modified via privatization or synchronized using mechanisms

like mutexes. For example, you can simply add the private(x) clause to the parallel for pragma

to eliminate the data-race condition on variable x for this loop.

// A data race condition exists for variable x;// you can eliminate it by adding private(x) clause.#pragma omp parallel forfor ( k = 0; k < 80; k++ )

{x = sin(k*2.0)*100 + 1;if ( x > 60 ) x = x % 60 + 1;printf ( "x %d = %d\n", k, x );}

OpenMP, it is easier to overlook data-race conditions. One tool that helps identify such situations is IntelThread Checker, which is an add-on to Intel VTunePerformance Analyzer.Managing shared and private data:

In writing multithreaded programs, understanding which data is sharedand which is private, notonly toperformance, but also for program correctness.

OpenMP makes this distinction apparent to the programmer through a set of clauses such asshared, private, and default.

With OpenMP, it is the developer's responsibility to indicate to the compiler which pieces ofmemory should be shared among the threads and which pieces should be kept private.


4/6

When memory is identified as shared, all threads access the exact same memory location.When

memory is identified as private, however, a separate copy of the variable is made for each thread.

By default, all the variables in a parallel region are shared, with three exceptions:

parallel for loops, the loop index is private. In the next example, the k variable is private. variables that are local to the block of the parallel region are private. Any variables listed in the private, firstprivate, lastprivate, or reduction clauses

are private.

The Private variables are initialized with default value using default constructor.OpenMP, memory can be declared as private in the following three ways.

Use the private, firstprivate, lastprivate, or reductionclause to specify variables that need to beprivate for eachthread.

Use the threadprivate pragmato specify the global variables that need to be private for eachthread.

Declare the variable inside the loop really inside the OpenMP parallel region without thestatic keyword. Because static variables are statically allocated in a designated memory area

by the compiler and linker, they are not truly private.

Example:

#pragma omp parallel forfor ( k = 0; k < 100; k++ ) {x = array[k];array[k] = do_work(x);

}

This problem can be fixed in either of the following two ways, which both declare the variable x as

private memory.// This works. The variable x is specified as private.#pragma omp parallel for private(x)for ( k = 0; k < 100; k++ ){x = array[i];array[k] = do_work(x);}

// This also works. The variable x is now private.#pragma omp parallel forfor ( k = 0; k < 100; k++ ){int x; // variables declared within a parallel// construct are, by definition, privatex = array[k];array[k] = do_work(x);}


5/6

Loop scheduling and Partitioning:

To have good load balancing and thereby achieve optimal performancein a multithreaded application, you

must have effective loop schedulingand partitioning. The ultimate goal is to ensure that the execution

coresare busy most, if not all, of the time, with minimum overhead ofscheduling, context switching and

synchronization.

OPENMP offers 4 Scheduling schemes:

Static Runtime Dynamic Guided

Effective use of Reduction:

Loops that reduce a collection of values to a single value are fairly common. Consider the following

simple loop that calculates the sum of the return value of the integer-type function call func(k) with

the loop index value as input data.

sum = 0;for ( k = 0; k < 100; k++ ){

sum = sum + func(k); // funchas no side-effects}


6/6

Instead of Providing Synchronisation use Reduction:

sum = 0;#pragma omp parallel for reduction(+:sum)for (k = 0; k < 100; k++) {sum = sum + func(k);}

Given the reduction clause, the compiler creates private copies of the variable sum for eachthread, and when the loop completes, it adds the values together and places the result in theoriginal variable.

For each variable specified in a reduction clause, a private copy is created, one for eachthread, as if the private clause is used.

The private copy is then initialized to the initialization value for the operator.At the end of theregion or the loop for which the reduction clause was specified, the original reduction variable

is updated by combining its original value with the final value from each thread.

While identifying the opportunities to explore the use of the reduction clause for threading,

you should keep the following three points in mind.

The value of the original reduction variable becomes undefined when the first thread reaches theregion or loop that specifies the reduction clause and remains so until the reduction

computation is completed.

If the reduction clause is used on a loop to which the nowait is also applied, the value oforiginal reduction variable remains undefined until a barrier synchronization is performed to

ensure that all threads have completed the reduction.

The order in which the values are combined is unspecified.Therefore, comparing sequential andparallel runs, even between two parallel runs, does not guarantee that bit-identical results will be

obtained or that side effects, such as floating-point exceptions,will be identical.

Challenges InThreading a Loop_doc1

Documents

Transcript of Challenges InThreading a Loop_doc1