Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for...

Unified Parallel C at LBNL/UCBUnified Parallel C at LBNL/UCB

Empirical (so far) Understanding of Communication Optimizations for GAS

Languages

Costin Iancu

LBNL


$10,000 Questions

• Can GAS languages do better than message passing?

• Claim : maybe, if programs are optimized simultaneously both in terms of serial and parallel performance.

• If not, is there any advantage?

• Claim - flexibility in choosing the best implementation strategy.


Motivation

• Parallel programming - cycle tune parallel, tune serial

• Serial and parallel optimizations - disjoint spaces

• Previous experience with GAS languages showed performance comparable with hand tuned MPI codes.


Optimizations/Previous Work

• Traditionally parallel programming done in terms of two-sided communication.

• Previous work on parallelizing compilers and comm. optimizations reasoned mostly in the terms of two sided communication.

• Focus on domain decomposition, lowering synchronization costs or finding the best schedule.

• GAS languages are based on one-sided communication. Domain decomposition done by programmer, optimizations done by compiler.


Optimization Spaces

• Serial optimizations -> interested mostly in loop optimizations:

- Unrolling- Software pipelining CACHE- Tiling

• Parallel optimizations:- Communication scheduling (comm-comm ovlp,

comm/comp ovlp)- Message vectorization- Message coalescing and aggregation- Inspector-executor NETWORK


Parameters

• Architectural: - Processor -> Cache- Network -> L,o,g,G, contention (LogPC)

• Software interface: blocking/non blocking primitives, explicit/implicit synchronization, scatter/gather….

• Application characteristics: memory and network

footprint


Modern Systems

• Large memory-processor distance: 2-10/20 cycles cache miss latency

• High bandwidth networks : 200MB/s-500M/s => cheaper to bring a byte over the network than a cache miss

• Natural question: by combining serial and parallel optimization can one trade cache misses with network bandwidth and/or overhead?


Goals

Given an UPC program and the optimization space parameters, choose the combination of

parameters that minimizes the total running time.


(What am I really talking about) LOOPS

• g(i), h(i) - indirect access -> unlikely to vectorize Either fine grained communication or inspector-

executor

• g(I) - direct access - can be vectorizedget_bulk(local_src, src);

for(…)

local_dest[g[i]] = local_src[g[i]];

put_bulk(dest, local_dest)put_bulk(dest, local_dest)

for (i=0; i < N;i++)

dest[g(i)] = f(src[h(i)]);


Fine Grained Loops

• Fine grained loops - unrolling, software pipelining and communication scheduling

for(…) { init 1; sync 1; compute 1; write back 1; init 2; sync 2; compute 2; write back 2;……..

}


Fine Grained Loops

for(…) {

init 1; sync1;

compute1;

write1;

init 2;sync 2;

compute 2;

write 2;

…….

}

(base)

for (…) {

init 1;

init 2;

init 3;

….

sync_all;

compute all;

}

for (…) {

init 1;

init2;

sync 1;

compute 1;

….

}

• Problem to solve - find the best schedule of operations and unrolling depth such as to minimize the total running time


Coarse Grained Loops

get_bulk(local_src, src);

for(…) {


}

put_bulk(dest, local_dest);

(base)

for(…) {

get B1;

get B2;

…

sync B1;

compute B1;

sync B2;

compute B2;

…..

}

(reg)

get B1;

…

for (…) {

sync Bi;

get Bj+1;

compute Bi;

sync Bi+1;

compute Bi+1;

…..

}

(ovlp)

• Coarse grained loops - unrolling, software pipelining and communication scheduling + “blocking/tiling”


Coarse Grained Loops

• Coarse grained loops could be “tiled”. Add the tile size as a parameter to the optimization problem.

• Problem to solve - find the best schedule of operations, unrolling depth and “tile” size such as to minimize the total running time

• Questions:- Is the tile constant?- Is the tile size a function of cache size and/or

network parameters?


How to Evaluate?

• Synthetic benchmarks - fine grained messages and large messages

• Distribution of the access stream varies: uniform, clustered and hotspot => UPC datatypes

• Variable computation per message size - k*N, N, K*N, N2 .

• Variable memory access pattern - strided and linear.


Evaluation Methodology

• Alpha/Quadrics cluster

• X86/Myrinet cluster

• All programs compiled with highest optimization level and aggressive inlining.

• 10 runs, report average


Fine Grained Communication


Fine Grained Communication

for(…) {

init 1; sync1;

compute1;

write1;

init 2;sync 2;

compute 2;

write 2;

…….

}

(base)

for (…) {

init 1;

init 2;

init 3;

….

sync_all;

compute all;

}

for (…) {

init 1;

init2;

sync 1;

compute 1;

….

}

• Interested in the benefits of communication communication overlap

Read Pipelining (uniform distribution)

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80 90 100

Percentage of remote accesses

Time (sec)

base

2

4

8

16

32

64

u2

Write Pipelining (uniform distribution)

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70 80 90 100


Time (sec)

base

2

4

8

16

32

64

X86/Myrinet (os > g)

• comm/comm overlap is beneficial

• loop unrolling helps, best factor 32 < U < 64

Write Pipelining (clustered distribution)

24.2

24.4

24.6

24.8

25

25.2

25.4

25.6

0 10 20 30 40 50 60 70

Cluster length

Time (sec)

2

4

8

16

32

64

Read pipelining (clustered)

24

24.5

25

25.5

26

26.5

0 10 20 30 40 50 60 70

Cluster length

Time (sec)

2

4

8

16

32

64

X86/Myrinet (os > g)


Read Pipelining vs.Write Pipelining (hotspot)

32

33

34

35

36

37

38

39

40

2 4 8 16 32 64

Unroll length

Time (sec)

read-pipe

write-pipe

Myrinet

Myrinet: communication/communication overlap works, use non-blocking primitives for fine grained messages. There’s a limit on the number of outstanding messages (32 < L <64).

Read Pipelining (uniform distribution)

0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70 80 90 100


Time (sec)

base

u2

u4

u8

u16

u32

u64

Write Pipelining (uniform distribution)

0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 100


Time (sec)

base

u2

u4

u8

u16

u32

u64

Alpha/Quadrics (g > os)

Write Pipelining (clustered sequence)

7.35

7.4

7.45

7.5

7.55

7.6

7.65

7.7

7.75

7.8

0 10 20 30 40 50 60 70

Cluster length

Time (sec)

2

4

8

16

32

64

Read Pipelining (clustered sequence)

6

6.2

6.4

6.6

6.8

7

0 10 20 30 40 50 60 70

Cluster length

Time (sec)

2

4

8

16

32

64

Alpha/Quadrics


Read vs. Write Pipelining (hotspot)

0

1

2

3

4

5

6

7

2 4 8 16 32 64

Number of pipelined operations

Time (sec)

read-pipe

write-pipe

Alpha/Quadrics

On Quadrics, for fine grained messages where there the amount of computation available for overlap is small - use blocking

primitives.


Coarse Grained Communication


Benchmark

• Fixed amount of computation• Vary the message sizes.• Vary the loop unrolling depth.

get_bulk(local_src, src);

for(…) {


}

put_bulk(dest, local_dest);

(base)

for(…) {

get B1;

get B2;

…

sync B1;

compute B1;

sync B2;

compute B2;

…..

}

(reg)

get B1;

…

for (…) {

sync Bi;

get Bj+1;

compute Bi;

sync Bi+1;

compute Bi+1;

…..

}

(ovlp)

Uniform Distribution - linear computation - 1.0

1.5

1.7

1.9

2.1

2.3

2.5

2.7

2.9

3.1

3.3

3.5

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Size (2xBytes)

Time (sec)

base

ovlp-2

ovlp-4

ovlp-8

ovlp-16

ovlp-32

ovlp-64

reg-2

reg-4

reg-8

reg-16

reg-32

reg-64

Alpha/Quadrics Software pipelining with staggered gets is slower.

Cluster 2 - linear computation - 1.0

2.5

2.7

2.9

3.1

3.3

3.5

3.7

3.9

4.1

4.3

4.5

4.7

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Size (2x Bytes)

Time (sec)

base

ovlp-2

ovlp-4

ovlp-8

ovlp-16

ovlp32

ovlp-64

reg-2

reg-4

reg-8

reg-16

reg-32

reg-64

Cluster 16 - linear computation - 1.0

2.5

2.7

2.9

3.1

3.3

3.5

3.7

3.9

4.1

4.3

4.5

11 12 13 14 15 16 17 18 19 20 21 22 23 24


Time (sec)

base

ovlp-2

ovlp-4

ovlp-8

ovlp-16

ovlp32

ovlp-64

reg-2

reg-4

reg-8

reg-16

reg-32

reg-64

Alpha/Quadrics

• Both optimizations help.

• Again knee around tile x unroll = cache_size

• The optimal value for the blocking case - is it a function of contention or some other factor (packet size,TLB size)

Hotspot - linear computation - 1.0

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

4.2

4.4

4.6

4.8

5

11 12 13 14 15 16 17 18 19 20 21 22 23 24


Time (sec)

base

ovlp-2

ovlp-4

ovlp-8

ovlp-16

ovlp-32

ovlp-64

reg-2

reg-4

reg-8

reg-16

reg-32

reg-64

Alpha/Quadrics

Staggered better than back-to-back - result of contention.


Conclusion

• Unified optimization model - serial+parallel likely to improve performance over separate optimization stages

• Fine grained messages: os > g -> comm/comm overlap helps g > os -> comm/comm overlap might not be worth

• Coarse grained messages:- Blocking improves the total running time by offering

better opportunities for comm/comp overlap and reducing pressure

- “Software pipelining” + loop unrolling usually better than unrolling alone


Future Work

• Worth further investigation - trade bandwidth for cache performance (region based allocators, inspector executor, scatter/gather)

• Message aggregation/coalescing ?


Other Questions

• Fact :Cache miss time same order of magnitude as G.

Question - can somehow trade cache misses for bandwidth? (scatter/gather, inspector/executor)

• Fact: program analysis often over conservative.

Question: given some computation communication overlap how much bandwidth can I waste without noticing in the total running time. (prefetch and region based allocators)

.


Effect of contention (cluster length)

2.5

2.7

2.9

3.1

3.3

3.5

3.7

3.9

4.1

4.3

4.5

4.7

4.9

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Message Size (2x Bytes

Time (sec)

base-2

base-4

base-8

base-16

ovlp-2

ovlp4

ovlp8

ovlp16

reg-2

reg-4

reg-8

reg-16

Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for...

Documents

Transcript of Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for...