Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin...

25
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University of Alberta **IBM Research

Transcript of Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin...

Page 1: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Eliminating affinity tests and simplifying shared accesses in

UPC

Rahul Garg*, Kit Barton*, Calin Cascaval**Gheorghe Almasi**, Jose Nelson Amaral*

*University of Alberta**IBM Research

Page 2: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

UPC : Unified Parallel C

0 1 2 3 4 5THREADS = 6

Partitioned Global Address Space

Page 3: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Shared arrays

Arrays can be shared b/w all threads Eg : shared [2] double A[9]; Assuming THREADS=3 1-d block cyclic distribution : similar to HPF

cyclic(k)

0 1 2 3 4 5 6 7 8

Page 4: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Vector addition example

#include <upc.h> #include <stdio.h> shared [2] double A[10]; shared [3] double B[10],C[10]; int main(){ int i; upc_forall(i=0;i<10;i++;&C[i]) C[i] = A[i] + B[i]; }

Page 5: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Outline of talk

upc_forall loops syntax and uses

Compiling upc_forall loops Data distributions in UPC Multiblocking distributions Privatization of access Results

Page 6: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

upc_forall and affinity tests

upc_forall is a work distribution construct Form : shared [BF] double A[M]; upc_forall(i=0; i<N; i++; &A[i]){

//loop body } “Affinity test” expression determines which

thread executes which iteration.

Affinity test expression

Page 7: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Affinity test elimination : naive

shared [BF] double A[M];upc_forall(i=0;i<M;i++; &A[i]){

//loop body}

shared [BF] double A[M];for(i=0; i<M; i++){

if(upc_threadof(&A[i])==MYTHREAD){//loop body

}}

Page 8: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Affinity test elimination : optimized

shared [BF] double A[M];upc_forall(i=0;i<M;i++; &A[i]){

//loop body}

shared [BF] double A[M];for(i=MYTHREAD*BF; i<M; i+=(BF*THREADS)){

for(j=i; j<i+BF; j++){//loop body

}}

Page 9: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Integer Affinity Tests

upc_forall(i=0;i<M;i++; i){//loop body

}

for(i=MYTHREAD; i<M; i+=THREADS){//loop body

}

Page 10: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Data distributions for shared arrays

UPC official spec only supports 1d block cyclic IBM xlupc compiler supports more general data

distribution : 'multi-dimensional blocking' Eg : shared [2][3] double A[5][5]; Divide the array into multidimensional tiles Distribute the tiles among processors in cyclic

fashion More general than UPC spec, but not as

general as ScaLAPACK or HPF

Page 11: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Multidimensional Blocking

shared [2][2] double A[5][5];

0 0

0 0

1 1

1 1

2

2

3 3 0 0 1

3 3 0 0 1

2 2 3 3 0

Page 12: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Locality analysis and privatization

Consider : shared [2][3] A[5][6],B[5][6]; for(i=0; i<4; i++){

upc_forall(j=0; j<4; j++; &A[i][j]){ A[i][j] = B[i+1][j];

} } What code should we generate for references

A[i][j] and B[i+1][j]?

Page 13: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Shared access code generation

for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){

val = shared_deref(B,i+1,j);shared_assign(A,i,j,val);

}}

for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){

A[i][j] = B[i+1][j];}

}

Page 14: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Shared access code generation

Do we really need the function calls?

A[i][j] should only be a memory load/store??

What about B[i+1][j] on SMP? This should be just a load? On hybrids?

for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){

A[i][j] = B[i+1][j];}

}

Page 15: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Locality Analysis

Area belonging to thread 0

Area referenced by thread 0 for B[i+1][j]

for(i=0;i<4;i++)upc_forall(j=0;j<4;j++;&A[i][j])

A[i][j] = B[i+1][j];

Page 16: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Locality Analysis : Intuition

The locality can only change if index (i+1) crosses block boundaries in a direction

Block boundaries : 0, BF , 2*BF ... (i+1)%BF==0 gives block boundary So we only need to see if (i+1)%BF==0 to

find places where locality can change!

for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){

A[i][j] = B[i+1][j];}

}

Page 17: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Locality Analysis

Define offset vector : [k1 k2] k1=1, k2=0 k1 and k2 are integer constants Cross block boundary at (i+k1)%BF ==0 Cases : i%BF<(BF-k1%BF) and i%BF>=

(BF-k1%BF) i%BF<(BF-k1) : we refer it to as 'cut'

for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){

A[i][j] = B[i+1][j];}

}

Page 18: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Shared access code generation

for(i=0;i<4;i++){if((i%2<1){

upc_forall(j=0;j<4;j++;&A[i][j]){val = memory_load(B,i+1,j);memory_store(A,i,j,val);

}}else{

upc_forall(j=0;j<4;j++; &A[i][j]){val = shared_deref(B,i+1,j);memory_store(A,i,j,val);

}}

}

Page 19: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Locality analysis : algorithm

For each shared reference in loop: Check if blocking factor matches

Check if distance vector is constant

If reference is eligible: Generate cut expressions

Put cut in a sorted “cut list”

Replicate loop body as necessary Insert memory load/store if local reference

otherwise insert RTS call

Page 20: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Improvements of locality analysis in isolation

1 node 2 nodes 3 nodes

0

50

100

150

200

250

300

350

400

450

500

% Improvement : 100*(base-opt)/opt

1 thread/node

2 thread/node

4 threads/node

8 threads/node

Page 21: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Improvements of affinity test elimination in isolation

1 node 2 nodes 3 nodes

0

50

100

150

200

250

300

Percentage improvements

1 thread/node

2 thread/node

4 threads/node

8 threads/node

Page 22: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Results : Vector addition

1 node 2 nodes 3 nodes

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Percentage improvements in runtime

1 thread/node

2 thread/node

4 threads/node

8 threads/node

Page 23: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Matrix-vector multiplication

1 node 2 nodes 3 nodes

0

500

1000

1500

2000

2500

3000

Percentage improvements in runtime

1 thread/node

2 threads/node

4 threads/node

8 threads/node

Page 24: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Matrix-vector scalability

1 threads/

node

2 threads/

node

4 threads/

node

8 threads/

node

0

0.5

1

1.5

2

2.5

Speedup over C

1 node

2 node

3 node

Page 25: Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Conclusions

UPC requires extensive compiler support upc_forall is a challenging construct to compile

efficiently Shared access implementation requires compiler

support Optimizations working together produce good

results Compiler optimizations can produce >80x

speedup over unoptimized code If one optimization fails, then results can still be

bad