CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS,...

CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

APPLIED PARALLEL ALGORITHMS 3

Prof. Thomas SterlingDr. Hartmut Kaiser Center for Computation and TechnologyLouisiana State UniversityMarch 22, 2011


Topics

• LU Decomposition• N-Body Problem• Parallel Sorting

2


Puzzle of the Day

#include<stdio.h> int array[] = {23, 34, 12, 17, 204, 99, 16};

#define TOTAL_ELEMENTS (sizeof(array) / sizeof(array[0]))int main(){ int d; for (d = -1; d <= (TOTAL_ELEMENTS-2); ++d) printf("%d\n", array[d+1]); return 0;}

3

The expected output of the following C program is to print the elements in the array. But when actually run, it doesn't do so. Why?


Topics


4


LU Factorization

• Gaussian Elimination is simple but– What if we have to solve many Ax = b systems for different values of b?

• This happens a LOT in real applications

• Another method is the “LU Factorization” (LU Decomposition)• Ax = b• Say we could rewrite A = L U, where L is a lower triangular matrix, and U is an upper

triangular matrix O(n3)• Then Ax = b is written L U x = b• Solve L y = b O(n2) • Solve U x = y O(n2)

??????

x =??????

x =

equation i has i unknowns equation n-i has i unknowns

triangular system solves are easy

src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

5


LU Factorization: Principle

• It works just like the Gaussian Elimination, but instead of zeroing out elements, one “saves” scaling coefficients.

• Magically, A = L x U !• Should be done with pivoting as well

1 2 -1

4 3 1

2 2 3

1 2 -1

0 -5

5

2 2 3

gaussianelimination

save thescalingfactor

1 2 -1

4 -5

5

2 2 3

gaussianelimination

+save thescalingfactor

1 2 -1

4 -5

5

2 -2

5gaussianelimination

+save thescalingfactor

1 2 -1

4 -5 5

2 2/5 3

1 0 0

4 1 0

2 2/5 1

L = 1 2 -1

0 -5 5

0 0 3U =


6


LU Factorization

stores the scaling factors

k

k

LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk

for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj

}}


• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

7


LU Factorization

• We’re going to look at the simplest possible version– No pivoting: just creates a bunch of indirections that are easy but make

the code look complicated without changing the overall principle

LU-sequential(A,n) { for k = 0 to n-2 { // preparing column k for i = k+1 to n-1 aik -aik / akk

for j = k+1 to n-1 // Task Tkj: update of column j for i=k+1 to n-1 aij aij + aik * akj

}}

k

ij

k

update


8


Parallel LU on a ring

• Since the algorithm operates by columns from left to right, we should distribute columns to processors

• Principle of the algorithm– At each step, the processor that owns column k does the “prepare” task

and then broadcasts the bottom part of column k to all others• Annoying if the matrix is stored in row-major fashion• Remember that one is free to store the matrix in anyway one wants, as long

as it’s coherent and that the right output is generated

– After the broadcast, the other processors can then update their data.

• Assume there is a function alloc(k) that returns the rank of the processor that owns column k– Basically so that we don’t clutter our program with too many global-to-

local index translations

• In fact, we will first write everything in terms of global indices, as to avoid all annoying index arithmetic


9


LU-broadcast algorithm

LU-broadcast(A,n) { q MY_NUM() p NUM_PROCS() for k = 0 to n-2 { if (alloc(k) == q) // preparing column k for i = k+1 to n-1 buffer[i-k-1] aik -aik / akk

broadcast(alloc(k),buffer,n-k-1) for j = k+1 to n-1 if (alloc(j) == q) // update of column j for i=k+1 to n-1 aij aij + buffer[i-k-1] * akj

}}


10


Dealing with local indices

• Assume that p divides n• Each processor needs to store r=n/p columns and its

local indices go from 0 to r-1• After step k, only columns with indices greater than k will

be used• Simple idea: use a local index, l, that everyone initializes

to 0• At step k, processor alloc(k) increases its local index so

that next time it will point to its next local column


11


LU-broadcast algorithm

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (alloc(k) == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

12


Bad load balancing

P1 P2 P3 P4

alreadydone

alreadydone working

on it


13


Good Load Balancing?

working on it

alreadydone

alreadydone

Cyclic distribution


14


Load-balanced program

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k mod p == q) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

15


Performance Analysis

• How long does this code take to run?– This is not an easy question because there are many tasks and

many communications

• A little bit of analysis shows that the execution time is the sum of three terms– n-1 communications: n L + (n2/2) b + O(1)– n-1 column preparations: (n2/2) w’ + O(1)– column updates: (n3/3p) w + O(n2)

• Therefore, the execution time is O(n3/p) – Note that the sequential time is: O(n3)

• Therefore, we have perfect asymptotic efficiency!– This is good, but isn’t always the best in practice

• How can we improve this algorithm?


16


Pipelining on the Ring

• So far, in the algorithm we’ve used a simple broadcast• Nothing was specific to being on a ring of processors

and it’s portable – in fact you could just write raw MPI that just looks like our

pseudo-code and have a very limited, inefficient for small n, LU factorization that works only for some number of processors

• But it’s not efficient– The n-1 communication steps are not overlapped with

computations– Therefore Amdahl’s law, etc.

• Turns out that on a ring, with a cyclic distribution of the columns, one can interleave pieces of the broadcast with the computation– It almost looks like inserting the source code from the broadcast

code we saw at the very beginning throughout the LU code


17


Previous program

... double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 broadcast(alloc(k),buffer,n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

18


LU-pipeline algorithm

double a[n-1][r-1];

q MY_NUM() p NUM_PROCS() l 0 for k = 0 to n-2 { if (k == q mod p) for i = k+1 to n-1 buffer[i-k-1] a[i,k] -a[i,l] / a[k,l] l l+1 send(buffer,n-k-1) else recv(buffer,n-k-1) if (q ≠ k-1 mod p) send(buffer, n-k-1) for j = l to r-1 for i=k+1 to n-1 a[i,j] a[i,j] + buffer[i-k-1] * a[k,j] }} src: navet.ics.hawaii.edu/~casanova/courses/ics632_fall08

19


Topics


20

CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 21

N Bodies

OU Supercomputing Center for Education & Research


OU Supercomputing Center for Education & ResearchImg src : http://www.lsbu.ac.uk/water

N-Body Problems

An N-body problem is a problem involving N “bodies” – that is, particles (e.g., stars, atoms) – each of which applies a force to all of the others.

For example, if you have N stars, then each of the N stars exerts a force (gravity) on all of the other N–1 stars.

Likewise, if you have N atoms, then every atom exerts a force on all of the other N–1 atoms. The forces are Coulombic and van der Waal’s.


2-Body Problem

When N is 2, you have – surprise! – a 2-Body Problem: exactly two particles, each exerting a force that acts on the other.

The relationship between the 2 particles can be expressed as a differential equation that can be solved analytically, producing a closed-form solution.

So, given the particles’ initial positions and velocities, you can immediately calculate their positions and velocities at any later time.



N-Body Problems

• For N of 3 or more, no one knows how to solve the equations to get a closed form solution.

• So, numerical simulation is pretty much the only way to study groups of 3 or more bodies.

• Popular applications of N-body codes include astronomy and chemistry.

• Note that, for N bodies, there are on the order of N2 forces, denoted O(N2).



N-Body Problems

• Given N bodies, each body exerts a force on all of the other N–1 bodies.

• Therefore, there are N • (N–1) forces in total.• You can also think of this as (N • (N–1))/2 forces, in

the sense that the force from particle A to particle B is the same (except in the opposite direction) as the force from particle B to particle A.



N-Body Problems

• Given N bodies, each body exerts a force on all of the other N–1 bodies.

• Therefore, there are N • (N–1) forces in total.• In Big-O notation, that’s O(N2) forces to calculate.• So, calculating the forces takes O(N2) time to execute.• But, there are only N particles, each taking up the

same amount of memory, so we say that N-body codes are of:– O(N) spatial complexity (memory)– O(N2) time complexity



How to Calculate?

• Whatever your physics is, you have some function, F(A,B), that expresses the force between two bodies A and B.

• For example,

F(A,B) = G · mA · mB / dist(A,B)2

where G is the gravitational constant and m is the mass of the particle in question.

• If you have all of the forces for every pair of particles, then you can calculate their sum, obtaining the force on every particle.



• Objective is to find positions and movements of bodies in space (say planets) that are subject to gravitational forces from other bodies using Newtonian laws of physics.

• Subject to forces, a body will accelerate according to Newton’s second law:

F = mawhere m is the mass of the body,

F is the force it experiences, and a is the resultant acceleration.

2r

mGmF ba

Gravitational N-Body Problem

28


• For a precise numeric description, differential equations would be used

F = m dv/dt and v = dx/dt

• Let the time interval be t. Then, for a particular body of mass m, the force is given by

and a new velocity

• where vt+1 is the velocity of body at time t + 1 and• vt is the velocity of body at time t.

• If a body is moving at a velocity v over the time interval t, its position changes by

xt+1 - xt = vt• where xt is its position at time t.

• Once bodies move to new positions, the forces change and the computation has to be repeated.

t

vvmF

tt

)( 1

m

tFvv tt

1


29


– The velocity is not actually constant over the time interval Dt .• It can help to have a “leap-frog” computation in which velocity and

position are computed alternately

• and

• where the velocities are computed for times t, t + 1, t + 2, etc. and the position are computed for times t + 1/2, t + 3/2, t + 5/2, etc.

t

vvmF

ttt

)( 2/12/1

tvxx ttt 2/11


30


– In a three-dimensional space having a coordinate system (x, y, z),• the distance between the bodies at (xa, ya, za) and (xb, yb, zb) is

given by

• The forces are resolved in the three directions, using, for example,

– where particles are of mass ma and mb and

– have coordinates (xa, ya, za) and (xb, yb, zb).

222 )()()( ababab zzyyxxr

r

zz

r

mGmF

r

yy

r

mGmF

r

xx

r

mGmF

abbaz

abbay

abbax

2

2

2

Three-Dimensional Space

31


O(N2) Forces

Note that this picture shows only the forces between A and everyone else.

A



Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

Overall gravitational N-body computation can be described by:

for (t = 0; t < tmax; t++) { /* for each time period */ for (i = 0; i < N; i++) { /* for each body */

F = Force_routine(i); /* compute force on ith body */vnew[i] = v[i] + F * dt / m; /* compute new velocity */xnew[i] = x[i] + vnew[i] * dt; /* and new position */

} for (i = 0; i < nmax; i++) { /* for each body */

x[i] = xnew[i]; /* update velocity & position*/v[i] = vnew[i];

}}

Sequential Code

33


How to Parallelize?

Okay, so let’s say you have a nice serial (single-CPU) code that does an N-body calculation.

How are you going to parallelize it?You could:• have a master feed particles to processes;• have a master feed interactions to processes;• have each process decide on its own subset of the

particles, and then share around the forces;• have each process decide its own subset of the

interactions, and then share around the forces.



Do You Need a Master?

• Let’s say that you have N bodies, and therefore you have ½N(N-1) interactions (every particle interacts with all of the others, but you don’t need to calculate both A B and B A).

• Do you need a master?• Well, can each processor determine on its own either

(a) which of the bodies to process, or (b) which of the interactions?

• If the answer is yes, then you don’t need a master.



Nbody - OpenMP

36

#ifndef _NBODY_H#define _NBODY_H/* Needed includes */#include <stdlib.h> /* atoi */#include <stdio.h> /* fprintf */#include <time.h> /* clock/clock_gettime */#include <malloc.h> /* malloc */#include <math.h> /* sqrt */#ifdef WITH_OPENMP#include <omp.h>#endif/* Some constants */#define GRAV_CONST 6.673e-11 /* m^3/(kg*s^2) */#define EPSILON 1e-12/* for init_bodies */#define INIT_LINEAR 0#define INIT_SPIRAL 1/* ickiness in naming of time stuff */#define GNU_TIME#ifdef GNU_TIME# define __need_clock_t# define mytspec clock_t# define get_time(tspec) tspec = clock()#else# define mytspec timespec_t# define get_time(tspec) clock_gettime(CLOCK_SGI_CYCLE,&tspec)#endif#define X 0#define Y 1#if 0#ifndef _BOOL

#endif #ifdef _STANDARD_C_PLUS_PLUS // If -LANG:std is specified, it defines the macro // in the preceding line. Use new-style headers // and a using directive to bring names from the // std namespace into the global namespace. #include<complex> #include<iostream> using namespace std; #else // If -LANG:std is not specified, use old-style headers, // and there is no need for a using directive. #include<complex.h> #include<iostream.h> #endif complex<float> x(1,2);#endif/* the Body structure */

struct Body_struct { double mass; double pos[2]; double vel[2];};typedef struct Body_struct Body;/* Subroutines in nbody.c */Body* init_bodies(unsigned int num_bodies, int init_type);int check_simulation(Body *bodies, int num_bodies);double elapsed_time(const mytspec t2, const mytspec t1);#endif /* _NBODY_H */


Nbody - OpenMP

37

#include "nbody.h"#include <omp.h>int main(int argc, char **argv) { mytspec start_time, end_time; int num_bodies, num_steps, max_threads=1; int i, j, k, l; double dt=1.0, dv[2]; double r[2], dist, force_len, force_ij[2], tot_force_i[2]; #if NEWTON_OPT double *forces_matrix; #endif Body *bodies; const char Usage[] = "Usage: nbody <num bodies> <num_steps>\n"; if (argc < 2) { fprintf(stderr, Usage); exit(1); } num_bodies = atoi(argv[1]); num_steps = atoi(argv[2]); /* Initialize with OpenMP */ #ifdef _OPENMP max_threads = omp_get_max_threads(); #else printf("Warning: no OpenMP!\n"); #endif #if NEWTON_OPT > 0 printf("Using Newton's third law optimization, variant %d.\n", NEWTON_OPT); #endif printf("Initializing with %d threads, %d bodies, %d time steps\n", max_threads, num_bodies, num_steps); bodies = init_bodies(num_bodies, INIT_SPIRAL); check_simulation(bodies, num_bodies);#endif


Nbody - OpenMP

38

printf("Running "); fflush(stdout); get_time(start_time);#define PRIVATE_VARS r, dist, force_len, force_ij, tot_force_i, dv#define Calc_Force_ij() r[X] = bodies[j].pos[X] - bodies[i].pos[X]; r[Y] = bodies[j].pos[Y] - bodies[i].pos[Y]; dist = r[X]*r[X] + r[Y]*r[Y]; force_len = GRAV_CONST * bodies[i].mass * bodies[j].mass (dist*sqrt(dist)); \ force_ij[X] = force_len * r[X]; force_ij[Y] = force_len * r[Y]…

#pragma omp parallel for private(j, PRIVATE_VARS) for (i=0; i<num_bodies; i++) { tot_force_i[X] = 0.0; tot_force_i[Y] = 0.0; for (j=0; j<num_bodies; j++) { if (j==i) continue; Calc_Force_ij(); tot_force_i[X] += force_ij[X]; tot_force_i[Y] += force_ij[Y]; } Step_Body_i(); } #elif NEWTON_OPT == 1 #define forces(i,j,x) forces_matrix[x + 2*(j + num_bodies*i)]


Nbody - OpenMP

39

…

Body *init_bodies(unsigned int num_bodies, int init_type) { int i; double n = num_bodies; Body *bodies = (Body *) malloc(num_bodies * sizeof(Body)); return bodies; for (i=0; i<num_bodies; i++) { switch (init_type) { case INIT_LINEAR: bodies[i].mass = 1.0; bodies[i].pos[X] = i/n; bodies[i].pos[Y] = i/n; bodies[i].vel[X] = 0.0; bodies[i].vel[Y] = 0.0; break; case INIT_SPIRAL: bodies[i].mass = (n-i)/n; bodies[i].pos[X] = (1+i/n) * cos(2*M_PI*i/n) / 2; bodies[i].pos[Y] = (1+i/n) * sin(2*M_PI*i/n) / 2; bodies[i].vel[X] = 0.0; bodies[i].vel[Y] = 0.0; break; } }


Nbody - OpenMP

40

/** Verify that the simulation is running correctly; it should satisfy the invariant of conservation of momentum*/int check_simulation(Body *bodies, int num_bodies) { int i, check_ok; double momentum[2] = { 0.0, 0.0 }; for (i=0; i<num_bodies; i++) { momentum[X] += bodies[i].mass * bodies[i].vel[X]; momentum[Y] += bodies[i].mass * bodies[i].vel[Y]; } check_ok = ((abs(momentum[X]) < EPSILON) && (abs(momentum[Y]) < EPSILON)); if (!check_ok) printf("Warning: total momentum = (%3.3f, %3.3f)\n", momentum[X], momentum[Y]); return check_ok;}#ifdef GNU_TIMEdouble elapsed_time(const mytspec t2, const mytspec t1) { return 1.0 * (t2 - t1) / CLOCKS_PER_SEC;}#elsedouble elapsed_time(const mytspec t2, const mytspec t1) { return (((double)t2.tv_sec) + ((double)t2.tv_nsec / 1e9)) - (((double)t1.tv_sec) + ((double)t1.tv_nsec / 1e9));}#endif


N-Body “Pipeline” Implementation Flowchart

Create ring communicator

Initialize particle parameters

Copy local particle data to send buffer

Update positions of local particles

All iterations done?

Finalize MPI

N

Y

Initiate transmission of send buffer to the RIGHT neighbor in ring

Initiate reception of data from the LEFT neighbor in ring

Compute forces between local and send buffer particles

Processed particles from all remote nodes?

N

Wait for message exchange to complete

Copy particle data from receive buffer to send buffer

Y

Initialize MPI environment


N-Body (source code)

#include "mpi.h"#include <stdlib.h>#include <stdio.h>#include <string.h>#include <math.h>

/* Pipeline version of the algorithm... *//* we really need the velocities as well… */

/* Simplified structure describing parameters of a single particle */typedef struct { double x, y, z; double mass; } Particle;/* We use leapfrog for the time integration ... */

/* Structure to hold force components and old position coordinates of a particle */typedef struct { double xold, yold, zold; double fx, fy, fz; } ParticleV;

void InitParticles( Particle[], ParticleV [], int );double ComputeForces( Particle [], Particle [], ParticleV [], int );double ComputeNewPos( Particle [], ParticleV [], int, double, MPI_Comm );

#define MAX_PARTICLES 4000#define MAX_P 128



main( int argc, char *argv[] ){ Particle particles[MAX_PARTICLES]; /* Particles on ALL nodes */ ParticleV pv[MAX_PARTICLES]; /* Particle velocity */ Particle sendbuf[MAX_PARTICLES], /* Pipeline buffers */

recvbuf[MAX_PARTICLES]; MPI_Request request[2]; int counts[MAX_P], /* Number on each processor */ displs[MAX_P]; /* Offsets into particles */ int rank, size, npart, i, j,

offset; /* location of local particles */ int totpart, /* total number of particles */

cnt; /* number of times in loop */ MPI_Datatype particletype; double sim_t; /* Simulation time */ double time; /* Computation time */ int pipe, left, right, periodic; MPI_Comm commring; MPI_Status statuses[2];

/* Initialize MPI Environment */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size );

/* Create 1-dimensional periodic Cartesian communicator (a ring) */ periodic = 1; MPI_Cart_create( MPI_COMM_WORLD, 1, &size, &periodic, 1, &commring ); MPI_Cart_shift( commring, 0, 1, &left, &right ); /* Find the closest neighbors in ring */


/* Calculate local fraction of particles */ if (argc < 2) {

fprintf( stderr, "Usage: %s n\n", argv[0] );MPI_Abort( MPI_COMM_WORLD, 1 );

} npart = atoi(argv[1]) / size; if (npart * size > MAX_PARTICLES) {

fprintf( stderr, "%d is too many; max is %d\n", npart*size, MAX_PARTICLES );MPI_Abort( MPI_COMM_WORLD, 1 );

} MPI_Type_contiguous( 4, MPI_DOUBLE, &particletype ); /* Data type corresponding to Particle struct */ MPI_Type_commit( &particletype );

/* Get the sizes and displacements */ MPI_Allgather( &npart, 1, MPI_INT, counts, 1, MPI_INT, commring ); displs[0] = 0; for (i=1; i<size; i++)

displs[i] = displs[i-1] + counts[i-1]; totpart = displs[size-1] + counts[size-1];

/* Generate the initial values */ InitParticles( particles, pv, npart); offset = displs[rank]; cnt = 10; time = MPI_Wtime(); sim_t = 0.0;

/* Begin simulation loop */ while (cnt--) {

double max_f, max_f_seg;

44



N-Body (source code)/* Load the initial send buffer */memcpy( sendbuf, particles, npart * sizeof(Particle) );max_f = 0.0;for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) {

/* Initialize send to the “right” neighbor, while receiving from the “left” */MPI_Isend( sendbuf, npart, particletype, right, pipe, commring, &request[0] );MPI_Irecv( recvbuf, npart, particletype, left, pipe, commring, &request[1] );

} /* Compute forces */ max_f_seg = ComputeForces( particles, sendbuf, pv, npart ); if (max_f_seg > max_f) max_f = max_f_seg;

/* Wait for updates to complete and copy received particles to the send buffer */ if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, counts[pipe] * sizeof(Particle) );}/* Compute the changes in position using the already calculated forces */sim_t += ComputeNewPos( particles, pv, npart, max_f, commring );

/* We could do graphics here (move particles on the display) */ } time = MPI_Wtime() - time; if (rank == 0) {

printf( "Computed %d particles in %f seconds\n", totpart, time ); } MPI_Finalize(); return 0;}


N-Body (source code)/* Initialize particle positions, masses and forces */void InitParticles( Particle particles[], ParticleV pv[], int npart ){ int i; for (i=0; i<npart; i++) {

particles[i].x = drand48();particles[i].y = drand48();particles[i].z = drand48();particles[i].mass = 1.0;pv[i].xold = particles[i].x;pv[i].yold = particles[i].y;pv[i].zold = particles[i].z;pv[i].fx = 0;pv[i].fy = 0;pv[i].fz = 0;

}}/* Compute forces (2-D only) */double ComputeForces( Particle myparticles[], Particle others[], ParticleV pv[], int npart ){ double max_f, rmin; int i, j;

max_f = 0.0; for (i=0; i<npart; i++) { double xi, yi, mi, rx, ry, mj, r, fx, fy; rmin = 100.0; xi = myparticles[i].x; yi = myparticles[i].y; fx = 0.0; fy = 0.0;


N-Body (source code)for (j=0; j<npart; j++) { rx = xi - others[j].x; ry = yi - others[j].y; mj = others[j].mass; r = rx * rx + ry * ry; /* ignore overlap and same particle */ if (r == 0.0) continue; if (r < rmin) rmin = r; /* compute forces */ r = r * sqrt(r); fx -= mj * rx / r; fy -= mj * ry / r; } pv[i].fx += fx; pv[i].fy += fy; /* Compute a rough estimate of (1/m)|df / dx| */ fx = sqrt(fx*fx + fy*fy)/rmin; if (fx > max_f) max_f = fx; } return max_f;}

/* Update particle positions (2-D only) */double ComputeNewPos( Particle particles[], ParticleV pv[], int npart, double max_f, MPI_Comm commring ){ int i; double a0, a1, a2; static double dt_old = 0.001, dt = 0.001; double dt_est, new_dt, dt_new;


N-Body (source code)/* integation is a0 * x^+ + a1 * x + a2 * x^- = f / m */ a0 = 2.0 / (dt * (dt + dt_old)); a2 = 2.0 / (dt_old * (dt + dt_old)); a1 = -(a0 + a2); /* also -2/(dt*dt_old) */ for (i=0; i<npart; i++) { double xi, yi; /* Very, very simple leapfrog time integration. We use a variable step version to simplify time-step control. */ xi = particles[i].x; yi = particles[i].y; particles[i].x = (pv[i].fx - a1 * xi - a2 * pv[i].xold) / a0; particles[i].y = (pv[i].fy - a1 * yi - a2 * pv[i].yold) / a0; pv[i].xold = xi; pv[i].yold = yi; pv[i].fx = 0; pv[i].fy = 0; } /* Recompute a time step. Stability criteria is roughly 2/sqrt(1/m |df/dx|) >= dt. We leave a little room */ dt_est = 1.0/sqrt(max_f); if (dt_est < 1.0e-6) dt_est = 1.0e-6; MPI_Allreduce( &dt_est, &dt_new, 1, MPI_DOUBLE, MPI_MIN, commring ); /* Modify time step */ if (dt_new < dt) { dt_old = dt; dt = dt_new; } else if (dt_new > 4.0 * dt) { dt_old = dt; dt *= 2.0; } return dt_old;}


Demo : N-Body Problem

> mpiexec –np 4 ./nbodypipe 4000Computed 4000 particles in 1.119051 seconds> mpiexec –np 4 ./nbodypipe 4000Computed 4000 particles in 1.119051 seconds


Barnes-Hut Algorithm

• Start with whole space in which one cube contains the bodies (or particles).– First, this cube is divided into eight subcubes.– If a subcube contains no particles, the subcube is deleted from further

consideration.– If a subcube contains more than one body, it is recursively divided

until every subcube contains not more than one body.• This process creates an octtree; that is,

– a tree with up to eight edges from each node.• The leaves represent cells each containing one body.• The decomposition for a two-dimensional case follows the

same construction except with up to four edges from each node - Quadtree

50


– In Barnes-Hut Algorithm, after the tree has been constructed, the total mass and center of mass of the subcube is stored at each node.

• The force on each body can then be obtained by traversing the tree starting at the root, stopping at a node when the clustering approximation can be used, e.g. when:

– where is a constant typically 1.0 or less ( is called the opening angle).

θ

dr

51


• Once all the bodies have been given new positions and velocities,– the process is repeated for each time period.– This means that the whole octtree must be reconstructed for each time

period(because the bodies have moved).– Constructing the tree requires a time of (nlogn), and so does computing all

the forces, so that the overall time complexity of the method is O(nlogn).• The algorithm can be described by the following:

for (t = 0; t < tmax; t++) { /* for each time period */Build_Octtree(); /* construct Octtree(or Quadtree) */Tot_Mass_Center(); /* compute total mass & center */Comp_Force(); /* traverse tree/computing forces */Update(); /* update position/velocity */

}

– Build_Octtree(): can be constructed from the positions of the bodies, considering each body in turn.

– Tot_Mass_Center(): must traverse the tree, computing the total mass and center of mass at each node.

» where position of the centers of mass have three components, in the x, y, and z directions.

– Comp_Force() : must visit nodes ascertaining whether the clustering approximation can be applied to compute the force of all the bodies in that cell.

» If the clustering approximation cannot be applied, the children of the node must be visited.

)(1 7

0

7

0i

ii

ii cm

MCmM

52


Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,

@ 2004 Pearson Education Inc. All rights reserved.

(For 2-dimensional area) First, find a vertical line that divides area into two areas each with equal number of bodies. For each area, find a horizontal line that divides it into two areas each with equal number of bodies. Repeated as required.

Orthogonal Recursive Bisection

53


O(NlogN) TreeCode

54

[cdekate@celeritas tree]$ ./treecode nbody=4000Hierarchical N-body code (theta scan) nbody dtime eps theta usequad dtout tstop 4000 0.03125 0.0250 1.00 false 0.25000 2.0000…… time |T+U| T -U -T/U |Vcom| |Jtot| CPUtot

2.000 0.24094 0.23061 0.47155 0.48904 0.00019 0.00494 0.085

DEMO


N Body Viz

55

The Millennium Run used more than 10 billion particles to trace the evolution of the matter distribution in a cubic region of the Universe over 2 billion light-years on a side. It kept busy the principal supercomputer at the Max Planck Society's Supercomputing Centre in Garching, Germany for more than a month. By applying sophisticated modelling techniques to the 25 Tbytes of stored output, Virgo scientists have been able to recreate evolutionary histories both for the 20 million or so galaxies which populate this enormous volume and for the supermassive black holes which occasionally power quasars at their hearts. By comparing such simulated data to large observational surveys, one can clarify the physical processes underlying the buildup of real galaxies and black holes.


Topics


– Bubble Sort – Merge Sort – Heap Sort – Quick Sort

56


Parallel Sorting

• Finding a permutation of a sequence [a1, a2, ...an-1], such that a1 <= a2 <= … an-1

• Often we sort records based on key• Parallel sort results in:

– Partial sequences are sorted on all nodes– Largest value on node N-1 is smaller or equal to smallest value

on node N

• Several ways to parallelize– Chunk sequence, sort locally, merge back (bubblesort)– Project algorithm structure onto communication and distribution

scheme (quicksort)

57


Topics



58


Bubble Sort• The bubble sort is the oldest and simplest sort in use. Unfortunately, it's also the

slowest. • The bubble sort works by comparing each item in the list with the item next to it,

and swapping them if required. • The algorithm repeats this process until it makes a pass all the way through the

list without swapping any items (in other words, all items are in the correct order). • This causes larger values to "bubble" to the end of the list while smaller values

"sink" towards the beginning of the list.• The bubble sort is generally considered to be the most inefficient sorting algorithm

in common usage. Under best-case conditions (the list is already sorted), the bubble sort can approach a constant O(n) level of complexity. General-case is O(n2).

• Pros: Simplicity and ease of implementation.• Cons: Extremely inefficient.

Referencehttp://math.hws.edu/TMCM/java/xSortLab/

Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/sorting/bubblesort.c

http://www.sci.hkbu.edu.hk

59


Bubblesort

void sort(int *v, int n){

int i, j;for(i = n-2; i >= 0; i--)

for(j = 0; j <= i; j++)if(v[j] > v[j+1])

swap(v[j], v[j+1]);}

60


Bubblesort

61


Discussion

• Bubble sort takes time proportional to N*N/2 for N data items

• This parallelization splits N data items into N/P so time on one of the P processors now proportional to (N/P*N/P)/2 – i.e. have reduced time by a factor of P*P!

• Bubble sort is much slower than quick sort!– Better to run quick sort on single processor than bubble sort on

many processors!


62


Topics



63


Merge Sort

• The merge sort splits the list to be sorted into two equal halves, and places them in separate arrays.

• Each array is recursively sorted, and then merged back together to form the final sorted list.

• Like most recursive sorts, the merge sort has an algorithmic complexity of O(n log n). • Elementary implementations of the merge sort make use of three arrays - one for

each half of the data set and one to store the sorted list in. The below algorithm merges the arrays in-place, so only two arrays are required. There are non-recursive versions of the merge sort, but they don't yield any significant performance enhancement over the recursive algorithm on most machines.

Pros: Marginally faster than the heap sort for larger sets.

Cons: At least twice the memory requirements of the other sorts; recursive.

Reference

http://math.hws.edu/TMCM/java/xSortLab/

64


Merge Sort

[cdekate@celeritas sort]$ mpiexec -np 4 ./mergesort1000000; 4 processors; 0.250000 secs[cdekate@celeritas sort]$

65


Mergesort

void msort(int *A, int min, int max){

int *C; /* dummy, just to fit the function */int mid = (min+max)/2;int lowerCount = mid - min + 1;int upperCount = max - mid;

/* If the range consists of a single element, it's already sorted */if (max == min) {

return;} else {

/* Otherwise, sort the first half */sort(A, min, mid);/* Now sort the second half */sort(A, mid+1, max);/* Now merge the two halves */C = merge(A + min, lowerCount, A + mid + 1, upperCount);

}}

66


Mergesort

67


Topics



68


Heap Sort• The heap sort is the slowest of the O(n log n) sorting algorithms, but unlike the merge

and quick sorts it doesn't require massive recursion or multiple arrays to work. This makes it the most attractive option for very large data sets of millions of items.

• The heap sort works as it name suggests1. It begins by building a heap out of the data set, 2. Then removing the largest item and placing it at the end of the sorted array. 3. After removing the largest item, it reconstructs the heap and removes the largest remaining

item and places it in the next open position from the end of the sorted array.4. This is repeated until there are no items left in the heap and the sorted array is full.

Elementary implementations require two arrays - one to hold the heap and the other to hold the sorted elements.

• To do an in-place sort and save the space the second array would require, the algorithm below "cheats" by using the same array to store both the heap and the sorted array. Whenever an item is removed from the heap, it frees up a space at the end of the array that the removed item can be placed in.

• Pros: In-place and non-recursive, making it a good choice for extremely large data sets.

• Cons: Slower than the merge and quick sorts.

Referencehttp://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/heapsort.html

Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/heapsort/heapsort.c

69


Heapsort

70


Topics



71


Quick Sort• The quick sort is an in-place, divide-and-conquer, massively recursive sort.• Divide and Conquer Algorithms

– Algorithms that solve (conquer) problems by dividing them into smaller sub-problems until the problem is so small that it is trivially solved.

• In Place– In place sorting algorithms don't require additional temporary space to store

elements as they sort; they use the space originally occupied by the elements.• Quicksort takes time proportional to (worst case) N*N for N data items, usually

n log n, but most of the time much faster– for 1,000,000 items, Nlog2N ~ 1,000,000*20

• Constant communication cost – 2*N data items– for 1,000,000 must send/receive 2*1,000,000 from/to root

• In general, processing/communication proportional to N*log2N/(2*N) = log2N/2

– so for 1,000,000 items, only 20/2 =10 times as much processing as communication

• Suggests can only get speedup, with this parallelization, for very large N

Referencehttp://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html

Sourcehttp://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/qsort/qsort.c


72


Quick Sort

• The recursive algorithm consists of four steps (which closely resemble the merge sort):

1. If there are one or less elements in the array to be sorted, return immediately.

2. Pick an element in the array to serve as a "pivot" point. (Usually the left-most element in the array is used.)

3. Split the array into two parts - one with elements larger than the pivot and the other with elements smaller than the pivot.

4. Recursively repeat the algorithm for both halves of the original array.

• The efficiency of the algorithm is majorly impacted by which element is chosen as the pivot point.

• The worst-case efficiency of the quick sort, O(n2), occurs when the list is sorted and the left-most element is chosen.

• If the data to be sorted isn't random, randomly choosing a pivot point is recommended. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n).

Pros: Extremely fast.Cons: Very complex algorithm, massively recursive


73


Quicksort

74


Summary : Material for the Test

• LU decomposition: Slides 5-19• N-body problem: Slides 33-48• Sorting Algorithms: Slides 57-74

75

CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS,...

Documents

Transcript of CSC 7600 Lecture 17: Applied Parallel Algorithms 3, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS,...