P0 P1 P2 P3 P4 P5 P6 P7 - Argonne National Laboratory · 2009-02-07 · P0 P1 P2 P3 P4 P5 P6 P7 S...

P0 P1 P2 P3 P4 P5 P6 P7

Step 1Step 2Step 3

P0 P1 P2 P3 P4 P50 1 2 3 4 5

P0 P1 P2 P3 P4 P54

00 1 2 3 51 2 3 4 5

P0 P1 P2 P3 P4 P5

10543

0 2 3 511 2 3 4 5 023 5 24

P1 P2 P3 P4 P5P0

After local shift

P1 P2 P3 P4 P54

34

4 5 1

34

212

00

0

2

0

2

45

0

2

4 45

2

0 0 0

22

4 4

P00

2

4

4

0 1

1 2 3 52 3 4 5 01

5 0 10

235

5 1 3

1 11111

3 3 3 3 3 34

5 555

Initial data After step 0 After step 1

After step 2

0

20

40

60

80

100

120

140

0 5 10 15 20 25 30 35

time

(micr

osec

.)

Number of processes

Myrinet Cluster, 16 bytes message size

Recursive DoublingBruck Algorithm

0

200

400

600

800

1000

1200

1400

1600

1800

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

time

(micr

osec

.)

message length (bytes)

Myrinet Cluster

MPICH OldMPICH New

0

20000

40000

60000

80000

100000

120000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

time

(micr

osec

.)


Myrinet Cluster

Recursive doublingRing

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

time

(micr

osec

.)


IBM SP

Recursive doublingRing

0

50000

100000

150000

200000

250000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

time

(micr

osec

.)


Myrinet Cluster

MPICH OldMPICH New

0

20000

40000

60000

80000

100000

120000

140000

160000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

time

(micr

osec

.)


IBM SP

IBM MPIMPICH New

200

300

400

500

600

700

800

900

0 50 100 150 200 250 300

time

(micr

osec

.)


Myrinet Cluster, 64 nodes

MPICH OldMPICH New

000102030405

1112131415

22232425

2021

333435

303132

4445

40414243

5051525354

5500

04

11

15

22

20 31

33

42

44

53

50 01 12 23 34 45

54 10 21 32 43

5500 11 22 33 4450 01 12 23 34 45

3040

4151

5202

0313

1424

2535

5500 11 22 33 4450 01 12 23 34 45

3040

4151

5202

0313

1424

2535

0454 05

152010

3121

4232

5343

P0 P1 P2 P3 P4 P5

Initial Data

P0 P1 P2 P3 P4 P5 P0 P1 P2 P3 P4 P5

P1 P2 P3 P4 P5P0

10

55

After local rotation

03

05

015500

02

04

11

13

15

22

24

20 31

33

35 40

42

44

51

53

12

14

10

23

25

21

30

32

34

41

43

45 50

52

54 05

P1 P2 P3 P4 P5P0

After local inverse rotation

0012

3024

42

10

50

20

40

011121

514131

02

2232

52

03

332313

04

544434

14

4353

0515

4555

2535

5202

0313

1424

2535

3040

4151

0454 05

15 2010

3121

4232

5343

After communication step 0


P1 P2 P3 P4 P5P0


P0 P1 P2 P3 P4 P5 P6 P7

Step 2

Step 1

Step 3

0

200

400

600

800

1000

1200

1400

1600

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

time

(micr

osec

.)


IBM SP

IBM MPIMPICH New

0

50000

100000

150000

200000

250000

300000

350000

400000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

time

(micr

osec

.)


Myrinet Cluster

MPICH OldMPICH New

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06 8e+06 9e+06

time

(micr

osec

.)


Myrinet Cluster

MPICH OldMPICH New

2

4

8

16

32

64

128

256

512

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]

Fastest Protocol forAllreduce(sum,dbl)

vendorbinary tree

pairwise + ringhalving + doublingrecursive doubling

binary blocks halving+doublingbreak-even points : size=1k and 2k and min( (size/256)9/16, ...)

0

10

20

30

40

50

60

70

80

90

100

2 4 8 16 32 64 128 256

band

widt

h [M

b/s]

number of MPI processes

buffersize = 32 kbAllreduce(sum,dbl)

vendorbinary tree

pairwise + ringhalving + doubling

binary blocks halving + doublingrecursive doubling

chosen best

16

32

64

128

256

512

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]

Allreduce(sum,dbl) - ratio := best bandwidth of 4 new al

2

4

8

16

32

64

128

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]

Allreduce(sum,dbl) - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth

100.<= ratio 50. <= ratio <100.20. <= ratio < 50.10. <= ratio < 20.7.0 <= ratio < 10.5.0 <= ratio < 7.03.0 <= ratio < 5.02.0 <= ratio < 3.01.5 <= ratio < 2.01.1 <= ratio < 1.50.9 <= ratio < 1.10.7 <= ratio < 0.90.0 <= ratio < 0.7

4

8

16

32

64

128

256

512

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]


2

4

8

16

32

64

128

256

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]

Allreduce(sum,dbl) - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth


2

4

8

16

32

64

128

256

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]


2

4

8

16

32

64

128

256

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]

Reduce(sum,dbl) - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth


2

4

8

16

32

64

128

256

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]

Allreduce(maxloc,dbl) - ratio := best bandwidth of 5 new

2

4

8

16

32

64

128

256

8 32 256 1k 8k 32k 256k 1M 8M

num

ber o

f MPI

pro

cess

es

buffersize [bytes]

Reduce(maxloc,dbl) - ratio := best bandwidth of 4 new algo.s / vendor’s bandwidth


P0 P1 P2 P3 P4 P5 P6 P7 - Argonne National Laboratory · 2009-02-07 · P0 P1 P2 P3 P4 P5 P6 P7 S...

Documents

Transcript of P0 P1 P2 P3 P4 P5 P6 P7 - Argonne National Laboratory · 2009-02-07 · P0 P1 P2 P3 P4 P5 P6 P7 S...