Kernel-Level Support for Scalable Intra-Node Collective...

33
Kernel-Level Support for Scalable Intra-Node Collective Communications Hyun-Wook Jin and Joong-Yeon Cho System Software Laboratory Dept. of Computer Science and Engineering Konkuk University [email protected] 1

Transcript of Kernel-Level Support for Scalable Intra-Node Collective...

Page 1: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Kernel-Level Support for Scalable Intra-Node Collective Communications

Hyun-Wook Jin and Joong-Yeon Cho System Software Laboratory

Dept. of Computer Science and Engineering Konkuk University [email protected]

1

Page 2: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Contents

• MPI intra-node communication

• Intra-node collective communication – MPI_Bast()

– MPI_Gather()

• Conclusions and future work

2

Page 3: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Multi/Many-Core Processors

Xeon 5100 Series

(Woodcrest)

Xeon 5500 Series

(Gainestown)

Xeon 5600 Series

(Westmere-EP)

Xeon E5-2600 Series

(Sandy Bridge-EP)

2 4 6 8

Xeon Phi X100 Series

(Knights Corner)

Xeon Phi 7200 Series

(Knights Landing)

61 72

Xeon E5-2600 v2

Series (Ivy Bridge-EP)

Xeon E5-2600 v3

Series (Haswell-EP)

Xeon E7 v4 Family

(Broadwell)

Xeon Platinum

Series (Skylake)

12 18 24 28

3

Page 4: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

MPI Intra-Node Communication

• Loopback – NIC provides a

loopback path

– Two DMAs

• Shared memory – Communicate through a

memory area shared between MPI processes

– Two data copies

Processor i

Processor j

Memory

Send Buf

Recv Buf

Process A

Process B DMA

DMA

NIC

Memory

Send Buf

Recv Buf

Process A

Process B

Shared Memory

Copy

Copy

Processor i

Processor j

4

Page 5: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

MPI Intra-Node Communication

• Memory mapping – Directly move a message from source to destination

buffer by means of kernel-level support

– Single data copy • Beneficial for large messages

Memory

Send Buf

Recv Buf

Process A

Process B

Direct Copy

Processor i

Processor j

5

Page 6: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Kernel-Level Support for MMapping

• LiMIC2 – Opened the era of one-copy intra-node communication

• H-W. Jin, S. Sur, L. Chai, and D. K. Panda, “LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster,” In Proc. of CPP-05, Jun. 2005.

• H.-W. Jin, S. Sur, Lei Chai, and D. K. Panda, "Lightweight Kernel-Level Primitives for High-Performance MPI Intra-Node Communication over Multi-Core Systems," In Proc. of IEEE Cluster 2007, Sep. 2007.

– LiMIC2-0.5 was publicly released with MVAPICH2-1.4RC1 (Jun. 2009)

– LiMIC2-0.5.6 is being released with the latest MVAPICH2 • mvapich2-src]$ ./configure --with-limic2 [omit other configure

options]

• mvapich2-src]$ mpirun_rsh -np 4 -hostfile ~/hosts MV2_SMP_USE_LIMIC2=1 [path to application]

6

Page 7: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Kernel-Level Support for MMapping

• CMA – In-kernel implementation + New system calls

• J. Vienne, “Benefits of Cross Memory Attach for MPI Libraries on HPC Clusters,“ In Proc. of XSEDE 14, Jul. 2014.

– Default intra-node communication channel for large messages in MVAPICH2

• XPMEM – Supports memory mapping to user-level address space

• B. Kocoloski and J. Lange, “XEMEM: Efficient Shared Memory for Composed Applications on Multi-OS/R Exascale Systems,” In Proc. of HPDC 2015, 2015.

7

Page 8: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Intra-Node Collective Communication

• MPI_Bcast() – Broadcasts a message from the root to all other

processes of the communicator • One-to-Many: Root -> Other processes

– MVAPICH2 (version 2.3) uses the collective-aware shared memory

• MPI_Gather() – Gathers together values from a group of processes

• Many-to-One: All processes -> Root

– MVAPICH2 (version 2.3) uses the kernel-level support (either CMA or LiMIC2) for large messages

8

Page 9: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

MPI_BCAST

9

Page 10: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

MPI_Bcast() in MVAPICH2 (v.2.3)

프로세스 B, C, …, N 프로세스 B, C, …, N

Collective-aware Shared Memory

Root Process Source Buffer

1. Copies 8KB data blocks to the shared memory (by the root process)

2. Copies 8KB data blocks to the destination buffer (by the other processes)

Other Processes Destination Buffer

10

Page 11: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

How bad is LiMIC2 for MPI_Bcast()?

• Experimentally applied LiMIC2 instead of shared memory – Shows higher latency up to 548%

11

Page 12: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Why not to use LiMIC2 in MPI_Bcast()?

• What we expected…

P0 (Root) P1 P2 P3

MPI_Bcast()

Send Descriptor

Memory Mapping

Data Copy

Memory Unmapping

return

12

Page 13: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Why not to use LiMIC2 in MPI_Bcast()?

• What actually happened…

P0 (Root) P1 P2 P3

MPI_Bcast()

Send Descriptor

Memory Mapping (get_user_pages())

Data Copy

Memory Unmapping

return

13

Page 14: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

MPI_Bcast() with LiMIC2-overlap

• The root performs memory mapping and the others reuse (share) the mapped area

P0 (Root) P1 P2 P3

MPI_Bcast()

Send Descriptor

Memory Mapping

Data Copy

Memory Unmapping return

14

Page 15: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Preliminary Measurement Results

• 20-core system – Intel Xeon Haswell

Deca-Core x 2

– LiMIC2-overlap reduces the latency up to 68%

• 120-core system – Intel Xeon IvyBridge

15-Core x 8

– LiMIC2-overlap reduces the latency up to 84%

15

Page 16: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

What’s going on in MVAPICH2 Bcast?

프로세스 B, C, …, N 프로세스 B, C, …, N

Collective-aware Shared Memory

Root Process Source Buffer

1. Copies 8KB data blocks to the shared memory (by the root process)

2. Copies 8KB data blocks to the destination buffer (by the other processes)

Other Processes Destination Buffer

16

Page 17: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

What’s going on in MVAPICH2 Bcast?

• Collective-aware shared memory

• LiMIC2-overlap

* Message size: 256KB * Some profiling overheads are included 17

Page 18: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

What’s going on in MVAPICH2 Bcast?

• Data copy operations are not overlapped as much as expected

18

2

1

2

1

2

1

Block ID

Tim

e

Page 19: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

MPI_GATHER

19

Page 20: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

MPI_Gather() in MVAPICH2 (v.2.3)

Root Process (P0) Destination Buffer

Process P1

Source Buffer ∙∙∙

Process P2

Source Buffer

Process P(N-1)

Source Buffer

Intermediate Buffer

1. Allocates an intermediate buffer

2. Moves messages to the intermediate buffer via point-to-point communication

3. Copies the gathered messages to the destination buffer

20

Page 21: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Is it OK to use LiMIC2 in MPI_Gather()?

* Message size: 256KB * Some profiling overheads are included 21

Page 22: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Is it OK to use LiMIC2 in MPI_Gather()?

22

Page 23: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Is it OK to use LiMIC2 in MPI_Gather()?

23

Page 24: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Why not to use LiMIC2 in MPI_Gather()?

P0 (Root) P1 P2 P3

MPI_Gather()

Send Descriptor

return

Memory Mapping Data Copy

Memory Unmapping

Memory Mapping Data Copy

Memory Unmapping

Memory Mapping Data Copy

Memory Unmapping

Gather for P1

Gather for P2

Gather for P3

24

Page 25: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

MPI_Gather() with LiMIC2-overlap

P0 (Root) P1 P2 P3

MPI_Gather()

return

Data Copy Memory Unmapping

Memory Mapping

Send Descriptor

25

Page 26: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Preliminary Measurement Results

• 20-core system – LiMIC2-overlap reduces

the latency up to 88%

• 120-core system – LiMIC2-overlap reduces

the latency up to 50%

– Different algorithms matter (e.g., binomial tree algorithm)

26

Page 27: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

CONCLUSIONS

27

Page 28: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Concluding Remark

• Intra-node collective communication – MPI_Bcast()

• One-to-Many communication

• Implemented using collective-aware shared memory

– MPI_Gather() • Many-to-One communication

• Implemented using point-to-point

• LiMIC2-overlap – New interfaces

• Memory mapping reuse

• Flexibility of who can perform data copy

– 84% improvement for MPI_Bcast()

– 88% improvement for MPI_Gather()

28

Page 29: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Ongoing Work

• Other collectives – MPI_Scatter()

• LiMIC2-overlap reduces the latency up to 78% on the 20-core system

– MPI_Allgather()

• LiMIC2-overlap reduces the latency up to 38% on the 20-core system

29

Page 30: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Ongoing Work

• Overlapping between collective communication and computation

30

P0 (Root) P1 P2 P3

MPI_Bcast()

return

Com

puta

tion

P0 (Root) P1 P2 P3

MPI_Bcast()

sync

return

Com

puta

tion

Page 31: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Future Work

LiMIC3

31

Page 32: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

ParaMo 2019

• The 1st International Workshop on Parallel Programming Models in High-Performance Cloud – Co-located with Euro-Par 2019

– Date: August 26, 2019

– Venue: Göttingen, Germany

32

Page 33: Kernel-Level Support for Scalable Intra-Node Collective ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · X100 Series (Knights Corner) Xeon Phi 7200 Series (Knights Landing)

Thank You!

33