OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel
description
Transcript of OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel
![Page 1: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/1.jpg)
OpenMP for Networks of SMPs
Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel
ECE1747 – Parallel Programming
Vicky Tsang
![Page 2: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/2.jpg)
Background
Published in the Journal of Parallel and Distributed Computing, vol. 60 (12), pp. 1512-1530, December 2000
Work to further improve TreadMarks Presents an alternative solution to MPI
![Page 3: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/3.jpg)
Roadmap
Motivation Solution OpenMP API TreadMarks OpenMP Translator Performance Measurement Results Conclusion
![Page 4: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/4.jpg)
Motivation
To enable the programmer to reply on a single, standard, shared-memory API for parallelization within and between multiprocessors.
To provide another standard other than MPI?
![Page 5: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/5.jpg)
Solution
Presents the first system that implements OpenMP on a network of shared-memory multiprocessors
Implemented via a translator converting OpenMP directives to calls in modified TreadMarks
Modified TreadMarks uses POSIX threads for parallelism within an SMP node
![Page 6: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/6.jpg)
Solution
Original version of TreadMarks:A Unix process was executed on each
processor of the multiprocessor node and communication between processes was achieved through message passing
Fails to take advantage of hardware shared memory
![Page 7: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/7.jpg)
Solution
Modified version of TreadMarks POSIX threads used to implement parallelism OpenMP threads within a multiprocessor share a
single address space Positive:
Reduces the number of changes to TreadMarks to support multithreading on a multiprocessor
OS maintains the coherence of page mappings automatically Negative:
More difficult to provide uniform sharing of memory between threads on the same node and threads on different nodes
![Page 8: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/8.jpg)
OpenMP API
Three kinds of directives: Parallelism/work sharing Data environment Synchronization
Based on a fork-join model Sequential code sections executed by master
thread Parallel code sections are executed by all
threads, including the master thread
![Page 9: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/9.jpg)
OpenMP API
Parallel directive – all threads perform the same computation
Work sharing directive – computation is divided among the threads
Data environment directive – control the sharing of program variables
Synchronization directive – control the synchronization between threads
![Page 10: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/10.jpg)
TreadMarks
User-level SDSM system Provides a global shared address space
on top of physically distributed memories Key functions performed are memory
coherence and synchronization
![Page 11: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/11.jpg)
TreadMarks – Memory Coherence
Minimize the amount of communication performed to maintain memory consistency by: a lazy implementation of release consistency reducing the impact of false sharing by allowing
multiple concurrent writers to modify a page
Propagation of consistency information is postponed until the time of an acquire
![Page 12: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/12.jpg)
TreadMarks - Synchronization
Barrier implemented as acquire and release messages
Governed by a centralized manager
![Page 13: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/13.jpg)
TreadMarks – Modifications for OpenMP Inclusion of two primitives:
Tmk_fork Tmk_join
All threads created at the start of a program’s execution to minimize overhead.
Slave threads are blocked during sequential execution until the next Tmk_fork is issued by the master thread.
![Page 14: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/14.jpg)
TreadMarks – Modifications for Networks of Multiprocessors POSIX thread enabled sharing of data between processors.
Addition of some data structures, such as message buffers, in thread-private memory for data that is to remain private within a thread.
A per-page mutex was added to allow greater concurrency in the page fault handler.
Synchronization functions in TreadMarks were modified to use POSIX thread-based synchronization between processors within a node and existing TreadMarks synchronization functions between nodes.
A second mapping was added for the memory that is shared between nodes so shared-memory pages can be updated while the first mapping remains invalid until the update is complete. This reduces the number of page protection operations performed by TreadMarks.
![Page 15: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/15.jpg)
OpenMP Translator
Synchronization directives translate directly to TreadMarks synchronization operations.
The complier translates the code sections marks with parallel directives to fork-join code.
Data environment directives implemented to work with both TreadMarks and POSIX threads, hiding the interface issues from the programmer.
![Page 16: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/16.jpg)
Performance Measurement
Platform IBM SP2 consisting of four SMP nodesPer node:
Four IBM PowerPC 604 processors 1 GB memory Running AIX 4.2
![Page 17: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/17.jpg)
Performance Measurement
ApplicationsSPLASH-2 Barnes-HutNAS 3D-FFTSPLASH-2 CLUSPLASH-2 WaterRed-Black SORTSPModified Gramm-Schmidt (MGS)
![Page 18: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/18.jpg)
Results
![Page 19: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/19.jpg)
Results
![Page 20: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/20.jpg)
Results
![Page 21: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/21.jpg)
Results
![Page 22: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/22.jpg)
Conclusion
Enables the programmer to rely on a single, standard, shared-memory API for parallelization within and between multiprocessors.
Using shared hardware memory reduced data and messages transmitted.
The speedups of multithreaded TreadMarks codes on four four-way SMP SP2 nodes are within 7-30% of the MPI versions.
![Page 23: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/23.jpg)
Critique
Solution allows easier implementation of program parallelization across multiprocessors if speedup is not crucial
OpenMP is easier on the programmer but speedup still not as good as MPI
![Page 24: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/24.jpg)
Critique
Issues: AIX has inefficient implementation of page protection
Paper claims that every other brand of Unix, including Linux, uses data structures that handle mprotect operations more efficiently
Why wasn’t the solution implemented on another platform?
Paper failed to present a big motivation for using this solution over MPI.
![Page 25: OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel](https://reader035.fdocuments.in/reader035/viewer/2022062520/56815b1b550346895dc8ca09/html5/thumbnails/25.jpg)
Thank You