AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu...
-
Upload
stella-fowler -
Category
Documents
-
view
217 -
download
0
Transcript of AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu...
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE
OF SMP-CLUSTER
AuthorAuthor :: Y. ZhaoY. Zhao 、 、 C. HuC. Hu 、、 S. WangS. Wang 、 、 S. ZhangS. ZhangSource Source :: Proceedings of the 2nd IASTED international Proceedings of the 2nd IASTED international
conference on Advances in computer science conference on Advances in computer science and technology and technology
Speaker Speaker : : Cheng-Jung WuCheng-Jung Wu
OutlineOutline
Introduction Extensions in EOMP
Computing Resource Definition Hierarchical Data Layout and Data Mapping
Execution Model for EOMP Experiments and Results
Dot Product Matrix multiplication under EOMP execution mode
l on SMP cluster Conclusion
Introduction Clusters of shared-memory multiprocessors (SMPs)
More and more popular in High Performance Computing area
SMP clusters’ hybrid architectures Supports for a wide range of parallel paradigm
Three programming paradigms
Standard message passing Hybrid paradigm corresponding to the underlying architecture Shared memory paradigm built on a Software Distribute Softwar
e Memory (SDSM)
Three major metrics Performance、 Portability 、 Programmability
Introduction
None of the three parallel programming paradigms can meet on all of the three metrics
New parallel paradigm (EOMP) A compromising model
Balance the three major metrics Features
Good programmability Acceptable performance
Improve memory behavior Data locality The programs running on SMP cluster
Inter-node and intra-node data locality
Extensions in EOMP
Since OpenMP Shared-memory systems Lacks the support for distributed memory system
New directives Computing resource definition Data mapping
Computing Resource Definition
Definitions Virtual node (VN) Virtual processor (VP)
VNs Physical nodes Target units of inter-node data distribution
VPs Physical processors Target units of intra-node data reallocation and task s
cheduling during compilation
Computing Resource Definition
Semantics of computing resource definition directives Examples for processor mapping are given
Hierarchical Data Layout and Data Mapping : Inter-node Data Mapping
Scalar data defined in the EOMP Shared data at default
Every node gets an own copy of the data
Inter-node task parallel Allows the shared scalar data be modified in certain nodes
Global addresses of distributed arrays Llocal addresses
Inter-node data mapping distributes the mapped arrays to VNs
Semantics for inter-node data distribution directive: #pragma eomp distribute a (BLOCK*) onto N
Hierarchical Data Layout and Data Mapping : Intra-node Data Mapping
Shared memory data layout takes the advantage of global address
Technically No further data mapping is required inside the nodes
In certain cases Improper order of data access would decrease cache performance
false sharing or long-stride access
For instance: two threads always access the nearby array elements in memory at same time cache performance may be very poor due to severe false sharing
Optimizations for intra-node data layout will be necessary
Hierarchical Data Layout and Data Mapping : Intra-node Data Mapping
An extreme example Experiment on that circumstance shows an overall 90% reductio
n of L1 cache miss after the intra-node data reallocation optimization
(On 4-cpu IA64 SMP; the array a is of 1M size).
Hierarchical Data Layout and Data Mapping : Intra-node Data Mappin
g Two strategies can be adopted to reduce cache miss
Rearrange the access order of each thread Not always possible for compiler optimization It depends closely on the source program structure In the interleaving data case above, this means to avoid accessing t
he neighboring data in memory at the same time.
Reallocate the data layout in memory, Not change data dependencies of the source program Assures the correctness of this optimization Store the data that accessed by the same thread in a contiguous me
mory block
Hierarchical Data Layout and Data Mapping : Intra-node Data Mapping Intra-node data reallocation
programmer-specified directives compiler reference analysis
Intra-node data reallocating data in memory Additional time and space overheads Evaluating the performance speedup of this optimization The data locations have been changed
The reallocated data should be forbidden
Semantics for intra-node data reallocation directive #pragma eomp distribute a (CYCLIC,*) intra
Execution Model for EOMP
Inter-node barriers and broadcasts Modifications of shared vari
ables in the parallel section at the edges of task parallel region
Maintain data consistency
Inter-node communications use explicit message passing
Execution Model for EOMP
Massage passing & multithreading program generated By compiler first distributes d
ata and schedule the tasks across nodes
Then deals with the intra-node data reallocation and task scheduling
Experiments and Results : Dot Product
Experiments and Results : Dot Product
The experiment result shows that the efficiency of the EOMP based on the runtime library is similar to the MPI+OpenMP program (better under some cases)
But not good as pure MPI, because the amount of calculations in the dot product operation are not enough, comp
aring to the cost of inside-node scheduling
Matrix multiplication under EOMP execution model on SMP cluster
C=A*B A and C is distributed in rows B is distributed in columns
Matrix multiplication under EOMP execution model on SMP cluster
Matrix size is small The cost of inter-node scheduling and communications are relati
vely high (compared with the computation cost) The three distributed memory models can not acquire a speedup
Matrix size becomes larger The three distributed memory models achieve reasonable speed
ups
Notice that the EOMP model after intra-node data reallocation Gets a high speedup when the matrix size is large Showing that the improved intra-node cache performance can gr
eatly benefit the overall performance of the program on SMP clusters
Matrix multiplication under EOMP execution model on SMP cluster
Peaks of EOMP-INDR curves in 500*500 and 1000*1000 cases The effect of data reallocation is related with both the size of cac
he line local b As the nodes become more and more, the size of local b on eac
h node becomes smaller That means the cache line may fill in more rows of local b, thus t
he cache misses is reduced Explain why the peak in 500*500 multiplication case comes earli
er than that of the 1000*1000 case
Matrix multiplication under EOMP execution model on SMP cluster
Conclusion
The experiment result Feasibility of our execution model The benefit gained from intra-node data reallocation
For future work, we plan to develop a complete source to source EOMP compiler Be based on ORC (Open Resource Compiler for IA64) Our current runtime library prototype Focusing on the communication generation and data manageme
nt.