Compiler, Languages, and Libraries
description
Transcript of Compiler, Languages, and Libraries
Compiler, Languages, and Compiler, Languages, and LibrariesLibraries
ECE Dept., University of TehranECE Dept., University of TehranParallel Processing Course SeminarParallel Processing Course Seminar
Hadi EsmaeilzadehHadi Esmaeilzadeh810181079810181079
[email protected]@cad.ece.ut.ac.ir
IntroductionIntroduction
Distributed systems are heterogeneous:Distributed systems are heterogeneous: PowerPower ArchitectureArchitecture Data RepresentationData Representation
Data access latency are significantly long Data access latency are significantly long and vary with underlaying network trafficand vary with underlaying network traffic
Network bandwidths are limited and can Network bandwidths are limited and can vary dramatically with the underlaying vary dramatically with the underlaying loadload
Programming Support Programming Support Systems: PrinciplesSystems: Principles
Principle: each component of the Principle: each component of the system should do what it does bestsystem should do what it does best
The application developer should be The application developer should be able to concentrate on problem able to concentrate on problem analysis and decomposition at a analysis and decomposition at a fairly high level of abstractionfairly high level of abstraction
Programming Support Programming Support Systems: GoalsSystems: Goals
They should make applications easy to developThey should make applications easy to develop Build applications that portable across different Build applications that portable across different
architectures and computing configurationsarchitectures and computing configurations Achieve high performance close to what an expert Achieve high performance close to what an expert
programmer can achieve using the underlaying programmer can achieve using the underlaying features of the network and computing features of the network and computing configurationsconfigurations
Exploits various forms of parallelism to balance Exploits various forms of parallelism to balance across a heterogeneous configurationacross a heterogeneous configuration Minimizing the computation timeMinimizing the computation time Matching the communication to the underlaying Matching the communication to the underlaying
bandwidths and latenciesbandwidths and latencies Ensure the performance variability remains within certain Ensure the performance variability remains within certain
boundsbounds
AutoparallelizationAutoparallelization
The user focuses on The user focuses on what what is beingis being computed rather than computed rather than HowHow
Performance penalty should not be Performance penalty should not be worse rather than a factor of twoworse rather than a factor of two
Automatic vectorizationAutomatic vectorization Dependence analysisDependence analysis
Asynchronous (MIMD) Parallel Asynchronous (MIMD) Parallel ProcessingProcessing Symmetric multiprocessor (SMP)Symmetric multiprocessor (SMP)
Distributed Memory Distributed Memory ArchitectureArchitecture
CachesCaches Higher latency of large memoriesHigher latency of large memories Determine how to apportion data to the memories of Determine how to apportion data to the memories of
processors in away thatprocessors in away that Maximize local memory accessMaximize local memory access Minimize communicationMinimize communication
Regions of parallel execution had to be large enough to Regions of parallel execution had to be large enough to compensate for the overhead of initiating and compensate for the overhead of initiating and synchronizationsynchronization
Interprocedural analysis and optimizationInterprocedural analysis and optimization Mechanisms that involve the programmer in the design of Mechanisms that involve the programmer in the design of
the parallelization as well as the problem solution will be the parallelization as well as the problem solution will be requiredrequired
Explicit CommunicationExplicit Communication
Message passing to get data from Message passing to get data from remote memoriesremote memories
Single version of program runs on Single version of program runs on the all processorsthe all processors
The computation is specialized to The computation is specialized to specific processors through specific processors through extracting number of processor and extracting number of processor and indexing its own dataindexing its own data
Send-Receive ModelSend-Receive Model
A shared-memory environmentA shared-memory environment Each processor not only receives its Each processor not only receives its
needed data but also sends data needed data but also sends data other ones requireother ones require
PVMPVM MPIMPI
Get-Put ModelGet-Put Model
The processor that needs data from a The processor that needs data from a remote memory is able to explicitly remote memory is able to explicitly get it without requiring any explicit get it without requiring any explicit action by the remote processoraction by the remote processor
DiscussionDiscussion
Program is responsible for:Program is responsible for: Decomposition of computation Decomposition of computation
The power of individual processorThe power of individual processor Load balancingLoad balancing Layout of the memoryLayout of the memory Management of latencyManagement of latency Organization and optimization of communicationOrganization and optimization of communication
Explicit communication can be though of as Explicit communication can be though of as an assembly language for gridsan assembly language for grids
Distributed Shared MemoryDistributed Shared Memory
DSM as a vehicle for hiding complexities DSM as a vehicle for hiding complexities of memory and communication of memory and communication managementmanagement
Address space is as flatten as a single-Address space is as flatten as a single-processor machine for programmerprocessor machine for programmer
The hardware/software is responsible for The hardware/software is responsible for data retrieval through generating needed data retrieval through generating needed communications, from remote memoriescommunications, from remote memories
Hardware ApproachHardware Approach
Stanford DASH, HP/Convex Exemplar, Stanford DASH, HP/Convex Exemplar, SGI OriginSGI Origin
Local cache misses initiate data Local cache misses initiate data transfer from remote memory if transfer from remote memory if neededneeded
Software SchemeSoftware Scheme
Shared Virtual Memory, TreadMarkShared Virtual Memory, TreadMark Rely on paging mechanism in the Rely on paging mechanism in the
operating systemoperating system Transfer whole page on demand between Transfer whole page on demand between
operating systemsoperating systems Make granularity and latency significantly Make granularity and latency significantly
largelarge Used in conjunction with relaxed memory Used in conjunction with relaxed memory
consistency models and support for consistency models and support for latency hidinglatency hiding
DiscussionDiscussion
Programmer is free from handling Programmer is free from handling thread packaging and parallel loopsthread packaging and parallel loops
Has performance penalties and then is Has performance penalties and then is useful for coarser-grained parallelismuseful for coarser-grained parallelism
Works best with some help from the Works best with some help from the programmer on the layout of memoryprogrammer on the layout of memory
Is a promising strategy for simplifying Is a promising strategy for simplifying the programming modelthe programming model
Data-Parallel LanguagesData-Parallel Languages
High performance on distributed memory:High performance on distributed memory: Allocate data to various processor memory to maximize Allocate data to various processor memory to maximize
locality and minimize communicationlocality and minimize communication For scaling parallelism to hundreds or thousands of For scaling parallelism to hundreds or thousands of
processors data parallelism is necessaryprocessors data parallelism is necessary Data parallelism: subdividing the data domain in some Data parallelism: subdividing the data domain in some
manner and assigning the subdomains to different manner and assigning the subdomains to different processors (data layout)processors (data layout)
These are the foundations for data-parallel These are the foundations for data-parallel languageslanguages
Fortran D, Vienna Fortran, CM Fortran, C*, data-Fortran D, Vienna Fortran, CM Fortran, C*, data-parallel C, and PC++parallel C, and PC++
High Performance Fortran (HPF), and High High Performance Fortran (HPF), and High Performance C++ (HPC++)Performance C++ (HPC++)
HPFHPF
Provides directives for data layout on F’90 and F’95Provides directives for data layout on F’90 and F’95 Directives have no effect on the meaning of the programDirectives have no effect on the meaning of the program Advices for compiler on how to assign elements of the Advices for compiler on how to assign elements of the
program arrays and data structures to different processorsprogram arrays and data structures to different processors These specification is relatively machine independentThese specification is relatively machine independent The principle focus is the layout of arraysThe principle focus is the layout of arrays Arrays are typically associated with the data domains of Arrays are typically associated with the data domains of
underlying problemunderlying problem The principle drawback: limited support for problems on The principle drawback: limited support for problems on
irregular meshesirregular meshes Distribution via run-time arrayDistribution via run-time array Generalized block distribution (blocks to be of different sizes)Generalized block distribution (blocks to be of different sizes)
For heterogeneous machines: block sizes can be adopted to For heterogeneous machines: block sizes can be adopted to the powers of target machines (generalized block the powers of target machines (generalized block distribution)distribution)
HPC++HPC++
Unsynchronized for-loopsUnsynchronized for-loops Parallel template libraries, with Parallel template libraries, with
parallel or distributed data structures parallel or distributed data structures as basisas basis
Task ParallelismTask Parallelism
Different components of the same Different components of the same computation are executed in parallelcomputation are executed in parallel
Different tasks can be allocated to Different tasks can be allocated to different nodes of the griddifferent nodes of the grid
Object parallelism (Different tasks may be Object parallelism (Different tasks may be components of objects of different classes)components of objects of different classes)
Task parallelism need not be restricted to Task parallelism need not be restricted to shared-memory systems and can be shared-memory systems and can be defined in terms of communication librarydefined in terms of communication library
HPF 2.0 Extensions for Task HPF 2.0 Extensions for Task ParallelismParallelism
Can be implemented on both shared- and Can be implemented on both shared- and distributed-memory systemsdistributed-memory systems
Providing a way for a set of cases to be Providing a way for a set of cases to be run in parallel with no communication until run in parallel with no communication until synchronization at the endsynchronization at the end
Remaining problems on using HPF on a Remaining problems on using HPF on a computational grid:computational grid: Load matchingLoad matching Communication optimizationCommunication optimization
Coarse-Grained Software Coarse-Grained Software IntegrationIntegration
Complete application is not a simple programComplete application is not a simple program It is a collection of programs that must all be It is a collection of programs that must all be
run, passing data to one anotherrun, passing data to one another The main technical challenge of the The main technical challenge of the
integration is how to prevent performance integration is how to prevent performance degradation due to sequential processing of degradation due to sequential processing of the various programsthe various programs
Each program could be viewed as a taskEach program could be viewed as a task Tasks collected and matched to the power of Tasks collected and matched to the power of
the various nodes in the gridthe various nodes in the grid
Latency ToleranceLatency Tolerance
Dealing with long memory or communication Dealing with long memory or communication latencieslatencies
Latency hiding:Latency hiding: data communication is overlapped data communication is overlapped with computation (software-perfecting)with computation (software-perfecting)
Latency reduction: Latency reduction: programs are reorganized to programs are reorganized to reuse more data in local memories (loop blocking reuse more data in local memories (loop blocking for cache)for cache)
More complex to implement on heterogeneous More complex to implement on heterogeneous distributed computersdistributed computers Latencies are large and variableLatencies are large and variable More time to be spent on estimating running timesMore time to be spent on estimating running times
Load BalancingLoad Balancing
Spreading the calculation evenly across Spreading the calculation evenly across processors while minimizing communicationprocessors while minimizing communication
Simulated annealing, neural netsSimulated annealing, neural nets Recursive bisection: at each stage, the work Recursive bisection: at each stage, the work
is divided into two equal parts.is divided into two equal parts. For Grid: power of each node must be taken For Grid: power of each node must be taken
in the accountin the account Performance prediction of components is essentialPerformance prediction of components is essential
Runtime CompilationRuntime Compilation
A problem with automatic load-balancing A problem with automatic load-balancing (especially on irregular grids)(especially on irregular grids) Unknown loop upper boundsUnknown loop upper bounds Unknown array sizesUnknown array sizes
Inspector/executer modelInspector/executer model Inspector: Inspector: executed a single time once the executed a single time once the
runtime, establishes a plan for efficient runtime, establishes a plan for efficient executionexecution
Executor:Executor: executed on each iteration, carries executed on each iteration, carries out the plan defined by inspectorout the plan defined by inspector
LibrariesLibraries
Functional library: Functional library: the parallelized version of standard the parallelized version of standard functions are applied to user-defined data structures functions are applied to user-defined data structures (ScaLAPACK, FFTPACK)(ScaLAPACK, FFTPACK)
Data structure library: Data structure library: aa parallel data structure is parallel data structure is maintained within the library whose representation is maintained within the library whose representation is hidden from the user (DAGH)hidden from the user (DAGH) Well suited for OO languagesWell suited for OO languages Provides max flexibility to the library developer to manage Provides max flexibility to the library developer to manage
runtime challengesruntime challenges Heterogeneous networksHeterogeneous networks Adaptive girdingAdaptive girding Variable latenciesVariable latencies
Drawback: their components are currently treated by Drawback: their components are currently treated by compilers as black boxescompilers as black boxes Some sort of collaboration between compiler and library might Some sort of collaboration between compiler and library might
be possible, particularly in an interprocidural compilationbe possible, particularly in an interprocidural compilation
Programming ToolsProgramming Tools
Tools like Pablo, Gist and Upshot can Tools like Pablo, Gist and Upshot can show where performance bottlenecks show where performance bottlenecks existexist
Performance-tuning toolsPerformance-tuning tools
Future Directions Future Directions (Assumptions)(Assumptions)
The user is responsible for both The user is responsible for both problem decomposition and assignmentproblem decomposition and assignment
Some kind of service negotiator runs Some kind of service negotiator runs prior the execution and determines the prior the execution and determines the available nodes and their relative poweravailable nodes and their relative power
Some portion of compilation will be Some portion of compilation will be invoked after this serviceinvoked after this service
Task CompilationTask Compilation
Constructing a task graph, along with an Constructing a task graph, along with an estimation of running time for each taskestimation of running time for each task TG construction and decompositionTG construction and decomposition Performance EstimationPerformance Estimation
Restructuring the program to better suit Restructuring the program to better suit the target grid configurationthe target grid configuration
Assignments of components of the TG to Assignments of components of the TG to the available nodesthe available nodes JavaJava
Grid Shared Memory Grid Shared Memory (Challenges)(Challenges)
Different nodes has different page Different nodes has different page sizing and paging mechanismssizing and paging mechanisms
Good Performance EstimationGood Performance Estimation Managing the system level Managing the system level
interaction providing DSMinteraction providing DSM
Global Grid CompilationGlobal Grid Compilation
Providing a programming language Providing a programming language and compilation strategy targeted to and compilation strategy targeted to gridgrid
Mixture of parallelism styles, data Mixture of parallelism styles, data parallelism and task parallelismparallelism and task parallelism Data decompositionData decomposition Function decompositionFunction decomposition