Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima...
-
Upload
shauna-joseph -
Category
Documents
-
view
218 -
download
2
Transcript of Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima...
Towards Future Programming Languages and Models for High-Productivity Computing
Hans P. Zima
University of Vienna, Austriaand
JPL, California Institute of Technology, Pasadena, CA
IFIP WG 10.3 Seminar September 7, 2004
11 IntroductionIntroduction
22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)
33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms
44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design
55 ConclusionConclusion
11 IntroductionIntroduction
22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)
33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms
44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design
55 ConclusionConclusion
Contents
2
Programming models and languages bridge the gap between Programming models and languages bridge the gap between
““reality” and hardware – at different levels of abstraction - e.g.,reality” and hardware – at different levels of abstraction - e.g.,
assembly languages
general-purpose procedural languages
functional languages
very high-level domain-specific languages
Programming models and languages bridge the gap between Programming models and languages bridge the gap between
““reality” and hardware – at different levels of abstraction - e.g.,reality” and hardware – at different levels of abstraction - e.g.,
assembly languages
general-purpose procedural languages
functional languages
very high-level domain-specific languages
Abstraction in Programming
Abstraction implies loss of information – gain in simplicity, clarity, verifiability, portability
versus potential performance degradation
3
The designers of the very first high level programming language The designers of the very first high level programming language were aware that their success depended on acceptable were aware that their success depended on acceptable performance of the generated target programs:performance of the generated target programs:
John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …”
The designers of the very first high level programming language The designers of the very first high level programming language were aware that their success depended on acceptable were aware that their success depended on acceptable performance of the generated target programs:performance of the generated target programs:
John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …”
The Emergence of High-Level Sequential Languages
High-level algorithmic languages became generally accepted standards for sequential programming since
their advantages outweighed any performance drawbacks
For parallel programming no similar development took place
4
Current HPC hardware: large clusters at reasonable costCurrent HPC hardware: large clusters at reasonable cost commodity clusters or custom MPPs off-the-shelf processors and memory components, built for mass market latency and bandwidth problems
Current HPC software: Current HPC software: application efficiency sometimes in single digits low-productivity programming models dominate focus on programming in the small inadequate programming environments and tools
Resemblance to pre Fortran era in sequential programming Resemblance to pre Fortran era in sequential programming assembly language programming vs. MPI wide gap between the domain of the scientist and programming language
Current HPC hardware: large clusters at reasonable costCurrent HPC hardware: large clusters at reasonable cost commodity clusters or custom MPPs off-the-shelf processors and memory components, built for mass market latency and bandwidth problems
Current HPC software: Current HPC software: application efficiency sometimes in single digits low-productivity programming models dominate focus on programming in the small inadequate programming environments and tools
Resemblance to pre Fortran era in sequential programming Resemblance to pre Fortran era in sequential programming assembly language programming vs. MPI wide gap between the domain of the scientist and programming language
The Crisis of High Performance Computing
5
Exploiting performance on current parallel architectures Exploiting performance on current parallel architectures requires low level control and “heroic” programmersrequires low level control and “heroic” programmers
This has led to “local view” programming models as exemplified This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.)by MPI based programming (as well as CoArray Fortran etc.)
explicit processors: local views of data
program state associated with memory regions
explicit communication intertwined with the algorithm
Exploiting performance on current parallel architectures Exploiting performance on current parallel architectures requires low level control and “heroic” programmersrequires low level control and “heroic” programmers
This has led to “local view” programming models as exemplified This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.)by MPI based programming (as well as CoArray Fortran etc.)
explicit processors: local views of data
program state associated with memory regions
explicit communication intertwined with the algorithm
Domination of Low-Level Paradigms
6
The 1990s saw a strong (mainly, but not exclusively, academic) The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situationeffort to improve this situation
Focus on higher level paradigmsFocus on higher level paradigms CMFortran
Fortran-D
Vienna Fortran
High Performance Fortran (HPF)
and many others…
A wealth of research on related compiler technologyA wealth of research on related compiler technology
These languages have not been broadly accepted, for a number of These languages have not been broadly accepted, for a number of reasons …reasons …
The 1990s saw a strong (mainly, but not exclusively, academic) The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situationeffort to improve this situation
Focus on higher level paradigmsFocus on higher level paradigms CMFortran
Fortran-D
Vienna Fortran
High Performance Fortran (HPF)
and many others…
A wealth of research on related compiler technologyA wealth of research on related compiler technology
These languages have not been broadly accepted, for a number of These languages have not been broadly accepted, for a number of reasons …reasons …
The High Performance Fortran (HPF) Effort
7
do while (.not. converged) do J=1,M do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do end do A(1:N,1:N) = B(1:N,1:N)
localcomputation
initialize MPI
if (MOD(myrank,2) .eq. 1) then call MPI_SEND(B(1,1),N,…,myrank-1,..) call MPI_RCV(A(1,0),N,…,myrank-1,..)
if (myrank .lt. s-1) then call MPI_SEND(B(1,M),N,…,myrank+1,..)
call MPI_RCV(A(1,M+1),N,…,myrank+1,..)endif
else … …
Assuming block column distribution:
…
Example: Parallel Jacobi With MPI
8
do while (.not. converged) do J=1,N do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do end do A(1:N,1:N) = B(1:N,1:N)
globalcomputation
processors P(NUMBER_OF_PROCESSORS) distribute(*,BLOCK) onto P :: A, B
Communication is automatically generated by compiler
Change of distribution requires modification of just one line in the source program; no change of the algorithm
Example: Parallel Jacobi With HPF
…
block column distribution
9
The HPF code is far simpler than the MPI codeThe HPF code is far simpler than the MPI code
Compiler generated object code is as good as MPI codeCompiler generated object code is as good as MPI code
parallelization of the loop
static analysis of access patterns and communication
aggregation and fusion of communication
The data distribution directives could even be generated The data distribution directives could even be generated automatically from the sequential codeautomatically from the sequential code
The HPF code is far simpler than the MPI codeThe HPF code is far simpler than the MPI code
Compiler generated object code is as good as MPI codeCompiler generated object code is as good as MPI code
parallelization of the loop
static analysis of access patterns and communication
aggregation and fusion of communication
The data distribution directives could even be generated The data distribution directives could even be generated automatically from the sequential codeautomatically from the sequential code
So where is the problem with HPF?
Example: Observations
10
HPF 1 lacked important functionalityHPF 1 lacked important functionality data distributions do not support irregular algorithms lack of flexibility for processor-mapping and data/thread affinity focus on support for array types and SPMD data parallelism
HPF includes features that are difficult to implementHPF includes features that are difficult to implement
Current architectures do not support well dynamic and Current architectures do not support well dynamic and irregular features in advanced applicationsirregular features in advanced applications
HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator Simulator explicit high-level formulation of communication patterns explicit high-level control of communication schedules explicit high-level control of “halos”
HPF 1 lacked important functionalityHPF 1 lacked important functionality data distributions do not support irregular algorithms lack of flexibility for processor-mapping and data/thread affinity focus on support for array types and SPMD data parallelism
HPF includes features that are difficult to implementHPF includes features that are difficult to implement
Current architectures do not support well dynamic and Current architectures do not support well dynamic and irregular features in advanced applicationsirregular features in advanced applications
HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator Simulator explicit high-level formulation of communication patterns explicit high-level control of communication schedules explicit high-level control of “halos”
HPF Problems
However:
11
FEM crash simulation kernel
Language Extensions
(HPF/Earth Simulator)
Schedule Reuse
Halo Control
Purest procedures
10
100
1000
10000
100000
number of processors
tim
e (
se
co
nd
s)
HPF (PGI)
HPF/VFC
HPF+/VFC(REUSE)
MPI/F77
F90 (seq.)2 4 8 16 32 64 128
The Effect of Schedule Reuse and Halo Management
12
11 IntroductionIntroduction
22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)
33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms
44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design
55 ConclusionConclusion
11 IntroductionIntroduction
22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)
33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms
44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design
55 ConclusionConclusion
Contents
13
Current parallel programming language, compiler, and tool Current parallel programming language, compiler, and tool technologies are unable to support high productivity computingtechnologies are unable to support high productivity computing
New programming models, languages, New programming models, languages, compiler, and tool technologies are compiler, and tool technologies are necessary to address the productivity necessary to address the productivity demands of future systemsdemands of future systems
Current parallel programming language, compiler, and tool Current parallel programming language, compiler, and tool technologies are unable to support high productivity computingtechnologies are unable to support high productivity computing
New programming models, languages, New programming models, languages, compiler, and tool technologies are compiler, and tool technologies are necessary to address the productivity necessary to address the productivity demands of future systemsdemands of future systems
State-of-the-Art
14
Make Scientists and Engineers more productive:Make Scientists and Engineers more productive:
provide a higher level of abstractionprovide a higher level of abstraction
Support “Abstraction without Guilt” [Ken Kennedy]:Support “Abstraction without Guilt” [Ken Kennedy]:
Increase programming language usability without Increase programming language usability without
sacrificing performancesacrificing performance
Make Scientists and Engineers more productive:Make Scientists and Engineers more productive:
provide a higher level of abstractionprovide a higher level of abstraction
Support “Abstraction without Guilt” [Ken Kennedy]:Support “Abstraction without Guilt” [Ken Kennedy]:
Increase programming language usability without Increase programming language usability without
sacrificing performancesacrificing performance
Global Goal
15
The major goal must be productivityThe major goal must be productivity performance user productivity – time to solution portability robustness
This requires significant progress in many fields …This requires significant progress in many fields … parallel programming languages compiler technology runtime systems intelligent programming environments
……and better support from hardware architecturesand better support from hardware architectures
The major goal must be productivityThe major goal must be productivity performance user productivity – time to solution portability robustness
This requires significant progress in many fields …This requires significant progress in many fields … parallel programming languages compiler technology runtime systems intelligent programming environments
……and better support from hardware architecturesand better support from hardware architectures
Productivity
16
Large-Scale ParallelismLarge-Scale Parallelism hundreds of thousands of processors
component failures will occur in relatively short intervals
Non-Uniform Data AccessNon-Uniform Data Access deep memory hierarchies
severe differences in latency: 1000+ cycles for accessing data from memory; larger latency for remote memory
severe differences in bandwidths
Large-Scale ParallelismLarge-Scale Parallelism hundreds of thousands of processors
component failures will occur in relatively short intervals
Non-Uniform Data AccessNon-Uniform Data Access deep memory hierarchies
severe differences in latency: 1000+ cycles for accessing data from memory; larger latency for remote memory
severe differences in bandwidths
Challenges of Peta-Scale Architectures
17
Long lived applications, surviving many generations of hardwareLong lived applications, surviving many generations of hardware
Applications are becoming larger and more complexApplications are becoming larger and more complex multi-disciplinary
multi-language
multi-paradigm
Legacy codes pose a big problemLegacy codes pose a big problem from F77 to F90, C/C++, MPI, Coarray Fortran etc.
intellectual content must be preserved
automatic re-write under constraint of performance portability is a difficult research problem
Many advanced applications are dynamic, irregular, and adaptiveMany advanced applications are dynamic, irregular, and adaptive
Long lived applications, surviving many generations of hardwareLong lived applications, surviving many generations of hardware
Applications are becoming larger and more complexApplications are becoming larger and more complex multi-disciplinary
multi-language
multi-paradigm
Legacy codes pose a big problemLegacy codes pose a big problem from F77 to F90, C/C++, MPI, Coarray Fortran etc.
intellectual content must be preserved
automatic re-write under constraint of performance portability is a difficult research problem
Many advanced applications are dynamic, irregular, and adaptiveMany advanced applications are dynamic, irregular, and adaptive
Application Challenges
18
High performance networks and multithreading can contribute High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidthto tolerating memory latency and improving memory bandwidth
Hardware support for locality aware programming can avoid Hardware support for locality aware programming can avoid serious performance problems in current architecturesserious performance problems in current architectures
Abandoning global cache coherence in favor of a more flexible Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecksscheme can eliminate a major source of bottlenecks
Lightweight processors (possibly in the millions) can be used both Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layeras a computational fabric and a service layer exploitation of spatial locality lightweight synchronization an introspection infrastructure for fault tolerance, performance tuning, and
intrusion detection
High performance networks and multithreading can contribute High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidthto tolerating memory latency and improving memory bandwidth
Hardware support for locality aware programming can avoid Hardware support for locality aware programming can avoid serious performance problems in current architecturesserious performance problems in current architectures
Abandoning global cache coherence in favor of a more flexible Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecksscheme can eliminate a major source of bottlenecks
Lightweight processors (possibly in the millions) can be used both Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layeras a computational fabric and a service layer exploitation of spatial locality lightweight synchronization an introspection infrastructure for fault tolerance, performance tuning, and
intrusion detection
Opportunities for Peta-Scale Architectures
19
Semantics
ProductivityModel
Programming Model
conceptual view of data and control
Programming Language
Programming Language
+ Directives
Library
Command-line Interface
SP
SP
SP
SP
ExecutionModel
abstract machine
realizations
Programming Models for High Productivity Computing
20
Programming models and their realizations can be characterized Programming models and their realizations can be characterized along (at least) three dimensions:along (at least) three dimensions: semantics user productivity performance
SemanticsSemantics: a mapping from programs to functions specifying : a mapping from programs to functions specifying input/output behavior of the program:input/output behavior of the program: S: P F , where each f in F is a function f: I O
User productivityUser productivity: a mapping from programs to a : a mapping from programs to a characterization of structural complexity:characterization of structural complexity: U: P N
PerformancePerformance: a mapping from programs to functions specifying : a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real the complexity of the program in terms of its execution on a real or abstract target machine: or abstract target machine: C: P G, where each g in G is a function g : I N*
Programming models and their realizations can be characterized Programming models and their realizations can be characterized along (at least) three dimensions:along (at least) three dimensions: semantics user productivity performance
SemanticsSemantics: a mapping from programs to functions specifying : a mapping from programs to functions specifying input/output behavior of the program:input/output behavior of the program: S: P F , where each f in F is a function f: I O
User productivityUser productivity: a mapping from programs to a : a mapping from programs to a characterization of structural complexity:characterization of structural complexity: U: P N
PerformancePerformance: a mapping from programs to functions specifying : a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real the complexity of the program in terms of its execution on a real or abstract target machine: or abstract target machine: C: P G, where each g in G is a function g : I N*
Programming Model Issues
21
11 IntroductionIntroduction
22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)
33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms1 General-Purpose Languages
2 Domain-Specific Languages
3 Programming Environments and Tools
44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design
55 ConclusionConclusion
11 IntroductionIntroduction
22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)
33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms1 General-Purpose Languages
2 Domain-Specific Languages
3 Programming Environments and Tools
44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design
55 ConclusionConclusion
Contents
22
Key Issues in General Purpose Languages
High-Level Features for Explicit Concurrency
High-Level Features for Locality Control
High-Level Support for Distributed Collections
global address space
Safety
object orientation
performance transparency
generic programming
Extensibility
Critical Functionality Orthogonal Language Issues
High-Level Support for Programming-In-the-Large
23
data work
data
align
workalign with data
distribute
HPCS architecture
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
PM
Locality Awareness: Distribution, Alignment, Affinity
distribute
24
Data distributions must be able to reflect Data distributions must be able to reflect partitioning ofpartitioning of physical domainsphysical domains: exploiting locality in physical space: exploiting locality in physical space
Data distributions must be able to reflect Data distributions must be able to reflect dynamicdynamic reshaping of data domainsreshaping of data domains (e.g., Gaussian elimination): (e.g., Gaussian elimination): a load balancing issue a load balancing issue
AffinityAffinity --- the co-location of threads and data they are --- the co-location of threads and data they are operating upon --- is an important performance issueoperating upon --- is an important performance issue
Language features for expressing Language features for expressing user-defined user-defined distribution distribution and alignmentand alignment
Data distributions must be able to reflect Data distributions must be able to reflect partitioning ofpartitioning of physical domainsphysical domains: exploiting locality in physical space: exploiting locality in physical space
Data distributions must be able to reflect Data distributions must be able to reflect dynamicdynamic reshaping of data domainsreshaping of data domains (e.g., Gaussian elimination): (e.g., Gaussian elimination): a load balancing issue a load balancing issue
AffinityAffinity --- the co-location of threads and data they are --- the co-location of threads and data they are operating upon --- is an important performance issueoperating upon --- is an important performance issue
Language features for expressing Language features for expressing user-defined user-defined distribution distribution and alignmentand alignment
Requirements for Data Distribution Features
25
DSLs provide DSLs provide high-level abstractions tailored to a specific high-level abstractions tailored to a specific domaindomain
DSLs DSLs separateseparate domain expertise from parallelization domain expertise from parallelization knowledge knowledge
DSLs allow a domain expert to work DSLs allow a domain expert to work closer to the scientificcloser to the scientific processprocess (removed from resource management) – a potentially (removed from resource management) – a potentially easier route to user productivity than general purpose easier route to user productivity than general purpose languages languages
Potential Potential problemsproblems include fragility with respect to changes in include fragility with respect to changes in the application domain, and target code performancethe application domain, and target code performance
DSLs can be built on top of general purpose languages, DSLs can be built on top of general purpose languages, providing a transition path for applicationsproviding a transition path for applications
DSLs provide DSLs provide high-level abstractions tailored to a specific high-level abstractions tailored to a specific domaindomain
DSLs DSLs separateseparate domain expertise from parallelization domain expertise from parallelization knowledge knowledge
DSLs allow a domain expert to work DSLs allow a domain expert to work closer to the scientificcloser to the scientific processprocess (removed from resource management) – a potentially (removed from resource management) – a potentially easier route to user productivity than general purpose easier route to user productivity than general purpose languages languages
Potential Potential problemsproblems include fragility with respect to changes in include fragility with respect to changes in the application domain, and target code performancethe application domain, and target code performance
DSLs can be built on top of general purpose languages, DSLs can be built on top of general purpose languages, providing a transition path for applicationsproviding a transition path for applications
Domain-Specific Languages (DSLs)
26
Building Building domain-specific skeletonsdomain-specific skeletons for applications (with plug for applications (with plug ins for user defined functions)ins for user defined functions)
Using domain knowledge for extending and improving Using domain knowledge for extending and improving standard standard compilercompiler analysisanalysis
Developing Developing domain-specific librariesdomain-specific libraries
Detecting and optimizing Detecting and optimizing domain-specific patternsdomain-specific patterns in serial in serial legacy codeslegacy codes
Using domain knowledge for Using domain knowledge for verification, validation, and fault verification, validation, and fault tolerancetolerance
Telescoping Telescoping ---- automatic generation of optimizing compilers for automatic generation of optimizing compilers for domain specific languages defined by librariesdomain specific languages defined by libraries
Building Building domain-specific skeletonsdomain-specific skeletons for applications (with plug for applications (with plug ins for user defined functions)ins for user defined functions)
Using domain knowledge for extending and improving Using domain knowledge for extending and improving standard standard compilercompiler analysisanalysis
Developing Developing domain-specific librariesdomain-specific libraries
Detecting and optimizing Detecting and optimizing domain-specific patternsdomain-specific patterns in serial in serial legacy codeslegacy codes
Using domain knowledge for Using domain knowledge for verification, validation, and fault verification, validation, and fault tolerancetolerance
Telescoping Telescoping ---- automatic generation of optimizing compilers for automatic generation of optimizing compilers for domain specific languages defined by librariesdomain specific languages defined by libraries
Exploiting Domain-Specific Information
27
Rewriting Legacy CodesRewriting Legacy Codes Preservation of intellectual content
Opportunity for exploiting new hardware, including new algorithms
Code size may preclude practicality of rewrite
Language, compiler, tool, and runtime supportLanguage, compiler, tool, and runtime support (Semi) automatic tools for migrating code
incremental porting
transition of performance-critical sections requires highly-sophisticated software for automatic adaptation
high-level analysis pattern matching and concept comprehension optimization and specialization
Rewriting Legacy CodesRewriting Legacy Codes Preservation of intellectual content
Opportunity for exploiting new hardware, including new algorithms
Code size may preclude practicality of rewrite
Language, compiler, tool, and runtime supportLanguage, compiler, tool, and runtime support (Semi) automatic tools for migrating code
incremental porting
transition of performance-critical sections requires highly-sophisticated software for automatic adaptation
high-level analysis pattern matching and concept comprehension optimization and specialization
Legacy Code Migration
28
Reliability ChallengesReliability Challenges massive parallelism poses new problems
fault prognostics, detection, recovery
data distribution may cause vital data to be spread across all nodes
(Semi) Automatic Tuning(Semi) Automatic Tuning closed loop adaptive control: measurement, decision-making, actuation
information exposure: users, compilers, runtime systems
learning from experience: databases, data mining, reasoning systems
IntrospectionIntrospection a technology for the support of validation, fault detection, and performance
tuning
Reliability ChallengesReliability Challenges massive parallelism poses new problems
fault prognostics, detection, recovery
data distribution may cause vital data to be spread across all nodes
(Semi) Automatic Tuning(Semi) Automatic Tuning closed loop adaptive control: measurement, decision-making, actuation
information exposure: users, compilers, runtime systems
learning from experience: databases, data mining, reasoning systems
IntrospectionIntrospection a technology for the support of validation, fault detection, and performance
tuning
Issues in Programming Environments and Tools
29
Parallelizingcompiler
sourceprogram
targetprogram
Expertadvisor
Transformation system
execution on
target machine
end of tuning cycle
Program / ExecutionKnowledge Base
Example: Offline Performance Tuning
30
11 IntroductionIntroduction
22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)
33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms
44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design1 The DARPA Sponsored HPCS Program
2 An Overview of the Cascade Architecture
3 Design Aspects of the Chapel Programming Language
55 ConclusionConclusion
11 IntroductionIntroduction
22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)
33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms
44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design1 The DARPA Sponsored HPCS Program
2 An Overview of the Cascade Architecture
3 Design Aspects of the Chapel Programming Language
55 ConclusionConclusion
Contents
31
High ProductivityComputing Systems
Goals:Goals:
Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010)
Impact:Performance (efficiency): critical national security
applications by a factor of 10X to 40XProductivity (time-to-solution) Portability (transparency): insulate research and
operational application software from systemRobustness (reliability): apply all known techniques
to protect against outside attacks, hardware faults, & programming errors
Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)
Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)
Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant
modeling and biotechnology
HPCS Program Focus Areas
32
one year Concept Study, July 2002 -- June 2003one year Concept Study, July 2002 -- June 2003
three year Prototyping Phase, July 2003 -- June 2006 three year Prototyping Phase, July 2003 -- June 2006
Led by Cray Inc. (Burton Smith)Led by Cray Inc. (Burton Smith)
Partners: Caltech/JPL
University of Notre Dame
Stanford University
one year Concept Study, July 2002 -- June 2003one year Concept Study, July 2002 -- June 2003
three year Prototyping Phase, July 2003 -- June 2006 three year Prototyping Phase, July 2003 -- June 2006
Led by Cray Inc. (Burton Smith)Led by Cray Inc. (Burton Smith)
Partners: Caltech/JPL
University of Notre Dame
Stanford University
The Cascade Project
33
Hierarchical architecture: two levels of processing elementsHierarchical architecture: two levels of processing elements
Lightweight processors in “smart memory”Lightweight processors in “smart memory”
Shared address spaceShared address space
Program controlled selection of UMA or NUMA data accessProgram controlled selection of UMA or NUMA data access
Hybrid UMA/NUMA programming paradigmHybrid UMA/NUMA programming paradigm
Hierarchical architecture: two levels of processing elementsHierarchical architecture: two levels of processing elements
Lightweight processors in “smart memory”Lightweight processors in “smart memory”
Shared address spaceShared address space
Program controlled selection of UMA or NUMA data accessProgram controlled selection of UMA or NUMA data access
Hybrid UMA/NUMA programming paradigmHybrid UMA/NUMA programming paradigm
Cascade Architecture: Key Elements
34
Tolerate latency with processor concurrencyTolerate latency with processor concurrency Vectors provide concurrency within a thread
Multithreading provides concurrency between threads
Exploit locality to reduce bandwidth demandExploit locality to reduce bandwidth demand “Heavyweight” processors (HWPs) to exploit temporal
locality
“Lightweight” processors (LWPs) to exploit spatial locality
Use other techniques to reduce network trafficUse other techniques to reduce network traffic Atomic memory operations
Efficient remote thread creation
Tolerate latency with processor concurrencyTolerate latency with processor concurrency Vectors provide concurrency within a thread
Multithreading provides concurrency between threads
Exploit locality to reduce bandwidth demandExploit locality to reduce bandwidth demand “Heavyweight” processors (HWPs) to exploit temporal
locality
“Lightweight” processors (LWPs) to exploit spatial locality
Use other techniques to reduce network trafficUse other techniques to reduce network traffic Atomic memory operations
Efficient remote thread creation
Using Bandwidth Wisely
35
A Simplified Global View of the Cascade Architecture
Smart Memory
HWP
SW-controlled Cache
Smart Memory
HWP
SW-controlled Cache…
Global Interconnection Network
Locale Locale
36
Heavyweight ProcessorVector
MultithreadedStreaming
Compiler-assisted cache
LocaleInterconnect
Network
Router
To Other Locales
DRAM
Lightweight Processors
Multithreaded
cache
Lightweight Processor Chip
Multithreaded
LWP LWP LWP LWP…
DRAM
Lightweight Processors
Multithreaded
cache
Lightweight Processor Chip
Multithreaded
LWP LWP LWP LWP…
DRAM
Lightweight Processors
Multithreaded
cache
Lightweight Processor Chip
Multithreaded
LWP LWP LWP LWP…
DRAM
Lightweight Processors
Multithreaded
cache
Lightweight Processor Chip
Multithreaded
LWP LWP LWP LWP…
DRAM
Lightweight Processors
Multithreaded
cache
Lightweight Processor Chip
Multithreaded
LWP LWP LWP LWP…
DRAM
Lightweight Processors
Multithreaded
cache
Lightweight Processor Chip
Multithreaded
LWP LWP LWP LWP…
A Cascade Locale
Source: David Callahan, Cray Inc. 37
Multiple processors per dieMultiple processors per die
Heavily multithreadedHeavily multithreaded local memory available to hold
thread state
Large on-chip DRAM cacheLarge on-chip DRAM cache
Multiple processors per dieMultiple processors per die
Heavily multithreadedHeavily multithreaded local memory available to hold
thread state
Large on-chip DRAM cacheLarge on-chip DRAM cache
Specialized instruction setSpecialized instruction set
fork, quit
spawn
MTA-style F/E bits for synchronizationMTA-style F/E bits for synchronization
Specialized instruction setSpecialized instruction set
fork, quit
spawn
MTA-style F/E bits for synchronizationMTA-style F/E bits for synchronization
DRAM
Lightweight Processors
Multithreaded
cache
Lightweight Processor Chip
Multithreaded
LWP LWP LWP LWP…
Optimized to start, stop and move
Lightweight Processors
Source: David Callahan, Cray Inc. 38
Lightweight processorsLightweight processors co-located with memory
focus on availability
full exploitation is not a primary system goal
Lightweight threadsLightweight threads minimal state – high rate context switch
spawned by sending a parcel to memory
Exploiting spatial localityExploiting spatial locality fine-grain: reductions, prefix operations, search
coarse-grain: data distribution and alignment
Saving bandwidth by migrating threads to dataSaving bandwidth by migrating threads to data
Lightweight processorsLightweight processors co-located with memory
focus on availability
full exploitation is not a primary system goal
Lightweight threadsLightweight threads minimal state – high rate context switch
spawned by sending a parcel to memory
Exploiting spatial localityExploiting spatial locality fine-grain: reductions, prefix operations, search
coarse-grain: data distribution and alignment
Saving bandwidth by migrating threads to dataSaving bandwidth by migrating threads to data
Lightweight Processors and Threads
39
Fine grain Fine grain application parallelismapplication parallelism
Implementation of aImplementation of a service layerservice layer
Components of agent systems that asynchronously Components of agent systems that asynchronously monitor the monitor the computationcomputation performing performing introspection,introspection, and dealing with: and dealing with: dynamic program validation
validation of user directives
fault tolerance
intrusion prevention and detection
performance analysis and tuning
support of feedback-oriented compilation
Fine grain Fine grain application parallelismapplication parallelism
Implementation of aImplementation of a service layerservice layer
Components of agent systems that asynchronously Components of agent systems that asynchronously monitor the monitor the computationcomputation performing performing introspection,introspection, and dealing with: and dealing with: dynamic program validation
validation of user directives
fault tolerance
intrusion prevention and detection
performance analysis and tuning
support of feedback-oriented compilation
Uses of Lightweight Threads
40
Introspection can be defined as a system’s ability to:Introspection can be defined as a system’s ability to: explore its own properties reason about its internal state make decisions about appropriate state changes where necessary
an enabling technology for building “autonomic” systems an enabling technology for building “autonomic” systems (among other purposes)(among other purposes)
Introspection can be defined as a system’s ability to:Introspection can be defined as a system’s ability to: explore its own properties reason about its internal state make decisions about appropriate state changes where necessary
an enabling technology for building “autonomic” systems an enabling technology for building “autonomic” systems (among other purposes)(among other purposes)
Introspection
41
Performance tuning is critical for distributed high performance Performance tuning is critical for distributed high performance applications in a massively parallel architectureapplications in a massively parallel architecture
Trace based off line analysis may not be practical due to Trace based off line analysis may not be practical due to volume of datavolume of data
Complexity of machine architecture makes manual control Complexity of machine architecture makes manual control difficultdifficult
Automatic support for on line tuning essential for efficiently Automatic support for on line tuning essential for efficiently dealing with the problemdealing with the problem
Performance tuning is critical for distributed high performance Performance tuning is critical for distributed high performance applications in a massively parallel architectureapplications in a massively parallel architecture
Trace based off line analysis may not be practical due to Trace based off line analysis may not be practical due to volume of datavolume of data
Complexity of machine architecture makes manual control Complexity of machine architecture makes manual control difficultdifficult
Automatic support for on line tuning essential for efficiently Automatic support for on line tuning essential for efficiently dealing with the problemdealing with the problem
A Case Study: Automatic Performance Tuning
42
Program/Performance Data BaseProgram/Performance Data Base – – storing information about the program, storing information about the program, executions (input, performance data), and system.executions (input, performance data), and system.
It provides the central interface in the system.It provides the central interface in the system.
Asynchronous parallel agentsAsynchronous parallel agents for various functions, includingfor various functions, including
Simplification: data reduction and filtering
Invariant checking
Assertion checking
Local problem analysis
Performance Exception Handler Performance Exception Handler – performs a model-based analysis of – performs a model-based analysis of performance problems communicated by agents, and decides on an performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may triggerappropriate action if necessary. Such an action may trigger
New method selection
Re-instrumentation
Re-compilation
Program/Performance Data BaseProgram/Performance Data Base – – storing information about the program, storing information about the program, executions (input, performance data), and system.executions (input, performance data), and system.
It provides the central interface in the system.It provides the central interface in the system.
Asynchronous parallel agentsAsynchronous parallel agents for various functions, includingfor various functions, including
Simplification: data reduction and filtering
Invariant checking
Assertion checking
Local problem analysis
Performance Exception Handler Performance Exception Handler – performs a model-based analysis of – performs a model-based analysis of performance problems communicated by agents, and decides on an performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may triggerappropriate action if necessary. Such an action may trigger
New method selection
Re-instrumentation
Re-compilation
Elements of an Approach to On the Fly Performance Tuning
43
Example: A Society of Agents for Performance Analysis and Feedback-Oriented Tuning
Program/
Performance
Data Base
execution of instrumented
application program
compiler
instrumenter
data collection
analysis
agent
invariantchec
king agent
simplification
agent
Performance Exception Handler
data reduction and filtering
44
Base ModelBase Model Unbounded thread parallelism in a flat shared memory
Explicit synchronization for access to shared data
No consideration of temporal or spatial locality, no “processors”
Extended ModelExtended Model Explicit memory locales, data distribution, data/thread affinity
Explicit cache management
Base ModelBase Model Unbounded thread parallelism in a flat shared memory
Explicit synchronization for access to shared data
No consideration of temporal or spatial locality, no “processors”
Extended ModelExtended Model Explicit memory locales, data distribution, data/thread affinity
Explicit cache management
Cascade Programming Model
45
Global name spaceGlobal name space even in the context of a NUMA model avoid “local view” programming model of MPI, CoArray Fortran, UPC
Multiple models of parallelismMultiple models of parallelism
Provide support forProvide support for:: explicit parallel programming locality-aware programming interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.) generic programming
Global name spaceGlobal name space even in the context of a NUMA model avoid “local view” programming model of MPI, CoArray Fortran, UPC
Multiple models of parallelismMultiple models of parallelism
Provide support forProvide support for:: explicit parallel programming locality-aware programming interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.) generic programming
Language Design Criteria
46
The “Concrete Language” enhances HPF and ZPLThe “Concrete Language” enhances HPF and ZPL generalized HPF-type data distributions domains are first-class objects: a central repository for index space, distribution, and
associated set of arrays support for automatic partitioning of dynamic graph-based data structures high-level control for communication (halos,…)?
TheThe “ “Abstract Language”Abstract Language” supports generic programmingsupports generic programming abstraction of types: type inference from context abstraction of iteration: generalization of the CLU iterator data structure inference: system-selected implementation for programmer-specified
object categories
NoteNote: The “concrete” and “abstract” languages as used above do not in reality refer to : The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussiondifferent languages but are only used here to structure this discussion . .
The “Concrete Language” enhances HPF and ZPLThe “Concrete Language” enhances HPF and ZPL generalized HPF-type data distributions domains are first-class objects: a central repository for index space, distribution, and
associated set of arrays support for automatic partitioning of dynamic graph-based data structures high-level control for communication (halos,…)?
TheThe “ “Abstract Language”Abstract Language” supports generic programmingsupports generic programming abstraction of types: type inference from context abstraction of iteration: generalization of the CLU iterator data structure inference: system-selected implementation for programmer-specified
object categories
NoteNote: The “concrete” and “abstract” languages as used above do not in reality refer to : The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussiondifferent languages but are only used here to structure this discussion . .
Language Design Highlights
47
A modern base languageA modern base language Strongly typed
Fortran-like array features
Objected-oriented
Module structure for name space management
Optional automatic storage management
High performance featuresHigh performance features Abstractions for parallelism
data parallelism (domains, forall) task parallelism (cobegin/coend)
Locality management via data distributions and affinity
A modern base languageA modern base language Strongly typed
Fortran-like array features
Objected-oriented
Module structure for name space management
Optional automatic storage management
High performance featuresHigh performance features Abstractions for parallelism
data parallelism (domains, forall) task parallelism (cobegin/coend)
Locality management via data distributions and affinity
Basics
48
Type Tree
primitive domain collection function
homogeneouscollection
heterogeneouscollection
sequence setarray class record
iterator other function
49
Low level tools direct object and thread placement at time Low level tools direct object and thread placement at time of creationof creation Natural extensions to existing partitioned address space languages
Evolve from HPF and ZPL a higher level concept:Evolve from HPF and ZPL a higher level concept: Introduce domains as first class program entities
Sets of names (indices) for data
Centralization of distribution, layout, and iteration ordering
Low level tools direct object and thread placement at time Low level tools direct object and thread placement at time of creationof creation Natural extensions to existing partitioned address space languages
Evolve from HPF and ZPL a higher level concept:Evolve from HPF and ZPL a higher level concept: Introduce domains as first class program entities
Sets of names (indices) for data
Centralization of distribution, layout, and iteration ordering
Locality Management
50
locale view: a logical view for a set of locales
distribution: a mapping of an index set to a locale view
array: a map from an index set to a collection of variables
Domains
index sets: Cartesian products, sparse, opaque
51
2) Separation of Concerns: Definition
var Mat: domain(2) = [1..m, 1..n];var MatCol: domain(1) = Mat(2);var MatRow: domain(1) = Mat(1);
var A: array [Mat] of float;var v: array [MatCol] of float;var s: array [MatRow] of float;
s = sum(dim=2) [i,j:Mat] A(i,j)*v(j);
Example: Matrix Vector Multiplication V1
52
var L: array[1..p1,1..p2] of locale;
var Mat: domain(2) dist(block,block) to L = [1..m,1..n];var MatCol: domain(1) align(*,Mat(2)) = Mat(2);var MatRow: domain(1) align(Mat(1),*) = Mat(1);
var A: array [Mat] of float;var v: array [MatCol] of float;var s: array [MatRow] of float;
s = sum(dim=2) [i,j:Mat] A(i,j)*v(j);
Example: Matrix Vector Multiplication V2
53
0 53 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 93 0 0 0 0 0
0 0 0 0 21 0 0 0 16 72 0 0 0 0 0 0 0 0 0 13 0
0 0 0 0 0 23 69 0 27 0 0 11
44 0 0 19 37 0 0 0 0 0 64 0
D0
____
53191793
C0
____
2 1 4 5
R0
____
1 2 2 3 3 4 5 5
D0
____
53191793
D0
____
53191793
C0
____
2 1 4 5
R0
____
1 2 2 3 3 4 5 5
C1
____
2 3 1 2
R1
____
1 1 2 3 4 4 4 5
D1
____
21167213
D0
____
53191793
C2
____
2 3 1 4
R2
____
1 1 3 5
D2
____
23692711
D3
____
44193764
C3
____
1 4 1 3
R3
____
1 3 4 5
Sparse Matrix Distribution
54
var L: array[1..p1,1..p2] of locale;
var Mat: domain(sparse2) dist(mysdist) to L layout(myslay) = [1..m,1..n] where enumeratenonzeroes(); var MatCol: domain(1) align(*,Mat(2)) = Mat(2);var MatRow: domain(1) align(Mat(1),*) = Mat(1);
var A: array [Mat] of float;var v: array [MatCol] of float;var s: array [MatRow] of float;
s = sum(dim=2) [i,j:Mat] A(i,j)*v(j);
Example: Matrix Vector Multiplication V3
55
Global name spaceGlobal name space
High level control features supporting explicit parallelism High level control features supporting explicit parallelism
High level locality managementHigh level locality management
High level support for collectionsHigh level support for collections
Static typingStatic typing
Support for generic programmingSupport for generic programming
Global name spaceGlobal name space
High level control features supporting explicit parallelism High level control features supporting explicit parallelism
High level locality managementHigh level locality management
High level support for collectionsHigh level support for collections
Static typingStatic typing
Support for generic programmingSupport for generic programming
Language Summary
56
Today’s programming languages, models, and tools are not Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and adequate to deal with the challenges of 2010 architectures and application requirementsapplication requirements
Peta scale architectures will pose additional problems but may Peta scale architectures will pose additional problems but may also provide better structural support for high level languages also provide better structural support for high level languages and compilersand compilers
A focused and persistent research and development effort -- A focused and persistent research and development effort -- along a number of different paths -- will be needed to create along a number of different paths -- will be needed to create viable language systems and tool support for economically viable language systems and tool support for economically feasible and robust high productivity computing of the future feasible and robust high productivity computing of the future
Today’s programming languages, models, and tools are not Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and adequate to deal with the challenges of 2010 architectures and application requirementsapplication requirements
Peta scale architectures will pose additional problems but may Peta scale architectures will pose additional problems but may also provide better structural support for high level languages also provide better structural support for high level languages and compilersand compilers
A focused and persistent research and development effort -- A focused and persistent research and development effort -- along a number of different paths -- will be needed to create along a number of different paths -- will be needed to create viable language systems and tool support for economically viable language systems and tool support for economically feasible and robust high productivity computing of the future feasible and robust high productivity computing of the future
Conclusion
57