Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima...

Towards Future Programming Languages and Models for High-Productivity Computing

Hans P. Zima

University of Vienna, Austriaand

JPL, California Institute of Technology, Pasadena, CA

IFIP WG 10.3 Seminar September 7, 2004

11 IntroductionIntroduction

22 Programming Models for High Productivity Programming Models for High Productivity Computing Systems (HPCS)Computing Systems (HPCS)

33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms

44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design

55 ConclusionConclusion






Contents

2

Programming models and languages bridge the gap between Programming models and languages bridge the gap between

““reality” and hardware – at different levels of abstraction - e.g.,reality” and hardware – at different levels of abstraction - e.g.,

assembly languages

general-purpose procedural languages

functional languages

very high-level domain-specific languages

Programming models and languages bridge the gap between Programming models and languages bridge the gap between

““reality” and hardware – at different levels of abstraction - e.g.,reality” and hardware – at different levels of abstraction - e.g.,

assembly languages

general-purpose procedural languages

functional languages

very high-level domain-specific languages

Abstraction in Programming

Abstraction implies loss of information – gain in simplicity, clarity, verifiability, portability

versus potential performance degradation

3

The designers of the very first high level programming language The designers of the very first high level programming language were aware that their success depended on acceptable were aware that their success depended on acceptable performance of the generated target programs:performance of the generated target programs:

John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …”

The designers of the very first high level programming language The designers of the very first high level programming language were aware that their success depended on acceptable were aware that their success depended on acceptable performance of the generated target programs:performance of the generated target programs:

John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …”

The Emergence of High-Level Sequential Languages

High-level algorithmic languages became generally accepted standards for sequential programming since

their advantages outweighed any performance drawbacks

For parallel programming no similar development took place

4

Current HPC hardware: large clusters at reasonable costCurrent HPC hardware: large clusters at reasonable cost commodity clusters or custom MPPs off-the-shelf processors and memory components, built for mass market latency and bandwidth problems

Current HPC software: Current HPC software: application efficiency sometimes in single digits low-productivity programming models dominate focus on programming in the small inadequate programming environments and tools

Resemblance to pre Fortran era in sequential programming Resemblance to pre Fortran era in sequential programming assembly language programming vs. MPI wide gap between the domain of the scientist and programming language

Current HPC hardware: large clusters at reasonable costCurrent HPC hardware: large clusters at reasonable cost commodity clusters or custom MPPs off-the-shelf processors and memory components, built for mass market latency and bandwidth problems

Current HPC software: Current HPC software: application efficiency sometimes in single digits low-productivity programming models dominate focus on programming in the small inadequate programming environments and tools

Resemblance to pre Fortran era in sequential programming Resemblance to pre Fortran era in sequential programming assembly language programming vs. MPI wide gap between the domain of the scientist and programming language

The Crisis of High Performance Computing

5

Exploiting performance on current parallel architectures Exploiting performance on current parallel architectures requires low level control and “heroic” programmersrequires low level control and “heroic” programmers

This has led to “local view” programming models as exemplified This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.)by MPI based programming (as well as CoArray Fortran etc.)

explicit processors: local views of data

program state associated with memory regions

explicit communication intertwined with the algorithm

Exploiting performance on current parallel architectures Exploiting performance on current parallel architectures requires low level control and “heroic” programmersrequires low level control and “heroic” programmers

This has led to “local view” programming models as exemplified This has led to “local view” programming models as exemplified by MPI based programming (as well as CoArray Fortran etc.)by MPI based programming (as well as CoArray Fortran etc.)

explicit processors: local views of data

program state associated with memory regions

explicit communication intertwined with the algorithm

Domination of Low-Level Paradigms

6

The 1990s saw a strong (mainly, but not exclusively, academic) The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situationeffort to improve this situation

Focus on higher level paradigmsFocus on higher level paradigms CMFortran

Fortran-D

Vienna Fortran

High Performance Fortran (HPF)

and many others…

A wealth of research on related compiler technologyA wealth of research on related compiler technology

These languages have not been broadly accepted, for a number of These languages have not been broadly accepted, for a number of reasons …reasons …

The 1990s saw a strong (mainly, but not exclusively, academic) The 1990s saw a strong (mainly, but not exclusively, academic) effort to improve this situationeffort to improve this situation

Focus on higher level paradigmsFocus on higher level paradigms CMFortran

Fortran-D

Vienna Fortran

High Performance Fortran (HPF)

and many others…

A wealth of research on related compiler technologyA wealth of research on related compiler technology

These languages have not been broadly accepted, for a number of These languages have not been broadly accepted, for a number of reasons …reasons …

The High Performance Fortran (HPF) Effort

7

do while (.not. converged) do J=1,M do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do end do A(1:N,1:N) = B(1:N,1:N)

localcomputation

initialize MPI

if (MOD(myrank,2) .eq. 1) then call MPI_SEND(B(1,1),N,…,myrank-1,..) call MPI_RCV(A(1,0),N,…,myrank-1,..)

if (myrank .lt. s-1) then call MPI_SEND(B(1,M),N,…,myrank+1,..)

call MPI_RCV(A(1,M+1),N,…,myrank+1,..)endif

else … …

Assuming block column distribution:

…

Example: Parallel Jacobi With MPI

8

do while (.not. converged) do J=1,N do I=1,N B(I,J) = 0.25 * (A(I-1,J)+A(I+1,J)+ A(I,J-1)+A(I,J+1)) end do end do A(1:N,1:N) = B(1:N,1:N)

globalcomputation

processors P(NUMBER_OF_PROCESSORS) distribute(*,BLOCK) onto P :: A, B

Communication is automatically generated by compiler

Change of distribution requires modification of just one line in the source program; no change of the algorithm

Example: Parallel Jacobi With HPF

…

block column distribution

9

The HPF code is far simpler than the MPI codeThe HPF code is far simpler than the MPI code

Compiler generated object code is as good as MPI codeCompiler generated object code is as good as MPI code

parallelization of the loop

static analysis of access patterns and communication

aggregation and fusion of communication

The data distribution directives could even be generated The data distribution directives could even be generated automatically from the sequential codeautomatically from the sequential code

The HPF code is far simpler than the MPI codeThe HPF code is far simpler than the MPI code

Compiler generated object code is as good as MPI codeCompiler generated object code is as good as MPI code

parallelization of the loop

static analysis of access patterns and communication

aggregation and fusion of communication

The data distribution directives could even be generated The data distribution directives could even be generated automatically from the sequential codeautomatically from the sequential code

So where is the problem with HPF?

Example: Observations

10

HPF 1 lacked important functionalityHPF 1 lacked important functionality data distributions do not support irregular algorithms lack of flexibility for processor-mapping and data/thread affinity focus on support for array types and SPMD data parallelism

HPF includes features that are difficult to implementHPF includes features that are difficult to implement

Current architectures do not support well dynamic and Current architectures do not support well dynamic and irregular features in advanced applicationsirregular features in advanced applications

HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator Simulator explicit high-level formulation of communication patterns explicit high-level control of communication schedules explicit high-level control of “halos”

HPF 1 lacked important functionalityHPF 1 lacked important functionality data distributions do not support irregular algorithms lack of flexibility for processor-mapping and data/thread affinity focus on support for array types and SPMD data parallelism

HPF includes features that are difficult to implementHPF includes features that are difficult to implement

Current architectures do not support well dynamic and Current architectures do not support well dynamic and irregular features in advanced applicationsirregular features in advanced applications

HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth HPF+/JA HPF plasma code reached 12.5 TFLOPS on Earth Simulator Simulator explicit high-level formulation of communication patterns explicit high-level control of communication schedules explicit high-level control of “halos”

HPF Problems

However:

11

FEM crash simulation kernel

Language Extensions

(HPF/Earth Simulator)

Schedule Reuse

Halo Control

Purest procedures

10

100

1000

10000

100000

number of processors

tim

e (

se

co

nd

s)

HPF (PGI)

HPF/VFC

HPF+/VFC(REUSE)

MPI/F77

F90 (seq.)2 4 8 16 32 64 128

The Effect of Schedule Reuse and Halo Management

12











Contents

13

Current parallel programming language, compiler, and tool Current parallel programming language, compiler, and tool technologies are unable to support high productivity computingtechnologies are unable to support high productivity computing

New programming models, languages, New programming models, languages, compiler, and tool technologies are compiler, and tool technologies are necessary to address the productivity necessary to address the productivity demands of future systemsdemands of future systems

Current parallel programming language, compiler, and tool Current parallel programming language, compiler, and tool technologies are unable to support high productivity computingtechnologies are unable to support high productivity computing

New programming models, languages, New programming models, languages, compiler, and tool technologies are compiler, and tool technologies are necessary to address the productivity necessary to address the productivity demands of future systemsdemands of future systems

State-of-the-Art

14

Make Scientists and Engineers more productive:Make Scientists and Engineers more productive:

provide a higher level of abstractionprovide a higher level of abstraction

Support “Abstraction without Guilt” [Ken Kennedy]:Support “Abstraction without Guilt” [Ken Kennedy]:

Increase programming language usability without Increase programming language usability without

sacrificing performancesacrificing performance

Make Scientists and Engineers more productive:Make Scientists and Engineers more productive:

provide a higher level of abstractionprovide a higher level of abstraction

Support “Abstraction without Guilt” [Ken Kennedy]:Support “Abstraction without Guilt” [Ken Kennedy]:

Increase programming language usability without Increase programming language usability without

sacrificing performancesacrificing performance

Global Goal

15

The major goal must be productivityThe major goal must be productivity performance user productivity – time to solution portability robustness

This requires significant progress in many fields …This requires significant progress in many fields … parallel programming languages compiler technology runtime systems intelligent programming environments

……and better support from hardware architecturesand better support from hardware architectures

The major goal must be productivityThe major goal must be productivity performance user productivity – time to solution portability robustness

This requires significant progress in many fields …This requires significant progress in many fields … parallel programming languages compiler technology runtime systems intelligent programming environments

……and better support from hardware architecturesand better support from hardware architectures

Productivity

16

Large-Scale ParallelismLarge-Scale Parallelism hundreds of thousands of processors

component failures will occur in relatively short intervals

Non-Uniform Data AccessNon-Uniform Data Access deep memory hierarchies

severe differences in latency: 1000+ cycles for accessing data from memory; larger latency for remote memory

severe differences in bandwidths

Large-Scale ParallelismLarge-Scale Parallelism hundreds of thousands of processors

component failures will occur in relatively short intervals

Non-Uniform Data AccessNon-Uniform Data Access deep memory hierarchies

severe differences in latency: 1000+ cycles for accessing data from memory; larger latency for remote memory

severe differences in bandwidths

Challenges of Peta-Scale Architectures

17

Long lived applications, surviving many generations of hardwareLong lived applications, surviving many generations of hardware

Applications are becoming larger and more complexApplications are becoming larger and more complex multi-disciplinary

multi-language

multi-paradigm

Legacy codes pose a big problemLegacy codes pose a big problem from F77 to F90, C/C++, MPI, Coarray Fortran etc.

intellectual content must be preserved

automatic re-write under constraint of performance portability is a difficult research problem

Many advanced applications are dynamic, irregular, and adaptiveMany advanced applications are dynamic, irregular, and adaptive

Long lived applications, surviving many generations of hardwareLong lived applications, surviving many generations of hardware

Applications are becoming larger and more complexApplications are becoming larger and more complex multi-disciplinary

multi-language

multi-paradigm

Legacy codes pose a big problemLegacy codes pose a big problem from F77 to F90, C/C++, MPI, Coarray Fortran etc.

intellectual content must be preserved

automatic re-write under constraint of performance portability is a difficult research problem

Many advanced applications are dynamic, irregular, and adaptiveMany advanced applications are dynamic, irregular, and adaptive

Application Challenges

18

High performance networks and multithreading can contribute High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidthto tolerating memory latency and improving memory bandwidth

Hardware support for locality aware programming can avoid Hardware support for locality aware programming can avoid serious performance problems in current architecturesserious performance problems in current architectures

Abandoning global cache coherence in favor of a more flexible Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecksscheme can eliminate a major source of bottlenecks

Lightweight processors (possibly in the millions) can be used both Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layeras a computational fabric and a service layer exploitation of spatial locality lightweight synchronization an introspection infrastructure for fault tolerance, performance tuning, and

intrusion detection

High performance networks and multithreading can contribute High performance networks and multithreading can contribute to tolerating memory latency and improving memory bandwidthto tolerating memory latency and improving memory bandwidth

Hardware support for locality aware programming can avoid Hardware support for locality aware programming can avoid serious performance problems in current architecturesserious performance problems in current architectures

Abandoning global cache coherence in favor of a more flexible Abandoning global cache coherence in favor of a more flexible scheme can eliminate a major source of bottlenecksscheme can eliminate a major source of bottlenecks

Lightweight processors (possibly in the millions) can be used both Lightweight processors (possibly in the millions) can be used both as a computational fabric and a service layeras a computational fabric and a service layer exploitation of spatial locality lightweight synchronization an introspection infrastructure for fault tolerance, performance tuning, and

intrusion detection

Opportunities for Peta-Scale Architectures

19

Semantics

ProductivityModel

Programming Model

conceptual view of data and control

Programming Language

Programming Language

+ Directives

Library

Command-line Interface

SP

SP

SP

SP

ExecutionModel

abstract machine

realizations

Programming Models for High Productivity Computing

20

Programming models and their realizations can be characterized Programming models and their realizations can be characterized along (at least) three dimensions:along (at least) three dimensions: semantics user productivity performance

SemanticsSemantics: a mapping from programs to functions specifying : a mapping from programs to functions specifying input/output behavior of the program:input/output behavior of the program: S: P F , where each f in F is a function f: I O

User productivityUser productivity: a mapping from programs to a : a mapping from programs to a characterization of structural complexity:characterization of structural complexity: U: P N

PerformancePerformance: a mapping from programs to functions specifying : a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real the complexity of the program in terms of its execution on a real or abstract target machine: or abstract target machine: C: P G, where each g in G is a function g : I N*

Programming models and their realizations can be characterized Programming models and their realizations can be characterized along (at least) three dimensions:along (at least) three dimensions: semantics user productivity performance

SemanticsSemantics: a mapping from programs to functions specifying : a mapping from programs to functions specifying input/output behavior of the program:input/output behavior of the program: S: P F , where each f in F is a function f: I O

User productivityUser productivity: a mapping from programs to a : a mapping from programs to a characterization of structural complexity:characterization of structural complexity: U: P N

PerformancePerformance: a mapping from programs to functions specifying : a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real the complexity of the program in terms of its execution on a real or abstract target machine: or abstract target machine: C: P G, where each g in G is a function g : I N*

Programming Model Issues

21



33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms1 General-Purpose Languages

2 Domain-Specific Languages

3 Programming Environments and Tools





33 Issues in HPCS Languages and ParadigmsIssues in HPCS Languages and Paradigms1 General-Purpose Languages

2 Domain-Specific Languages

3 Programming Environments and Tools



Contents

22

Key Issues in General Purpose Languages

High-Level Features for Explicit Concurrency

High-Level Features for Locality Control

High-Level Support for Distributed Collections

global address space

Safety

object orientation

performance transparency

generic programming

Extensibility

Critical Functionality Orthogonal Language Issues

High-Level Support for Programming-In-the-Large

23

data work

data

align

workalign with data

distribute

HPCS architecture

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

Locality Awareness: Distribution, Alignment, Affinity

distribute

24

Data distributions must be able to reflect Data distributions must be able to reflect partitioning ofpartitioning of physical domainsphysical domains: exploiting locality in physical space: exploiting locality in physical space

Data distributions must be able to reflect Data distributions must be able to reflect dynamicdynamic reshaping of data domainsreshaping of data domains (e.g., Gaussian elimination): (e.g., Gaussian elimination): a load balancing issue a load balancing issue

AffinityAffinity --- the co-location of threads and data they are --- the co-location of threads and data they are operating upon --- is an important performance issueoperating upon --- is an important performance issue

Language features for expressing Language features for expressing user-defined user-defined distribution distribution and alignmentand alignment

Data distributions must be able to reflect Data distributions must be able to reflect partitioning ofpartitioning of physical domainsphysical domains: exploiting locality in physical space: exploiting locality in physical space

Data distributions must be able to reflect Data distributions must be able to reflect dynamicdynamic reshaping of data domainsreshaping of data domains (e.g., Gaussian elimination): (e.g., Gaussian elimination): a load balancing issue a load balancing issue

AffinityAffinity --- the co-location of threads and data they are --- the co-location of threads and data they are operating upon --- is an important performance issueoperating upon --- is an important performance issue

Language features for expressing Language features for expressing user-defined user-defined distribution distribution and alignmentand alignment

Requirements for Data Distribution Features

25

DSLs provide DSLs provide high-level abstractions tailored to a specific high-level abstractions tailored to a specific domaindomain

DSLs DSLs separateseparate domain expertise from parallelization domain expertise from parallelization knowledge knowledge

DSLs allow a domain expert to work DSLs allow a domain expert to work closer to the scientificcloser to the scientific processprocess (removed from resource management) – a potentially (removed from resource management) – a potentially easier route to user productivity than general purpose easier route to user productivity than general purpose languages languages

Potential Potential problemsproblems include fragility with respect to changes in include fragility with respect to changes in the application domain, and target code performancethe application domain, and target code performance

DSLs can be built on top of general purpose languages, DSLs can be built on top of general purpose languages, providing a transition path for applicationsproviding a transition path for applications

DSLs provide DSLs provide high-level abstractions tailored to a specific high-level abstractions tailored to a specific domaindomain

DSLs DSLs separateseparate domain expertise from parallelization domain expertise from parallelization knowledge knowledge

DSLs allow a domain expert to work DSLs allow a domain expert to work closer to the scientificcloser to the scientific processprocess (removed from resource management) – a potentially (removed from resource management) – a potentially easier route to user productivity than general purpose easier route to user productivity than general purpose languages languages

Potential Potential problemsproblems include fragility with respect to changes in include fragility with respect to changes in the application domain, and target code performancethe application domain, and target code performance

DSLs can be built on top of general purpose languages, DSLs can be built on top of general purpose languages, providing a transition path for applicationsproviding a transition path for applications

Domain-Specific Languages (DSLs)

26

Building Building domain-specific skeletonsdomain-specific skeletons for applications (with plug for applications (with plug ins for user defined functions)ins for user defined functions)

Using domain knowledge for extending and improving Using domain knowledge for extending and improving standard standard compilercompiler analysisanalysis

Developing Developing domain-specific librariesdomain-specific libraries

Detecting and optimizing Detecting and optimizing domain-specific patternsdomain-specific patterns in serial in serial legacy codeslegacy codes

Using domain knowledge for Using domain knowledge for verification, validation, and fault verification, validation, and fault tolerancetolerance

Telescoping Telescoping ---- automatic generation of optimizing compilers for automatic generation of optimizing compilers for domain specific languages defined by librariesdomain specific languages defined by libraries

Building Building domain-specific skeletonsdomain-specific skeletons for applications (with plug for applications (with plug ins for user defined functions)ins for user defined functions)

Using domain knowledge for extending and improving Using domain knowledge for extending and improving standard standard compilercompiler analysisanalysis

Developing Developing domain-specific librariesdomain-specific libraries

Detecting and optimizing Detecting and optimizing domain-specific patternsdomain-specific patterns in serial in serial legacy codeslegacy codes

Using domain knowledge for Using domain knowledge for verification, validation, and fault verification, validation, and fault tolerancetolerance

Telescoping Telescoping ---- automatic generation of optimizing compilers for automatic generation of optimizing compilers for domain specific languages defined by librariesdomain specific languages defined by libraries

Exploiting Domain-Specific Information

27

Rewriting Legacy CodesRewriting Legacy Codes Preservation of intellectual content

Opportunity for exploiting new hardware, including new algorithms

Code size may preclude practicality of rewrite

Language, compiler, tool, and runtime supportLanguage, compiler, tool, and runtime support (Semi) automatic tools for migrating code

incremental porting

transition of performance-critical sections requires highly-sophisticated software for automatic adaptation

high-level analysis pattern matching and concept comprehension optimization and specialization

Rewriting Legacy CodesRewriting Legacy Codes Preservation of intellectual content

Opportunity for exploiting new hardware, including new algorithms

Code size may preclude practicality of rewrite

Language, compiler, tool, and runtime supportLanguage, compiler, tool, and runtime support (Semi) automatic tools for migrating code

incremental porting

transition of performance-critical sections requires highly-sophisticated software for automatic adaptation

high-level analysis pattern matching and concept comprehension optimization and specialization

Legacy Code Migration

28

Reliability ChallengesReliability Challenges massive parallelism poses new problems

fault prognostics, detection, recovery

data distribution may cause vital data to be spread across all nodes

(Semi) Automatic Tuning(Semi) Automatic Tuning closed loop adaptive control: measurement, decision-making, actuation

information exposure: users, compilers, runtime systems

learning from experience: databases, data mining, reasoning systems

IntrospectionIntrospection a technology for the support of validation, fault detection, and performance

tuning

Reliability ChallengesReliability Challenges massive parallelism poses new problems

fault prognostics, detection, recovery

data distribution may cause vital data to be spread across all nodes

(Semi) Automatic Tuning(Semi) Automatic Tuning closed loop adaptive control: measurement, decision-making, actuation

information exposure: users, compilers, runtime systems

learning from experience: databases, data mining, reasoning systems

IntrospectionIntrospection a technology for the support of validation, fault detection, and performance

tuning

Issues in Programming Environments and Tools

29

Parallelizingcompiler

sourceprogram

targetprogram

Expertadvisor

Transformation system

execution on

target machine

end of tuning cycle

Program / ExecutionKnowledge Base

Example: Offline Performance Tuning

30




44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design1 The DARPA Sponsored HPCS Program

2 An Overview of the Cascade Architecture

3 Design Aspects of the Chapel Programming Language





44 Cascade and the Chapel Language DesignCascade and the Chapel Language Design1 The DARPA Sponsored HPCS Program

2 An Overview of the Cascade Architecture

3 Design Aspects of the Chapel Programming Language


Contents

31

High ProductivityComputing Systems

Goals:Goals:

Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 – 2010)

Impact:Performance (efficiency): critical national security

applications by a factor of 10X to 40XProductivity (time-to-solution) Portability (transparency): insulate research and

operational application software from systemRobustness (reliability): apply all known techniques

to protect against outside attacks, hardware faults, & programming errors

Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)

Fill the Critical Technology and Capability GapToday (late 80’s HPC technology)…..to…..Future (Quantum/Bio Computing)

Applications: Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant

modeling and biotechnology

HPCS Program Focus Areas

32

one year Concept Study, July 2002 -- June 2003one year Concept Study, July 2002 -- June 2003

three year Prototyping Phase, July 2003 -- June 2006 three year Prototyping Phase, July 2003 -- June 2006

Led by Cray Inc. (Burton Smith)Led by Cray Inc. (Burton Smith)

Partners: Caltech/JPL

University of Notre Dame

Stanford University

one year Concept Study, July 2002 -- June 2003one year Concept Study, July 2002 -- June 2003

three year Prototyping Phase, July 2003 -- June 2006 three year Prototyping Phase, July 2003 -- June 2006

Led by Cray Inc. (Burton Smith)Led by Cray Inc. (Burton Smith)

Partners: Caltech/JPL

University of Notre Dame

Stanford University

The Cascade Project

33

Hierarchical architecture: two levels of processing elementsHierarchical architecture: two levels of processing elements

Lightweight processors in “smart memory”Lightweight processors in “smart memory”

Shared address spaceShared address space

Program controlled selection of UMA or NUMA data accessProgram controlled selection of UMA or NUMA data access

Hybrid UMA/NUMA programming paradigmHybrid UMA/NUMA programming paradigm

Hierarchical architecture: two levels of processing elementsHierarchical architecture: two levels of processing elements

Lightweight processors in “smart memory”Lightweight processors in “smart memory”

Shared address spaceShared address space

Program controlled selection of UMA or NUMA data accessProgram controlled selection of UMA or NUMA data access

Hybrid UMA/NUMA programming paradigmHybrid UMA/NUMA programming paradigm

Cascade Architecture: Key Elements

34

Tolerate latency with processor concurrencyTolerate latency with processor concurrency Vectors provide concurrency within a thread

Multithreading provides concurrency between threads

Exploit locality to reduce bandwidth demandExploit locality to reduce bandwidth demand “Heavyweight” processors (HWPs) to exploit temporal

locality

“Lightweight” processors (LWPs) to exploit spatial locality

Use other techniques to reduce network trafficUse other techniques to reduce network traffic Atomic memory operations

Efficient remote thread creation

Tolerate latency with processor concurrencyTolerate latency with processor concurrency Vectors provide concurrency within a thread

Multithreading provides concurrency between threads

Exploit locality to reduce bandwidth demandExploit locality to reduce bandwidth demand “Heavyweight” processors (HWPs) to exploit temporal

locality

“Lightweight” processors (LWPs) to exploit spatial locality

Use other techniques to reduce network trafficUse other techniques to reduce network traffic Atomic memory operations

Efficient remote thread creation

Using Bandwidth Wisely

35

A Simplified Global View of the Cascade Architecture

Smart Memory

HWP

SW-controlled Cache

Smart Memory

HWP

SW-controlled Cache…

Global Interconnection Network

Locale Locale

36

Heavyweight ProcessorVector

MultithreadedStreaming

Compiler-assisted cache

LocaleInterconnect

Network

Router

To Other Locales

DRAM

Lightweight Processors

Multithreaded

cache

Lightweight Processor Chip

Multithreaded

LWP LWP LWP LWP…

DRAM


Multithreaded

cache


Multithreaded

LWP LWP LWP LWP…

DRAM


Multithreaded

cache


Multithreaded

LWP LWP LWP LWP…

DRAM


Multithreaded

cache


Multithreaded

LWP LWP LWP LWP…

DRAM


Multithreaded

cache


Multithreaded

LWP LWP LWP LWP…

DRAM


Multithreaded

cache


Multithreaded

LWP LWP LWP LWP…

A Cascade Locale

Source: David Callahan, Cray Inc. 37

Multiple processors per dieMultiple processors per die

Heavily multithreadedHeavily multithreaded local memory available to hold

thread state

Large on-chip DRAM cacheLarge on-chip DRAM cache

Multiple processors per dieMultiple processors per die

Heavily multithreadedHeavily multithreaded local memory available to hold

thread state

Large on-chip DRAM cacheLarge on-chip DRAM cache

Specialized instruction setSpecialized instruction set

fork, quit

spawn

MTA-style F/E bits for synchronizationMTA-style F/E bits for synchronization

Specialized instruction setSpecialized instruction set

fork, quit

spawn

MTA-style F/E bits for synchronizationMTA-style F/E bits for synchronization

DRAM


Multithreaded

cache


Multithreaded

LWP LWP LWP LWP…

Optimized to start, stop and move


Source: David Callahan, Cray Inc. 38

Lightweight processorsLightweight processors co-located with memory

focus on availability

full exploitation is not a primary system goal

Lightweight threadsLightweight threads minimal state – high rate context switch

spawned by sending a parcel to memory

Exploiting spatial localityExploiting spatial locality fine-grain: reductions, prefix operations, search

coarse-grain: data distribution and alignment

Saving bandwidth by migrating threads to dataSaving bandwidth by migrating threads to data

Lightweight processorsLightweight processors co-located with memory

focus on availability

full exploitation is not a primary system goal

Lightweight threadsLightweight threads minimal state – high rate context switch

spawned by sending a parcel to memory

Exploiting spatial localityExploiting spatial locality fine-grain: reductions, prefix operations, search

coarse-grain: data distribution and alignment

Saving bandwidth by migrating threads to dataSaving bandwidth by migrating threads to data

Lightweight Processors and Threads

39

Fine grain Fine grain application parallelismapplication parallelism

Implementation of aImplementation of a service layerservice layer

Components of agent systems that asynchronously Components of agent systems that asynchronously monitor the monitor the computationcomputation performing performing introspection,introspection, and dealing with: and dealing with: dynamic program validation

validation of user directives

fault tolerance

intrusion prevention and detection

performance analysis and tuning

support of feedback-oriented compilation

Fine grain Fine grain application parallelismapplication parallelism

Implementation of aImplementation of a service layerservice layer

Components of agent systems that asynchronously Components of agent systems that asynchronously monitor the monitor the computationcomputation performing performing introspection,introspection, and dealing with: and dealing with: dynamic program validation

validation of user directives

fault tolerance

intrusion prevention and detection

performance analysis and tuning

support of feedback-oriented compilation

Uses of Lightweight Threads

40

Introspection can be defined as a system’s ability to:Introspection can be defined as a system’s ability to: explore its own properties reason about its internal state make decisions about appropriate state changes where necessary

an enabling technology for building “autonomic” systems an enabling technology for building “autonomic” systems (among other purposes)(among other purposes)

Introspection can be defined as a system’s ability to:Introspection can be defined as a system’s ability to: explore its own properties reason about its internal state make decisions about appropriate state changes where necessary

an enabling technology for building “autonomic” systems an enabling technology for building “autonomic” systems (among other purposes)(among other purposes)

Introspection

41

Performance tuning is critical for distributed high performance Performance tuning is critical for distributed high performance applications in a massively parallel architectureapplications in a massively parallel architecture

Trace based off line analysis may not be practical due to Trace based off line analysis may not be practical due to volume of datavolume of data

Complexity of machine architecture makes manual control Complexity of machine architecture makes manual control difficultdifficult

Automatic support for on line tuning essential for efficiently Automatic support for on line tuning essential for efficiently dealing with the problemdealing with the problem

Performance tuning is critical for distributed high performance Performance tuning is critical for distributed high performance applications in a massively parallel architectureapplications in a massively parallel architecture

Trace based off line analysis may not be practical due to Trace based off line analysis may not be practical due to volume of datavolume of data

Complexity of machine architecture makes manual control Complexity of machine architecture makes manual control difficultdifficult

Automatic support for on line tuning essential for efficiently Automatic support for on line tuning essential for efficiently dealing with the problemdealing with the problem

A Case Study: Automatic Performance Tuning

42

Program/Performance Data BaseProgram/Performance Data Base – – storing information about the program, storing information about the program, executions (input, performance data), and system.executions (input, performance data), and system.

It provides the central interface in the system.It provides the central interface in the system.

Asynchronous parallel agentsAsynchronous parallel agents for various functions, includingfor various functions, including

Simplification: data reduction and filtering

Invariant checking

Assertion checking

Local problem analysis

Performance Exception Handler Performance Exception Handler – performs a model-based analysis of – performs a model-based analysis of performance problems communicated by agents, and decides on an performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may triggerappropriate action if necessary. Such an action may trigger

New method selection

Re-instrumentation

Re-compilation

Program/Performance Data BaseProgram/Performance Data Base – – storing information about the program, storing information about the program, executions (input, performance data), and system.executions (input, performance data), and system.

It provides the central interface in the system.It provides the central interface in the system.

Asynchronous parallel agentsAsynchronous parallel agents for various functions, includingfor various functions, including

Simplification: data reduction and filtering

Invariant checking

Assertion checking

Local problem analysis

Performance Exception Handler Performance Exception Handler – performs a model-based analysis of – performs a model-based analysis of performance problems communicated by agents, and decides on an performance problems communicated by agents, and decides on an appropriate action if necessary. Such an action may triggerappropriate action if necessary. Such an action may trigger

New method selection

Re-instrumentation

Re-compilation

Elements of an Approach to On the Fly Performance Tuning

43

Example: A Society of Agents for Performance Analysis and Feedback-Oriented Tuning

Program/

Performance

Data Base

execution of instrumented

application program

compiler

instrumenter

data collection

analysis

agent

invariantchec

king agent

simplification

agent

Performance Exception Handler

data reduction and filtering

44

Base ModelBase Model Unbounded thread parallelism in a flat shared memory

Explicit synchronization for access to shared data

No consideration of temporal or spatial locality, no “processors”

Extended ModelExtended Model Explicit memory locales, data distribution, data/thread affinity

Explicit cache management

Base ModelBase Model Unbounded thread parallelism in a flat shared memory

Explicit synchronization for access to shared data

No consideration of temporal or spatial locality, no “processors”

Extended ModelExtended Model Explicit memory locales, data distribution, data/thread affinity

Explicit cache management

Cascade Programming Model

45

Global name spaceGlobal name space even in the context of a NUMA model avoid “local view” programming model of MPI, CoArray Fortran, UPC

Multiple models of parallelismMultiple models of parallelism

Provide support forProvide support for:: explicit parallel programming locality-aware programming interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.) generic programming

Global name spaceGlobal name space even in the context of a NUMA model avoid “local view” programming model of MPI, CoArray Fortran, UPC

Multiple models of parallelismMultiple models of parallelism

Provide support forProvide support for:: explicit parallel programming locality-aware programming interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.) generic programming

Language Design Criteria

46

The “Concrete Language” enhances HPF and ZPLThe “Concrete Language” enhances HPF and ZPL generalized HPF-type data distributions domains are first-class objects: a central repository for index space, distribution, and

associated set of arrays support for automatic partitioning of dynamic graph-based data structures high-level control for communication (halos,…)?

TheThe “ “Abstract Language”Abstract Language” supports generic programmingsupports generic programming abstraction of types: type inference from context abstraction of iteration: generalization of the CLU iterator data structure inference: system-selected implementation for programmer-specified

object categories

NoteNote: The “concrete” and “abstract” languages as used above do not in reality refer to : The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussiondifferent languages but are only used here to structure this discussion . .

The “Concrete Language” enhances HPF and ZPLThe “Concrete Language” enhances HPF and ZPL generalized HPF-type data distributions domains are first-class objects: a central repository for index space, distribution, and

associated set of arrays support for automatic partitioning of dynamic graph-based data structures high-level control for communication (halos,…)?

TheThe “ “Abstract Language”Abstract Language” supports generic programmingsupports generic programming abstraction of types: type inference from context abstraction of iteration: generalization of the CLU iterator data structure inference: system-selected implementation for programmer-specified

object categories

NoteNote: The “concrete” and “abstract” languages as used above do not in reality refer to : The “concrete” and “abstract” languages as used above do not in reality refer to different languages but are only used here to structure this discussiondifferent languages but are only used here to structure this discussion . .

Language Design Highlights

47

A modern base languageA modern base language Strongly typed

Fortran-like array features

Objected-oriented

Module structure for name space management

Optional automatic storage management

High performance featuresHigh performance features Abstractions for parallelism

data parallelism (domains, forall) task parallelism (cobegin/coend)

Locality management via data distributions and affinity

A modern base languageA modern base language Strongly typed

Fortran-like array features

Objected-oriented

Module structure for name space management

Optional automatic storage management

High performance featuresHigh performance features Abstractions for parallelism

data parallelism (domains, forall) task parallelism (cobegin/coend)

Locality management via data distributions and affinity

Basics

48

Type Tree

primitive domain collection function

homogeneouscollection

heterogeneouscollection

sequence setarray class record

iterator other function

49

Low level tools direct object and thread placement at time Low level tools direct object and thread placement at time of creationof creation Natural extensions to existing partitioned address space languages

Evolve from HPF and ZPL a higher level concept:Evolve from HPF and ZPL a higher level concept: Introduce domains as first class program entities

Sets of names (indices) for data

Centralization of distribution, layout, and iteration ordering

Low level tools direct object and thread placement at time Low level tools direct object and thread placement at time of creationof creation Natural extensions to existing partitioned address space languages

Evolve from HPF and ZPL a higher level concept:Evolve from HPF and ZPL a higher level concept: Introduce domains as first class program entities

Sets of names (indices) for data

Centralization of distribution, layout, and iteration ordering

Locality Management

50

locale view: a logical view for a set of locales

distribution: a mapping of an index set to a locale view

array: a map from an index set to a collection of variables

Domains

index sets: Cartesian products, sparse, opaque

51

2) Separation of Concerns: Definition

var Mat: domain(2) = [1..m, 1..n];var MatCol: domain(1) = Mat(2);var MatRow: domain(1) = Mat(1);

var A: array [Mat] of float;var v: array [MatCol] of float;var s: array [MatRow] of float;

s = sum(dim=2) [i,j:Mat] A(i,j)*v(j);

Example: Matrix Vector Multiplication V1

52

var L: array[1..p1,1..p2] of locale;

var Mat: domain(2) dist(block,block) to L = [1..m,1..n];var MatCol: domain(1) align(*,Mat(2)) = Mat(2);var MatRow: domain(1) align(Mat(1),*) = Mat(1);




53

0 53 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 93 0 0 0 0 0

0 0 0 0 21 0 0 0 16 72 0 0 0 0 0 0 0 0 0 13 0

0 0 0 0 0 23 69 0 27 0 0 11

44 0 0 19 37 0 0 0 0 0 64 0

D0

____

53191793

C0

____

2 1 4 5

R0

____

1 2 2 3 3 4 5 5

D0

____

53191793

D0

____

53191793

C0

____

2 1 4 5

R0

____

1 2 2 3 3 4 5 5

C1

____

2 3 1 2

R1

____

1 1 2 3 4 4 4 5

D1

____

21167213

D0

____

53191793

C2

____

2 3 1 4

R2

____

1 1 3 5

D2

____

23692711

D3

____

44193764

C3

____

1 4 1 3

R3

____

1 3 4 5

Sparse Matrix Distribution

54

var L: array[1..p1,1..p2] of locale;

var Mat: domain(sparse2) dist(mysdist) to L layout(myslay) = [1..m,1..n] where enumeratenonzeroes(); var MatCol: domain(1) align(*,Mat(2)) = Mat(2);var MatRow: domain(1) align(Mat(1),*) = Mat(1);




55

Global name spaceGlobal name space

High level control features supporting explicit parallelism High level control features supporting explicit parallelism

High level locality managementHigh level locality management

High level support for collectionsHigh level support for collections

Static typingStatic typing

Support for generic programmingSupport for generic programming

Global name spaceGlobal name space

High level control features supporting explicit parallelism High level control features supporting explicit parallelism

High level locality managementHigh level locality management

High level support for collectionsHigh level support for collections

Static typingStatic typing

Support for generic programmingSupport for generic programming

Language Summary

56

Today’s programming languages, models, and tools are not Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and adequate to deal with the challenges of 2010 architectures and application requirementsapplication requirements

Peta scale architectures will pose additional problems but may Peta scale architectures will pose additional problems but may also provide better structural support for high level languages also provide better structural support for high level languages and compilersand compilers

A focused and persistent research and development effort -- A focused and persistent research and development effort -- along a number of different paths -- will be needed to create along a number of different paths -- will be needed to create viable language systems and tool support for economically viable language systems and tool support for economically feasible and robust high productivity computing of the future feasible and robust high productivity computing of the future

Today’s programming languages, models, and tools are not Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and adequate to deal with the challenges of 2010 architectures and application requirementsapplication requirements

Peta scale architectures will pose additional problems but may Peta scale architectures will pose additional problems but may also provide better structural support for high level languages also provide better structural support for high level languages and compilersand compilers

A focused and persistent research and development effort -- A focused and persistent research and development effort -- along a number of different paths -- will be needed to create along a number of different paths -- will be needed to create viable language systems and tool support for economically viable language systems and tool support for economically feasible and robust high productivity computing of the future feasible and robust high productivity computing of the future

Conclusion

57

Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima...

Documents

Transcript of Towards Future Programming Languages and Models for High-Productivity Computing Hans P. Zima...