N. Vasilache , R. Lethin
description
Transcript of N. Vasilache , R. Lethin
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
1
N. Vasilache , R. Lethin
The R-Stream High-Level Transformation Tool: State of the Art and Objectives Within the UHPC Program
• Government Purpose Rights• Purchase Order Number: N/A• Agreement No.: HR001‐10‐3‐0007• Contractor Name: Intel Corporation• Contractor Address: 2111 NE 25th Ave M/S JF2‐60, Hillsboro, OR 97124• Expiration Date: None• The Government’s rights to use, modify, reproduce, release, perform, display, or disclose this technical data are
restricted by paragraphs B (1),(3) and (6) of Article VIII as incorporated within the above purchase order and Agreement. No restrictions apply after the expiration data shown above. Any reproduction of the software or portions thereof marked with this legend must also reproduce the markings. The following entities, their respective successors and assigns, shall possess the right to exercise said property rights, as if they were the Government, on behalf of the Government.: University of Delaware – www.udel.edu; ETIInternational – www.etinternational.com; Intel Corporation – www.intel.com; Reservoir Labs – www.reservoir.com; University of California – San Diego – www.ucsd.edu; University of Illinois at Urbana-Champaign- www.illinois.edu.
•
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
2
• R-Stream Overview
• UHPC Goals
• Some Performance Results
Outline
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
Power Efficiency Driving Architectures
3
GPP
SIMD
Memory
SIMD
DMA
FPGASIMD
SIMD
GPP
SIMD
Memory
SIMD
DMA
FPGASIMD
SIMD
GPP
SIMD
Memory
SIMD
DMA
FPGASIMD
SIMD
GPP
SIMD
Memory
SIMD
DMA
FPGASIMD
SIMD
HeterogeneousProcessing
DistributedLocal Memories
ExplicitlyManaged
Architecture
NUMA
BandwidthStarved
Hierarchical(including board,chassis, cabinet)
Multiple Execution
Models
MixedParallelism
Types
MultipleSpatial
Dimensions
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
4
• Expressing it in the program:• Annotations and pragma dialects for C• Chapel subset (UHPC in progress with UIUC)• CnC subset (UHPC in progress with Intel)• Generating it:• Explicitly (e.g., new languages like CUDA, target
specific )• Implicitly (UHPC in progress: libraries, runtime
abstractions CnC)• But before expressing it, how can programmers find it?• Manual constructive procedures, art, sweat, time
– Artisans get complete control over every detail• Fully-automatic
– Operations research problems and (advanced) autotuning
– Faster, sometimes better, than a human
Computation Choreography
Not our focus
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
5
Program Transformations Specification
iteration space of a statement S(i,j)
j
i
22: ZZ
t1
t2
• Schedule maps iterations to multi-dimensional time:• A feasible schedule preserves dependences• Placement maps iterations to multi-dimensional space:• UHPC in progress, partially done• Layout maps data elements to multi-dimensional
space:• UHPC in progress• Hierarchical by design, tiling serves separation of
concerns
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
6
• Parametric imperfect loop nests
• Subsumes classical transformations
• Compacts the transformation search space
• Parallelization, locality optimization (communication avoiding)
• Preserves semantics
• Analytic joint formulations of optimizations
• Not just for affine static control programs
Polyhedral Slogans
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
R-Stream Blueprint
7
Polyhedral Mapper
Raising Lowering
Scalar RepresentationEDG CFront End
Pretty Printer (CUDA, C+annotations, pthreads …)
Machine Model
CnC / ChapelFront End
(UHPC in progress)
Extended RepresentationCnC High-Level
C Low-Level
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
8
Mapping Process for Explicitly Managed Memories
2- Task formation:- Coarse-grain atomic tasks- Master/slave side operations
- Local / global data layout optimization- Multi-buffering (explicitly managed)- Synchronization (barriers)- Bulk communications- Thread generation -> master/slave- Target-specific optimizations
1- Scheduling:Parallelism, locality, tilability
3- Placement:Assign tasks to blocks/threads
Dependencies
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
9
Model for Scheduling Trades 3 Objectives Jointly
Loop Fusion
MoreParalleli
smMore
Locality
Loop Fission Sufficien
tOccupan
cy
MemoryContigui
ty
FewerGlobal
MemoryAccesse
s
+ successive thread
contiguity
BetterEffectiveBandwid
th Patent pending
+ successive thread
contiguity
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
Inside the R-Stream Mapper
10
Jolylib, …
Extended GDG representation
Tactics Module
ParallelizationLocality
OptimizationTiling Placement Comm.
Generation
MemoryPromotion
SyncGeneration
LayoutOptimization
PolyhedralScanning
…
Optimization modules engineered to expose advanced “knobs” used by auto-tuner
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
Optimization Across BLAS Calls
/* Optimization with BLAS */for loop { … BLAS call 1 … BLAS call 2 … … BLAS call n …}
Outer loop(s)Retrieve data Z
from diskStore data Z back to diskRetrieve data Z from disk !!!
Numerous cache misses /* Global Optimization*/
doall loop { … for loop { … [read from Z] … [write to Z] … [read from Z] } …}
Loop fusion
can improve locality
Can parallelize
outer loop(s)
VS.
→ Global optimization exposes better parallelism and locality (significant speedups)
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
12
• R-Stream Overview
• UHPC Goals
• Some Performance Results
Outline
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
13
• Codelets have:• Fine granularity• Explicit communication• Point to point, other kinds of synchronization• Can utilize scheduling and dependence
information hints• Should also use placement of data and
computation hints• Work from local scratch pad memories• Good match for UHPC hardware, allows good
control for energy, resilience, etc.
Codelets From a HLC perspective
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
14
• Energy • must minimize data motion/communication• Near Threshold Voltage • must find even more parallelism• Resilience • synergy needed with new
checkpointing/recovery models• Self awareness • dynamic distributed feedback and regulation
UHPC from HLC perspective
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
15
• But programming directly in codelets is impractical:• Exposing machine details is a good thing, but don’t
want programmers to manage them. – Too complicated: getting it done, getting it right, getting
it fast. (Complexity = parallelism x locality x resilience…)
• Writing directly in codelets will also overs-pecify the program, bake to one machine, and defeat portability
• Role of HLC is to take high level abstractions from programmer– sequential code, – Chapel, CnC, – data-parallel idioms, – math language
• Perform optimization to various levels of the target hardware hierarchy
Another Observation
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
16
Based on R-Stream Technology
Existing New For CodeletsEnergy Locality Opt
Explicit Comm GenMap to acceleratorsHierarchical barriers
Deep hierarchical schedulingPoint to point syncData placement opts
More parallelism
Exact dependenceImperfect loops
Dynamic schedules and placements
Emit scheduling and placement hints
Resilience Emit interaction setsABFT supportMemory reuse optCheckpointing opt
High level programming
Sequential C Chapel, CnC, Math, Data Parallel Idioms
Self-awareness Dynamic mappings
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
17
• Assume a mapping from CnC -> Codelets• Advantages of CnC• More succinct expression of parallelism (the
skewing problem)• Adaptable parallelism and load balancing• High-level representation of data parallel idioms
– CnC help solve the irregular, idiomatic part of the problem
– R-Stream can target optimizations across irregular idioms
• Easy to test for correctness of generated code and execute efficiently on x86 / clusters.
Goal: Generating CnC
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
18
• Represent CnC action-attribute graphs explicitly in R-Stream:
• Benefit from optimization across multiple CnC steps
• Explore tradeoff between fusing steps and running them in parallel:– Fused steps reduce the runtime overhead– An also the memory footprint
• Generate many semantically equivalent versions and explore the design space tradeoffs– R-Stream auto-tuning mode will help a lot here
Goal: Synergy with CnC
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
19
• Extensions to blackboxing:• User interface, can represent any program• Supports even linking with precompiled code• Integrate user-specific data distributions within R-
Stream• HTAs• Locales• Find the right abstraction• The goal for Rstream to understand the
abstraction and make good mapping decisions; not to replace the user choices
• Iterative, feedback-directed design• Language / transformation tool• Transformation tool / Runtime• Language / Runtime
Goal: Synergy with Chapel and UIUC
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
20
• Support multiple kinds of placement:• Explicit / implicit ; virtual / physical ; linear/ cyclic/ block
cyclic/ general• Build on R-Stream’s current over-provisioning for
performance:• Originally built for CUDA performance• Concepts extend to any architecture with dynamic
scheduling decisions• Has implications on locality/communication granularity• Examine implications on power• Use advanced auto-tuning features for design space
exploration• Explore which modes perform best with CnC:• Dependent on how over-provisioning is implemented • Over-provisioning (may) have implications on memory
persistence:• Opportunities / loss of high-level reuse and
communication optimizations
Goal: Pragmatic Approach
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
21
• Go beyond loop nest optimizations• Chapel / data-parallel support • CnC attribute action graph optimization• SAR• New locality transformations demonstrated speedups
on linear flight path (reported to DARPA)• MD• Exploring HLC optimization to neutral territory methods• Graph• High level approaches to optimizing graph algorithms
and increasing locality, new lock-free data-parallel algorithm for BFS
• Chess, Hydrodynamics• TBD.
Goal: HLC support for Challenge Applicationss
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
22
• R-Stream Overview
• UHPC Goals
• Some Performance Results
Outline
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
• Main comparisons:• R-Stream High-Level C Compiler 3.1.2• Intel MKL 10.2.1• Dual quad-core E5405 Xeon processors (8 cores total),
9GB memory, 8 thr
CSLC-LMS (Mapping Across Function/Library Calls)
Radar code MKL
calls
Configuration 1: MKL
Radar code
GCC
Configuration 2: Low-level compilers
ICC
Radar code
Configuration 3: R-Stream
Optimized radar code
GCCICC
R-Stream
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
CSLC-LMS (Mapping Across Function/Library Calls)
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
25
• void RTM_3D(double (*U1)[Y][X], double (*U2)[Y][X], double (*V)[Y][X],• int pX, int pY, int pZ) {• double temp;• int i, j, k;
• for (k=4; k<pZ-4; k++) {• for (j=4; j<pY-4; j++) {• for (i=4; i<pX-4; i++) {• temp = C0 * U2[k][j][i] +• C1 * (U2[k-1][j][i] + U2[k+1][j][i] +• U2[k][j-1][i] + U2[k][j+1][i] +• U2[k][j][i-1] + U2[k][j][i+1]) +• C2 * (U2[k-2][j][i] + U2[k+2][j][i] +• U2[k][j-2][i] + U2[k][j+2][i] +• U2[k][j][i-2] + U2[k][j][i+2]) +• C3 * (U2[k-3][j][i] + U2[k+3][j][i] +• U2[k][j-3][i] + U2[k][j+3][i] +• U2[k][j][i-3] + U2[k][j][i+3]) +• C4 * (U2[k-4][j][i] + U2[k+4][j][i] +• U2[k][j-4][i] + U2[k][j+4][i] +• U2[k][j][i-4] + U2[k][j][i+4]);
• U1[k][j][i] = 2.0f * U2[k][j][i] - U1[k][j][i] + V[k][j][i] * temp;
• } } } }
RTM (Exploiting Over-Provisioning for Performance)
25-point 8th order (in space) stencil
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
26
• 3D discretized wave equation kernel with single time iteration
• Run on NVIDIA GTX 480• Double Precision 256^3 Problem• High-Performance from Over-Provisioning space
exploration and explicit optimization of register rotation and shared memory reuse
RTM (Exploiting Over-Provisioning for Performance)
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
27
• Examined feasibility and benefits of automatic coordination language (CnC )generation from R-Stream:
• on 4-D stencil, in-place, kernel application• coarse grained parallelism is pipelined (i.e.
wavefronts of parallel tasks) and representative of other streaming kernels
• Rstream generates a non-trivial OpenMP version
• Manually transform this OpenMP version to CnC code
• Process completely automatable
R-Stream to CnC Proof of Concept
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
28
R-Stream to CnC Proof of Concept
Reservoir LabsCopyright © 2010 Reservoir Labs, Inc. All Rights Reserved. Use or disclosure of data contained on this slide is subject to the restrictions on the title page of this presentation. This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.
29
• R-Stream simplifies software development and maintenance
• Does this by automatically parallelizing loop code• While optimizing for data locality, coalescing,
communications reuse, etc.
• Many exciting developments within UHPC
Conclusion