Challenges in Binary Translation for Desktop Supercomputing David Kaeli Rodrigo Dominguez Department...
-
Upload
jemima-mcdowell -
Category
Documents
-
view
217 -
download
0
Transcript of Challenges in Binary Translation for Desktop Supercomputing David Kaeli Rodrigo Dominguez Department...
Challenges in Binary Translation for Desktop Supercomputing
Challenges in Binary Translation for Desktop Supercomputing
David KaeliRodrigo Dominguez
Department of Electrical and Computer EngineeringNortheastern University
Boston, MA
Current trends in Many-core ComputingCurrent trends in Many-core Computing
The CPU industry has elected to jump off the cycle-time scaling bandwagon Power/thermal constraints have become a limiting factor We now see CPU vendors placing multiple (10’s of) cores
on a single chip Clock speeds have not changed The memory wall persists and multiple cores that assume
a shared-memory model place further pressure on this problem
Software vendors are looking for new parallelization technology Multi-core aware operating systems Semi-automatic parallelizing compilers
The CPU industry has elected to jump off the cycle-time scaling bandwagon Power/thermal constraints have become a limiting factor We now see CPU vendors placing multiple (10’s of) cores
on a single chip Clock speeds have not changed The memory wall persists and multiple cores that assume
a shared-memory model place further pressure on this problem
Software vendors are looking for new parallelization technology Multi-core aware operating systems Semi-automatic parallelizing compilers
Current trends in Many-core ComputingCurrent trends in Many-core Computing
There has been a renewed interest in parallel computing paradigms and languages
Existing many-core architectures are being considered for general-purpose platforms (e.g., Cell, GPUs, DSPs)
Heterogeneous systems are becoming a common theme
The trend will only accelerate if proper programming frameworks are available to effectively exploit many-core resources
There has been a renewed interest in parallel computing paradigms and languages
Existing many-core architectures are being considered for general-purpose platforms (e.g., Cell, GPUs, DSPs)
Heterogeneous systems are becoming a common theme
The trend will only accelerate if proper programming frameworks are available to effectively exploit many-core resources
Graphics ProcessorsGraphics ProcessorsGraphics ProcessorsGraphics Processors
Graphics Processing Units More than 64% of Americans played a video game in 2009
High-end - primarily used for 3-D rendering for videogame graphics and movie animation
Mid/low-end – primarily used for computer displays
Manufacturers include NVIDIA, AMD/ATI, IBM-Cell
Very competitive commodities market
Graphics Processing Units More than 64% of Americans played a video game in 2009
High-end - primarily used for 3-D rendering for videogame graphics and movie animation
Mid/low-end – primarily used for computer displays
Manufacturers include NVIDIA, AMD/ATI, IBM-Cell
Very competitive commodities market
GPU Performance GPU Performance GPU Performance GPU Performance
GPUs provide a path for performance growth Cost and power usage numbers are also impressive
GPUs provide a path for performance growth Cost and power usage numbers are also impressive
Source:NVIDIA 2009
Near exponential growth
in performancefor GPUS!!
Comparison of CPU and GPU Hardware ArchitecturesComparison of CPU and GPU Hardware Architectures
CPU: Cache heavy, focused on individual thread performance
GPU: ALU heavy, massively parallel,
throughput-oriented
CPU/GPU Relationship CPU/GPU Relationship
CPU(host)CPU
(host)GPU w/
local DRAM(device)
A wide range of GPU appsA wide range of GPU apps
3D image analysis Adaptive radiation therapy Acoustics Astronomy Audio Automobile vision Bioinfomatics Biological simulation Broadcast Cellular automata Fluid dynamics Computer vision Cryptography CT reconstruction Data mining Digital cinema / projections Electromagnetic simulation Equity training
Film Financial Languages GIS Holographics cinema Machine learning Mathematics research Military Mine planning Molecular dynamics MRI reconstruction Multispectral imaging N-body simulation Network processing Neural network Oceanographic research Optical inspection Particle physics
Protein folding Quantum chemistry Ray tracing Radar Reservoir simulation Robotic vision / AI Robotic surgery Satellite data
analysis Seismic imaging Surgery simulation Surveillance Ultrasound Video conferencing Telescope Video Visualization Wireless X-Ray
GPU as a General Purpose Computing PlatformGPU as a General Purpose Computing Platform
Speedups are impressive and ever increasing! Speedups are impressive and ever increasing!
Genetic Algorithm
2600 X
Real Time Eliminationof Undersampling Artifacts
2300 X
Lattice-Boltzmann Methodfor Numerical Fluid Mechanics
1840 X
Source: CUDA Zone at www.nvidia.com/cuda/
Total Variation Modeling
1000 X
Fast Total Variation forComputer Vision
1000 X
Monte Carlo SimulationOf Photon Migration
1000 X
Stochastic DifferentialEquations
675 X
K-Nearest NeighborSearch470 X
GPGPU is becoming mainstream researchGPGPU is becoming mainstream research
Research activities are expanding significantlyResearch activities are expanding significantly
Search result for keyword “GPGPU” in IEEE and ACM
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Texture Processor Cluster
SM
Streaming Processor Array
Streaming Multiprocessor
Te
xtu
re U
nit
TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC
SM
SM
NVIDIA GT200 architecture
Grid of thread blocks
Multiple thread blocks, many warps of threads
Individual threads
• 240 shader cores• 1.4B transistors• Up to 2GB onboard memory• ~150GB/sec BW• 1.06 SP GFLOPS• CUDA and OpenCL support• Programmable memory spaces• Tesla S1070 provides 4 GPUs in a 1U unit
AMD/ATI Radeon HD 5870AMD/ATI Radeon HD 5870
• Codename “Evergreen” • 1600 SIMD cores
• L1/L2 memory architecture
• 153GB/sec memory bandwidth
• 2.72 TFLOPS SP
• OpenCL and DirectX11
• Hidden memory microarchitecure
• Provides for vectorized operation
Comparison of CPU and GPU Hardware ArchitecturesComparison of CPU and GPU Hardware Architectures
CPU/GPU Single precision TFLOPs
Cores GFLOPs/Watt
$/GFLOP
NVIDIA 285 1.06 240 5.8 $3.12
NVIDIA 295 1.79 480 6.2 $3.80
AMD HD 5870 2.72 1600 14.5 $0.16
AMD HD 4890 1.36 800 7.2 $0.18
Intel I-7 965 0.051 4 0.39 $11.02
Source: NVIDIA, AMD and Intel
AMD NVIDIA
Hardware architecture Vector Scalar
Programming language Brook+, IL, OpenCL CUDA, OpenCL
Programming model SIMD vector SIMT
Thread hierarchy Single level Two level
Memory exposure Uniform space Multiple space
Source of horsepower Vectorization and multiple output
Memory spaces utilization including shared memory
Pros Easier programming More flexible programming
Challenges Harnessing the potential horsepower
AMD vs. NVIDIAAMD vs. NVIDIA
Talk OutlineTalk Outline
Introduction on GPUs Overview of the tool chains for both CUDA and
OpenCL Motivation for pursuing this work
Comparing intermediate representations Leveraging/analyzing benefits of Open64 optimization on
AMD GPUs Comparing challenges with fundamentally different ISAs (SS
SIMT versus VLIW SIMT) Discuss PTX and IL Describe new common IR Two examples of PTX->IR->IL binary translation Discuss status of project and future work
Introduction on GPUs Overview of the tool chains for both CUDA and
OpenCL Motivation for pursuing this work
Comparing intermediate representations Leveraging/analyzing benefits of Open64 optimization on
AMD GPUs Comparing challenges with fundamentally different ISAs (SS
SIMT versus VLIW SIMT) Discuss PTX and IL Describe new common IR Two examples of PTX->IR->IL binary translation Discuss status of project and future work
GPU Programming ModelGPU Programming Model
Single Instruction Multiple Threads (SIMT)
Parallelism is implicit
Programs (also called kernels or shaders) are generally small and contain nested loops
Synchronization is handled explicitly
Single Instruction Multiple Threads (SIMT)
Parallelism is implicit
Programs (also called kernels or shaders) are generally small and contain nested loops
Synchronization is handled explicitly
ToolchainsToolchains
Toolchain = compiler + runtime library Toolchain = compiler + runtime library
NVIDIA AMD
GPU
CUDA Runtime
C for CUDA
OpenCL
Graphics driver
GPU
CAL Runtime
Brook+ OpenCL
Graphics driver
CUDA CompilerCUDA Compiler
cudafe
Open64
host compiler
runtime
host
gpu
ptx*
exe
binary
compile-time
execution-time
c for cuda
* ptx is included as data in the host application
driver
OpenCL (Dynamic) CompilerOpenCL (Dynamic) Compiler
OpenCL Library
LLVM
runtimebinary
execution-time
OpenCL
driver
compile-time
exe
host compiler
Objectives of our workObjectives of our work
Compare two different IRs from similar massively-threaded architectures
Influence future IR design (an active topic in GPGPU research)
Leverage/analyze benefits of Open64 optimizations
Compare challenges with fundamentally different ISAs: Superscalar/SIMT versus VLIW/SIMT
Compare two different IRs from similar massively-threaded architectures
Influence future IR design (an active topic in GPGPU research)
Leverage/analyze benefits of Open64 optimizations
Compare challenges with fundamentally different ISAs: Superscalar/SIMT versus VLIW/SIMT
CUDA RuntimeCUDA Runtime
• Device Management– cudaSetDevice, cudaGetDevice
• Memory Management– Allocation: cudaMalloc, cudaFree– Transfer: cudaMemcpy, cudaMemset
• Execution Control– Kernel launch: cudaLaunch– Config: cudaConfigureCall
• Thread Management– cudaSynchronize
• Device Management– cudaSetDevice, cudaGetDevice
• Memory Management– Allocation: cudaMalloc, cudaFree– Transfer: cudaMemcpy, cudaMemset
• Execution Control– Kernel launch: cudaLaunch– Config: cudaConfigureCall
• Thread Management– cudaSynchronize
CUDA Runtime (Vector Add example)CUDA Runtime (Vector Add example)
__global__ void vecAdd(int A[ ], int B[ ], int C[ ]) { int i = threadIdx.x; C[i] = A[i] + B[i];}
int main() { int hA[ ] = {…}; int hB[ ] = {…};
cudaMemcpy(dA, hA, sizeof(hA), HostToDevice); cudaMemcpy(dB, hB, sizeof(hB), HostToDevice);
vecAdd<<<1, N>>>(dA, dB, dC);
cudaMemcpy(dA, hA, sizeof(hA), DeviceToHost);}
cudaConfigureCallcudaSetupArgumentcudaLaunch
NVIDIA PTXNVIDIA PTX
Low-level IR (close to ISA)
Pseudo-assembly style syntax
Load-Store instruction set
Strongly typed language cvt.s32.u16 %r1, %tid.x;
Unlimited virtual registers
Predicate registers
Low-level IR (close to ISA)
Pseudo-assembly style syntax
Load-Store instruction set
Strongly typed language cvt.s32.u16 %r1, %tid.x;
Unlimited virtual registers
Predicate registers
AMD ILAMD IL
High-level IR
Structured control flow (if-endif, while-end, switch-end)
No predication
32-bit registers (4 components) - vectorization
High-level IR
Structured control flow (if-endif, while-end, switch-end)
No predication
32-bit registers (4 components) - vectorization
Common PTX and IL instructionsCommon PTX and IL instructions
mov.u16 %rh1, %ctaid.x;mov.u16 %rh2, %ntid.x;mul.wide.u16 %r1, %rh1, %rh2;cvt.u32.u16 %r2, %tid.x;add.u32 %r3, %r2, %r1;ld.param.s32 %r4, [N];setp.le.s32 %p1, %r4, %r3;@%p1 bra $LabelA;cvt.u64.s32 %rd1, %r3;mul.lo.u64 %rd2, %rd1, 4;ld.param.u64 %rd3, [A];add.u64 %rd4, %rd3, %rd2;ld.global.f32 %f1, [%rd4+0];ld.param.u64 %rd5, [B];add.u64 %rd6, %rd5, %rd2;ld.global.f32 %f2, [%rd6+0];add.f32 %f3, %f1, %f2;ld.param.u64 %rd7, [C];add.u64 %rd8, %rd7, %rd2;st.global.f32 [%rd8+0], %f3;$LabelA:exit;
• Data movement (mov)• Memory access (ld, st)• Arithmetic (mul, add)• Conversion (cvt)• Comparison and selection (setp)• Control flow (bra): uses predication for conditional branch
vectorAdd (PTX)
Common PTX and IL instructionsCommon PTX and IL instructions
mov r0, vThreadGrpId.xmov r1, cb0[0].ximul r2, r0, r1mov r3, vTidInGrp.xiadd r4, r3, r2mov r5, cb1[3]ige r6, r4, r5if_logicalz r6mov r7, r4imul r8, r7, l0mov r9, cb1[0]iadd r10, r9, r8uav_raw_load_id(0) r11, r10mov r12, cb1[1]iadd r13, r12, r8uav_raw_load_id(0) r14, r13add r15, r11, r14mov r16, cb1[2]iadd r17, r16, r8uav_raw_store_id(0) mem.xyzw, r17, r15endifend
• Data movement (mov)• Memory access (uav_raw)• Arithmetic (imul, iadd)• No conversion instructions• Comparison and Selection (ige)• Control Flow (if_logicalz): structured statements
vectorAdd (IL)
Ocelot Framework*Ocelot Framework*
Implemented as a CUDA library Intercepts library calls PTX Emulation on the CPU Parses PTX into an internal IR Analysis: CFG, SSA, Data flow, optimizations Our work:
IR for IL programs PTX IR -> IL IR translation AMD/CAL Backend
Implemented as a CUDA library Intercepts library calls PTX Emulation on the CPU Parses PTX into an internal IR Analysis: CFG, SSA, Data flow, optimizations Our work:
IR for IL programs PTX IR -> IL IR translation AMD/CAL Backend
*Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. Modeling gpu-cpu workloads and systems. In GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 31–42, New York, NY, USA, 2010. ACM.
Translation FrameworkTranslation Framework
exe
ptx parser
analysis
translation to IL
CAL back-end
Ocelot
compile-time
ATI driver
IL Control TreeIL Control Tree
Based on Structural Analysis*
Build DFS spanning tree of the control flow graph and traverse in postorder
Form regions and collapse the nodes in the CFG
Construct the Control Tree in the process
Repeat until only 1 node is left in the CFG
Based on Structural Analysis*
Build DFS spanning tree of the control flow graph and traverse in postorder
Form regions and collapse the nodes in the CFG
Construct the Control Tree in the process
Repeat until only 1 node is left in the CFG
*S. Muchnick. Advanced Compiler Design and Implementation, chapter 7.7. Morgan Kaufmann, 1997.
IL Control TreeIL Control Tree
Entry
WHILE
BB IF
BB BB BB
cond body
condtrue
false
abstract node representing
regions
Example 1 (if-then)Example 1 (if-then)
mov.u16…setp.le.s32 p1, r4, r3@p1 bra LabelAcvt.u64.s32…LabelA:exit
Entry
Block
IF
BB:setp..
BB:cvt…
BB:mov..
BB:exit
cond true
PTX
Example 1 (if-then)Example 1 (if-then)
Entry
Block
IF
BB:setp..
BB:cvt…
BB:mov..
BB:exit
cond true
mov…ige r6, r4, r5if_logicalz r6mov…endifend
IL
Example 2 (for-loop)Example 2 (for-loop)
mov.u16…setp.le.s32 p1, r5, r3@p1 bra LabelAcvt.u64.s32…LabelB:…setp.lt.s32 p2, r4, r5@p2 bra LabelBLabelA:exit
Entry+
Block
IF
BB:setp..
Block
BB:mov..
BB:exit
cond true
PTX
BB:cvt…
WHILE
setp …
cond body
Example 2 (for-loop)Example 2 (for-loop)
Entry+
Block
IF
BB:setp..
Block
BB:mov..
BB:exit
cond true
BB:cvt…
WHILE
setp …
cond body
mov…ige r7, r4, r6if_logicalz r7mov…whileloop…if_logicalz r17breakendifendloopendifend
IL
Other BT ChallengesOther BT Challenges
Pointer arithmetic in CUDA needs to be emulated in CAL
Translate Application Binary Interface (ABI), e.g. different calling conventions
Architectural bitness: Tesla and Cypress are 32-bit architectures but Fermi is 64-bits
Pointer arithmetic in CUDA needs to be emulated in CAL
Translate Application Binary Interface (ABI), e.g. different calling conventions
Architectural bitness: Tesla and Cypress are 32-bit architectures but Fermi is 64-bits
Project StatusProject Status
Main CUDA library API’s are implemented (cudaMalloc, cudaMemcpy, cudaLaunch, etc.)
3 CUDA applications from the SDK running
Code quality comparable to LLVM code generation
Main CUDA library API’s are implemented (cudaMalloc, cudaMemcpy, cudaLaunch, etc.)
3 CUDA applications from the SDK running
Code quality comparable to LLVM code generation
Next StepsNext Steps
Enhance translation of the Control Tree to support other IL constructs (e.g., switch-case)
Implement other GPGPU abstractions (e.g., shared memory, textures, etc.)
Handle PTX predicated instructions (since IL does not support predication directly)
Enhance translation of the Control Tree to support other IL constructs (e.g., switch-case)
Implement other GPGPU abstractions (e.g., shared memory, textures, etc.)
Handle PTX predicated instructions (since IL does not support predication directly)
Summary and Future WorkSummary and Future Work
GPUs are revolutionizing desktop supercomputing
A number of critical applications have been migrated successfully
CUDA and OpenCL have made these platforms much more accessible for general purpose computing
AMD presently has the highest DP FP performance CUDA presently produces higher performance code for
NVIDIA We are developing a platform that leverages the best of
both worlds
GPUs are revolutionizing desktop supercomputing
A number of critical applications have been migrated successfully
CUDA and OpenCL have made these platforms much more accessible for general purpose computing
AMD presently has the highest DP FP performance CUDA presently produces higher performance code for
NVIDIA We are developing a platform that leverages the best of
both worlds