Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel...
-
Upload
lesley-simon -
Category
Documents
-
view
216 -
download
0
Transcript of Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel...
![Page 1: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/1.jpg)
Portable Performanceon Heterogeneous Architectures
Phitchaya Mangpo PhothilimthanaJason Ansel
Jonathan Ragan-KelleySaman Amarasinghe
Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology
![Page 2: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/2.jpg)
Programming onHeterogeneous Architectures …
2D Convolutio
n2D
Convolution
![Page 3: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/3.jpg)
Programming onHeterogeneous Architectures …
SepConvolutio
n
![Page 4: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/4.jpg)
Porting to Another System …
SepConvolutio
nSep
Convolution
![Page 5: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/5.jpg)
Porting to Another System …
SepConvolutio
nSep
Convolution
2D w/ localscratchpad
![Page 6: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/6.jpg)
Concrete Example: Convolution
Desktop Server Laptop
At kernel width = 7At kernel width =15
All choices are in OpenCL
![Page 7: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/7.jpg)
Search Space is Huge and Complex …
• Which devices?• Which algorithms?• Which memory?• How many threads per block?• How to divide workload?• Transfer data to a faster
device or keep the computation local?
…
![Page 8: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/8.jpg)
Search Space is Huge and Complex …
Need to build programs to automatically adapt!
Infeasible to find the best choice manually.Unified model-driven analysis across tool chains is hard.
![Page 9: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/9.jpg)
Portable Programming Model for Heterogeneous Architectures
Compiler that automatically converts input program into optimized code for different devices
Runtime system that schedules tasks efficiently and manages memory cleverly• Hybrid CPU work-stealing GPU work pushing model
Empirical autotuner that automatically finds the best program configuration:• Mapping of computations to devices• Types of memory to use• Workload balance among devices• Algorithms
![Page 10: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/10.jpg)
PetaBricks
Compiler
Autotuner
Runtime System
PetaBricksProgram
C++ output
Program
Training Informatio
n
ChoiceConfigurati
on
- dependency analysis- task creations- task scheduler- C++ code gen- etc.
- algorithmic choices- parellelization techniques- data distributions- transformations- etc.
- CPU work-stealing model
- dependency analysis- data movement analysis- CPU/GPU task creations- task scheduler- C++ code gen- OpenCL code gen- etc.
- CPU work-stealing model- GPU work-pushing model- memory management
- algorithmic choices- parellelization techniques- data distributions- transformations- CPU/GPU choices- global/local memory- CPU-GPU workload ratio- GPU local work size- etc.
![Page 11: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/11.jpg)
Compiler
![Page 12: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/12.jpg)
Algorithmic Choices of Convolution
2D Convolution
k1
k2
k3
k1k
1
k1k
2
k1k
3
k2k
1
k2k
2
k2k
3
k3k
1
k3k
1
k3k
1
2D kernel
1D kernel
k1k
1
k1k
2
k1k
3
k2k
1
k2k
2
k2k
3
k3k
1
k3k
1
k3k
1
k1k
1
k1k
2
k1k
3
k2k
1
k2k
2
k2k
3
k3k
1
k3k
1
k3k
1
k1k
1
k1k
2
k1k
3
k2k
1
k2k
2
k2k
3
k3k
1
k3k
1
k3k
1
k1k
1
k1k
2
k1k
3
k2k
1
k2k
2
k2k
3
k3k
1
k3k
1
k3k
1
input
output
![Page 13: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/13.jpg)
Algorithmic Choices of Convolution
2D Convolution Separable Convolution
k1
k2
k3
k1k
1
k1k
2
k1k
3
k2k
1
k2k
2
k2k
3
k3k
1
k3k
1
k3k
1
2D kernel
1D kernel
input
output
input intermediate
intermediate
output
Convolve Row
Convolve Column
k1 k2 k3k1 k2 k3k1 k2 k3
k1
k2
k3
k1
k2
k3
k1
k2
k3
![Page 14: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/14.jpg)
Language [PLDI’09]
transform SeparableConvolutionfrom In[w, h], Kernel[KWIDTH]to Out[w - KWIDTH+1, h - KWIDTH+1]{
// Choice 1: single pass 2D convolutionto(Out out) from(In in, Kernel kernel) {
Convolve2D(out, in, kernel);}
// Choice 2: two pass separable convolutionto(Out out) from(In in, Kernel kernel) using(buffer[w - KWIDTH+1, h]) {
ConvolveRows(buffer, in, kernel);ConvolveColumns(out, buffer, kernel);
}}
![Page 15: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/15.jpg)
Automatic OpenCL Code Generation
STEP 1: dependency analysisAllow sequential dependency and data parallel dependency patterns, and reject complex data dependency
STEP 2: syntactic conversionRewrite data accesses to GPU global memory
STEP 3: GPU local memory utilizationWhen there is stencil computation pattern, GPU local memory version kernel is generated.
Phase 1: work-items cooperate to load data into local memory that will be accessed by the work-group they belong to
Phase 2: actual computation derived from the basic version by replacing global memory accesses with local memory accesses
![Page 16: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/16.jpg)
Schedule 1: Convolve2D();
Schedule 2: ConvolveRows(); ConvolveColumns();
Schedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: ConvolveRows(); ConvolveColumns();
Schedule 4: ConvolveRows (); ConvolveColumns_opencl();
Schedule 5: ConvolveRows_opencl(); ConvolveColumns();
Schedule 6: ConvolveRows_opencl(); ConvolveColumns_opencl();
Before adding OpenCL
After adding OpenCL
Scheduling Choices: Convolution
![Page 17: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/17.jpg)
Scheduling Choices: Convolution
Schedule 1: Convolve2D();
Schedule 2: ConvolveRows(); ConvolveColumns();
Schedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: ConvolveRows(); ConvolveColumns();
Schedule 4: ConvolveRows (); ConvolveColumns_opencl();
Schedule 5: ConvolveRows_opencl(); ConvolveColumns();
Schedule 6: ConvolveRows_opencl(); ConvolveColumns_opencl();
Original Choices
After adding OpenCL
Schedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: Convolve2D_opencl_local();
Schedule 4: ConvolveRows(); ConvolveColumns();
Schedule 5: ConvolveRows (); ConvolveColumns_opencl();
Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();
Schedule 7: ConvolveRows_opencl(); ConvolveColumns();
Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();
Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();
Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();
Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();
Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();
After adding local mem version
Local memory = scratchpad memory shared by all work-items (gpu threads) in the block
![Page 18: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/18.jpg)
Data Movement Analysis
Goal: minimize data transfer between CPU and GPU
Task 1 (GPU)Input: A
Output: B, C
Task 2 (CPU)Input: BOutput: D
Task 3 (GPU)Input: COutput: E
TRANSFORMInput: AOutput: D, E
must copy-out region Breused region Cmay copy-out region E
![Page 19: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/19.jpg)
Runtime System
![Page 20: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/20.jpg)
Runtime System
CPU Worker
CPU Worker
CPU Worker
GPU Manager
Randomized Work-stealing
GPU Task Pushing
Local Task Creation
Runnable TaskDeques
Non Runnable Tasks
![Page 21: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/21.jpg)
GPU Tasks
Prepare tasks allocate buffers on the GPU, and update metadata for GPU execution.Copy-in tasks copy the required input data to the GPU.Execute tasks initiate the asynchronous execution of the kernel, perform non-blocking reads from GPU buffers.Copy-out completion tasks check the status of the non-blocking reads called by the execute task.
Depending on the result of data movement analysis, tasks to prepare, copy-in, execute, and copy-out completion are inserted into the schedule by the compiler.
![Page 22: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/22.jpg)
Memory Management
GPU memory is allocated and managed by the GPU management thread. keeps a table of data stored in the GPU releasing stale buffers copy data back to main memory when the data is needed
or flagged for eager copy-out handle CPU-GPU data division
Copy-in Management• If data in a copy-in task is already on GPU, change the
status of the task to complete without actually executing the task.
• Otherwise, it will perform the required copy.Copy-out Management• One buffer for one output matrix.• Multiple rules may write to the same buffer.
Optimization
![Page 23: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/23.jpg)
CPU-GPU Workload Balancing
CPU/GPU ratio parameter statically defines how much of the data should be computed on each device.
ProgramProgram
![Page 24: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/24.jpg)
Autotuner
![Page 25: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/25.jpg)
GPU Choice Representation
TYPE 1: decision of if and when to use GPU• possible to use GPU for some input sizes and not others• possible to have poly-algorithms that run some parts of
computation on GPU and others on CPU
TYPE 2: global or local memory
TYPE 3: number of work-items in work-groups (local work size)• different for different OpenCL kernels
TYPE 4: GPU-CPU workload ratio• Different for each transforms• range from 1/8 to 8/8
Schedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: Convolve2D_opencl_local();
Schedule 4: ConvolveRows(); ConvolveColumns();
Schedule 5: ConvolveRows (); ConvolveColumns_opencl();
Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();
Schedule 7: ConvolveRows_opencl(); ConvolveColumns();
Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();
Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();
Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();
Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();
Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();
![Page 26: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/26.jpg)
GPU Choice RepresentationSchedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: Convolve2D_opencl_local();
Schedule 4: ConvolveRows(); ConvolveColumns();
Schedule 5: ConvolveRows (); ConvolveColumns_opencl();
Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();
Schedule 7: ConvolveRows_opencl(); ConvolveColumns();
Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();
Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();
Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();
Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();
Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();
4
9
16
25
1/8
2/8
3/8
8/8
…
Local Work Size GPU-CPU Ratio
![Page 27: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/27.jpg)
GPU Choice RepresentationSchedule 1: Convolve2D();
Schedule 2: Convolve2D_opencl();
Schedule 3: Convolve2D_opencl_local();
Schedule 4: ConvolveRows(); ConvolveColumns();
Schedule 5: ConvolveRows (); ConvolveColumns_opencl();
Schedule 6: ConvolveRows (); ConvolveColumns_opencl_local();
Schedule 7: ConvolveRows_opencl(); ConvolveColumns();
Schedule 8: ConvolveRows_opencl_local(); ConvolveColumns();
Schedule 9: ConvolveRows_opencl(); ConvolveColumns_opencl();
Schedule 10: ConvolveRows_opencl(); ConvolveColumns_opencl_local();
Schedule 11: ConvolveRows_opencl_local(); ConvolveColumns_opencl();
Schedule 12: ConvolveRows_opencl_local(); ConvolveColumns_opencl_local();
4
9
16
25
1/8
2/8
3/8
8/8
…
Local Work Size GPU-CPU Ratio Other Parameters …
…
…
…
…
…Big Search Space!
up to 101040 choicesBottem-up evolutionary algorithm [GECCO’11]
![Page 28: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/28.jpg)
Experimental Results
![Page 29: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/29.jpg)
Experimental Results
Convolution Black-Sholes
Poisson2D SOR SortStrassen Tridiagonal SolverSingle Value Decomposition
![Page 30: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/30.jpg)
Experiment: Convolution
• Autotune on each machine• Test cross-run• Normalize execution time by the best config
Separable convolution w/local memory on GPU
Desktop config
Separable convolution on OpenCL
Server config
2D convolution w/local memory on GPU
Laptop config
Lower is better.
Hand-coded OpenCL
![Page 31: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/31.jpg)
Experiment: Stressen (Matrix Multiply)
Right configuration can provide huge performance improvement.
16.5x
Data parallel on GPUDesktop config
Recursive decomposition-> LAPACK on CPU
Server config
LAPACK on CPULaptop config
Hand-coded OpenCL
![Page 32: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/32.jpg)
Experiment: Poisson 2D SOR
Optimal placement is almost the opposite of another across machines.
Split on CPUCompute on GPU
Desktop config
Split on OpenCLCompute on CPU
Server config
Split on CPUCompute on GPU
Laptop config
![Page 33: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/33.jpg)
Experiment: Tridiagonal Solver
Algorithmic choice dramatically affects performance.
Cyclic reduction on GPUDesktop config
Direct solve on CPUServer config
Direct solve on CPULaptop config
![Page 34: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/34.jpg)
Experiment: Sort
It is not always best to use accelerators.
2MS -> QS -> 4MS -> IS on CPU
Desktop config
4MS -> 2MS -> ISon CPU
Server config
4MS -> 2MS -> 4MS -> ISon CPU
Laptop config
Bitonic sortGPU-only config
Radix sortHand-coded OpenCL
![Page 35: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/35.jpg)
Experiment: SVD
GPU-CPU task parallel division on some machines
Task parallelism betweenCPU/GPU
Desktop config
All on CPUServer config
All on CPULaptop config
![Page 36: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/36.jpg)
Experiment: Black-sholes
GPU-CPU task workload division on some machines
All on GPUDesktop config
All on OpenCLServer config
25% on CPU, 75% on GPULaptop config
![Page 37: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/37.jpg)
Convolution
Stressen
SOR
Tridiagonal Solver
Sort
SVD
Black-sholes
Choice Differences Across Machines
Device
s
(C++/O
penCL)
Algor
ithm
s
GPU-C
PU ra
tio
GPU/C
PU ta
sk
parall
elism
Global/
local
mem
ory
![Page 38: Portable Performance on Heterogeneous Architectures Phitchaya Mangpo Phothilimthana Jason Ansel Jonathan Ragan-Kelley Saman Amarasinghe Computer Science.](https://reader033.fdocuments.in/reader033/viewer/2022042702/56649ce35503460f949afd3b/html5/thumbnails/38.jpg)
Best algorithms and mapping strategies on one system are often not the same on another.
Model-driven analysis alone is not enough.
Empirical exploration is essential when facing with programs and machines of ever-increasing complexity.