Harnessing OpenCL in Modern Coprocessors

Harnessing OpenCL in modern coprocessors

Unai Lopez-Novoaunai.lopez@ehu.es

06 Aug 2014

Intelligent Systems Group

University of the Basque Country UPV/EHU

Outline

• Previous work

• Work @ UniMan: Relational Join

1.Motivation

2.Algorithm

3.Results

4.Conclusions

About Myself• PhD Student @ Intelligent Systems Group: 2011 – Now

• Research interest: Efficient use of Modern coprocessors• Performance modeling• Code acceleration

• Development of parallel implementations• Molecular Dynamics simulation code (MSc thesis)• Kernel Density Estimation (Under review)• Relational Join (Work @ UniMan)

Kernel Density Estimation• Estimate the Probability Density Function of a population

• Our use case: Climate models

• Challenge: large volumes of data

Histogram: KDE:

Kernel Density Estimation• 1st: Algorithmic rework

• 2nd: Parallel implementation: multi/many core processors• Compared to R+MKL and CUDA implementations

Naive approach

for each evaluation_point e

for each sample s

d = distance(e,s)

e += density (d)

Our approach

B = computeBoundingBox()

for each sample s

b = fitBoundingBox(B,s)

for each e_point e in b

d = distance(e,s)

e += density (d)

Work @ UniMan

Slide based on: Wu, Lisa, et al. "Navigating big data with high-throughput, energy-efficient data partitioning." Proc. of the 40th Annual International Symposium on Computer Architecture. ACM, 2013.

Do sunblock sales correlate with weather?Sales

Weather

Join-Date(Sales,Weather)

Join-Date

•Join is everyday operation

Goal: Develop a parallel implementation of relational join targeting nowadays heterogeneous systems

Heterogeneous systems

• Performance depends on the nature of the application

Multi-core•16 cores

•250 GFLOP/s

Many-core•61 cores

•1 TFLOP/s

GPU•2880 cores

•1.3 TFLOP/s

Complex control flow Number crunchingComplex control flow Number crunching

• Wide variety of programming environments in HPC• OpenMP, CUDA, MPI, TBB,…

• Our choice: OpenCL

NVIDIA SDKIntel SDKAMD SDK

Write once

Compile

Run many

• Cross-platform portability != Performance portability• OpenCL: Abstraction layer

• Solution 1: per-device hand-made tuning

• Not portable at all

• Solution 2: auto tuning

• Rely on performance models

Previous work

• Collection of performance modeling proposals for latest GPUs and Intel Xeon Phi

• Comprehensive analysis of the literature since ~2007• Organized as:

Unai Lopez-Novoa et al. A Survey of Performance Modeling and Simulation Techniques for Accelerator-based Computing IEEE Transactions on Parallel and Distributed Computing, DOI: 10.1109/TPDS.2014.2308216

Execution timeestimation

Bottleneckhighlighting

Power cons. estimation

Simulators

Types of Join

Inner Left Outer

Right Outer Full Outer

100 100 100 100

100 100

Table A

Table B

Algorithm• Biggest debate: Sort or Hash?

Hash-join

Complexity:

Limitation: Extensive use of atomics preventefficient parallelization

O(n + m)

Procedure: 1. Hash smaller table2. Scan larger table

Sort-join

Sorting increasescomplexity

O(n·log(n))

1. Sort keys2. Scan interleaved

Algorithm• Step 1: Sort keys in both tables

• Radix sort: speed/scalability sweet spot

Algorithm• Step 2: Merge

• Add non matching keys for outer joins

100 100

102 102

Table A Table B

Result – Inner Join

Implementation• Steps:

1)Develop a naive OpenCL implementation

2)Optimize per device type

3)Add a cost model for load balancing and partitioning

• Experimental setup:• M1: 4 (x2 SMT) Cores Xeon + Xeon Phi + 384 Cores GPU• M2: 12 (x2 SMT) Cores Xeon + Xeon Phi + 2496 Cores GPU

• Baseline: ModernGPU (CUDA)

Results

Per-device tuning• Optimizations:

• Thread scheduling• Memory management

• Overheads:

• Compilation• Memory allocation

Optimizations• Per device thread scheduling

OpenCLKernel

Threads:

Groups:

OpenCLDevices

Four core CPU

0 1 2 3

61 core Xeon Phi

2 3 4 600 1

• Per device memory management

Optimizations

Private Local GlobalOpenCL Device

Memory Hierarchy

Thread Thread-group Any thread

Scope:

Registers On-chip RAM

Registers RAM

RAMRegisters

Overheads• Compilation

• Online compilation: X% of runtime (without I/O)

• Memory allocation

• Intel SDK: Y % of Merge Step in Xeon Phi

OpenCLProgram

Host code Device code

Compilation: Offline (gcc) Online (SDK)

Results

Future work

1) Finish tuning per device code

2) Test join in FPGA

3) Revisit partitioning strategy

4) Support multi-device execution

• Develop a cost model that characterizes Join

• Split the workload in runtime among existing devices

Conclusions• Performance: device specific code• Performance portability:

a) Platform specific code

b) Parameterizable code

• High OpenCL SDK dependence• Only portable debugging tool: printf

• …but still the only portable framework• Future: OpenACC / OpenMP 4.0 ?

Harnessing OpenCL in modern coprocessors

Unai Lopez-Novoaunai.lopez@ehu.es

06 Aug 2014

Intelligent Systems Group

University of the Basque Country UPV/EHU

Harnessing OpenCL in Modern Coprocessors

Engineering

Transcript of Harnessing OpenCL in Modern Coprocessors

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Programming in OpenCL - developer.download.nvidia.comdeveloper.download.nvidia.com/.../asia/3_OpenCL_Programming.pdf · OpenCL Programming Tools & Resources. ... cl_context. Command

ECE 636 Reconfigurable Computing Lecture 15 Reconfigurable Coprocessors

PostgreSQL with OpenCL

Introduction to Many Integrated Core (MIC) Coprocessors on ...

OpenCL Do’s and Don’ts - Haifux · 2 | OpenCL Software | OpenCL Do’s and Don’ts | 12/2011 Application code do’s and don’ts Use OpenCL where its right • Analyze the application

Coprocessors: Uses, Abuses & Solutions · Apache Phoenix Coprocessors •Apache Phoenix –OLTP and operational analytics for Apache Hadoop •Phoenix jar runs on top of the Region

Mac OpenCL

OpenCL Tutorial - Basics

Conservation Cores: Energy-Saving Coprocessors for Nasty ...parallel.ucsd.edu/papers/LCTES_2011_Final.pdf · Conservation Cores: Energy-Saving Coprocessors for Nasty Real-World Code

Conservation Cores: Energy-Saving Coprocessors …cseweb.ucsd.edu/~mbtaylor/papers/LCTES_2011_Final.pdfConservation Cores: Energy-Saving Coprocessors for Nasty Real-World Code Jack

Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016

Using TMS320C6416 Coprocessors: Turbo Coprocessor (TCP ...

Improving Performance Portability in OpenCL Programspeople.cs.uchicago.edu/~yaozhang/main-portability.pdf · Improving Performance Portability in OpenCL Programs ... for OpenCL 1.2

Modern C++, OpenCL SYCL & OpenCL CL2 - …ronan.keryell.fr/Talks/2014/2014-11-18-SC14-OpenCL_BoF_SYCL/2014...Modern C++, OpenCL SYCL & OpenCL CL2.hpp ... C++14 Modern C++ & HPC (I)

OpenCL (pdf presentation) - Beyond Programmable Shadings08.idav.ucdavis.edu/munshi-opencl.pdf · •OpenCL – Open Computing Language ... OpenCL Software Stack. Beyond Programmable

Intel® OpenCL SDK User's Guide · Intel® OpenCL SDK 6 Document Number: 323626-003US 1 Introduction The Intel® OpenCL SDK User’s Guide contains the general information about OpenCL,

Coprocessors and Attached Processors

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

Conservation Cores: Energy-Saving Coprocessors for …mbtaylor/papers/LCTES_2011_Final.pdf · Conservation Cores: Energy-Saving Coprocessors ... – "Turbo Mode" ... C-cores start