Desktop supercomputing: Harnessing the power of...

42
Technology for a better society 1 Desktop supercomputing: Harnessing the power of accelerators Lessons learned from 10 years with GPUs

Transcript of Desktop supercomputing: Harnessing the power of...

Page 1: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 1

Desktop supercomputing:

Harnessing the power of accelerators

Lessons learned from 10 years with GPUs

Page 2: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 2

Talk outline

• Parallel computing on your desktop

• Motivation for parallelism

• Parallel algorithm design

• Lessons learned from working with multi- and many-core processors

• Ca 2005: OpenGL

• Ca 2007: CUDA and the Cell BE SDK

• Ca 2014: iPython and pyopencl

• Summary

Page 3: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 3

Motivation for going parallel

Page 4: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 4

History lesson: development of the microprocessor 1/2

1942: Digital Electric Computer (Atanasoff and Berry)

1971: Microprocessor (Hoff, Faggin, Mazor)

1947: Transistor (Shockley, Bardeen, and Brattain)

1956

1958: Integrated Circuit (Kilby)

2000

1971- Exponential growth (Moore, 1965)

Page 5: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 5

1971: 4004, 2300 trans, 740 KHz

1982: 80286, 134 thousand trans, 8 MHz

1993: Pentium P5, 1.18 mill. trans, 66 MHz

2000: Pentium 4, 42 mill. trans, 1.5 GHz

2010: Nehalem 2.3 bill. Trans, 8 cores, 2.66 GHz

History lesson: development of the microprocessor 2/2

Page 6: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 6

• Today, we have

• 6-60 processors per chip

• 8 to 32-wide SIMD instructions

• Combines both SISD, SIMD, and MIMD on a single chip

• Heterogeneous cores (e.g., CPU+GPU on single chip)

Todays multi- and many-core processor designs

Multi-core CPUs:

x86, SPARC, Power 7

Accelerators:

GPUs, Xeon Phi

Heterogeneous chips:

Intel Haswell, AMD APU

Page 7: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 7

• In addition to heterogeneous chips, we can

have heterogeneous systems

• Accelerators are connected to the CPU via

the PCI-express bus

• Slow: 15.75 GB/s each direction

• Accelerator memory is limited but fast

• Typically on the order of 10 GB

• Up-to 340 GB/s!

• Fixed size, and cannot be expanded

with new dimm’s (like CPUs)

Heterogeneous Architectures

Multi-core CPU Accelerator

Main CPU memory (up-to 64 GB) Device Memory (up-to 6 GB)

~30 GB/s

~340 GB/s ~60 GB/s

Page 8: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 8

• 1970-2004: Frequency doubles every 34 months (Moore’s law for performance)

• 1999-2014: Parallelism doubles every 30 months

End of frequency scaling

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

1

10

100

1000

10000

Desktop processor performance (SP)

1999-2014:

Parallelism doubles

every ~30 months

1971-2004:

Frequency doubles

every ~34 months

2004-2014:

Frequency

constant

SSE (4x)

Hyper-Treading (2x)

Multi-core (2-6x)

AVX (2x)

Page 9: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 9

• Heat density approaching that of nuclear reactor core: Power wall

• Traditional cooling solutions (heat sink + fan) insufficient

• Industry solution: multi-core and parallelism!

What happened in 2004?

Original graph by G. Taylor, “Energy Efficient Circuit Design and the Future of Power Delivery” EPEPS’09

W /

cm

2

Critical dimension (um)

Page 10: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 10

Why Parallelism?

100%

100%

100%

85%

90% 90%

100%

Frequency

Performance

Power

Single-core Dual-core

The power density of microprocessors

is proportional to the clock frequency cubed:1

1 Brodtkorb et al. State-of-the-art in heterogeneous computing, 2010

Page 11: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 11

• Up-to 5760 floating point

operations in parallel!

• 5-10 times as power

efficient as CPUs!

Massive Parallelism: The Graphics Processing Unit

0

50

100

150

200

250

300

350

400

2000 2005 2010 2015

Ba

nd

wid

th (

GB

/s)

0

1000

2000

3000

4000

5000

6000

2000 2005 2010 2015

Gig

afl

op

s (

SP

)

Page 12: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 12

Parallel algorithm design

Page 13: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 13

• The key to performance, is to

consider the full algorithm and

architecture interaction.

• A good knowledge of both the

algorithm and the computer

architecture is required.

Why care about computer hardware?

Graph from David Keyes, Scientific Discovery through Advanced Computing, Geilo Winter School, 2008

Page 14: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 14

• Total performance is the product of

algorithmic and numerical performance

• Your mileage may vary: algorithmic

performance is highly problem

dependent

• Many algorithms have low numerical

performance

• Only able to utilize a fraction of the

capabilities of processors, and often

worse in parallel

• Need to consider both the algorithm and

the architecture for maximum performance

Algorithmic and numerical performance

Nu

me

rica

l pe

rfo

rma

nce

Algorithmic performance

Page 15: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 15

• Most algorithms contains

a mixture of work-loads:

• Some serial parts

• Some task and / or data parallel parts

• Amdahl’s law:

• There is a limit to speedup offered by

parallelism

• Serial parts become the bottleneck for a

massively parallel architecture!

• Example: 5% of code is serial: maximum

speedup is 20 times!

Parallel considerations 1/4

S: Speedup

P: Parallel portion of code

N: Number of processors Graph from Wikipedia, user Daniels220, CC-BY-SA 3.0

Page 16: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 16

• Gustafson's law:

• If you cannot reduce serial parts of algorithm,

make the parallel portion dominate the execution

time

• Essentially: solve a bigger problem to get a

bigger speedup!

Parallel considerations 2/4

S: Speedup

P: Number of processors

α: Serial portion of code Graph from Wikipedia, user Peahihawaii, CC-BY-SA 3.0

Page 17: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 17

• A single precision number is four bytes

• You must perform over 60 operations for each

float read on a GPU!

• Over 25 operations on a CPU!

• This groups algorithms into two classes:

• Memory bound

Example: Matrix multiplication

• Compute bound

Example: Computing π

• The third limiting factor is latencies

• Waiting for data

• Waiting for floating point units

• Waiting for ...

Parallel considerations 3/4

0,00

2,00

4,00

6,00

8,00

10,00

12,00

14,00

16,00

18,00

2000 2005 2010 2015

Optimal FLOPs per byte (SP)

CPU

GPU

Page 18: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 18

• Moving data has become the major bottleneck in computing.

• Downloading 1GB from Japan to Switzerland consumes

roughly the energy of 1 charcoal briquette1.

• A FLOP costs less than moving one byte2.

• Key insight: flops are free, moving data is expensive

Parallel considerations 4/4

1 Energy content charcoal: 10 MJ / kg, kWh per GB: 0.2 (Coroama et al., 2013), Weight charcoal briquette: ~25 grams

2Simon Horst, Why we need Exascale, and why we won't get there by 2020, 2014

Page 19: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 19

Lessons learned from working with

multi- and many-core processors

Page 20: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 20

Ca 2005: GPUs with OpenGL: GPGPU

Page 21: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 21

• GPUs were first programmed using OpenGL and other graphics languages

• Mathematics were written as operations on graphical primitives

• Extremely cumbersome and error prone

Early Programming of GPUs

Input B

Input A

Output

Geometry

Element-wise matrix multiplication

Page 22: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 22

Examples of Early GPU Research at SINTEF

Preparation for FEM (~5x)

Euler Equations (~25x) Marine aqoustics (~20x)

Self-intersection (~10x)

Registration of medical

data (~20x)

Fluid dynamics and FSI (Navier-Stokes)

Inpainting (~400x matlab code)

Water injection in a fluvial reservoir (20x) Matlab Interface

Linear algebra

SW Equations (~25x)

Page 23: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 23

• Larsen & McAllister demonstrated that using GPUs

through non-programmable OpenGL could be faster

than using ATLAS [1]

• Their algorithm was "simple"

• Matrix-matrix multiplication dots

row i of matrix A with

column j of matrix B to produce element (i, j)

• We can formulate this product for a virtual cube

of processors

• Processor (m, n, 0) computes the product

A[m, n]*B[n, o]

• By summing along the o dimension, the

matrix product is complete.

• L&M used textures and blending to implement

the virtual cube of processors algorithm

Matrix-matrix multiplication with OpenGL

Matrix multiplication

[1] Fast matrix multiplies using graphics hardware, Larsen and McAllister, 2001

Page 24: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 24

• Programmable OpenGL came in 2004, which

enabled more advanced algorithms.

• Algorithms still had to be mapped to operations

on graphical primitives, to trick the GPU into

rendering mathematics.

• Errors such as for-loops that silently only execute

the first 256 iterations was common

• Painstakingly difficult to develop

PLU factorization with OpenGL

Page 25: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 25

• As part of my masters thesis, I developed a Matlab

interface to GPU implementations of linear algebra

operations.

• Operations would be enqueued from Matlab onto

the GPU

• The GPU would execute the queue of operations in

the background

• Matlab would only stall if the GPU was not finished

when asking for results. This also enabled

heterogeneous computing!

• Thin Matlab scripts enabled transparent use of the

GPU: just use gpuMatrix(a) to use the GPU

functions!

Easy access to GPU resources

Page 26: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 26

Ca 2007: CUDA and the Cell BE SDK

Page 27: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 27

My first encounter with CUDA

• May 1st 2007: I handed in my masters thesis.

• June 15th 2007: I hold the oral presentation of my thesis and receive my grade.

• June 23rd 2007: CUDA was released officially.

Most of my thesis was officially obsolete.

Page 28: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 28

NVIDIA CUDA

• CUDA solved the major problems with OpenGL

• Unstable driver and different results on different hardware (it will only run on NVIDIA)

• Uncomfortable implementation regime

• We could now program in a C-like language

• Rapid development of a whole range of new applications

• A huge interest for GPUs emerged

• The first versions, however, had the same amount of compiler bugs as OpenGL

Page 29: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 29

The benefit of CUDA

• Render primitives that cover part of

the screen that represents your

computational domain

• The shader which colours the pixel

performs the wanted calculation

OpenGL "Kernel launch" CUDA kernel launch

• "Pixels" are still the underlying

primitive, but now we execute a grid of

blocks.

• Each block runs independently, but

threads within a block can collaborate

Page 30: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 30

CUDA fever

• CUDA became a superstar in the academic

camp "over night".

• Within 2010, there were over 1000 demos,

papers, and commercial applications using

CUDA on the CUDA Showcase.

• AMD tried countering with

Close-to-the-metal (assemly for AMD GPUs) and

Brook+, but none recieved any noticeable

attention.

Google search trends

Page 31: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 31

Exploring CUDA

• CUDA sparked a whole new range of research articles on GPUs

• Hardware exploration

• How was texture memory laid out?

• How large were the different caches?

• Would it be better to use more registers and less shared memory or not?

• Is more threads always better?

• Massive focus on memory movement

• Coalesced reads and writes

• Cached versus non-cached reads

• Texture reads versus cached reads

• Low-hanging fruit was rapidly picked

• More and more articles solved real-world problems

• Proof-of-concept slowly became less interesting

Page 32: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 32

The Cell BE

• In 2007, Sony, Toshiba and Intel released

the Cell BE

• Consisted of one regular low-power CPU, and eight

low-power SIMD units

• Originally used in the Playstation console,

and the Roadrunner supercomputer:

The first petaflop supercomputer!

• There was a lot of excitement in the GPGPU

community:

Another low-cost easily available accelerator was

here! EIB

Page 33: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 33

The Cell BE

• We spent a lot of time exploring this new architecture, implementing a range of different

algorithms

Heat equation Heat equation inpainting

JPEG compression Mandelbrot

Page 34: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 34

The Cell BE

• The Cell BE was a nightmare

• You manually had to move data between cores, and know an infinite amount of

hardware details to achieve performance

• If you did anything wrong, you would get the infamous "bus error",

the evil twin of "segmentation fault".

• Game developers reported the same for the Playstation console

• IBM originally had plans for an updated version with more performance, but these

plans were scrapped early on.

Page 35: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 35

Ca 2014: iPython and pyopencl

Page 36: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 36

• OpenCL is much like CUDA, but slightly more cumbersome

to work with.

• The benefit is that the same code can run on

Intel CPUs, the Xeon Phi,

NVIDIA GPUs,

AMD GPUs, etc.

• The amount of code needed to do the exact same thing

is larger in OpenCL

• OpenCL is a C API

• CUDA has C++ bindings, and supports templates

• CUDA has a much better development community for NVIDIA

• OpenCL has a lot of very nice tools available from AMD

OpenCL

Page 37: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 37

iPython and Pyopencl

• OpenCL is a C API, which requires working in C,

and possibly long compilation times

• Even the simplest OpenCL example will require a lot of boilerplate code

• Pyopencl solves this, by enabling access to OpenCL through Python

• Ipython notebook gives us an interactive shell to try out OpenCL and prototype!

Page 38: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 38

Demo of iPython and Pyopencl

If time permits

Page 39: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 39

The caveat

• Installing can be slightly cumbersome

• You need an OpenCL driver, which complicates things

• If you're on Ubuntu and have an NVIDIA graphics card, you're lucky

• sudo apt-get install ipyton ipython-notebook python-pyopencl

and you're good to go

• If you're on any other platform or hardware, you need to do a bit of manual labour

• But it still shouldn't take longer than half an hour.

Page 40: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 40

Summary

Page 41: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 41

Summary

• We need to consider parallelism in "everything" we do on computers

• We cannot afford to use 1% of the true potential

• GPU computing used to be a pain, but no more

• Even though the GPGPU-crew said it was easy to get started

• Getting started today is easier than ever

• Getting good performance still requires knowledge of the architecture

• The success of GPUs has been unprecedented

• Its widespread availability, low cost, and "easy" programming model has made it a

success, where other parallel architectures have failed

• It will be exciting to see how the Intel Xeon Phi stands the test of time

Page 42: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology

Technology for a better society 42

Thank you for your attention!

André R. Brodtkorb

Email: [email protected]

Homepage: http://babrodtk.at.ifi.uio.no/

Youtube: http://youtube.com/babrodtk/

SINTEF webpages: http://www.sintef.no/math/