GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT...

79
GPUs in CT Reconstruction Logan Johnson <3 2 1

Transcript of GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT...

Page 1: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

GPUs in CT Reconstruction Logan Johnson

<3

2

1

Page 2: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Agenda

• Introduction

• CT Essentials

• Forward Projection

• GPU Programming 101

• GPU Optimization of Forward Projector

Page 3: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

This Guy

Professional • BS Bioengineering - Clemson University (2009)

• 5 years at GE Healthcare, CT Recon

• Just started at NeuroLogica, Mobile CT Recon

• Algorithm design and optimization – CPU, GPU, and Xeon Phi architectures

– CUDA and OpenCL

Unprofessional • Runner, writer, and digital artist

• Lover of “coffeine” and scotch

Glennfiddich distillery in Dufftown, Scotland

Page 4: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

CT ESSENTIALS

Page 5: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

What is CT?

Biggest drawback:

Irradiates patient (and potential use of contrast agent)

2

3D Imaging

Trauma/ER Cardiac

Perfusion

Hard tissues

Guided surgery

Great for:

1

Page 6: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

What is CT really?

https://www.youtube.com/watch?v=2CWpZKuy-NE

Page 7: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

CT Reconstruction in a Nutshell

SCAN

RECONSTRUCT

CO

RR

ECT

RAW PROJECTIONS

SINOGRAM IMAGES

1

2

FBP or Iterative

Page 8: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Filtered Back Projection

Fourier + Radon transform based algorithm

1

1

CT scan is like a Radon transform of a patient. Goal is to inverse Radon transform (FBP) to recover anatomy.

F BP

Page 9: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Core FBP Reconstruction Math-magics

Raw View

Calibration

Beer’s Law vout = -ln(vin/vref)

vout = vin * gain + offsets

Filter

Rebinning

vout = conv1D(vin, rampFilter)

vout = interp2D(vin-100, vin, vin+100)

Step Output Projection Simplified Math

Generally, core steps are easily parallelizable algorithms and projections can be processed independently of one another (except rebin).

vout = raw scanner data

Page 10: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Back Projection

Final step is Back Projection, which is also easily parallelizable but requires many projections.

1

Page 11: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

More Reasons for using GPU

Reasons:

• Off-the-shelf technology = cost savings

• Much better performance than x86/64

• Easier to program/develop than FPGA

• Floating point performance > FPGA

Draw-backs:

• Short GPU life cycle = more cost in V&V, inventory

Full-body scan of 6’ patient ready in < 5 minutes

Page 12: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Iterative Reconstruction

Improvements in HPC technology enable more sophisticated reconstruction algorithms

1

GE Veo Model Based Iterative Reconstruction (MBIR) on BladeCenter

Page 13: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Iterative Reconstruction

GE

V

EO

Siem

ens

IRIS

Siem

ens

- SA

FIR

E P

hill

ips

- iD

ose

1

1

Algorithms are generally much more complex than FBP, therefore slower

Page 14: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Iterative Reconstruction

1

You get what you compute for.

Page 15: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Iterative Reconstruction

1

Next big challenge in CT imaging for GPUs – Veo quality at SAFIRE/iDose speeds

Page 16: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

FORWARD PROJECTION

Page 17: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

What is Forward Projection?

SCAN

Forward Project

CORRECTED PROJECTIONS

RE-PROJECTIONS

Forward projection is like simulating a CT scan. The input to this simulation are CT images. Reprojections should be similar to original corrected projections which made the aforementioned input images.

1

2

Page 18: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

What is Forward Projection?

1

Page 19: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Modeling X-Ray Transmission

−ln𝜆𝑜𝑢𝑡𝜆𝑖𝑛= 𝑎𝑖𝑙𝑖𝑖

𝜆𝑖𝑛

𝜆𝑜𝑢𝑡

Intensity, 𝜆, decreases as beam passes through object

Σ𝑜𝑢𝑡 = 𝑎𝑖𝑙𝑖𝑖

Σ𝑖𝑛

Σ𝑜𝑢𝑡

Real System FP with CT Image Input

Sum of attenuations, Σ, increases as ray passes through image

Beer (-Lambert)’s Law!

1

2

Page 20: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Modeling X-Ray Transmission

Σ𝑜𝑢𝑡 = 𝑎𝑖𝑙𝑖𝑖

Summing attenuation values

For each row, compute attenuation by interpolating between pixels at intersection with ray. Add these to an accumulator, and multiply the result by the geometric scaling factor, 𝑙𝑖, since this value is constant for all rows for this particular ray.

3 5

5 1 𝑎𝑛 = 5 ∗ .5 + 1 ∗ .5 = 3

𝑎𝑛+1 = 3 ∗ .2 + 5 ∗ .8 = 4.6

𝜃 𝑙𝑖 =

sin (𝜃)

Page 21: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Modeling X-Ray Transmission

“Walking” across just rows

“Walking” across rows OR columns

Choose between sampling pattern with “if |cos(𝜃)| > |sin(𝜃)|”, where 𝜃 is the ray angle

Just two samples?

That’s more like it.

Page 22: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Modeling a CT Scanner

sou

rce to iso

center

sou

rce to d

etector

Detector channel radial width

Detector row width

det

ecto

r R

ow

s

detector Channels

X-Ray Source (Tube)

X-ray source and detector rotate around isocenter. Detector channels are equiangularly spaced w.r.t. to source. Rows are all the same width.

CT Detector

1

2

3

4

Page 23: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Modeling a CT Scanner

One rotation 21 equally spaced views

(Not a realistic scan)

View

0

View

1

View

2

View

20

View

19

View

18

Two key parameters – views (exposures) per rotation and rotation speed.

-180° -90° 0° 90° 180°

1

2

Page 24: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Ray Driven Cone Beam

Forward Projection

One rotation per N seconds, M equally spaced views Want to compute projection for each ray at each view location.

View

0

View

1

View

2

View

M

View

M-1

Vie

w M

-2

ray

chan

nel

dir

ecti

on

ray row direction

walk across IMAGE COLUMNS

walk across IMAGE ROWS

walk across IMAGE COLUMNS

3D Ray Tracing!

x

y

In-plane Geometry

ROTATE

Total output elements = rows * channels * views

-z ← Out of plane geometry → +z

1

Page 25: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

GPU PROGRAMMING 101

Page 26: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Programming GPUs

• Compute Uniform Device Architecture • NVidia Proprietary • GPU only • Block size and grid size

• Open Compute Language • Khronos Group open standard • AMD, NVidia, Intel, Altera, Xilinx • GPU, CPU, Phi, FPGA, others (?) • Global work size and work group size

Very similar paradigms, and both are C/C++ API’s. Comparing CUDA to OpenCL is like comparing Java to C++

1

2

Page 27: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

CUDA Programming Model

Key concepts: • SIMT – Single Instruction Multiple Threads • 32 threads / warp • Threads are grouped into blocks • One warp worth of threads are executed in

parallel per compute unit. • Each warp executes same instruction at the same

time – lock-step execution • Branch divergence when threads within half-warp

choose different logical paths

1

Page 28: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Architecture

NVidia Maxwell (GM204)

32 cores/SM for 1 warp 1 2

Page 29: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Architecture

NVidia Maxwell (GM204)

32 cores/SM for 1 warp 298 mm2

1 2

Page 30: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Memory Architecture

1 to 32 cycles

1 cycle

400 to 600 cycles

Avoid global memory accesses, try to use shared memory.

Access latency

1

Page 31: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Performance Optimization

Tools NVidia NVVP AMD CodeXL

Knowledge GPU Gems

AMD/NVidia Programming Guides Experience

Creativity (borderline madness)

1

2,3

4

Page 32: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

GPU OPTIMIZATION OF FORWARD PROJECTOR

Page 33: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Introduction

1

Page 34: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Experimental Setup

System Configuration • CUDA 6.5 • NVidia K20m • Visual Studio 2012

Projector Configuration • Joseph et. al 1982 Projection Model • 32 rows • 800 views per rotation • 1 rotation per second • 32 mm/s movement in Z • RTK 12 CPU Reference – 1473 seconds

NIH-NLM Visible Human Body Project

Frozen Female 512 (x) 512 (y) 1784(z) image matrix size

CT Scan Case

Reconstruction Toolkit (RTK) by Creatis, MGH, et. al also contains an excellent example of this algorithm in CUDA.

Page 35: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Performance Goal

Acquires 1 rotation per second

Have a performance goal before you begin designing! (Even if it’s roughly 1400x)

So…

Processing at least 1 rotation per second will ensure FP is not pipeline bottleneck

1 2

Page 36: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Naïve Implementation

Priorities:

• Needs to produce correct results

• Write GPU friendly code

if Avoid big if conditions

t0 t1 t2 t3

d0 d1 d2 d3

Output-driven parallelism (one thread / output element)

1

Page 37: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Trilinear In

terp

olatio

n

Z-coordinate computation

Weight, write final result to global memory once

Kernel Source Code

The Inner Loop – executed at most 512 times!

Somewhat redundant, but good for prototyping

Page 38: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Determine if walking across rows or columns Compute ray change in x and y accordingly

Compute line integral weighting

Kernel Source Code

Projection loops don’t need to be inside if condition. 1. Avoids unnecessary and costly warp divergence 2. Eliminates duplicate code

Page 39: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Results of Naïve Implementation

in a blazing

One rotation of data

433 seconds!!!

For this much of the anatomy

Page 40: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Performance Profiling

NVidia Visual Profiler

Very basic profiling on a 7 minute application took overnight to complete. Try running a smaller but representative case.

Page 41: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Performance Profiling

Complete profiling took 10 minutes for 16 of 800 views. Overflow issues still persistent, but sufficient information to begin optimizing.

Page 42: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Guided Performance Analysis

Helpful tool to run the most relevant profiling experiments for your kernel. Took five minutes.

Page 43: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Register Usage

Registers/thread mostly driven by number of variables in kernel

Executive summary on kernel performance

Page 44: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Register Usage

This function

Does ~30 loads

And hits peak register usage

nvdiasm gives some insight into what is using up all the registers

Page 45: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Register Usage

~30

ele

me

nts

Perhaps this huge structure causing a lot of register spillage in the inner most loop?

Page 46: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Register Usage

Remove covertImageCoordinatesToSpace from inner loop with algebraic factorization

433s / rotation 67 registers

27.8s / rotation 65 registers

Yet we still need 60+ registers.

Page 47: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Register Usage

Since we’re optimizing the inner loop….

27.8 s/rotation 65 registers

16.2 s/rotation 74 registers

Simplified calculations and introduced pitched memory (more on this later). What else changed that could have driven up register usage?

Page 48: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Register Usage

Went from a pointer to a struct

Passing big structs by value, not by reference, to CUDA kernel is apparently a bad idea.

27.8 s/rotation, 65 registers

16.2 s/rotation, 74 registers

12.5 s/rotation, 48 registers

Page 49: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Occupancy

10.3 s/rotation 12.5 s/rotation

Changing block size (for this algorithm) is simple and can quickly yield improvements in device utilization. Using shared memory might make such tweaks more challenging.

64 threads/block 128 threads/block

Page 50: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Occupancy

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

What can we do to further improve on 63% occupancy?

(Occupancy = WarpsPerSM / TotalSM * 100% )

But is it worth it?

Page 51: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Removing Expensive Instructions

IPC = Instructions Per Clock. Higher is faster. Expected from CUDA Programming Guide Measured from my laptop

Expected IPC Measured IPC

float32 add/mply 6 5.26

float32 divide ? 3.41

float32 rsqrtf() 1 1.2

float32 1.0f/rsqrtf() ? 1.1

float32 sqrtf() ? 1.08

int32 add 5 4.01

int32 mply 1 1.09

Quadro K4100M (3.0)

10.3 s/ rotation, 48 registers per thread

9.00 s/rotation, 39 registers per thread

Simple factorization removed 512 sqrt computations / thread, some less expensive multiplications, and some variables

Page 52: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Assessing Our Progress

433

27.78

16.2 12.5

10.3 8.995

1

10

100

1000

Naïve removedstruct from

loop

removedclamping

struct pointer block sizeoptimization

sqrt removal

Tim

e p

er

rota

tio

n [

s]

Forward Projector Performance

About 50x faster, but still need another 10x

Page 53: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

First Pass Optimization

This might be a good point to profile an entire rotation

Where we started 433 seconds / rotation

Where we arrived 9 seconds / rotation

Page 54: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

High-level Profile for Full Rotation

16 Projections (.5 s/rotation)

800 Projections (9 s/rotation)

The 16 projection experiment isn’t representative of the full experiment. Why the 18x difference?

Page 55: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

What’s different?

One rotation per N seconds, M equally spaced views Want to compute projection for each ray at each view location.

View

0

View

1

View

2

View

M

View

M-1

Vie

w M

-2

ray

chan

nel

dir

ecti

on

ray row direction

walk across IMAGE COLUMNS

walk across IMAGE ROWS

walk across IMAGE COLUMNS

3D Ray Tracing!

x

y

In-plane Geometry

ROTATE

-z ← Out of plane geometry → +z

Processing more views means moving further in Z and changing rotation angles.

1

Page 56: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

A Little Design of Experiment

0

20

40

60

80

100

120

140

160

180

200

150

250

350

450

550

650

750

850

950

1050

1150

0 100 200 300 400 500

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

Number of Views

Adjusting Total Number of Views

Execution Time

Load Efficiency

0

20

40

60

80

100

120

140

160

180

200

150

155

160

165

170

175

0 100 200 300 400 500

% lo

ad e

ffic

ien

cy

No

rmal

ize

d E

xecu

tio

n T

ime

[m

s]

Number of Views

Adjusting Total Number of Views

Execution Time

Load Efficiency

If table position and gantry angle are held constant, the number of views has an expected linear impact on performance.

Page 57: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

A Little Design of Experiment

0

20

40

60

80

100

120

140

160

180

200

150

200

250

300

350

400

450

0 20 40 60 80 100

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

First View Location [mm]

Adjusting First View Location

Execution Time

Load Efficiency

0

20

40

60

80

100

120

140

160

180

200

150

160

170

180

190

200

210

220

0 2 4 6 8

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

First View Angle [radians]

Adjusting Initial View Angle

Execution Time

Load Efficiency

Adjusting table position or gantry angle with a fixed number of views causes performance loss. Why?

Page 58: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

First View Location

0

20

40

60

80

100

120

140

160

180

200

150

200

250

300

350

400

450

0 20 40 60 80 100

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

First View Location [mm]

Adjusting First View Location

Execution Time

Load Efficiency

1

2

3

4

5

6

1 6

1

2

3 4 5 6

The original 16 view test case (at position 1) wasn’t projecting much – many of its rays were completely outside of the image volume.

Positions 4-6 are more representative of actual performance.

First View

Page 59: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Initial Rotation Angle

0

20

40

60

80

100

120

140

160

180

200

150

160

170

180

190

200

210

220

0 2 4 6 8

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

First View Angle [radians]

Adjusting Initial View Angle

Execution Time

Load Efficiency

Load efficiency and execution time vary drastically with rotation angle. nvvp suggests that we check if our memory accesses are coalesced

Page 60: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

0 1 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

21 22 23 24 25 26 27

28 29 30 31 32 33 34

35 36 37 38 39 40 41

42 43 44 45 46 47 48

Memory Coalescing 101

• Threads will project each row in parallel • For row 2, the threads will collectively need

to read memory elements 15, 16, 17, 18, and 19 at the same time.

• Since these elements are adjacent, the access is said to be coalesced.

• How coalesced depends on alignment, the total number of bytes read, etc.

• Best case, these elements can be read in one transaction Fo

r ea

ch r

ow

, eac

h t

hre

ad w

ill in

par

alle

l pro

ject

a p

ixel

row 0

row 1

row 2

row 3

1

Page 61: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

0 1 2 3 4 5 6

7 8 9 10 11 12 13

14 15 16 17 18 19 20

21 22 23 24 25 26 27

28 29 30 31 32 33 34

35 36 37 38 39 40 41

42 43 44 45 46 47 48

Memory Not Coalescing 101

• Threads will project each column in parallel • For column 4, the threads will collectively

need to read memory elements 11, 18, 25, 32, and 39 at the same time.

• Since these elements are NOT adjacent, the access are likely not coalesced.

• How not coalesced depends on alignment, how far apart the elements are, etc.

• Worst case, these elements will be read in five transactions

For each column, each thread will in parallel project a pixel co

lum

n 6

colu

mn

5

colu

mn

4

colu

mn

3

The projector rotates 360 degrees, so our accesses will have periodically bad efficiency!

1

Page 62: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Some Thoughts on Design of Experiments

0

20

40

60

80

100

120

140

160

180

200

150

200

250

300

350

400

450

0 20 40 60 80 100

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

First View Location [mm]

Adjusting First View Location

Execution Time

Load Efficiency

0

20

40

60

80

100

120

140

160

180

200

150

160

170

180

190

200

210

220

0 2 4 6 8

% lo

ad e

ffic

ien

cy

Exe

cuti

on

Tim

e [

ms]

First View Angle [radians]

Adjusting Initial View Angle

Execution Time

Load Efficiency

But what are we going to do about that coalescing problem?

Make sure to test all key variables while optimizing to save on embarrassment later on.

Page 63: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Revisiting the sampling problem

“Walking” across just rows

“Walking” across rows OR columns

Choose between sampling pattern with “if |cos(𝜃)| > |sin(𝜃)|”, where 𝜃 is the ray angle

Just two samples?

That’s more like it.

Page 64: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Transposed Matrix

Instead of walking across columns, walk across rows of a transposed image

1

Page 65: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Improvement using transposed matrix

Was: 9 seconds/rotation Now: 2.97 seconds / rotation

32 registers – disabled debugging features, now 100% occupancy

Another way to deal with overflow problems is to break up the whole experiment into parts!

Page 66: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Tweaking Block Size

Was: 2.97 seconds/rotation Now: 1.8 seconds / rotation

32 registers – disabled debugging features

Changed from [16, 8, 1] to [16, 1, 8]

Page 67: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Taking Tally

433

27.78

16.2 12.5 10.3 8.995

2.97

1.8

1

10

100

1000

Naïve removedstruct from

loop

removedclamping

structpointer

block sizeoptimization

sqrt removal transposedmatrix

block sizeoptimization

Tim

e p

er

rota

tio

n [

s]

Forward Projector Performance

Lets fix the first view location for the original benchmark

Page 68: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Taking Correct Tally

733

40.5

22.5 16.4 15.7

3.34

2.02

1

10

100

1000

Naïve removedstruct from

loop

removedclamping

struct pointer sqrt removal+ block size

opt.

TransposedMatrix

block sizeoptimization

Forward Projector Performance

Off by 2x. What next?

Note on performance linearity: 16 views -> 2.1 s/ rotation 800 views -> 2.0 s/ rotation

Page 69: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Guided Profile Analysis

nvvp says latency is the bottleneck

Page 70: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Guided Profile Analysis

now nvvp is telling us that occupancy is the bottleneck

Page 71: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Guided Profile Analysis

Profiler is giving us the run around. Guess it doesn’t know how to improve performance.

Page 72: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Unguided Profile Analysis

The inner most loop is essentially 3D interpolation. What can be done to accelerate these computations?

Page 73: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Texture Memory

Hardware accelerated 8-bit 2D/3D-interpolation

Morton-ordering like schemes are used in texture hardware

1

2

Page 74: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Texture Memory

Texture hardware handles both interpolation computations and boundary checking.

Page 75: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Improvement using textures

.475 seconds / rotation, .429 seconds/rotation with another block size tweak and .64 seconds/rotation including transfer times.

VICTORY

Page 76: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

So Sweet

733

40.5

22.5 16.4 15.7

3.34

2.02

0.64

0.1

1

10

100

1000

Naïve removed structfrom loop

removedclamping

struct pointer sqrt removal +block size opt.

TransposedMatrix

block sizeoptimization

image textures

Forward Projector Performance

Page 77: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Verify Outputs

Difference b/w original and fully optimized Sinogram output (reformatted views)

Same results as naïve within +/- 0.7%, but in .6 s instead of 733 seconds. Also ~2800x faster than “reference” CPU implementation! (I think something is wrong with it)

Page 78: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels

Further GPU Optimization Reading

• Asynchronous compute and transfer

• Shared memory

• Multiple GPUs

Page 79: GPUs in CT Reconstruction - Meetupfiles.meetup.com/1774957/HPC MeetUp - GPUs in CT Reconstruction.pdf3D Ray Tracing! x In-plane Geometry ROTATE Total output elements = rows * channels