GPU-based data analysis for SAMI - Nucleus · GPU-based data analysis for Synthetic Aperture...

GPU-based data analysis for Synthetic

Aperture Microwave Imaging

1st IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis 1st-3rd June 2015

J.C. Chorley1, K.J. Brunner1, N.A. Dipper1, S.J. Freethy4, R.M. Sharples1

V.F. Shevchenko3, D.A. Thomas2, R.G.L. Vann2

1Durham University 2University of York 3Culham Centre for Fusion Energy 4Max-Planck-Institut für Plasmaphysik

This work is funded by Durham University and EPSRC grant EP/K504178/1

∂

Talk outline

• SAMI overview

• Motivation for GPU acceleration

• GPU code and techniques

• Acceleration results

• Summary and future work

1

∂

SAMI overview

• SAMI is the Synthetic Aperture Microwave Imaging diagnostic that reconstructs

2d thermal images of the plasma

• SAMI is a phased array – the phase on each antenna is determined by the

geometry and polarisation

• If the antennas do not have perfectly

aligned polarisations there is an

additional phase difference between the

antennas

• The image is then the sum of the

products of antenna cross-correlation

2

∂

SAMI overview

• SAMI is the Synthetic Aperture Microwave Imaging diagnostic that reconstructs

2d thermal images of the plasma

• SAMI is a phased array – the phase on each antenna is determined by the

geometry and polarisation

• If the antennas do not have perfectly

aligned polarisations there is an

additional phase difference between the

antennas

• The image is then the sum of the

products of antenna cross-correlation

3

∂

SAMI overview

• Optimised design for SAMI satisfying bandwidth and space requirements consists

of 8 antennas

[1] S.J. Freethy et al. IEEE transactions on antennas and propagation 60 5442 (2012)

[2] S.J. Freethy et al. Plasma Phys. Control Fusion 55 124010 (2013)

[1] [2]

4

∂

SAMI overview

• SAMI is the first diagnostic of its kind: 2d maps of Electron Bernstein Emission

process and mode conversion windows

Useful for RF heating and current drive

• SAMI has demonstrated the feasibility of a

phased array microwave imaging system through

a successful campaign on MAST and will be

installed on NSTX-U for the next campaign

• In a future reactor environment a microwave

imaging diagnostic such as SAMI is essential:

SAMI is resilient to high energy neutron fluxes

Antennas can be incorporated into vessel wall

Compact design, doesn’t use much wall space

S.J. Freethy et al. Plasma Phys. Control Fusion 55 124010 (2013)

Shot 27022

5

∂

SAMI overview

Above: An image of the array of Vivaldi

antennas in a 21 configuration

Right: The RF electronics mounted on MAST

V.F. Shevchenko et al. J. Inst. 7 p10016 (2012)

6

∂

SAMI overview

• Demanding data acquisition requirements!

• 16 frequency channels

• 14 bit sample depth (dynamic

range of plasma during ELMs)

• Sampling at 250 Msamples/s

• For a total of 500ms (length

of MAST shot)

• Data rate of 8 Gbytes/s

• Meaning we have 4 Gbytes raw data from SAMI per shot

7

∂

Motivation for GPU code

• 4 Gbytes raw data per shot on MAST => 12Tbyte RAID system plus backup for

M8 and M9 campaigns

• Data ∝ nAnt and Computation/Resolution ∝ nAnt(nAnt – 1)

• Original IDL data analysis code takes ~30 minutes to process data for 1 shot on

AMD Phenom(tm) II X2 560 Processor

• Time between shots on MAST is ~15 minutes => no intershot analysis

• Masses of unanalysed raw data accumulating

• An accelerated GPU data processing code could cycle through the data from

previous campaigns in significantly reduced time and in future campaigns provide

the ability to do intershot analysis

• Aim for real time data analysis as a multi-megawatt EBW current drive and

heating system will require real time aiming and interlocking diagnostics

8

∂

GPU architecture

Core1

Core4 Core2

Core3

CPU Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

GPU Cache

GPU Memory Size = 6GB, Speed = 250GB/s

CPU

GPU

Main System

Bus

System Memory Size = 64GB, Speed = 40GB/s

PCIe Bus

8GB/s

Key hardware features:

• Massive use of long vector

units

• Low clock speed

• Very fast memory

• No advanced instruction

processing

• Designed to do massive

parallel computations

9

∂

SAMI suitability for GPU code

• SAMI aquires nInt data points in all 8 antennas simultaneously and has a 160µs

switching frequency => data structure with shape nInt*nAnt*nf*nSweeps

𝑛𝑆𝑤𝑒𝑒𝑝𝑠 = 𝑠ℎ𝑜𝑡 𝑙𝑒𝑛𝑔𝑡ℎ

𝑠𝑤𝑖𝑡𝑐ℎ𝑖𝑛𝑔 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

• SIMD scenario => parallelisation by CUDA

• Each CUDA thread mapped to 1 element of vector unit Full vector unit = 32 consecutive threads = warp

• Warp processed at once by the hardware

• On software level threads are grouped into thread blocks

0B 128B 256B 384B

Warp

128B 256B 384B

Warp

0B

nInt

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

nf =1

.

.

.

nf=16

x nSweeps

10

∂

IDL code

bootconfig.rfctrl.ini

get_config.pro

read_freq_split.pro read_bin_raw.pro

noise data data

filter.pro

complexify.pro

gpu_correlate_model.pro

iqPhaseGradient.dat

upper_lower_complex.dat

sideband_cal_values_upper.dat

TF test shot data file

integer data

voltage data for each selected frequency for each frequency sweep

configuration data specifies which frequencies to read in and the time windows

length and location

calibration data correcting for phase drift between I and Q

components 16 real signals get converted to

8 complex signals for upper and lower sideband

calibration data correcting for phase offsets and balancing amplitudes between I and Q

components via matrix inversion

calibration data correcting for phase differences between antennas due

to RF electrical lengths

cross-correlations calculated for each antenna pair, frequency sweep and upper and lower

sidebands

11

∂

GPU code

read_bin_raw_gpu.cu copy to GPU

data conditioning forward CUFFT

IQ_filter backward CUFFT

backward CUFFT RF phase calibration

calculate cross correlations copy from GPU

results available on the host

sideband suppression IQ correction

forward CUFFT filter

• Wrote 14 CUDA kernels and made use of CUFFT library

• Limited memory available on GPU => can’t copy all data to GPU and process

at once

• Need to carve problem up and exploit CUDA streams and concurrency

12

∂

CUDA streams and concurrency

• Exploit concurrency – overlap copy to the GPU with kernel execution on the GPU

• CUDA exposes concurrency through streams – a sequence of commands that

execute in order

copy 1 up

copy 3 up

copy 5 up

kernel 1

kernel 3

kernel 5

copy 1 down

copy 3 down

copy 5 down

copy 2 up

copy 4 up

copy 6 up

kernel 2

kernel 4

kernel 6

copy 2 down

copy 4 down

copy 6 down

Stream 1

copy 1 up

copy 2 up

copy 3 up

kernel 1

kernel 2

kernel 3

copy 1 down

copy 2 down

copy 3 down

copy 4 up

copy 5 up

copy 6 up

kernel 4

kernel 5

kernel 6

copy 4 down

copy 5 down

copy 6 down

Stream 1

Stream 2

Stream 3

13

∂

• GPUs support the following forms of concurrency:

Overlapping copies to or from the device with kernel execution

Executing more than one kernel at the same time

Overlapping copies to the GPU with copies from the GPU

CUDA streams and concurrency

Stream 1

Stream 2

Stream 3 copy 3 up

kernel 3

copy down 3

copy 6 up

kernel 6

copy down 6

copy 2 up

kernel 2

copy down 2

copy 5 up

kernel 5

copy down 5

copy 1 up

kernel 1

copy down 1

copy 4 up

kernel 4

copy down 4

Time

14

∂

Acceleration results

• Code development on a machine with:

Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

Tesla K40C 12Gbytes GDDR5

• Data for a few shots on development machine to check correctness with IDL code

IDL C CUDA

Total time (s) 1038.38 464.55 17.42

Total time (mins:s) 17:18 7:44 0:17

Speed-up 2.24x 59.61x (26.67x)

• Acquired a dedicated GPU card for SAMI – GeForce GTX770 4Gbytes GDDR5

• Cycle through 1837 shots in 30 hours => averaging 58 seconds per shot

• Most of this increase due to CPU time and reading from hard disk

15

∂

Summary and future work

• Successfully achieved acceleration of the SAMI data analysis code to enable the

processing of 12Tbytes raw data from previous MAST campaigns

• Ability to compare cross-correlation data from many shots

• Enables inter-shot analysis in future campaigns (NSTX-U, MAST-U)

• Reduce run time of code aiming for real-time (how the code accesses raw

data/data shape, FPGA/GPU communication1)

• Demonstrate benefit of a multi-GPU system

1R. Bittner et al. Cluster Comput. DOI 10/1007/s 10586-013-0280-9

16

GPU-based data analysis for SAMI - Nucleus · GPU-based data analysis for Synthetic Aperture...

Documents

Transcript of GPU-based data analysis for SAMI - Nucleus · GPU-based data analysis for Synthetic Aperture...