GPU-based data analysis for SAMI - Nucleus · GPU-based data analysis for Synthetic Aperture...
Transcript of GPU-based data analysis for SAMI - Nucleus · GPU-based data analysis for Synthetic Aperture...
GPU-based data analysis for Synthetic
Aperture Microwave Imaging
1st IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis 1st-3rd June 2015
J.C. Chorley1, K.J. Brunner1, N.A. Dipper1, S.J. Freethy4, R.M. Sharples1
V.F. Shevchenko3, D.A. Thomas2, R.G.L. Vann2
1Durham University 2University of York 3Culham Centre for Fusion Energy 4Max-Planck-Institut für Plasmaphysik
This work is funded by Durham University and EPSRC grant EP/K504178/1
∂
Talk outline
• SAMI overview
• Motivation for GPU acceleration
• GPU code and techniques
• Acceleration results
• Summary and future work
1
∂
SAMI overview
• SAMI is the Synthetic Aperture Microwave Imaging diagnostic that reconstructs
2d thermal images of the plasma
• SAMI is a phased array – the phase on each antenna is determined by the
geometry and polarisation
• If the antennas do not have perfectly
aligned polarisations there is an
additional phase difference between the
antennas
• The image is then the sum of the
products of antenna cross-correlation
2
∂
SAMI overview
• SAMI is the Synthetic Aperture Microwave Imaging diagnostic that reconstructs
2d thermal images of the plasma
• SAMI is a phased array – the phase on each antenna is determined by the
geometry and polarisation
• If the antennas do not have perfectly
aligned polarisations there is an
additional phase difference between the
antennas
• The image is then the sum of the
products of antenna cross-correlation
3
∂
SAMI overview
• Optimised design for SAMI satisfying bandwidth and space requirements consists
of 8 antennas
[1] S.J. Freethy et al. IEEE transactions on antennas and propagation 60 5442 (2012)
[2] S.J. Freethy et al. Plasma Phys. Control Fusion 55 124010 (2013)
[1] [2]
4
∂
SAMI overview
• SAMI is the first diagnostic of its kind: 2d maps of Electron Bernstein Emission
process and mode conversion windows
Useful for RF heating and current drive
• SAMI has demonstrated the feasibility of a
phased array microwave imaging system through
a successful campaign on MAST and will be
installed on NSTX-U for the next campaign
• In a future reactor environment a microwave
imaging diagnostic such as SAMI is essential:
SAMI is resilient to high energy neutron fluxes
Antennas can be incorporated into vessel wall
Compact design, doesn’t use much wall space
S.J. Freethy et al. Plasma Phys. Control Fusion 55 124010 (2013)
Shot 27022
5
∂
SAMI overview
Above: An image of the array of Vivaldi
antennas in a 21 configuration
Right: The RF electronics mounted on MAST
V.F. Shevchenko et al. J. Inst. 7 p10016 (2012)
6
∂
SAMI overview
• Demanding data acquisition requirements!
• 16 frequency channels
• 14 bit sample depth (dynamic
range of plasma during ELMs)
• Sampling at 250 Msamples/s
• For a total of 500ms (length
of MAST shot)
• Data rate of 8 Gbytes/s
• Meaning we have 4 Gbytes raw data from SAMI per shot
7
∂
Motivation for GPU code
• 4 Gbytes raw data per shot on MAST => 12Tbyte RAID system plus backup for
M8 and M9 campaigns
• Data ∝ nAnt and Computation/Resolution ∝ nAnt(nAnt – 1)
• Original IDL data analysis code takes ~30 minutes to process data for 1 shot on
AMD Phenom(tm) II X2 560 Processor
• Time between shots on MAST is ~15 minutes => no intershot analysis
• Masses of unanalysed raw data accumulating
• An accelerated GPU data processing code could cycle through the data from
previous campaigns in significantly reduced time and in future campaigns provide
the ability to do intershot analysis
• Aim for real time data analysis as a multi-megawatt EBW current drive and
heating system will require real time aiming and interlocking diagnostics
8
∂
GPU architecture
Core1
Core4 Core2
Core3
CPU Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
GPU Cache
GPU Memory Size = 6GB, Speed = 250GB/s
CPU
GPU
Main System
Bus
System Memory Size = 64GB, Speed = 40GB/s
PCIe Bus
8GB/s
Key hardware features:
• Massive use of long vector
units
• Low clock speed
• Very fast memory
• No advanced instruction
processing
• Designed to do massive
parallel computations
9
∂
SAMI suitability for GPU code
• SAMI aquires nInt data points in all 8 antennas simultaneously and has a 160µs
switching frequency => data structure with shape nInt*nAnt*nf*nSweeps
𝑛𝑆𝑤𝑒𝑒𝑝𝑠 = 𝑠ℎ𝑜𝑡 𝑙𝑒𝑛𝑔𝑡ℎ
𝑠𝑤𝑖𝑡𝑐ℎ𝑖𝑛𝑔 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
• SIMD scenario => parallelisation by CUDA
• Each CUDA thread mapped to 1 element of vector unit Full vector unit = 32 consecutive threads = warp
• Warp processed at once by the hardware
• On software level threads are grouped into thread blocks
0B 128B 256B 384B
Warp
128B 256B 384B
Warp
0B
nInt
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
nf =1
.
.
.
nf=16
x nSweeps
10
∂
IDL code
bootconfig.rfctrl.ini
get_config.pro
read_freq_split.pro read_bin_raw.pro
noise data data
filter.pro
complexify.pro
gpu_correlate_model.pro
iqPhaseGradient.dat
upper_lower_complex.dat
sideband_cal_values_upper.dat
TF test shot data file
integer data
voltage data for each selected frequency for each frequency sweep
configuration data specifies which frequencies to read in and the time windows
length and location
calibration data correcting for phase drift between I and Q
components 16 real signals get converted to
8 complex signals for upper and lower sideband
calibration data correcting for phase offsets and balancing amplitudes between I and Q
components via matrix inversion
calibration data correcting for phase differences between antennas due
to RF electrical lengths
cross-correlations calculated for each antenna pair, frequency sweep and upper and lower
sidebands
11
∂
GPU code
read_bin_raw_gpu.cu copy to GPU
data conditioning forward CUFFT
IQ_filter backward CUFFT
backward CUFFT RF phase calibration
calculate cross correlations copy from GPU
results available on the host
sideband suppression IQ correction
forward CUFFT filter
• Wrote 14 CUDA kernels and made use of CUFFT library
• Limited memory available on GPU => can’t copy all data to GPU and process
at once
• Need to carve problem up and exploit CUDA streams and concurrency
12
∂
CUDA streams and concurrency
• Exploit concurrency – overlap copy to the GPU with kernel execution on the GPU
• CUDA exposes concurrency through streams – a sequence of commands that
execute in order
copy 1 up
copy 3 up
copy 5 up
kernel 1
kernel 3
kernel 5
copy 1 down
copy 3 down
copy 5 down
copy 2 up
copy 4 up
copy 6 up
kernel 2
kernel 4
kernel 6
copy 2 down
copy 4 down
copy 6 down
Stream 1
copy 1 up
copy 2 up
copy 3 up
kernel 1
kernel 2
kernel 3
copy 1 down
copy 2 down
copy 3 down
copy 4 up
copy 5 up
copy 6 up
kernel 4
kernel 5
kernel 6
copy 4 down
copy 5 down
copy 6 down
Stream 1
Stream 2
Stream 3
13
∂
• GPUs support the following forms of concurrency:
Overlapping copies to or from the device with kernel execution
Executing more than one kernel at the same time
Overlapping copies to the GPU with copies from the GPU
CUDA streams and concurrency
Stream 1
Stream 2
Stream 3 copy 3 up
kernel 3
copy down 3
copy 6 up
kernel 6
copy down 6
copy 2 up
kernel 2
copy down 2
copy 5 up
kernel 5
copy down 5
copy 1 up
kernel 1
copy down 1
copy 4 up
kernel 4
copy down 4
Time
14
∂
Acceleration results
• Code development on a machine with:
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Tesla K40C 12Gbytes GDDR5
• Data for a few shots on development machine to check correctness with IDL code
IDL C CUDA
Total time (s) 1038.38 464.55 17.42
Total time (mins:s) 17:18 7:44 0:17
Speed-up 2.24x 59.61x (26.67x)
• Acquired a dedicated GPU card for SAMI – GeForce GTX770 4Gbytes GDDR5
• Cycle through 1837 shots in 30 hours => averaging 58 seconds per shot
• Most of this increase due to CPU time and reading from hard disk
15
∂
Summary and future work
• Successfully achieved acceleration of the SAMI data analysis code to enable the
processing of 12Tbytes raw data from previous MAST campaigns
• Ability to compare cross-correlation data from many shots
• Enables inter-shot analysis in future campaigns (NSTX-U, MAST-U)
• Reduce run time of code aiming for real-time (how the code accesses raw
data/data shape, FPGA/GPU communication1)
• Demonstrate benefit of a multi-GPU system
1R. Bittner et al. Cluster Comput. DOI 10/1007/s 10586-013-0280-9
16