OFC/NFOEC: GPU-based Parallelization of System Modeling

Stephan Pachnicke, 18.03.2013

GPU-based Parallelization of System Modeling

© 2013 ADVA Optical Networking. All rights reserved.22

Outline

• Motivation

• Numerical System Modeling

• GPU-Parallelization

• Comparison of Speedup and Accuracy

• Conclusion


Acknowledgments

The author would like to acknowledge the help and contributions of

Adam Chachaj – Krone Messtechnik

Heinrich Müller – TU Dortmund

Peter Krummrich – TU Dortmund

Markus Roppelt – ADVA Optical Networking

Michael Eiselt – ADVA Optical Networking

http://www.library.yorku.ca/cms/staffblog/files/2011/06/clipart_of_15195_sm_2.jpg


Motivation


In Short: Computational Performance

vs.CPU Cluster

Graphical Processing Unit (GPU)

http://en.community.dell.com/cfs-file.ashx/__key/communityserver-discussions-components-files/3514/3465.cluster.jpg


Increase in GFlop/s

• GPU performance is growing even faster than predicted by Moore‘s law and is significantly higher than CPU performance

• GPUs are attractive also for general purpose computing (complex numerical simulations)



Optical System Modeling

• Simulation of (long-haul) optical transmission systems requires numerical solution of the nonlinear Schrödinger equation

High computational effort for small step-sizes due to accurate simulation of nonlinear fiber effects

• Precise estimation of the bit error ratio with Monte-Carlo simulations for PMD and noise

Requires a high number of simulated bits



• Splits nonlinear Schrödinger equation in linear and nonlinear parts

• Separate solution of linear and nonlinear parts

• Solution of the linear part in the frequency domain and of the nonlinear part in time domain (acceptable for small step-sizes)

Split-Step Fourier Method (SSFM)

IFFTIFFTFFTFFT IFFT�̂�𝑁

1 Split-Step

……


Speedup Factor (GPU vs CPU)

• Single precision arithmetic has much higher performance on GPU(because main target group is computer gaming)

• Longer block lengths allow better parallelization

Single precision implementation desirable

Legend

DP: Nvidia CUDA FFT

SP: FFT using pre-calculated twiddle factors

Single precision (SP)

Double precision (DP)



Accuracy (in single precision)

• Total accuracy of SSFM dominated by FFT accuracy

• Backward error grows linearly with increasing number of FFTs

• CUDA FFT shows considerably higher error than other FFT implementations

Legend

CUFFT: Nvidia CUDA FFT

FFTW: Fastest Fourier Transform in the West

IPP: Intel Integrated Performance Primitives

LUT: Precalculate trigonometric functions in DP

Backward Error

RMSE (IFFT(FFT()); )

LUT-based FFT



Analysis: Accuracy

Why is the accuracy of CUFFT in SP relatively low?

FFT performance depends crucially on accuracy of „twiddle-factors“ (or trigonometric functions)

HW implementation of trigonometric functions in SP on GPUs optimized for peak performance not accuracy

What can be done to increase accuracy in single precision?

Implementation of Taylor series expansion (slow!)

Compute trigonometric functions in DP on CPU and store them in a look-up table on the GPU (especially suited to the split-step Fourier method with thousands of FFTs of similar length)

J. C. Schatzman, SIAM J. Scientific Comput. (1996).



Illustrative Example

CUDA FFT (SP) LUT-based FFT (SP)

• Look-up table based FFT provides a significantly increased accuracy in single-precision arithmetics

• Look-up table holds pre-calculated „twiddle-factor“ values

Source: S. Pachnicke, et al, OFC 2011.

-: GPU

-: CPU


System Analysis (SSFM Simulation)

• GPU double precision results are (almost) identical to CPU results

• The OSNR penalty of our single precision implementation remains below 0.1 dB up to a number of approx. 125,000 split-steps

GPU simulation(in SP or DP)

vs. CPU simulation

(in DP)

Source: S. Pachnicke, IEEE ICTON, 2010.

11x 112 Gb/s CP-QPSK

Req

. OS

NR

dev

iatio

n fo

r B

ER

=10

-3 [d

B]



Combined Simulation in SP & DP

Combined simulation with single and double precision and automatic (algorithmic) choice of amount of single precision simulations

Calculate approximate division of the parameter space into strata by fast simulations with single precision.

The ellipses represent parameter combinations for which bit errors occur during transmission.

Execute simulations with double precision accuracy sparsely in the different strata to assess the BER.

P. Serena, et al, IEEE JLT, 2009.S. Pachnicke, et al, OFC 2011.



Discussion

• Results of combined (SP & DP) GPU simulations match well with results obtained from CPU simulations in DP

• Speedup of up to a factor of 180 possible compared to CPU

Stratified Monte-Carlo sampling allows algorithmic choice of amount of required DP simulations for a given accuracy

Robustness of algorithm has been checked by deliberately selecting high amount of 880,000 split-steps

Source: S. Pachnicke, et al, OFC 2011.



• GPU parallelization allows simulation of a long distance 80 WDM channel system on a PC in reasonable time

• Result: The system performance can be estimated much more precisely than with CPU-based simulations (typically modeling only 10 WDM channel systems)

Design Advantages

Source: C. Xia, D. van den Borne, OFC, 2011



Conclusion

• GPUs offer a much higher computational peak performance than CPUs

• Full benefit of GPU power only in single precision

• Increase in single precision accuracy possible by pre-computing of trigonometric function values for FFTs

• Speedup in simulation time of more than a factor of 100 possible compared to CPU



Further Reading

• N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, J. Manferdelli, “High Performance Discrete Fourier Transforms on Graphics Processors”, Proc. of IEEE conference on Supercomputing (SC), article no. 2 (2008).

• S. Pachnicke, “Fiber-Optic Transmission Networks: Efficient Design and Dynamic Operation”, Springer (2011).

• J. C. Schatzman, “Accuracy of the Discrete Fourier Transform and the Fast Fourier Transform”, SIAM J. Scientific Comput. 17, 1150-1166 (1996).

• G. Falcao, V. Silva, L. Sousa, “How GPUs can outperform ASICs for fast LDPC decoding”, Proc. of ACM International Conference on Supercomputing (ICS), 390-399 (2009).

• J. A. Stratton, S. S. Stone, W.-M. W. Hwu, “MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs”, Lecture Notes in Computer Science 5335, 16-30 (2008).

• R. R. Exposito, G. L. Taboada, S. Ramos, J. Tourino, R. Doallo, “General-purpose computation on GPUs for high performance cloud computing”, Wiley J. Concurrency and Computation 24 (2012).


[email protected]

Thank you

IMPORTANT NOTICE

The content of this presentation is strictly confidential. ADVA Optical Networking is the exclusive owner or licensee of the content, material, and information in this presentation. Any reproduction, publication or reprint, in whole or in part, is strictly prohibited.

The information in this presentation may not be accurate, complete or up to date, and is provided without warranties or representations of any kind, either express or implied. ADVA Optical Networking shall not be responsible for and disclaims any liability for any loss or damages, including without limitation, direct, indirect, incidental, consequential and special damages, alleged to have been caused by or in connection with using and/or relying on the information contained in this presentation.

Copyright © for the entire content of this presentation: ADVA Optical Networking.

http://www.linkedin.com/groups?about=&gid=1194227&trk=anet_ug_grppro

http://twitter.com/ADVAOpticalNews

http://www.facebook.com/pages/ADVA-Optical-Networking/37630238931?ref=ts

http://blog.advaoptical.com/

OFC/NFOEC: GPU-based Parallelization of System Modeling

Technology

Transcript of OFC/NFOEC: GPU-based Parallelization of System Modeling