Simulation of scanning transmission electron microscope images on desktop computers

ARTICLE IN PRESS

Ultramicroscopy 110 (2010) 195–198

Contents lists available at ScienceDirect

Ultramicroscopy

0304-39

doi:10.1

E-m

journal homepage: www.elsevier.com/locate/ultramic

Simulation of scanning transmission electron microscope imageson desktop computers

C. Dwyer

Monash Centre for Electron Microscopy, Department of Materials Engineering, Monash University, Victoria 3800, Australia

a r t i c l e i n f o

Article history:

Received 10 September 2009

Received in revised form

10 November 2009

Accepted 17 November 2009

Keywords:

STEM

Image simulation

Multislice

91/$ - see front matter & 2009 Elsevier B.V. A

016/j.ultramic.2009.11.009

ail address: [email protected]

a b s t r a c t

Two independent strategies are presented for reducing the computation time of multislice simulations

of scanning transmission electron microscope (STEM) images: (1) optimal probe sampling, and (2) the

use of desktop graphics processing units. The first strategy is applicable to STEM images generated by

elastic and/or inelastic scattering, and requires minimal effort for its implementation. Used together,

these two strategies can reduce typical computation times from days to hours, allowing practical

simulation of STEM images of general atomic structures on a desktop computer.

& 2009 Elsevier B.V. All rights reserved.

1. Introduction

The last decade has witnessed a surge in the popularity of thehigh-angle annular dark-field (HAADF) scanning transmissionelectron microscope (STEM) imaging technique for advancedmaterials characterization. This technique is capable of generatingatomic resolution images of a material that are intuitively inter-pretable at the qualitative level. Going beyond a qualitativeinterpretation, in order to perform atomic model refinement, forexample, necessitates quantification of the images by means ofsimulations (e.g. Refs. [1–7]). However, the simulation of STEMimages presents a challenge due to the time required to generatedata for each position of the electron probe. This difficulty isparticularly apparent for non-crystalline specimens because ofthe lack of symmetry. For various practical reasons, most notablyits efficiency in handling non-periodic atomic structures, thefast Fourier transform- (FFT) based multislice method [8–10] hasbecome widely adopted for electron image simulations. Thesuitability of the multislice approach for dynamical electronscattering in non-periodic solids makes it applicable to a widevariety of problems in materials science. While the FFT multislicemethod can easily handle non-crystalline specimens by meansof a supercell, a straightforward adaptation to STEM imagesimulation, whereby a separate multislice calculation is carriedout for each pixel of the image [11,12], can lead to impracticalcomputation times of days or weeks, even on a computing cluster

ll rights reserved.

du.au

[13]. Such long computation times make the task of atomic modelrefinement cumbersome.

In the case of HAADF-STEM images, the issue of long compu-tation times is exacerbated since the major contribution to theimage intensity arises from thermal diffuse scattering (TDS), andthe simulations must account for this scattering mechanism. Thefrozen phonon algorithm for the computation of TDS [14] is alogical choice from the perspective of rigor [15], and its accuracyhas been experimentally verified even in the case of thickerspecimens [6] and specimens containing heavy atomic species[16]. However, this algorithm requires that the simulation isrepeated several times using different frozen phonon configura-tions, and hence the time required to simulate HAADF-STEMimages is further increased if the frozen phonon algorithm is used.The large amount of computation involved has spawned recentefforts to parallelize multislice simulations of STEM images to runon multiple CPUs [13,17,18]. Such efforts have resulted in clearand considerable progress, with the caveat that they are currentlyreliant on the computing power of computing clusters, whichtend to be less accessible, less convenient and more expensivethan desktop computers for performing image simulations.

In the present work, two independent strategies are describedfor reducing the computation times of multislice simulations ofSTEM images. The first is the realization that the maximumnumber of probe positions required to simulate a STEM image isdetermined by the convergence angle of the electron probe andthe electron wavelength (as opposed to the desired resolution ofthe image or the resolution of the electron scattering calcula-tions). This number of probe positions is often considerably lessthan the number used by various authors in the past (including

www.elsevier.com/locate/ultramic

dx.doi.org/10.1016/j.ultramic.2009.11.009

mailto:[email protected]

ARTICLE IN PRESS

C. Dwyer / Ultramicroscopy 110 (2010) 195–198196

the present author). The second strategy for reducing computa-tion times capitalizes on the floating-point performance ofdesktop graphics processing units (GPUs) and the availability ofa convenient programming language associated with them [19].Used together, these two strategies can reduce typical computa-tion times from days to hours, allowing practical simulationsof STEM images of arbitrary atomic structures on a desktopcomputer.

2. Optimal probe sampling in STEM

Using a propagator-based formulation of Schrodinger quantummechanics [20] and adopting the paraxial scattering approxima-tion applicable to fast electrons, the amplitude at a point k in thediffraction plane of a STEM, arising from an electron probecentered at position x0 on the entrance surface of an arbitraryspecimen, can be written in the form

cðk;x0Þ ¼

Zd2xiGðk;xÞc0ðx�x0Þ; ð1Þ

where bold symbols denote 2-dimensional vectors transverse tothe optic axis, c0 is the probe wave function at the entrance sur-face, and the propagator Gðk;xÞ can be interpreted as (�i times)the amplitude at the point k in the diffraction plane given that theelectron was in a position eigenstate at the point x on thespecimen entrance surface. By expanding the probe wave functionin terms of partial plane waves [21], the above expression for thediffracted amplitude takes the form

cðk;x0Þ ¼

Zd2xiGðk;xÞ

Zd2k0

~c0ðk0Þe�2pik0 �ðx�x0Þ

¼

Zd2k0iGðk;k0Þ

~c0ðk0Þe2pik0 �x0 ; ð2Þ

where ~c0, the Fourier transform of c0, vanishes outside of theprobe-forming aperture, and the 2-dimensional wave vector k0

labels points in the plane of the probe-forming aperture. TheSTEM image intensity is a function of the probe position x0, and isgiven by the integral of the diffracted intensity, i.e.

Iðx0Þ ¼

Zd2kDðkÞjcðk;x0Þj

2 ¼

Zd2kd2k0d2k0

0 ~c�

0ðk0ÞG�ðk;k0ÞDðkÞ

� Gðk;k00 Þ ~c0ðk0

0 Þe�2piðk0�k00 Þ�x0 ;

ð3Þ

where DðkÞ equals unity for points on the detector and zerootherwise. From the second line of Eq. (3) it is apparent that thehighest spatial frequency in the STEM image corresponds to themaximum value of jk0�k0

0 j for which the integrand is non-zero,and that this value is dictated by the size of the probe-formingaperture. (The same conclusion can be reached by an argumentinvolving the principle of reciprocity.) Hence the STEM image isbandwidth limited and need only be computed at a resolutionwhich incorporates the spatial frequencies corresponding tomaxjk0�k0

0 j. For an orthogonal scan area of dimensions a� b

(where � denotes scalar multiplication) satisfying periodicboundary conditions (as in a supercell-based simulation), thenumber of probe positions is given by Mx �My ¼ 4ða� bÞa=l,where l is the electron wavelength and a is the probeconvergence semi-angle. According to Shannon’s theorem, thisnumber is the maximum necessary: computation at higherresolution is superfluous and simply amounts to sinc-functioninterpolation of the Mx �My array. In fact, sinc-function inter-polation can be subsequently applied to the Mx �My array inorder to recover the exact value of the image at any desired point.In practice, it is often desirable to carry out the resampling step

in order to obtain an image that is more aesthetically pleasing(but nonetheless contains the same information). A zero-paddedFourier transform provides a simple way to accomplish this. If thescan area does not obey periodic boundary conditions, then theabove scheme provides only an approximation for recovering thevalue of the image at intermediate points. However, severalalternatives exist for making the approximation sufficientlyaccurate, including increasing the size of the scan area to makeboundary effects less pronounced, or the use of more sophisti-cated resampling algorithms (see, for example, Refs. [22,23]).

With a suitable generalization of the propagator to account forexcitations of the specimen (see, for example, Ref. [24], Eq. (3)),the reasoning given above can be extended to incorporateinelastic scattering. In particular, the same conclusions regardingthe maximum number of probe positions hold in the case ofmultislice simulations of STEM images which derive from TDS,as in HAADF imaging [1,12–14,25], and atomic ionization, as inSTEM core-loss imaging [26–28]. The reasoning above also holdswhen the wave field of the incident probe is partially coherent. Asrecently emphasized [6,16,29], the partial spatial coherence of theprobe can have a significant effect on image contrast. However,since its effect is well-approximated by a convolution of the idealSTEM image (generated by a coherent probe) with an effectivesource distribution, it cannot lead to an increase in informationwith respect to the ideal image. Hence the maximum number ofprobe positions is not greater than that given above. In the case ofpartial temporal coherence, the effect on image contrast is well-approximated by a finite focal spread, and similar reasoning canbe applied.

The optimal probe sampling for STEM image simulations, asgiven above, is applicable to all methods of simulation. Itsimplementation into an existing algorithm will most likelyrequire only small changes to the algorithm (if any).

3. Multislice calculations on the GPU

In the present work, desktop GPUs are used as a means ofspeeding up multislice calculations. The advantage of thisapproach is the remarkable floating-point performance achievedwithin the convenient environment of a desktop computer. Thisperformance is achieved through the highly parallel architectureof the GPU. A GPU multislice code was written in the CUDAprogramming language [19] and makes use of the single-precisionFFT routines in the CUFFT library [30]. The GPU program was runon a GeForce GTX 295 GPU (NVIDIA).

The performance of the GPU program was tested against twoCPU multislice programs, both of which were written in theC=Cþþ programming language. The first CPU program uses thesingle-precision FFT routines written by Kirkland [10], whosesuite of multislice-related programs have gained wide popularityin the TEM community. The second CPU code uses the single-precision FFT routines available in the FFTW3 (fastest Fouriertransform in the west) library [31], which is a well-known andhighly optimized set of FFT routines. The CPU programs were runon a single core of a Core 2 Quad Q9550 (Intel) running at2.83 GHz. All programs made use of the atomic potentialssupplied by Kirkland [10]. Further details are given in theAppendix.

Table 1 compares the GPU and CPU runtimes required tocompute the evolution of a fast electron wave function through asingle atomic slice. Runtimes are given for array sizes 2562, 5122,and 10242 (see Appendix for details). It is observed that the GPUprogram significantly outperforms both CPU programs in all cases.When comparing the GPU program with the fastest (FFTW-based)CPU program, the speed-up is about 7 times, 11 times and 21

ARTICLE IN PRESS

Table 1Multislice runtimes for a single atomic slice (see text for details).

Device FFT library Compiler Time (ms)

2562 5122 10242

Core 2 Q9660 Kirkland icc 11.0 6.5 32 150

Core 2 Q9660 FFTW3 icc 11.0 1.7 7.7 47

GeForce GTX 295 CUFFT nvcc 2.3 0.26 0.69 2.2

Fig. 1. Atomic model (left) and simulated HAADF-STEM image (right) of a T1

ðAl2CuLiÞ precipitate in /1 1 2S Al (see text for details).

C. Dwyer / Ultramicroscopy 110 (2010) 195–198 197

times for array sizes 2562, 5122 and 10242, respectively. Thegreater speed-ups observed with increasing array size is due tothe greater efficiency of parallel execution on the GPU. The clearadvantage of the GPU for the larger array sizes is importantbecause it is precisely these cases where the computation timescan become impractical.

The total runtime for a given multislice simulation can beestimated from the runtimes in Table 1 in a straightforwardmanner. For example, the runtime of a HAADF-STEM image simu-lation using the frozen phonon algorithm is given by multiplyingthe runtime in Table 1 by the number of slices, the number ofprobe positions, and the number of frozen phonon configurations.While such an estimate assumes linear scaling of the totalruntime with the number of slices, some non-linearity was obser-ved in the case of the GPU due to the time required to transferdata from GPU memory to CPU memory: simulations using arelatively small number of slices ðt10Þ were observed to takeabout 20% longer than the estimated runtime.

It is emphasized that while the CPU used for the comparisonhas multiple cores, the CPU programs were run on a single coreonly. CPU multislice programs written for parallel execution onmultiple cores have the potential to run significantly faster thanthe CPU programs discussed here, especially if those programs arerun on a cluster of CPUs [13,18]. On the other hand, the use of aGPU offers significant speed-ups within the convenient environ-ment of a desktop computer. Furthermore, it is possible to write aCUDA program to run on multiple GPUs. For example, the CUDAenvironment views the GeForce GTX 295 as two separate devices,so that it is possible to run a multislice STEM simulation on thisGPU twice as fast as suggested by the runtimes in Table 1.

4. Discussion and conclusion

As an example of a STEM image simulation using the twostrategies presented above, Fig. 1 shows a simulated HAADF-STEM image of a proposed model of a T1 ðAl2CuLiÞ precipitate in[112] oriented Al. The model exhibits periodicity in the horizontaldirection only. The supercell size was 34:4 A� 32:7 A (the imagein Fig. 1 shows half of the supercell) and 200 A thick. Thesimulation was run on a GeForce GTX 295 GPU using an array size10242 and 80 slices (see Appendix B for further details). The beamenergy was 300 keV and the convergence semi-angle was17 mrad. The HAADF detector inner and outer semi-angles were40 and 190 mrad, respectively. Thermal diffuse scattering wasincorporated via the frozen phonon algorithm using 4 frozenphonon configurations (subsequent averaging with respect toperiodicity along the interface effectively gave 16 configurations).The number of probe positions spanned an array 1282 (whichresults in sampling that is slightly greater than optimum), and theresulting image was subsequently resampled at 5122 usingsinc-function interpolation, as described above. The simulationtook 97 min. In contrast, the same simulation would have taken

1.4 days on a single CPU core, and even longer if the optimal probesampling described in the present work had not been used.

In summary, it has been demonstrated mathematically thatthe maximum number of probe positions required for a STEMimage simulation is governed by the convergence angle of theelectron probe and the electron wavelength. This number, whichcan be used to simulate STEM images with maximum efficiencywhile still retaining a high level of accuracy, also applies for STEMimages generated by inelastic scattering, and for a partiallycoherent probe. It has also been demonstrated that the use ofGPUs for multislice calculations can produce significant speed-ups (factors of 10 or more) with respect to a single CPU core.Finally, the use of optimal probe sampling in conjunction withGPUs has been shown to result in practical computation times forSTEM image simulations of arbitrary atomic structures on adesktop computer [32].

Acknowledgments

The author would like to thank D. Lynch and S.D. Findlay forinteresting discussions, and L.Y. Chang and C.J. Rossouw forhelpful suggestions regarding the manuscript.

Appendix

The desktop computer used in the present work was equippedwith a GA-X48T-DQ6 motherboard (Gigabyte), Core 2 QuadQ9550 CPU (Intel), 8 GB of DDR3 1333 MHz RAM (Corsair), andwas running the Ubuntu 8.04 linux operating system.

Regarding the FFTW3-based code, the FFTW3 library (version3.2.2) was installed using the options –enable-float and –

enable-sse, which take advantage of the SIMD (single-instructionmultiple data) capability of the CPU. The FFTW3 library functionfftwf_plan_2d() was invoked with argument FFTW_MEASURE,which ensures the most efficient algorithm in the FFTW3 library isused. Both CPU codes were compiled using the Intel C compiler(version 11.0) invoked with optimization flag -fast. The GPU codewas compiled using the NVIDIA CUDA compiler (version 2.3).

ARTICLE IN PRESS

C. Dwyer / Ultramicroscopy 110 (2010) 195–198198

In order to avoid potential problems with timing calls,the runtimes in Table 1 were determined from computationsinvolving 40 slices and are an average of 4 repeated runs(occasionally longer runtimes were encountered because ofsignificant additional system load from other applications, butthese were excluded). The phase grating functions were pre-calculated and stored in the GPU or CPU memory prior tocomputing the wave function evolution. Hence the time requiredto compute the phase gratings is omitted. However, for STEMimage simulations, pre-calculation of the phase gratings isadvantageous and constitutes a relatively small fraction of thecomputing time. While the CUDA environment views the GeForceGTX 295 as two separate devices, the code was run on one deviceonly. The GPU and CPU codes were observed to produce identicaloutputs to 4 significant figures.

As might be expected for an algorithm dominated by FFTs, theruntimes of the FFTW3-based program listed in Table 1 areconsistent with FFT benchmarks in the literature. FFT benchmarksare typically quoted as the number of floating point operationsper second (FLOPS) achieved during execution. The number ofFLOPS is calculated using the following expression (which is notnecessarily indicative of the actual number of floating pointoperations for a particular FFT algorithm): FLOPS¼ t�15Nlog2N,where N is the total number of pixels and t is the time in secondstaken to perform one FFT. For array size 10242, the number ofFLOPS achieved by the FFTW3-based program is 4.8 GFLOPS,which compares well with the FFTW3 benchmark of 4.0 GFLOPSfor a similar system [31]. For the same array size, the GPU codeachieves 103 GFLOPS. In all cases, the number of FLOPS achievedis well below the device’s maximum capability, indicating that thecomputations are memory bound.

The simulation presented in Fig. 1 utilized the GeForce GTX295 as two separate devices. Since the phase grating functionswere stored in GPU memory, the 1.8 GB of available memoryplaces a constraint on the number of slices that can be used.However, this restriction is greatly relaxed in the case of the TeslaC1060 GPU (NVIDIA) which has 4 GB of memory.

References

[1] R.F. Loane, P. Xu, J. Silcox, Ultramicroscopy 40 (1992) 121–138.[2] S.C. Anderson, C.R. Birkeland, G.R. Anstis, D.J.H. Cockayne, Ultramicroscopy 69

(1997) 83–103.[3] K. Watanabe, T. Yamazaki, I. Hashimoto, M. Shiojiri, Phys. Rev. B 64 (2001)

115432.[4] E. Carlino, V. Grillo, Phys. Rev. B 71 (2005) 235303.[5] V. Grillo, E. Carlino, F. Glas, Phys. Rev. B 77 (2008) 054103.[6] J.M. LeBeau, S.D. Findlay, L.J. Allen, S. Stemmer, Phys. Rev. Lett. 100 (2008)

206101.[7] S. Van Aert, J. Verbeeck, R. Erni, S. Bals, M. Luysberg, D. Van Dyck, G. Van

Tendeloo, Ultramicroscopy 109 (2009) 1236–1244.[8] P. Goodman, A.F. Moodie, Acta Cryst. A 30 (1974) 280–290.[9] K. Ishizuka, N. Uyeda, Acta Cryst. A 33 (1977) 740–749.

[10] E.J. Kirkland, Advanced Computing in Electron Microscopy, Plenum Press, 1998.[11] E.J. Kirkland, R.F. Loane, J. Silcox, Ultramicroscopy 23 (1987) 77–96.[12] K. Ishizuka, Ultramicroscopy 90 (2002) 71–83.[13] J. Pizzaro, P.L. Galindo, E. Guerrero, A. Yanez, M.P. Guerrero, A. Rosenauer, D.L.

Sales, S.I. Molina, Appl. Phys. Lett. 93 (2008) 153107.[14] R.F. Loane, P. Xu, J. Silcox, Acta Cryst. A 47 (1991) 267–278.[15] D. Van Dyck, Ultramicroscopy 109 (2009) 677–682.[16] J.M. LeBeau, S.D. Findlay, X. Wang, A.J. Jacobson, L.J. Allen, S. Stemmer, Phys.

Rev. B 79 (2009) 214110.[17] /http://people.ccmr.cornell.edu/�kirklandS.[18] E. Carlino, V. Grillo, P. Palazzari, Springer Proceedings in Physics 120 (2008)

177–180.[19] NVIDIA CUDA Programming Guide Version 2.3, 2009.[20] J.D. Bjorken, S.D. Drell, Relativistic Quantum Mechanics, McGraw-Hill Book

Company, 1964.[21] J.C.H. Spence, J.M. Cowley, Optik 50 (1978) 129–142.[22] L.P. Yaroslavsky, Appl. Opt. 36 (1997) 460–463.[23] L. Yaroslavsky, Appl. Opt. 42 (2003) 4166–4175.[24] C. Dwyer, S.D. Findlay, L.J. Allen, Phys. Rev. B 77 (2008) 184107.[25] L.J. Allen, S.D. Findlay, M.P. Oxley, C.J. Rossouw, Ultramicroscopy 96 (2003)

47–63.[26] C. Dwyer, Ultramicroscopy 104 (2005) 141–151.[27] C. Dwyer, Phys. Rev. B 72 (2005) 144102.[28] S.D. Findlay, M.P. Oxley, S.J. Pennycook, L.J. Allen, Ultramicroscopy 104 (2005)

126–140.[29] C. Dwyer, R. Erni, J. Etheridge, Appl. Phys. Lett. 93 (2008) 021115.[30] CUDA CUFFT Library, 2008.[31] FFTW Library Version 3 Reference Manual, 2009, Available from /http://

www.fftw.orgS.[32] The CUDA multislice program used in the present work is available from the

author by request.

http://people.ccmr.cornell.edu/&sim;kirkland

http://people.ccmr.cornell.edu/&sim;kirkland

http://www.fftw.org

http://www.fftw.org

Simulation of scanning transmission electron microscope images on desktop computers

Documents

Transcript of Simulation of scanning transmission electron microscope images on desktop computers