GPU Computing Gems - Elsevier · Each GPU Computing Gems volume offers a snapshot of the state of...

GPU Computing GemsEmerald Edition

Morgan Kaufmann’s Applications of GPU Computing Series

Computing is quickly becoming the third pillar of scientific research, due in large part to the perfor-mance gains achieved through graphics processing units (GPUs), which have become ubiquitous inhandhelds, laptops, desktops, and supercomputer clusters. Morgan Kaufmann’s Applications of GPUComputing series offers training, examples, and inspiration for researchers, engineers, students, andsupercomputing professionals who want to leverage the power of GPUs incorporated into their simu-lations or experiments. Each high-quality, peer-reviewed book is written by leading experts uniquelyqualified to provide parallel computing insights and guidance.

Each GPU Computing Gems volume offers a snapshot of the state of parallel computing across acarefully selected subset of industry domains, giving you a window into the lead-edge research occur-ring across the breadth of science, and the opportunity to observe others’ algorithm work that mightapply to your own projects. Find out more at http://mkp.com/gpu-computing-gems.

Recommended Parallel Computing Titles

Programming Massively Parallel ProcessorsA Hands-on ApproachBy David B. Kirk and Wen-mei W. HwuISBN: 9780123814722

GPU Computing Gems: Jade EditionEditor-in-Chief: Wen-mei W. HwuISBN: 9780123859631Coming Summer 2011

The Art of Multiprocessor ProgrammingBy Maurice Herlihy and Nir ShavitISBN: 9780123705914

GPU Computing GemsEmerald Edition

Wen-mei W. Hwu

AMSTERDAM • BOSTON • HEIDELBERG • LONDONNEW YORK • OXFORD • PARIS • SAN DIEGO

SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO

Morgan Kaufmann Publishers is an imprint of Elsevier

Acquiring Editor: Todd GreenAssistant Editor: Robyn DayProject Manager: Paul GottehrerDesigner: Dennis Schaefer

Morgan Kaufmann is an imprint of Elsevier30 Corporate Drive, Suite 400, Burlington, MA 01803, USA

c© 2011 NVIDIA Corporation and Wen-mei W. Hwu. Published by Elsevier Inc. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or by any means, electronic ormechanical, including photocopying, recording, or any information storage and retrieval system, withoutpermission in writing from the publisher. Details on how to seek permission, further information about thePublisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Centerand the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

This book and the individual contributions contained in it are protected under copyright by the Publisher (otherthan as may be noted herein).

NoticesKnowledge and best practice in this field are constantly changing. As new research and experience broaden ourunderstanding, changes in research methods or professional practices, may become necessary. Practitioners andresearchers must always rely on their own experience and knowledge in evaluating and using any information ormethods described herein. In using such information or methods they should be mindful of their own safety andthe safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liabilityfor any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, orfrom any use or operation of any methods, products, instructions, or ideas contained in the material herein.

Library of Congress Cataloging-in-Publication DataGPU computing gems / editor, Wen-mei W. Hwu.

p. cm.Includes bibliographical references.ISBN 978-0-12-384988-5

1. Graphics processing units–Programming. 2. Imaging systems. 3. Computer graphics. 4. Imageprocessing–Digital techniques. I. Hwu, Wen-mei.

T385.G6875 2011006.6–dc22

2010047487

British Library Cataloguing-in-Publication DataA catalogue record for this book is available from the British Library.

For information on all MK publications visit our website atwww.mkp.com

Printed in the United States of America11 12 13 14 15 11 10 9 8 7 6 5 4 3 2 1

Contents

Editors, Reviewers, and Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Wen-mei W. Hwu

SECTION 1 SCIENTIFIC SIMULATIONRobert M. Farber

CHAPTER 1 GPU-Accelerated Computation and Interactive Display of MolecularOrbitals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5John E. Stone, David J. Hardy, Jan Saam, Kirby L. Vandivort, Klaus Schulten

CHAPTER 2 Large-Scale Chemical Informatics on GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Imran S. Haque, Vijay S. Pande

CHAPTER 3 Dynamical Quadrature Grids: Applications in Density FunctionalCalculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Nathan Luehr, Ivan Ufimtsev, Todd Martinez

CHAPTER 4 Fast Molecular Electrostatics Algorithms on GPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43David J. Hardy, John E. Stone, Kirby L. Vandivort, David Gohara, Christopher Rodrigues,

Klaus Schulten

CHAPTER 5 Quantum Chemistry: Propagation of Electronic Structure on a GPU. . . . . . . . . . . . . 59Jacek Jakowski, Stephan Irle, Keiji Morokuma

CHAPTER 6 An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-BodyAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Martin Burtscher, Keshav Pingali

CHAPTER 7 Leveraging the Untapped Computation Power of GPUs: Fast SpectralSynthesis Using Texture Interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Richard Townsend, Karthikeyan Sankaralingam, Matthew D. Sinclair

CHAPTER 8 Black Hole Simulations with CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103Frank Herrmann, John Silberholz, Manuel Tiglio

CHAPTER 9 Treecode and Fast Multipole Method for N -Body Simulation with CUDA. . . . . . . . 113Rio Yokota, Lorena A. Barba

v

vi Contents

CHAPTER 10 Wavelet-Based Density Functional Theory Calculation on MassivelyParallel Hybrid Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Luigi Genovese, Matthieu Ospici, Brice Videau, Thierry Deutsch, Jean-Francois Mehaut

SECTION 2 LIFE SCIENCESBertil Schmidt

CHAPTER 11 Accurate Scanning of Sequence Databases with the Smith-WatermanAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155Łukasz Ligowski, Witold R. Rudnicki, Yongchao Liu, Bertil Schmidt

CHAPTER 12 Massive Parallel Computing to Accelerate Genome-Matching . . . . . . . . . . . . . . . . . . 173Ben Weiss, Mike Bailey

CHAPTER 13 GPU-Supercomputer Acceleration of Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . 185Ali Khajeh-Saeed, J. Blair Perot

CHAPTER 14 GPU Accelerated RNA Folding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199Guillaume Rizk, Dominique Lavenier, Sanjay Rajopadhye

CHAPTER 15 Temporal Data Mining for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211Wu-chun Feng, Yong Cao, Debprakash Patnaik, Naren Ramakrishnan

SECTION 3 STATISTICAL MODELINGMike Giles

CHAPTER 16 Parallelization Techniques for Random Number Generators . . . . . . . . . . . . . . . . . . . . 231Thomas Bradley, Jacques du Toit, Robert Tong, Mike Giles, Paul Woodhams

CHAPTER 17 Monte Carlo Photon Transport on the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247Laszlo Szirmay-Kalos, Balazs Toth, Milan Magdics

CHAPTER 18 High-Performance Iterated Function Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263Christoph Schied, Johannes Hanika, Holger Dammertz, Hendrik P. A. Lensch

SECTION 4 EMERGING DATA-INTENSIVE APPLICATIONSVolodymyr Kindratenko

CHAPTER 19 Large-Scale Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277Jerod J. Weinman, Augustus Lidaka, Shitanshu Aggarwal

Contents vii

CHAPTER 20 Multiclass Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293Sergio Herrero-Lopez

CHAPTER 21 Template-Driven Agent-Based Modeling and Simulation with CUDA . . . . . . . . . . . . 313Paul Richmond, Daniela Romano

CHAPTER 22 GPU-Accelerated Ant Colony Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325Robin M. Weiss

SECTION 5 ELECTRONIC DESIGN AUTOMATIONSunil P. Khatri

CHAPTER 23 High-Performance Gate-Level Simulation with GP-GPUs . . . . . . . . . . . . . . . . . . . . . . . . 343Debapriya Chatterjee, Andrew DeOrio, Valeria Bertacco

CHAPTER 24 GPU-Based Parallel Computing for Fast Circuit Optimization . . . . . . . . . . . . . . . . . . . 365Yifang Liu, Jiang Hu

SECTION 6 RAY TRACING AND RENDERINGAustin Robison

CHAPTER 25 Lattice Boltzmann Lighting Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381Robert Geist, James Westall

CHAPTER 26 Path Regeneration for Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401Jan Novak, Vlastimil Havran, Carsten Dachsbacher

CHAPTER 27 From Sparse Mocap to Highly Detailed Facial Animation . . . . . . . . . . . . . . . . . . . . . . . 413Bernd Bickel, Manuel Lang

CHAPTER 28 A Programmable Graphics Pipeline in CUDA for Order-IndependentTransparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427Mengcheng Huang, Fang Liu, Xuehui Liu, Enhua Wu

SECTION 7 COMPUTER VISIONJames Fung

CHAPTER 29 Fast Graph Cuts for Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439P.J. Narayanan, Vibhav Vineet, Timo Stich

CHAPTER 30 Visual Saliency Model on Multi-GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451Anis Rahman, Dominique Houzet, Denis Pellerin

viii Contents

CHAPTER 31 Real-Time Stereo on GPGPU Using Progressive Multiresolution AdaptiveWindows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473Yong Zhao, Gabriel Taubin

CHAPTER 32 Real-Time Speed-Limit-Sign Recognition on an Embedded SystemUsing a GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497Pinar Muyan-Ozcelik, Vladimir Glavtchev, Jeffrey M. Ota, John D. Owens

CHAPTER 33 Haar Classifiers for Object Detection with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517Anton Obukhov

SECTION 8 VIDEO AND IMAGE PROCESSINGTimo Stich

CHAPTER 34 Experiences on Image and Video Processing with CUDA and OpenCL . . . . . . . . . . 547Alptekin Temizel, Tugba Halici, Berker Logoglu, Tugba Taskaya Temizel,

Fatih Omruuzun, Ersin Karaman

CHAPTER 35 Connected Component Labeling in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569Ondrej Stava, Bedrich Benes

CHAPTER 36 Image De-Mosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583Joe Stam, James Fung

SECTION 9 SIGNAL AND AUDIO PROCESSINGJohn Roberts

CHAPTER 37 Efficient Automatic Speech Recognition on the GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601Jike Chong, Ekaterina Gonina, Kurt Keutzer

CHAPTER 38 Parallel LDPC Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619Gabriel Falcao, Vitor Silva, Leonel Sousa

CHAPTER 39 Large-Scale Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629Yifeng Chen, Xiang Cui, Hong Mei

SECTION 10 MEDICAL IMAGINGLawrence Tarbox

CHAPTER 40 GPU Acceleration of Iterative Digital Breast Tomosynthesis . . . . . . . . . . . . . . . . . . . . 647Dana Schaa, Benjamin Brown, Byunghyun Jang, Perhaad Mistry, Rodrigo Dominguez,

David Kaeli, Richard Moore, Daniel B. Kopans

Contents ix

CHAPTER 41 Parallelization of Katsevich CT Image Reconstruction Algorithmon Generic Multi-Core Processors and GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659Abderrahim Benquassmi, Eric Fontaine, Hsien-Hsin S. Lee

CHAPTER 42 3-D Tomographic Image Reconstruction from Randomly Ordered Lineswith CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679Guillem Pratx, Jing-Yu Cui, Sven Prevrhal, Craig S. Levin

CHAPTER 43 Using GPUs to Learn Effective Parameter Settings for GPU-AcceleratedIterative CT Reconstruction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693Wei Xu, Klaus Mueller

CHAPTER 44 Using GPUs to Accelerate Advanced MRI Reconstruction with FieldInhomogeneity Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709Yue Zhuo, Xiao-Long Wu, Justin P. Haldar, Thibault Marin, Wen-mei W. Hwu,

Zhi-Pei Liang, Bradley P. Sutton

CHAPTER 45 `1 Minimization in `1-SPIRiT Compressed Sensing MRI Reconstruction . . . . . . . 723Mark Murphy, Miki Lustig

CHAPTER 46 Medical Image Processing Using GPU-Accelerated ITK Image Filters . . . . . . . . . . 737Won-Ki Jeong, Hanspeter Pfister, Massimiliano Fatica

CHAPTER 47 Deformable Volumetric Registration Using B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . 751James Shackelford, Nagarajan Kandasamy, Gregory Sharp

CHAPTER 48 Multiscale Unbiased Diffeomorphic Atlas Construction on Multi-GPUs . . . . . . . . . 771Linh Ha, Jens Kruger, Sarang Joshi, Claudio T. Silva

CHAPTER 49 GPU-Accelerated Brain Connectivity Reconstruction andVisualization in Large-Scale Electron Micrographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793Won-Ki Jeong, Hanspeter Pfister, Johanna Beyer, Markus Hadwiger

CHAPTER 50 Fast Simulation of Radiographic Images Using a Monte Carlo X-RayTransport Algorithm Implemented in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813Andreu Badal, Aldo Badano

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831

Editors, Reviewers, and Authors

Editor-In-ChiefWen-mei W. Hwu, University of Illinois at Urbana Champaign

Managing EditorAndrew Schuh, University of Illinois at Urbana Champaign

NVIDIA EditorNadeem Mohammad, NVIDIA

Area EditorsRobert M. Farber, Pacific Northwest National Laboratory (Section 1)

James Fung, NVIDIA (Section 7)

Mike Giles, Oxford University (Section 3)

Sunil P. Khatri, Texas A&M University (Section 5)

Volodymyr Kindratenko, University of Illinois at Urbana Champaign (Section 4)

John Roberts, NVIDIA (Section 9)

Austin Robison, NVIDIA (Section 6)

Bertil Schmidt, Nanyang Technical University (Section 2)

Timo Stich, NVIDIA (Section 8)

Lawrence Tarbox, Washington University in St. Louis (Section 10)

ReviewersFrancois Beaune, /*jupiter jazz*/ visual effects consultants

Jiawen Chen, Massachusetts Institute of Technology

Andrea Di Blas, University of California, Santa Cruz

Roshan Dsouza, University of Wisconsin-Milwaukee

Richard Edgar, Harvard University

Martin Eisemann, Technical University, Braunschweig

John Estabrook, University of Illinois at Urbana-Champaign

Cass Everitt, NVIDIA

xi

xii Editors, Reviewers, and Authors

Reza Farivar, University of Illinois at Urbana-Champaign

Vladimir Frolov, NVIDIA

Vladimir Glavtchev, BMW Technology Office

Kanupriya Gulati, Intel Corporation

Trym Vegard Haavardsholm, Norwegian Defense Research Establishment

Ken Hawick, University of Auckland, New Zealand

Jared Hoberock, NVIDIA

Tim Kaldewey, Oracle

Vinay Karkala, Advanced Micro Devices

Christian Linz, Technical University, Braunschweig

Christian Lipski, Technical University, Braunschweig

Weiguo Liu, Nanyang Technological University

Dave Luebke, NVIDIA

W. James MacLean, Google

Corey Manders, A*STAR Institute for Infocomm Research

Morgan McGuire, Williams College, Massachusetts

Derek Nowrouzezahrai, Disney Research Zurich

Ming Ouyang, University of Louisville, Kentucky

Steven Parker, NVIDIA

Kalyan Perumalla, Oak Ridge National Laboratory

Nicolas Pinto, Massachusetts Institute of Technology

Tobias Preis, Johannes Gutenberg University

Ramtin Shams, Australian National University

Craig Steffen, University of Illinois at Urbana-Champaign

Andrei Tatarinov, NVIDIA

Cristina Nader Vasconcelos, Institulo de Computacao, Universidade Federal Fluminense, Brazil

Ben Weiss, Shell and Slate Software

Ruediger Westermann, Technical University, Munich

Jan Woetzel, MeVis Medical Solutions, AG

Kesheng Wu, Berkeley Lab, University of California

Ren Wu, HP Labs

Weihang Zhu, Lamar University, Texas

Editors, Reviewers, and Authors xiii

AuthorsShitanshu Aggarwal, Grinnell College, Iowa (Chapter 19)

Mike Bailey, Oregon State University (Chapter 12)

Andreu Badal, US Food and Drug Administration (CDRH/OSEL/DIAM) (Chapter 50)

Aldo Badano, US Food and Drug Administration (CDRH/OSEL/DIAM) (Chapter 50)

Lorena A. Barba, Boston University (Chapter 9)

Bedrich Benes, Purdue University, Indiana (Chapter 35)

Abderrahim Benquassmi, Georgia Institute of Technology (Chapter 41)

Valeria Bertacco, University of Michigan (Chapter 23)

Johanna Beyer, King Abdullah University of Science and Technology (KAUST) (Chapter 49)

Bernd Bickel, Disney Research, Zurich (Chapter 27)

Thomas Bradley, NVIDIA (Chapter 16)

Benjamin Brown, Northeastern University (Chapter 40)

Martin Burtscher, Texas State University, San Marcos (Chapter 6)

Yong Cao, Virginia Tech (Chapter 15)

Debapriya Chatterjee, University of Michigan (Chapter 23)

Yifeng Chen, Peking University (Chapter 39)

Jike Chong, University of California, Berkeley (Chapter 37)

Jing-Yu Cui, Stanford University (Chapter 42)

Xiang Cui, Peking University (Chapter 39)

Carsten Dachsbacher, Karlsruhe Institute of Technology (Chapter 26)

Holger Dammertz, Ulm University (Chapter 18)

Andrew DeOrio, University of Michigan (Chapter 23)

Thierry Deutsch, Laboratoire de Simulation Atomistique (Chapter 10)

Rodrigo Dominguez, Northeastern University (Chapter 40)

Jacques Du Toit, Numerical Algorithms Group (Chapter 16)

Gabriel Falcao, University of Coimbra (Chapter 38)

Massimiliano Fatica, NVIDIA (Chapter 46)

Wu-chu Feng, Virginia Tech and Wake Forest University (Chapter 15)

Eric Fontaine, Georgia Institute of Technology (Chapter 41)

James Fung, NVIDIA (Chapter 36)

Robert Geist, Clemson University (Chapter 25)

xiv Editors, Reviewers, and Authors

Luigi Genovese, European Synchrotron Radiation Facility (Chapter 10)

Mike Giles, Oxford University (Chapter 16)

Vladimir Glavtchev, BMW Group Technology Office (Chapter 32)

David Gohara, Saint Louis University School of Medicine (Chapter 4)

Ekaterina Gonina, University of California, Berkeley (Chapter 37)

Linh Ha, University of Utah (Chapter 48)

Markus Hadwiger, King Abdullah University of Science and Technology (KAUST) (Chapter 49)

Justin P. Haldar, University of Illinois at Urbana-Champaign (Chapter 44)

Tugba Halici, Middle East Technical University (Chapter 34)

Johannes Hanika, Ulm University (Chapter 18)

Imran S. Haque, Stanford University (Chapter 2)

David J. Hardy, University of Illinois at Urbana-Champaign (Chapters 1 and 4)

Vlastimil Havran, Czech Technical University in Prague (Chapter 26)

Sergio Herrero-Lopez, Massachusetts Institute of Technology (Chapter 20)

Frank Herrmann, University of Maryland, College Park (Chapter 8)

Dominique Houzet, GIPSA-lab (Chapter 30)

Jiang Hu, Texas A&M University (Chapter 24)

Mengcheng Huang, Chinese Academy of Sciences (Chapter 28)

Wen-mei W. Hwu, University of Illinois at Urbana-Champaign (Chapter 44)

Stephan Irle, Nagoya University (Chapter 5)

Jacek Jakowski, National Institute for Computational Sciences (Chapter 5)

Byunghyun Jang, Northeastern University (Chapter 40)

Won-Ki Jeong, Harvard University (Chapters 46 and 49)

Sarang Joshi, University of Utah, Salt Lake City (Chapter 48)

David Kaeli, Northeastern University (Chapter 40)

Nagarajan Kandasamy, Drexel University (Chapter 47)

Ersin Karaman, Middle East Technical University (Chapter 34)

Kurt Keutzer, University of California, Berkeley (Chapter 37)

Ali Khajeh-Saeed, University of Massachusetts, Amherst (Chapter 13)

Daniel B. Kopans, Massachusetts General Hospital (Chapter 40)

Jens Kruger, Interactive Visualization and Data Analysis Group, Saarbrucken (Chapter 48)

Manuel Lang, Disney Research, Zurich (Chapter 27)

Editors, Reviewers, and Authors xv

Dominique Lavenier, Ecole Normale Superieure de Cachan (Chapter 14)

Hsien-Hsin S. Lee, Georgia Institute of Technology (Chapter 41)

Hendrik Lensch, Ulm University (Chapter 18)

Craig S. Levin, Stanford University (Chapter 42)

Zhi-Pei Liang, University of Illinois at Urbana-Champaign (Chapter 44)

Augustus Lidaka, Grinnell College (Chapter 19)

Łukasz Ligowski, University of Warsaw (Chapter 11)

Fang Liu, Chinese Academy of Sciences (Chapter 28)

Xuehui Liu, Chinese Academy of Sciences (Chapter 28)

Yifang Liu, Texas A&M University (Chapter 24)

Yongchao Liu, Nanyang Technological University (Chapter 11)

Berker Logoglu, Middle East Technical University (Chapter 34)

Nathan Luehr, Stanford University and SLAC National Accelerator Laboratory (Chapter 3)

Miki Lustig, University of California, Berkeley (Chapter 45)

Milan Magdics, Budapest University of Technology and Economics (Chapter 17)

Thibault Marin, Illinois Institute of Technology (Chapter 44)

Todd Martinez, Stanford University and SLAC National Accelerator Laboratory (Chapter 3)

Jean-Francois Mehaut, Universite Joseph Fourier (Chapter 10)

Hong Mei, Peking University (Chapter 39)

Perhaad Mistry, Northeastern University (Chapter 40)

Richard Moore, Massachusetts General Hospital (Chapter 40)

Keiji Morokuma, Kyoto University (Chapter 5)

Klaus Mueller, State University of New York, Stony Brook (Chapter 43)

Mark Murphy, University of California, Berkeley (Chapter 45)

Pinar Muyan-Ozcelik, University of California, Davis (Chapter 32)

P. J. Narayanan, International Institute of Information Technology Hyderabad (Chapter 29)

Jan Novak, Karlsruhe Institute of Technology (Chapter 26)

Anton Obukhov, NVIDIA (Chapter 33)

Fatih Omruuzun, Middle East Technical University (Chapter 34)

Matthieu Ospici, Laboratoire d’Informatique de Grenoble (Chapter 10)

Jeffery M. Ota, BMW Group Technology Office (Chapter 32)

John D. Owens, University of California, Davis (Chapter 32)

xvi Editors, Reviewers, and Authors

Vijay S. Pande, Stanford University (Chapter 2)

Debprakash Patnaik, Virginia Tech (Chapter 15)

Denis Pellerin, GIPSA-lab (Chapter 30)

J. Blair Perot, University of Massachusetts, Amherst (Chapter 13)

Hanspeter Pfister, Harvard University (Chapters 46 and 49)

Keshay Pingali, Texas State University, San Marcos (Chapter 6)

Guillem Pratx, Stanford University (Chapter 42)

Sven Prevrhal, Philips Healthcare (Chapter 42)

Anis Rahman, GIPSA-lab (Chapter 30)

Sanjay Rajopadhye, Colorado State University (Chapter 14)

Naren Ramakrishnan, Virginia Tech (Chapter 15)

Paul Richmond, University of Sheffield (Chapter 21)

Guillaume Rizk, Institut de Recherche en Informatique et Systemes Aleatoires, Universite deRennes (Chapter 14)

Christopher Rodrigues, University of Illinois at Urbana-Champaign (Chapter 4)

Daniela Romano, University of Sheffield (Chapter 21)

Witold R. Rudnicki, University of Warsaw (Chapter 11)

Jan Saam, University of Illinois at Urbana-Champaign (Chapter 1)

Karthikeyan Sankaralingam, University of Wisconsin-Madison (Chapter 7)

Dana Schaa, Northeastern University (Chapter 40)

Christoph Schied, Ulm University (Chapter 18)

Bertil Schmidt, Nanyang Technological University (Chapter 11)

Klaus Schulten, University of Illinois at Urbana-Champaign (Chapters 1 and 4)

James Shackleford, Drexel University (Chapter 47)

Gregory Sharp, Massachusetts General Hospital (Chapter 47)

John Silberholz, University of Maryland (Chapter 8)

Claudio Silva, University of Utah (Chapter 48)

Vitor Silva, University of Coimbra (Chapter 38)

Matthew D. Sinclair, University of Wisconsin-Madison (Chapter 7)

Leonel Sousa, Technical University of Lisbon (Chapter 38)

Joe Stam, NVIDIA (Chapter 36)

Ondrej Stava, Purdue University (Chapter 35)

Editors, Reviewers, and Authors xvii

Timo Stich, NVIDIA (Chapter 29)

John E. Stone, University of Illinois at Urbana-Champaign (Chapters 1 and 4)

Bradley P. Sutton, University of Illinois at Urbana-Champaign (Chapter 44)

Laszlo Szirmay-Kalos, Budapest University of Technology and Economics (Chapter 17)

Gabriel Taubin, Brown University (Chapter 31)

Alptekin Temizel, Middle East Technical University (Chapter 34)

Tugba Taskaya Temizel, Middle East Technical University (Chapter 34)

Manuel Tiglio, University of Maryland (Chapter 8)

Robert Tong, Numerical Algorithms Group(Chapter 16)

Balazs Toth, Budapest University of Technology and Economics (Chapter 17)

Richard Townsend, University of Wisconsin-Madison (Chapter 7)

Ivan Ufimtsev, Stanford University and SLAC National Accelerator Labortory (Chapter 3)

Kirby L. Vandivort, University of Illinois at Urbana-Champaign (Chapters 1 and 4)

Brice Videau, Laboratoire de Simulation Atomistique, Grenoble (Chapter 10)

Vibhav Vineet, International Institute of Information Technology, Hyderabad (Chapter 29)

Jerod J. Weinman, Grinnell College (Chapter 19)

Ben Weiss, Oregon State University (Chapter 12)

Robin M. Weiss, Macalester College (Chapter 22)

James Westall, Clemson University (Chapter 25)

Paul Woodhams, Numerical Algorithms Group (Chapter 16)

Enhua Wu, Chinese Academy of Sciences (Chapter 28)

Xiao-Long Wu, University of Illinois at Urbana-Champaign (Chapter 44)

Wei Xu, State University of New York, Stony Brook (Chapter 43)

Rio Yokota, Brown University (Chapter 9)

Yong Zhao, Brown University (Chapter 31)

Yue Zhuo, University of Illinois at Urbana-Champaign (Chapter 44)

Introduction

Wen-mei W. Hwu

STATE OF GPU COMPUTINGWe are entering the golden age of GPU computing. Since the introduction of CUDA in 2007, morethan 100 million computers with CUDA-capable GPUs have been shipped to end users. Unlike theprevious GPGPU shader programming models, CUDA supports parallel programming in C. From myown experience in teaching CUDA programming, C programmers can begin to write basic CUDAprograms after only attending one lecture and reading one textbook chapter. With such a low barrier ofentry, researchers all over the world have been engaged in developing new algorithms and applicationsto take advantage of the extreme floating point execution throughout these GPUs.

Today, there is a large community of GPU computing practitioners. Many of them have reported a10 to 100 times speedup of their applications with GPU computing. To put this into perspective, withthe historical 2X performance growth every 2 years, these researchers are experiencing the equivalentof time travel of 8 to 12 years. That is, they are getting the performance today that they would have towait for 8 to 12 years if they went for the “free-ride” advancement of performance in microprocessors.Interestingly, such “free ride” advancement is no longer available. Furthermore, once they developtheir application in CUDA, they will likely see continued performance growth of 2X for every twoyears from this day forward.

After discussing with numerous researchers, I have reached the conclusion that many of them aresolving similar algorithm problems in their programming efforts. Although they are working on diverseapplications, they often end up developing similar algorithmic strategies. The idea of GPU Comput-ing Gems is to provide a convenient means for application developers in diverse application areas tobenefit from each other’s experience. In this volume, we have collected 50 gem articles written byresearchers in 10 diverse areas. Each gems article reports a successful application experience in GPUcomputing. These articles describe the techniques or “secret sauce” that contributed to the success.The authors highlight the potential applicability of their techniques to other application areas. In oureditorial process, we have emphasized the accessibility of these gems to researchers in other areas.

When we issued the call for proposals for the first GPU Computing Gems, we received more than280 submissions, an overwhelming response. After careful review, we accepted 110 proposals thathave a high likelihood of making valuable contributions to other application developers. Many high-quality proposals were not accepted because of concerns that they may not be accessible to a largeaudience. With so many accepted proposals, we were forced to divide these gems into two volumes.This volume covers 50 gems in the application areas of scientific simulation, life sciences, statisticalmodeling, emerging data-intensive applications, electronic design automation, ray tracing and render-ing, computer vision, video and image processing, signal and audio processing, and medical imaging.

xix

xx Introduction

Each gem is first edited by an area editor who is a GPU computing expert in that area. This is followedby my own editing of these articles.

I would like to thank the people who have worked tirelessly on this project. Nadeem Mohammadat NVIDIA and Andrew Schuh at UIUC have done so much heavy lifting for this project. Withoutthem, it would have been impossible for me to coordinate so many authors and area editors. My areaeditors, whose names are in front of each section of this volume, have volunteered their valuable timeand energy to improve the quality of the gems. They worked closely with the authors to make sure thatthe gems indeed meet high technical standards while remain accessible to a wide audience. I would liketo thank all the authors who have shared their innovative work with the GPU computing community.All authors have worked hard to respond to our requests for improvements. Finally, I would like toacknowledge Manju Hegde, who championed the creation of GPU Computing Gems and pursued meto serve as the editor in chief. It has been a true privilege to work with all of these great people.

Online ResourcesVisit http://mkp.com/gpu-computing-gems and click the ONLINE RESOURCES tab to connectto gpucomputing.net, the vibrant official community site for GPU computing, where you can downloadsource code examples for most chapters and join discussions with other readers and GPU develop-ers. You’ll also find links to additional material including chapter walk-through videos and full-colorversions of many figures from the book.

SECTION

1Scientific SimulationArea Editor’s IntroductionRobert M. Farber

1 GPU-Accelerated Computation and Interactive Display of Molecular Orbitals . . . . . . . . . . . . . 5

2 Large-Scale Chemical Informatics on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Dynamical Quadrature Grids: Applications in Density Functional Calculations . . . . . . . . . . . . 35

4 Fast Molecular Electrostatics Algorithms on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Quantum Chemistry: Propagation of Electronic Structure on a GPU .. . . . . . . . . . . . . . . . . . . . . 59

6 An Efficient CUDA Implementation of the Tree-Based BarnesHut n-Body Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7 Leveraging the Untapped Computation Power of GPUs: Fast Spectral SynthesisUsing Texture Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

8 Black Hole Simulations with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

9 Treecode and Fast Multipole Method for N-Body Simulation with CUDA. . . . . . . . . . . . . . . . . . 113

10 Wavelet-Based Density Functional Theory Calculation on Massively Parallel HybridArchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

2 SECTION 1 Scientific Simulation

THE STATE OF GPU COMPUTING IN SCIENTIFIC SIMULATIONGPU computing is revolutionizing scientific simulation by providing one to two orders of magnitudeof increased computing performance per GPU at price points even students can afford. Exciting thingsare happening with this technology in the hands of the masses, as reflected by the applications, CUDAGems, and the extraordinary number of papers that have appeared in the literature since CUDA wasfirst introduced in February 2007.

Technology that provides two or more orders of magnitude of increased computational capabilityis disruptive and has the potential to fundamentally affect scientific research by removing time-to-discovery barriers. I cannot help getting excited by the potential as simulations that previously wouldhave taken a year or more to complete can now be finished in days. Better scientific insight also becomespossible because researchers can work with more data and have the ability to utilize more accurate,albeit computationally expensive, approximations and numerical methods. We are now entering theera where hybrid clusters and supercomputers containing large numbers of GPUs are being built andused around the world. As a result, many researchers (and funding agencies) now have to rethink theircomputational models and invest in software to create scalable, high-performance applications based onthis technology. The potential is there, and some lucky researchers may find themselves with a Galileanfirst opportunity to see, study, and model using exquisitely detailed data from projects utilizing GPUtechnology and these hybrid systems.

IN THIS SECTIONThe chapters in this section provide gems of insight both in thought and CUDA implementation tomap challenging scientific simulation problems to GPU technology. Techniques to work with irregulargrids, dynamic surfaces, treecodes, and far-field calculations are presented. All of these CUDA gemscan be adapted and should provide food for thought in solving challenging computational problemsin many areas. Innovative solutions are discussed, including just-in-time (JIT) compilation; appropri-ate and effective use of fast on-chip GPU memory resources across GPU technology generations; theapplication of texture unit arithmetic to augment GPU computational and global memory performance;and the creation of solutions that can scale across multiple GPUs in a distributed environment. Gen-eral kernel optimization principles are also provided in many chapters. Some of the kernels presentedrequire fewer than 200 lines of CUDA code, yet still provide impressive performance.

In Chapter 1: Evaluating molecular orbitals on 3-D lattices is a common problem in molecular visu-alization. This chapter discusses the design trade-offs in the popular VMD (visual molecular dynamics)software system plus the appropriate and effective use of fast on-chip GPU memory resources acrossvarious generations of GPUs. Several kernel optimization principles are provided. To account for vary-ing problem size and GPU performance regimes, an innovative just-in-time (JIT) kernel compilationtechnique is utilized.

In Chapter 2: The authors discuss the techniques they used to adapt the LIGO string similarityalgorithm to run efficiently on GPUs and avoid the memory bandwidth and conditional operations thatlimit parallelism in the CPU implementation. These techniques as well as the discussion on minimizingCPU-GPU transfer overhead and exploiting thread level parallelism should benefit readers in manyareas; not just those interested in large scale chemical informatics.

In This Section 3

In Chapter 3: This chapter discusses a GPU-accelerated dynamic quadrature grid method wherethe grid points move over the course of the calculation. The merits of several parallelization schemes,mixed precision arithmetic as an optimization technique, and problems arising from branching withina warp are discussed.

In Chapter 4: GPU kernels are presented that calculate electrostatic potential maps on structuredgrids containing a large amount of fine-grained data parallelism. Approaches to regularize the compu-tation work are discussed along with kernel loop optimizations and implementation notes on how tobest use the GPU memory subsystem. All of this is phrased in the context of the popular VMD (visualmolecular dynamics) and APBS (Adaptive Poisson-Boltzmann Solver) software packages.

In Chapter 5: Direct molecular dynamics (MD) requires repeated calculation of the potential energysurface obtained from electronic structure calculations. This chapter shows how this calculation can berethought to propagate the electronic structure without diagonalization — a time-consuming step thatis difficult to implement on GPUs. Other topics discussed include efficiently using CUBLAS and theintegration of CUDA within a FORTRAN framework.

In Chapter 6: Irregular tree-based data structures are a challenge given the GPGPU memory sub-system likes coalesced memory accesses. This chapter describes a number of techniques — both noveland conventional — to reduce main memory accesses on an irregular tree-based data structure. All themethods run on the GPU.

In Chapter 7: The GRASSY spectral synthesis platform is described, which utilizes GPUs to addressthe computational needs of asteroseismology. In particular, this chapter demonstrates an innovative useof interpolation by CUDA texture memory to augment arithmetic performance and reduce memoryaccess overhead. The low precision of texture memory arithmetic is discussed and shown to not affectsolution accuracy. Mesh building and rasterization are also covered.

In Chapter 8: Exploring the parameter space of a complex dynamical system is an important facetof scientific simulation. Many problems require integration of a coupled set of ordinary differentialequations (ODEs). Rather than parallelizing a single integration, the authors use CUDA to turn theGPU into a survey engine that performs many integrations at once. With this technology, scientists canexamine more of the phase space of the problem to gain a better understanding of the dynamics ofthe simulation. In the case of black holes in spirals, GPU technology might have a significant impactin the quest for direct measurement of gravity waves. Robustness across GPUs in a distributed MPIenvironment is also discussed.

In Chapter 9: As this chapter shows, constructing fast N-body algorithms is far from a formidabletask. Basic kernels are discussed that achieve substantial speedups (15x to 150x) in fewer than 200 linesof CUDA code. These same kernels extend previous GPU gems N-body CUDA mappings to encom-pass parallel far-field approximations that are useful for astrophysics, acoustics, molecular dynamics,particle simulation, electromagnetics, and boundary integral formulations. Other topics include struc-turing the data to preserve coalesced memory accesses and balancing parallelism and data reuse throughthe use of tiles.

In Chapter 10: The authors discuss the GPU-specific thought and implementation details forBigDFT, a massively parallel implementation of a full DFT (density functional theory) code for quan-tum chemistry that runs on hybrid clusters and supercomputers containing many GPUs. From theunconventional use of Daubechies wavelets, which are well suited for GPU-accelerated environments,the authors progress to a discussion of scalability and integration in a distributed runtime environment.

GPU Computing Gems - Elsevier · Each GPU Computing Gems volume offers a snapshot of the state of...

Documents

Transcript of GPU Computing Gems - Elsevier · Each GPU Computing Gems volume offers a snapshot of the state of...