Adapting Particle Filter Algorithms to the GPU Architecture

Adapting Particle Filter Algorithms to theGPUArchitecture

Mehdi Chitchian

Adapting Particle Filter Algorithms to theGPUArchitecture

Mehdi Chitchian

A thesis submi ed in partial ful llment of therequirements for the degree of

M S

in

C S

Parallel and Distributed Systems GroupDepartment of So ware Technology

Faculty of Electrical Engineering, Mathematics and Computer ScienceDel University of Technology

esis Commi ee:

Chair prof. dr. ir. Henk J. Sips Faculty EEMCS, TU DelSupervisor drs. Alexander S. van Amesfoort Faculty EEMCS, TU Del

Member dr. ir. Tamas Kevizcky Faculty 3mE, TU DelMember dr. Alexandru Iosup Faculty EEMCS, TU Del

© 2011 Mehdi Chitchian. All rights [email protected]@gmail.com

mailto:[email protected]

Abstract

e particle lter is a Bayesian estimation technique based on MonteCarlo simulations. e non-parametric nature of particle lters makesthem ideal for non-linear non-Gaussian systems. is greater lteringaccuracy, however, comes at the price of increased computational com-plexity which limits their practical use for real-time applications.

is thesis presents an a empt to enable real-timeparticle ltering forcomplex estimation problems using modern GPU hardware. We pro-pose aGPU-based generic particle ltering framework which can be ap-plied to various estimation problems. We implement a real-time esti-mation application using this particle ltering framework and measurethe estimation error with different lter parameters. Furthermore, wepresent an in-depth performance analysis of our GPU implementationfollowed by a number of optimisations in order to increase implemen-tation efficiency.

vii

Contents

AbstractAbstract vii

1 Introduction1 Introduction 11.1 Objectives1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Part I BackgroundPart I Background 3

2 Recursive Bayesian Estimation2 Recursive Bayesian Estimation 52.1 Bayesian Filtering2.1 Bayesian Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Kalman Filter2.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Particle Filter2.3 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Distributed Particle Filtering3 Distributed Particle Filtering 113.1 Classi cation3.1 Classi cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Related Work3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Filter Design3.3 Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Filter Parameters3.4 Filter Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Discussion3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 General-Purpose Computation on GPUs4 General-Purpose Computation on GPUs 174.1 GPU Architecture4.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Programming Model4.2 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Hardware Overview4.3 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Part II Implementation and ExperimentsPart II Implementation and Experiments 23

5 Two Applications: Robot Localisation and Robotic Arm5 Two Applications: Robot Localisation and Robotic Arm 255.1 Unicycle Robot Localisation5.1 Unicycle Robot Localisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Robotic Arm5.2 Robotic Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Particle Filtering on the GPU6 Particle Filtering on the GPU 296.1 Filter Design6.1 Filter Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2 Data Layout6.2 Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

ix

6.3 Individual Kernels6.3 Individual Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.4 Generic Particle Filtering6.4 Generic Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7 Performance Analysis and Optimisations7 Performance Analysis and Optimisations 377.1 Performance Model7.1 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.2 Baseline Performance7.2 Baseline Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.3 Optimisations7.3 Optimisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.4 Scalability7.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.5 Discussion7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Filter Accuracy8 Filter Accuracy 538.1 Filtering Scenario8.1 Filtering Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.2 Centralised Particle Filter Results8.2 Centralised Particle Filter Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.3 Distributed Particle Filter Results8.3 Distributed Particle Filter Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.4 Discussion8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

9 Conclusions and FutureWork9 Conclusions and FutureWork 63

BibliographyBibliography 65

A Unicycle Robot Localisation ImplementationA Unicycle Robot Localisation Implementation 69

B Robotic Arm Filtering ResultsB Robotic Arm Filtering Results 71B.1 Centralised Particle FilterB.1 Centralised Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71B.2 Distributed Particle FilterB.2 Distributed Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

x Contents

List of Figures

3.1 Possible con gurations for a particle lter network3.1 Possible con gurations for a particle lter network . . . . . . . . . . . . . . . . . . . . . 14

4.1 NVIDIA Streaming Multiprocessor (SM), source: NVIDIA4.1 NVIDIA Streaming Multiprocessor (SM), source: NVIDIA . . . . . . . . . . . . . . . . . 194.2 NVIDIA CUDA Memory Hierarchy, source: NVIDIA4.2 NVIDIA CUDA Memory Hierarchy, source: NVIDIA . . . . . . . . . . . . . . . . . . . 21

5.1 Robotic Arm with a Mounted Camera5.1 Robotic Arm with a Mounted Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.1 Roo ine Model7.1 Roo ine Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.2 Performance comparison of two PRNG implementations7.2 Performance comparison of two PRNG implementations . . . . . . . . . . . . . . . . . . 407.3 Roo ine Model and Baseline Results - GTX4807.3 Roo ine Model and Baseline Results - GTX480 . . . . . . . . . . . . . . . . . . . . . . . 427.4 Roo ine Model and Baseline Results - GTX2807.4 Roo ine Model and Baseline Results - GTX280 . . . . . . . . . . . . . . . . . . . . . . . 437.5 Roo ine Model for Each Kernel - GTX4807.5 Roo ine Model for Each Kernel - GTX480 . . . . . . . . . . . . . . . . . . . . . . . . . 487.6 Roo ine Model for Each Kernel - GTX2807.6 Roo ine Model for Each Kernel - GTX280 . . . . . . . . . . . . . . . . . . . . . . . . . 497.7 Filter performance with varying number of particles per block7.7 Filter performance with varying number of particles per block . . . . . . . . . . . . . . . 507.8 Filter performance with varying number of particle lter blocks7.8 Filter performance with varying number of particle lter blocks . . . . . . . . . . . . . . . 507.9 Filter performance with varying state dimensions7.9 Filter performance with varying state dimensions . . . . . . . . . . . . . . . . . . . . . . 51

8.1 Estimation Error for a Centralised Particle Filter8.1 Estimation Error for a Centralised Particle Filter . . . . . . . . . . . . . . . . . . . . . . . 548.2 Filtering scenario with a low number of particles8.2 Filtering scenario with a low number of particles . . . . . . . . . . . . . . . . . . . . . . . 558.3 Filtering scenario with a high number of particles8.3 Filtering scenario with a high number of particles . . . . . . . . . . . . . . . . . . . . . . 568.4 Maximum achievable particle lter frequency8.4 Maximum achievable particle lter frequency . . . . . . . . . . . . . . . . . . . . . . . . 578.5 Estimation error with varying particle lter network size and topology8.5 Estimation error with varying particle lter network size and topology . . . . . . . . . . . 588.6 Estimation error with varying number of exchanged particles8.6 Estimation error with varying number of exchanged particles . . . . . . . . . . . . . . . . 608.7 Estimation error with varying resampling frequency8.7 Estimation error with varying resampling frequency . . . . . . . . . . . . . . . . . . . . . 61

xi

List of Tables

4.1 NVIDIA GPU Overview4.1 NVIDIA GPU Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.1 Particle Filter Parameters6.1 Particle Filter Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7.1 A ained vs Speci ed Memory Bandwidth7.1 A ained vs Speci ed Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . 387.2 Parameters for performance measurements7.2 Parameters for performance measurements . . . . . . . . . . . . . . . . . . . . . . . . . 397.3 Breakdown7.3 Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.4 Instruction throughput microbenchmark results for GTX280 and GTX4807.4 Instruction throughput microbenchmark results for GTX280 and GTX480 . . . . . . . . . 417.5 Instruction count for the sampling_importance() kernel7.5 Instruction count for the sampling_importance() kernel . . . . . . . . . . . . . . 427.6 Instruction count for the block_sort() kernel7.6 Instruction count for the block_sort() kernel . . . . . . . . . . . . . . . . . . . . . 447.7 Instruction count for the resampling() kernel7.7 Instruction count for the resampling() kernel . . . . . . . . . . . . . . . . . . . . . 447.8 Memory and instruction throughput and utilisation - GTX4807.8 Memory and instruction throughput and utilisation - GTX480 . . . . . . . . . . . . . . . 457.9 Memory and instruction throughput and utilisation - GTX2807.9 Memory and instruction throughput and utilisation - GTX280 . . . . . . . . . . . . . . . 457.10 Memory optimisations: Speedup, throughput and utilisation - GTX4807.10 Memory optimisations: Speedup, throughput and utilisation - GTX480 . . . . . . . . . . 467.11 Memory optimisations: Speedup, throughput and utilisation - GTX2807.11 Memory optimisations: Speedup, throughput and utilisation - GTX280 . . . . . . . . . . 467.12 Uncoalesced memory optimisations: Speedup, throughput and utilisation - GTX4807.12 Uncoalesced memory optimisations: Speedup, throughput and utilisation - GTX480 . . . 477.13 Uncoalesced memory optimisations: Speedup, throughput and utilisation - GTX2807.13 Uncoalesced memory optimisations: Speedup, throughput and utilisation - GTX280 . . . 47

8.1 Robotic Arm Parameters8.1 Robotic Arm Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.2 Simulation Noise Parameters8.2 Simulation Noise Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.3 Particle Filter Parameters8.3 Particle Filter Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

B.1 Estimation error for centralised particle lterB.1 Estimation error for centralised particle lter . . . . . . . . . . . . . . . . . . . . . . . . 72B.2 Estimation error for star networkB.2 Estimation error for star network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74B.3 Estimation error for ring networkB.3 Estimation error for ring network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75B.4 Estimation error for 2D torus networkB.4 Estimation error for 2D torus network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

xiii

Introduction 1Many dynamical systems operate under various degrees of uncertainty. is uncertainty could stem from animperfect understandingof theunderlyingdynamics, noisy observations and/or actuationornumerousotherfactors. Bayesian estimation, a probabilistic approach, captures this uncertainty in the form of a probabilitydistribution function for the state of the dynamical system.

Tractable closed-form solutions for the Bayes lter exist only for linear Gaussian systems in the form of theKalman lter. Non-linear Gaussian systems can still use variants of this lter (e.g. extended Kalman lter, un-scented Kalman lter) depending on the amount of non-linearity. e particle lter, based on Monte Carlosimulations, is a non-parametric lter which can be used for state estimation in non-linear non-Gaussian sys-tems. Although much more powerful than extended/unscented Kalman lters, particle lters need manyparticles, and thus computational resources, in order to produce an accurate result for complex estimationproblems. ese high computational demands limit their practical use for real-time applications.

In recent years, Graphics Processing Units (GPUs) have turned into massive parallel processors with highcomputational capacity. With the introduction of programmable shaders in the beginning of the last decade,followed by the move towards a uni ed shading architecture and the arrival of general purpose programminglanguages targetingGPUs in the la er half of the decade, GPUs have been increasingly used inmany differentelds beyond computer graphics.In contrast to many other multi-core platforms, GPUs are speci cally designed for certain workloads typ-

ically encountered in computer graphics rendering. ese workloads are highly parallel, with the overallthroughput being favoured over the latency of individual tasks. Memory access latencies are hidden throughmassive threading and rapid context switching. erefore, the use of GPUs for a particular application de-pends on the presence of a signi cant amount of ( ne grained) parallelism which might require algorithmchanges or even a complete redesign.

is thesis presents a comprehensive study on the issues regarding the design and implementation of par-ticle lters for modern GPUs, the required algorithmic changes and its implications for the lter accuracy. AGPU-based generic particle ltering framework is proposedwhich can be applied to various estimation prob-lems. We will use two real-life applications in order to examine both the generic as well as the model-speci cparts of the implementation. Furthermore, we determine the effects of the various lter parameters on theestimation quality and analyse the performance of implementation on two GPU platforms.

. Objectives

e central research questions addressed in this thesis can be formulated as:

Canparticle lters be efficiently implementedonmodernGPUs for complex real-time estimationproblems? What, if any, modi cations are needed to the particle lter algorithm for parallelexecution? How do these modi cations in uence the lter behaviour?

In this context, efficiency refers to the utilisation of the available hardware resources (e.g.memory through-put). In order to answer these questions, we (i) study the particle lter algorithm to identify parallelismopportunities and limitations, (ii) explore the modern GPU architecture and its programming model, and(iii) use a real-life estimation problem to measure the lter accuracy and implementation efficiency.

. Contributions

We identify the following major contributions for the work presented in this thesis:

• Basedon[Simone o and Keviczky, 2009Simone o and Keviczky, 2009],wepresent a fully distributedparticle ltering scheme. With-out making any particular assumptions on the underlying hardware architecture, this scheme dividesthe “classical” particle lter into a network of smaller particle lters running concurrently. ese par-ticle lters exchange only a few representative particles from their respective particle population.

• Following this distributed scheme, we introduce a GPU-based generic particle ltering frameworkwhich can be used to implement particle lters given arbitrary, user-speci ed models.

• We conduct a performance analysis of our particle lter implementation on two GPU platforms.

– In order to be er understand and model certain GPU performance characteristics, we extendan existing microbenchmark suite. is allows us to reason be er about the performance of ourimplementation.

– Wecalculate the effective utilisation of the hardware resources in terms of instruction throughputas well as memory bandwidth.

– We identify the performance bo lenecks of the various steps of the algorithm.

– We analyse the scalability of the presented implementation in two directions: (i) increasing ltersize, and (ii) increasing state dimension.

• With a particle lter implementation for a real-life estimation problem, we present an in-depth analysisof the lter estimation quality under varying lter parameters.

. Outline

e remainder of this thesis is organised as follows. We begin in ChapterChapter with an introduction of theBayesian estimation framework, with an emphasis on particle lters. ChapterChapter presents our approach todistributing the particle lter over many computation units based on an earlier work. In ChapterChapter , we takea look at the modern GPU architecture and review its programming model. We discuss two real-life esti-mation problems in ChapterChapter which can bene t from the accuracy of particle lters. e actual GPU im-plementation of the generic particle ltering framework as well as the model-speci c code are discussed inChapterChapter . ChapterChapter presents an in-depth performance analysis of the lter implementation on two GPUplatforms followed by a number of optimisation steps. We evaluate the lter accuracy under varying parame-ters in ChapterChapter using one of the applications discussed earlier. Finally, ChapterChapter concludes this thesis witha summary of the presented work and a look into possible future research directions.

. Outline

Part I

Background

Recursive Bayesian Estimation 2eproblemof estimating the state of a dynamical system throughnoisymeasurements hasmany applications

in science. Many of the quantities which constitute the state of a system cannot be observed directly, but needto be inferred from noisy sensor data. Bayesian estimation is based on the assumption that this uncertaintyshould be represented by probabilities [West and Harrison, 1997West and Harrison, 1997]. A probabilistic state estimation computesa Probability Density Function (PDF) for the state over the range of possible values.

First, in Section .Section . , we will describe the generic Bayesian ltering framework. Next we will discuss theKalman lter in Section .Section . , and the Particle lter in Section .Section . . e material for this chapter is based on[ run et al., 2005run et al., 2005] and [Arulampalam et al., 2002Arulampalam et al., 2002].

. Bayesian Filtering

Suppose 𝑥 is a quantity which we wish to infer from the measurement 𝑧. e probability distribution 𝑝(𝑥)represents all the knowledge we have regarding this quantity prior to the actual measurement. is distri-bution is therefore called the prior probability distribution. e conditional probability 𝑝(𝑥 | 𝑧), called theposterior probability distribution, represents our knowledge of 𝑥 having incorporated the measurement data.

is distribution, however, is usually unknown in advance as a result of the complex dynamics involved inmost systems. Bayes rule allows us to calculate a conditional probability based on its inverse:

𝑝(𝑎 | 𝑏) = 𝑝(𝑏 | 𝑎) 𝑝(𝑎)𝑝(𝑏)

e inverse probability 𝑝(𝑧 | 𝑥) directly relates to the measurement characteristics.In order to discuss how the Bayes lter calculates the state estimate, we rst need to model the dynamics

of the system. Let 𝑥𝑡 denote the state at time 𝑡, and 𝑧𝑡 denote the set of all measurements acquired at time 𝑡.e evolution of the state is governed by the following conditional probabilistic distribution:

𝑝(𝑥𝑡 | 𝑥𝑡−1, … , 𝑥0, 𝑧𝑡−1, … , 𝑧1)

Assuming the systemexhibitsMarkovproperties , the state𝑥𝑡 dependsonlyon theprevious state𝑥𝑡−1. ere-fore, we can rewrite the above distribution simply as:

𝑝(𝑥𝑡 | 𝑥𝑡−1)

e aboveprobability is also referred to as the state transition probability. emeasurements of the state followthe probability distribution 𝑝(𝑧𝑡 | 𝑥𝑡) which is called themeasurement probability.

e Bayes lter calculates the estimate of the state recursively in two steps:

We maintain this assumption throughout this chapter.

Predict In this step, the state estimate from the previous step is used to predict the current state. is esti-mate is known as the a priori estimate, as it does not incorporate any measurements from the currenttimestep.

𝑝(𝑥𝑡) = 𝑝(𝑥𝑡 | 𝑥𝑡−1) 𝑝(𝑥𝑡−1 | 𝑧𝑡−1) 𝑑𝑥𝑡−1

Update e state estimate from the previous step is updated according to the actual measurements done onthe system. erefore, this estimate is referred to as the a posteriori estimate.

𝑝(𝑥𝑡 | 𝑧𝑡) = 𝜂 𝑝(𝑧𝑡 | 𝑥𝑡) 𝑝(𝑥𝑡)

In order to implement this lter, three probability distributions need to be known in advance:

i. Initial state 𝑝(𝑥0),

ii. Measurement probability 𝑝(𝑧𝑡 | 𝑥𝑡),

iii. State transition probability 𝑝(𝑥𝑡 | 𝑥𝑡−1).

e Bayes lter, in its basic form, is inapplicable for complex estimation problems. emain problem is thatthe prediction step requires us to be able to integrate over the state transition probability distribution in closedform which is only possible in a restricted number of problems. In the following sections, we will examinetwo derived lters which are more powerful and can be applied to a wider range of estimation problems.

. Kalman Filter

e Kalman lter is a popular Bayesian ltering technique, rst presented in the late ’50s and early ’60s, in-dependently, by Swerling [Swerling, 1958Swerling, 1958] and Kalman [Kalman, 1960Kalman, 1960]. is lter assumes the followingproperties hold for the system:

- e state variable 𝑥𝑡 takes values in a continuous space;

- e state transition probability, denoted by 𝑝(𝑥𝑡 | 𝑥𝑡−1), as well as the measurement probability, de-noted by 𝑝(𝑧𝑡 | 𝑥𝑡), are linear functions withGaussian noise;

- e initial state is normally distributed.

e aforementioned properties ensure that the posterior probability remains a Gaussian. From the above wecan write the state transition as:

𝑥𝑡 = 𝐴𝑡𝑥𝑡−1 + 𝜀𝑡

where 𝐴𝑡 is a squarematrix of the same dimension as the state vector 𝑥𝑡 and 𝜀𝑡 denotes aGaussian vector withmean 0 and variance 𝑅𝑡 representing the uncertainty in the state transition. e state transition probabilitycan be wri en as:

𝑝(𝑥𝑡|𝑥𝑡−1) = 𝑑𝑒𝑡(2𝜋𝑅𝑡)−1/2𝑒−1/2(𝑥𝑡−𝐴𝑡𝑥𝑡)𝑇 𝑅−1𝑡 (𝑥𝑡−𝐴𝑡𝑥𝑡)

e measurement probability can also be wri en as:

𝑧𝑡 = 𝐵𝑡𝑥𝑡 + 𝛿𝑡

. Kalman Filter

where 𝐵𝑡 is a square matrix of the same dimension as the measurement vector 𝑧𝑡 and 𝛿𝑡 denotes a Gaussianvector with mean 0 and variance 𝑄𝑡 representing the uncertainty in the measurements. e measurementprobability can, therefore, be wri en as:

𝑝(𝑧𝑡 | 𝑥𝑡) = 𝑑𝑒𝑡(2𝜋𝑄𝑡)−1/2𝑒−1/2(𝑧𝑡−𝐵𝑡𝑥𝑡)𝑇 𝑄−1𝑡 (𝑧𝑡−𝐵𝑡𝑥𝑡)

Finally, the initial state with mean 𝜇0 and variance Σ0 can be wri en as:

𝑝(𝑥0) = 𝑑𝑒𝑡(2𝜋Σ0)−1/2𝑒−1/2(𝑥0−𝜇0)𝑇 Σ−10 (𝑥0−𝜇0)

One of the main reasons the Kalman lter has gained much popularity lies in its computational efficiency.Each iteration of the Kalman lter is lower bound by 𝑂(𝑘2.4 + 𝑛2), where 𝑘 is the dimension of the mea-surement vector 𝑧𝑡 and 𝑛 is the dimension of the state space. is computational efficiency results from theassumption that the state andmeasurement functions are linear and operate onGaussian variables which canbe computed in closed form.

. . Extended andUnscented Kalman Filters

As discussed in the previous section, the Kalman lter imposes two main restrictions on the underlying sys-tem: linear state transition and measurement functions and a Gaussian posterior probability distribution.

is limits the applicability of the Kalman lter, as most state transition andmeasurement functions are non-linear. If we relax the rst restriction, the state transition and measurement functions become:

𝑥𝑡 = 𝑔(𝑥𝑡−1) + 𝜀𝑡

𝑧𝑡 = ℎ(𝑥𝑡) + 𝛿𝑡

Assuming non-linear functions 𝑔 and ℎ, the above lter does not have a closed-form solution. One solutionis to approximate these functions through linearisation.

e extended Kalman lter (EKF) uses Taylor expansions to linearise the non-linear state transition andmeasurement functions. e approximation is a linear function tangent to the target function at the mean ofthe Gaussian. e accuracy of this estimation depends on the degree of non-linearity.

For highly non-linear systems, the extended Kalman lter does not produce an accurate estimation as itdoes not preserve themean and covariance of theGaussian distribution. e unscentedKalman lter (UKF)uses a deterministic set of sampling points which are propagated through the non-linear functions in order tocapture the true mean and covariance of the Gaussian distribution.

Each update step, for both EKF and UKF, requires 𝑂(𝑘2.4 + 𝑛2) time.

. Particle Filter

Particle ltering [Gordon et al., 1993Gordon et al., 1993] is a recursive Bayesian ltering technique using Monte Carlo simula-tions. Particle lters represent the posterior by a nite set of random samples drawn from the posterior withassociated weights. Because of their non-parametric nature, particle lters are not bound to a particular dis-tribution form (e.g. Gaussian) and are compatible with arbitrary (i.e. non-linear) state transition functions.

As mentioned above, particle lters represent the posterior by a set of particles. Each particle 𝑥[𝑚]𝑡 can be

considered as an instantiation of the state at time 𝑡. In the prediction step of the particle lter, each particle𝑥[𝑚]

𝑡 is generated from the previous state 𝑥[𝑚]𝑡−1 by sampling from the state transition probability 𝑝(𝑥𝑡 | 𝑥𝑡−1).

In the update step, when measurement 𝑧𝑡 is available, each particle is assigned a weight 𝑤[𝑚]𝑡 according to:

𝑤[𝑚]𝑡 = 𝑝(𝑧𝑡 | 𝑥[𝑚]

𝑡 )

Recursive Bayesian Estimation

1: 𝑋𝑡 ← ∅2: for 𝑖 in 1 ∶ 𝑀 do3: sample 𝑥[𝑖]

𝑡 ∼ 𝑝(𝑥𝑡 | 𝑥[𝑖]𝑡−1)

4: 𝑤[𝑖]𝑡 ← 𝑝(𝑧𝑡 | 𝑥[𝑖]

𝑡 ) 𝑤[𝑖]𝑡−1

5: 𝑋𝑡 ← 𝑋𝑡 + {𝑥[𝑖]𝑡 , 𝑤[𝑖]

𝑡 }6: end for

Algorithm : Basic Particle Filter (Sequential Importance Sampling, SIS)

A high level description of the basic particle lter algorithm is given by AlgorithmAlgorithm .Given a large enough particle population, the weighted set of particles {𝑥[𝑖]

𝑡 , 𝑤[𝑖]𝑡 , 𝑖 = 0, … , 𝑁} becomes

a discrete weighted approximation of the true posterior 𝑝(𝑥𝑡 | 𝑧𝑡).

. . e Degeneracy Problem and Resampling

A common problem with the basic particle lter algorithm mentioned in the previous section is the de-generacy problem. It has been proven that the that the variance of the weights can only increase over time[Doucet et al., 2000Doucet et al., 2000]. is results in a situation where only a single particle holds the majority of the weightwith the rest having negligible weight. is results in wasted computational effort on particles which eventu-ally contribute very li le to the lter estimation.Resampling is a statistical technique which can be used to combat the degeneracy problem. Resampling

involves eliminating particles with small weights in favour of those with larger weights. is is achieved bycreating a new set of particles by sampling, with replacement, from the original particle set according to par-ticle weights. Particles with a higher weight will, therefore, have a higher chance of surviving the selectionprocess. One of the implications of the resampling step is the loss of diversity amongst particles as the newparticle set most likely contains many duplicates.

. . AlgorithmOverview

AlgorithmAlgorithm gives an overview of the particle lter algorithm with resampling. e rst for loop (lines 2through 6) generates, for each particle 𝑖, state 𝑥𝑖

𝑡 based on 𝑥𝑖𝑡−1 (line 3) and assigns a weight according to the

measurement 𝑧𝑡 (line 4). e second for loop (lines 8 through 16) transforms the particle set 𝑋𝑡 into a newset 𝑋𝑡 by resampling according to the weights. On line 9, a uniformly distributed random number 𝑟 is drawnfrom the interval [0, ∑ 𝑤𝑡]. By calculating the pre x sum of the weights in the inner while-loop (lines 11through 14), the randomly drawnweight is mapped to an actual particle, which is, subsequently, added to thenew set. is ensures that the likelihood of the selection of each particle, in each round, is proportional to itsweight.

Note that in contrast to the previous version, presented in AlgorithmAlgorithm , no history is maintained for theparticle weights as the resampling step resets the weights for the whole particle population.

. Particle Filter


𝑡 ∼ 𝑝(𝑥𝑡 | 𝑥[𝑖]𝑡−1)

4: 𝑤[𝑖]𝑡 ← 𝑝(𝑧𝑡 | 𝑥[𝑖]

𝑡 )5: 𝑋𝑡 ← 𝑋𝑡 + 𝑥[𝑖]

𝑡6: end for7: 𝑋𝑡 ← ∅8: for 𝑖 in 1 ∶ 𝑀 do9: draw 𝑟 ∼ 𝑈[0 ∶ ∑ 𝑤𝑡]

10: 𝑗, 𝑠 ← 011: while s < r do12: 𝑠 ← 𝑠 + 𝑤[𝑗]

𝑡13: 𝑗 ← 𝑗 + 114: end while15: 𝑋𝑡 ← 𝑋𝑡 + 𝑥[𝑗]

𝑡16: end for

Algorithm : Particle Filter with Resampling (Sequential Importance Resampling, SIR)

Recursive Bayesian Estimation

Distributed Particle Filtering 3We examined a number of Bayesian ltering techniques in ChapterChapter including particle lters. Because oftheir non-parametric form, particle lters can be applied to non-linear and non-Gaussian systems where (ex-tended/unscented) Kalman lters fall short. However, the required computational effort, especially when alarge number of particles is needed for an accurate estimation, limits their practical use in time-constrainedapplications.

A natural approach would be to distribute the particle lter calculations amongst multiple computationunits. is would allow the use of particle lters for complex estimation problems in time critical applications.In this chapter we will discuss the issues regarding the implementation of a distributed particle lter.

We will start with a classi cation of distributed particle ltering schemes in Section .Section . . In Section .Section . ,we will explore the related work, while a discussion on the design of a distributed particle lter follows inSection .Section . . Next, we will review the parameters which affect the lter behaviour in Section .Section . and, nally,we conclude this chapter in Section .Section . .

. Classi cation

ere are different approaches for implementing distributedparticle lters. Wewill use the classi cation from[Simone o and Keviczky, 2009Simone o and Keviczky, 2009] in order to categorise these approaches:

Distributed Sensing Particle Filters is class of particle lters assumes that there aremultiple sensors per-forming measurements. Each sensor has a particle lter operating only on local measurements. eseparticle lters exchange limited information in order to achieve a global estimate.

Distributed Computation Particle Filters In contrast to the previous class, the assumption here is that allmeasurement data is available to all particle lters. Each particle lter is assigned a subset of the totalparticle population.

We will only focus on the second class of distributed particle lters in this thesis.

. RelatedWork

Becauseof thenumerous advantages of particle lteringover its parametric counterparts,much research efforthas been dedicated to designing a distributed particle lter in order to overcome its inherent computationalcomplexity. In this section we will examine a number of related studies and discuss their main contributions.

e rst workwe consider is from [Brun et al., 2002Brun et al., 2002]where a data parallel approach is proposed. e parti-cle population is partitioned into several subsets. Each subset is assigned to a single processor. e samplingas well the weight calculations are performed independently for each subset. In order to calculate a global

estimate, local estimates are calculated for each subset. ese estimates are subsequently aggregated into aglobal estimate. e authors claim that performing local resampling on each subset does not compromise onthe lter accuracy.

In [Bashi et al., 2003Bashi et al., 2003], the authors propose three methods for implementing distributed particle lters:(i) Global Distributed Particle Filter (GDPF), (ii) Local Distributed Particle Filter (LDPF) and (iii) Com-pressed Distributed Particle Filter (CDPF). With GDPF, only the sampling and weight calculation stepsare parallelised while resampling is performed centrally. LPDF is comparable to the proposed solution of[Brun et al., 2002Brun et al., 2002] in which resampling is performed locally without any communication. CDPF, similar toGDFP, has a centralised resampling implementation. However, only a small representative set of particles isused for global resampling. e results are sent back to each individual node. ey conclude from a numberof simulations that LDFP provides a be er estimation while being faster.

Two distributed resampling algorithms are proposed in [Bolić et al., 2005Bolić et al., 2005]: (i) Resampling with Propor-tional Allocation (RPA) and (ii) Resampling withNon-proportional Allocation (RNA). RPA involves a two-stage resampling step (global and local) while RNA involves local resampling followed by a particle exchangestep. ese algorithms still involve a certain degree of centralised planning and information exchange. RPAprovides a be er estimation while RNA has a simpler design. In a later work [Bolić et al., 2010Bolić et al., 2010], they com-pare anFPGA implementationof a standardparticle lterwith that of aGaussian particle lter e presentedresults indicate that the Gaussian particle lter, while being faster than a standard particle lter, is equally ac-curate for (near-)Gaussian problems.

A number of the previously mentioned algorithms (GDPF, RNA, RPA, Gaussian particle lter) are com-pared in [Rosén et al., 2010Rosén et al., 2010] using a parallel implementation on amulti-coreCPU. e comparison goes onlyas far as experiments with 10000 particles. Nevertheless, the Gaussian particle lter outperforms all other al-gorithms inGaussian estimation problems. RNA can achieve near-linear speedupwith respect to the numberof cores, which is much be er than the other non-Gaussian lters.

A particle lter implementation on a GPU is presented in [Hendeby et al., 2010Hendeby et al., 2010]. For the resampling step,the particles are divided into multiple subsets. Each subset is resampled independently and the results areredistributed into different sets (similar to the RNA algorithm). Pseudo-random numbers are generated onthe host CPU and transferred to the GPU. is severely impacts the performance of the lter as about 85%of the total runtime is spent on generating pseudo-random numbers and transferring them to the GPU. eGPU implementation is not suitable for real-time estimation for complex problems and is only marginallyfaster than a CPU implementation.

None of the studies we reviewed in this section have been able to demonstrate a working particle lterimplementation for complex real-time estimation problems. e literature we examined (with the exceptionof [Hendeby et al., 2010Hendeby et al., 2010]) mention experiments with particle lters running up to 10000 particles. is isinsufficient formost complex problems. Some of the presented algorithms divide the particle population intosmaller subsets without any form of communication or information exchange except for calculating globalestimates. e only reference to a large scale particle lter implementation was by [Hendeby et al., 2010Hendeby et al., 2010]which mentions experiments with up to a million particles. ere is, however, no mention of any real-timeapplication for the proposed particle lter implementation.

. Filter Design

e particle lter algorithm, as described in AlgorithmAlgorithm , consists of three main stages:

i. Sampling (prediction),

Gaussianparticle lters [Kotecha and Djuric,Kotecha and Djuric, ] are a special kindof particle lterwhich, similar to the extendedandunscentedKalman lters, approximate the posterior with a normal distribution. ey do not require a resampling step.

. Filter Design

ii. Importance weight calculation (update),

iii. Resampling

e rst two steps can be considered trivially parallel, as they involve operations on a single particle and donot require any global knowledge of the whole particle population. A data parallel approach would be appli-cable in this case. e resampling step, however, does require information from the entire particle populationwhich cannot be trivially executed in parallel.

. . Particle Filter Network

For a truly distributeddesignwewill use the particle lter scheme from[Simone o and Keviczky, 2009Simone o and Keviczky, 2009]. emain idea behind this approach is to construct a network of smaller particle lters. Instead of communicatingestimates or other aggregate data, these particle lters exchange only a few representative particles in eachround. Such an approach is inherently scalable, as instead of increasing the size of individual particle lters,more particle lters can be added to the network. Each particle lter runs the algorithm as described inAlgorithmAlgorithm .

With this distributed scheme, each particle lter can be executed on a different computation unit. Depend-ing on the underlying hardware memory model and architecture, a suitable network scheme can be chosenfor efficient (particle) data transfer amongst the computation units. Figure .Figure . depicts a number of possiblecon gurations.

Suppose we have a network of 𝑁 particle lters, each maintaining 𝑚 particles. In every iteration of the es-timation procedure, each particle lter sends its 𝑡 likeliest particles to its direct neighbours. Furthermore, thereceived particles replace the local particles with the smallest weights. e number of neighbouring lters de-pends on the network topology. From the con gurations presented in Figure .Figure . the star network is a specialcase. It can be considered a semi-distributed or hybrid approach. In this network, each particle lter writesits 𝑡 likeliest particles to a global data structure (black circle in the diagram). ese particles are subsequentlysorted to nd the 𝑡 particles with the highest weight. All the particle lters add these 𝑡 particles to their localparticle set.

We know from [Simone o and Keviczky, 2009Simone o and Keviczky, 2009] that the accuracy of such a distributed particle lter net-work approaches that of a centralised particle lter with 𝑚𝑁 nodes , even when 𝑡 is much smaller than thenumber of particles in each node 𝑚.

. . Resampling

e usual problems with parallel resampling are avoided with this approach. Each particle lter performslocal resampling on its own particles. One of the main concerns with resampling is that it can lead to a sig-ni cant loss of variation amongst the particle population. erefore, much a ention is needed in order toavoid resampling too o en. We propose a simple probabilistic resampling scheme with parameter 𝑟 ∈ [0, 1]indicating the probability of performing the resampling step at each round. Each particle lter draws inde-pendently 𝑢 ∼ 𝑈(0, 1) and only performs the resampling step if 𝑢 < 𝑟.

. . Global Estimate

One of the main challenges in the design of a network of particle lters is the calculation of a global estimate.Whether such a global estimate can be efficiently calculated highly depends on the underlying hardware ar-chitecture and speci cally, the cost of communication amongst the computation units running the particlelters. We know from the simulation results presented in [Simone o and Keviczky, 2009Simone o and Keviczky, 2009] that theworst local

Distributed Particle Filtering

(a)Star (b)Ring

(c)2D Mesh (d)3D Mesh

Figure . : Possible con gurations for a particle lter network

. Filter Design


𝑡 ∼ 𝑝(𝑥𝑡 | 𝑥[𝑖]𝑡−1)

4: 𝑤[𝑖]𝑡 ← 𝑝(𝑧𝑡 | 𝑥[𝑖]

𝑡 )5: 𝑋𝑡 ← 𝑋𝑡 + 𝑥[𝑖]

𝑡6: end for7: sort 𝑋𝑡 according to weight8: send < 𝑥[1]

𝑡 , 𝑤[1]𝑡 > ⋯ < 𝑥[𝑛]

𝑡 , 𝑤[𝑛]𝑡 > to neighbours

9: receive < 𝑥[𝑀−𝑛]𝑡 , 𝑤[𝑀−𝑛]

𝑡 > ⋯ < 𝑥[𝑀]𝑡 , 𝑤[𝑀]

𝑡 > from neighbours10: 𝑋𝑡 ← ∅11: for 𝑖 in 1 ∶ 𝑀 do12: draw 𝑟 ∼ 𝑈[0 ∶ ∑ 𝑤𝑡]13: 𝑗, 𝑠 ← 014: while s < r do15: 𝑠 ← 𝑠 + 𝑤[𝑗]

𝑡16: 𝑗 ← 𝑗 + 117: end while18: 𝑋𝑡 ← 𝑋𝑡 + 𝑥[𝑗]

𝑡19: end for

Algorithm : Distributed Particle Filter

estimate in such a network is still close to the best local estimate. e trade-off between the extra communi-cation cost and the potential estimation improvement is highly dependant on the application as well as thehardware.

. Filter Parameters

While traditional centralised particle lters are characterised only by the total number of particles, our dis-tributed particle lter has a number of parameters which in uence its behaviour. In this section we will sum-marise these parameters, which where discussed throughout this chapter.

Number of particles e total number of particles for each particle lter.

Number of particle lters e number of particle lters in the network.

Particle lter network topology De nes how the particle lters are connected, and how information prop-agates through the network.

Number of exchanged particles How many particles are exchanged between neighbouring particle lters.

Resampling frequency How o en resampling is performed.

. Discussion

In this chapter, we reviewed existing literature on the subject of distributed particle ltering. None of thesea empts have been able to demonstrate a large scale (millions of particles) particle lter implementation for

Distributed Particle Filtering

real-time estimation problems. We presented a fully distributed particle lter algorithm based on the work of[Simone o and Keviczky, 2009Simone o and Keviczky, 2009] which can be implemented on different hardware architectures.

e general idea behind this design is to construct a large particle lter by forming a network of smallerparticle lters. e smaller particle lters operate independently with only a limited communication amongstneighbouring nodes. Finally, we discussed a number of parameters which in uence the behaviour of theparticle lter.

. Discussion

General-Purpose Computation onGPUs 4In recent years, Graphics Processing Units (GPUs) have evolved into highly parallel processors with massivecomputational capacity. Modern GPUs offer a peak performance of over a thousand GFLOPS; an order ofmagnitude higher than that of conventional multi-core CPUs. e reason for this massive performance gaplies in the special workloads theGPU is designed to handle. is substantial computational capacity, togetherwith high availability, has led to the wide adaptation of GPUs beyond their original purpose of graphics ren-dering.

We will start with an overview of the GPU architecture in Section .Section . . In Section .Section . , we will discuss theprogramming model for general purpose computation on GPUs. Finally, in Section .Section . we will take a closerlook at a number of GPUs which we will use for the experiments described in ChapterChapter and ChapterChapter .

. GPUArchitecture

In order to be er understand the GPU architecture, we brie y examine the computer graphics processingpipeline which has been dictating the design of GPUs. Furthermore, we discuss the main characteristics ofthroughput-oriented architectures, of which GPUs are a prime example.

. . Graphics Processing Pipeline

Real-time computer graphics has been the primary target for the development of the GPU architecture. egraphics processing pipeline consists of a number of stages which can be classi ed as either xed-function orprogrammable [Fatahalian and Houston, 2008Fatahalian and Houston, 2008]. is pipeline converts a set of primitives (e.g. lines, poly-gons) to actual pixel values. e programmable stages of the pipeline are de ned by shader functions, whichare usually expressed in a high-level language. e shader code is subsequently compiled into a binary whichcan run on the GPU. e shader functions are applied to hundreds or thousands of graphical entities (e.g.pixels, vertices) which offers great opportunity for data parallelism.

e xed function stages are, however, difficult to parallelise as they involve interaction between multipleentities. erefore, these stages are separated from the programmable parts and are (usually) implementedin hardware. ese xed function stages include texture ltering and rasterisation .

. . roughput-Oriented Architectures

e GPU architecture has been classi ed as a throughput-oriented architecture [Garland and Kirk, 2010Garland and Kirk, 2010]. Incontrast to traditional (multi-core) CPUs, throughput-oriented processors sacri ce the latency of individual

Texture ltering determines the texture value given a speci c coordinate for texture mapped pixels.Rasterisation involves transforming primitives into pixel values.

tasks in order tomaximise total throughput. e main assumption for this class of processors is that they willbe presented with highly parallel workloads.

With the ever increasing gap between processor speeds and memory access latencies, o en referred to astheMemory Wall [Wulf and McKee, 1995Wulf and McKee, 1995], throughput oriented architectures employ a different strategy toovercome this problem. While traditional CPUs use extensive caching, sophisticated branch prediction andout-of-order execution tohidememory access latencies, throughput orientedprocessors rely on the executionof a vast number of threads which can resumework while data is being fetched from themuch slower off-chipmemory. is results in a much simpler processing core design which allows for more chip resources to bededicated to processing elements.

ree main characteristics have been identi ed for throughput-oriented processors: Focusing on manysimple processing cores, hardware multithreading and SIMD execution [Garland and Kirk, 2010Garland and Kirk, 2010]. We al-ready discussed the concept of having many simple processing cores when dealing with memory access la-tency. Hardware multithreading refers to a special case of multithreading in which the execution context ofthe threads aremaintained on-chip. is allows switching context at no cost which allows for greater exploita-tion of instruction- and thread-level parallelism. Single Instruction, Multiple Data (SIMD) [Flynn, 1966Flynn, 1966],refers to a class of computer architectures in which a single instruction stream operates on multiple datastreams. e SIMD architecture allows for a single control unit to work withmultiple arithmetic units, whichresults in effectively dedicating more chip resources (e.g. transistors) to arithmetic units.

. . NVIDIAGPUArchitecture

We will now examine the CUDA architecture [NVIDIA, 2010aNVIDIA, 2010a], introduced by NVIDIA in 2006. e G80microprocessorwas the rstNVIDIAGPU tomove from the traditional pipeline executionmodel to a uni edshader model [NVIDIA, 2006NVIDIA, 2006]. e uni ed shader architecture allows different shaders (e.g. pixel shaders,vertex shaders) to run on the same computation units, marking the end of the traditional pipeline architec-tures.

CUDA-based GPUs are based on a scalable array of processors, called Streaming Multiprocessors (SM)[Lindholm et al., 2008Lindholm et al., 2008]. A streaming multiprocessor is a collection of simple processing cores with integerand oating point units, combined with on-chip memory. Each multiprocessor can maintain hundreds oreven a thousand resident threads, depending on the hardware generation. e SM contains thousands of reg-isters divided amongst all resident threads and few tens of kilobytes of fast shared sharedmemory. Figure .Figure .gives a high level overview of the SM design for two NVIDIA architectures.

SIMT ExecutionModel

In order to efficiently run thousands of concurrent threads, CUDA employs what is called a Single-Instruction,Multiple- read or SIMT architecture. In this architecture, threads are organised into warps, which are basi-cally a group of 32 threads. Each warp is executed independently, with the warp scheduler choosing a warpwith active threads which is ready for execution.

Although similar to the SIMD architecture described in the previous section, SIMT has two main dif-ferences: programming model and independent branching. With classic SIMD vector machines, the pro-grammer explicitly uses vector operations on ( xed width) vectors, whereas SIMT enables the programmerto specify the behaviour of an individual thread. reads in SIMT can branch independently, but within awarp, this leads to the serialisation of the different execution paths.

. GPU Architecture

SP

DP

SP

SP SP

SP SP

SP SP

I-Cache

MT Issue

C-Cache

SFU SFU

SharedMemory

(a)GT200

Dispatch Unit

Warp Scheduler

Instruction Cache

Dispatch Unit

Warp Scheduler

Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

SFU

SFU

SFU

SFU

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

UniformCache

Core

Register File (32,768 x 32-bit)

CUDA Core

Operand Collector

Dispatch Port

Result Queue

FP Unit INT Unit

(b)GF100 (Fermi)

Figure . : NVIDIA Streaming Multiprocessor (SM), source: NVIDIA

. ProgrammingModel

With the introduction of programmable shaders for GPUs, numerous (high level) programming languageshave been developed in order to ease the task of programming these shaders. All these shading languages aredesigned according to the needs of modern graphics pipelines. e increased popularity of the GPU outsidethe graphics domain has led to the introduction of general purpose languages speci cally designed for thispurpose. In this section we will examine the programming model of CUDA, introduced by NVIDIA in 2006for its lineofGPUs. Acompeting standard fromtheKhronosGroup, calledOpenCL[Khronos Group, 2010Khronos Group, 2010],was released in 2008. OpenCL, an open and royalty-free standard, is a programming framework for hetero-geneous platforms including multi-core CPUs as well as GPUs. e programming and memory model is,however, comparable to that of CUDA.

ACUDAprogram is divided into a host and a devicepart. e host program runs on theCPUand the deviceprogram which runs on the GPU consists of one or more kernels [NVIDIA, 2010aNVIDIA, 2010a]. Each kernel typicallyconsists of numerous threads grouped into blocks. Blocks of threads are, furthermore, grouped into a grid.Blocks of threads can be executed concurrently on different multiprocessors.

reads within each block are organised in a one-, two- or three-dimensional fashion. Each thread has aunique ID within a block which can be accessed in the kernel code by the built-in variable threadIdx.

General-Purpose Computation on GPUs

Similarly, blocks are grouped into one- or two-dimensional grids, with a unique ID accessible through thevariable blockIdx. reads from the same block can synchronise their execution by de ning synchroni-sation points with __syncthreads().

. . MemoryModel

Memory ismanaged in a hierarchical way, with each thread having access to private localmemory. All threadswithin a blocks have access to the same shared memory, which as with thread-local memory has only thelifetime of a single kernel execution. Global memory is accessible to all threads from all blocks and persiststhrough kernel calls. Figure .Figure . gives an illustration of the CUDA memory hierarchy.

Global memory is stored off-chip, and is the slowest available memory on the GPU. It is accessed through32-, 64- or 128 byte memory transactions. Whenever one or more threads within a warp access global mem-ory, a number of memory transactions are issued. If thememory accesses within a warpmeet certain require-ments the requests are coalesced into a single memory transaction. is is the most efficient way to accessglobal memory. e requirements for memory coalescing are different between devices of different genera-tions, but in general aligned and especially contiguous (unit stride) loads/stores are the most efficient.

Sharedmemory is a fast on-chipmemory available to all threadswithin a thread-block. It ismuch faster thantheoff-chip globalmemory, but is generallymuch smaller (8-, 16- or 32KBperSMdependingon thehardwaregeneration) and does not persist through kernel calls. e manner in which shared memory is accessed canhave a huge impact on the performance. Shared memory consists of a number of (16 or 32) banks. A reador write request to shared memory can only reach the optimum performance if all the requested locationsfrom a single warp fall into distinct banks. If 𝑘 addresses from a request fall in a single bank, we have a 𝑘-waybank con ict. It is, therefore, important to consider the bank sizes and access strides in order to prevent bankcon icts.

. HardwareOverview

In this section we will examine twoNVIDIAGPUs which we use for the experiments described in ChapterChapterand ChapterChapter . e rst GPU is the GTX280, introduced in 2008, which is based on the GT200 microar-chitecture. It contains 240 CUDA cores divided amongst 30 SMs clocked at 1296 MHz. As depicted inFigure . aFigure . a Each SM consists of 8 processing cores (SPs) and 2 Special Function Units (SFUs). e SPscan execute one single-precision oating-point operation per clock cycle or two multiply-add (MAD) in-structions. e SFUs handle transcendental math and interpolation but can also compute four oating-pointoperations per cycle. is leads to a theoretical maximum throughput of:

1.296 GHz × 30 × (8 × 2 FLOP + 2 × 4 FLOP) = 933.12 GFLOP/s

is assumes that instructions are executed on both SPs and SFUs (dual-issuing). However, if there are noinstructions issued to the SFUs, this gure drops to 622.08 GFLOP/s. Similarly, if no MAD instructions areissued, it further drops to 311.04 GFLOP/s. e GTX280 has 1GB of main memory with a throughput of141.7 GB/s. Each SM has 16KB of shared memory together with 16K 32-bit wide registers.

e second NVIDIA GPU we use for the experiments is the GTX480, based on the GF100/Fermi archi-tecture introduced in 2010. It has 480CUDA cores clocked at 1401MHz, partitioned into 15 SMs. Each SMhas 32 cores and 4 SFUs as shown in Figure . bFigure . b. e GTX480 delivers a single-precision peak performanceof:

1.401 GHz × 15 × (32 × 2 FLOP) = 1344.96 GFLOP/s

. Hardware Overview

Thread

Per-thread localmemory

(a) read Local Memory

Thread BlockPer-block shared

memory

(b)Block Shared Memory

Global memory

Grid 0

Block (2, 1)Block (1, 1)Block (0, 1)

Block (2, 0)Block (1, 0)Block (0, 0)

Grid 1

Block (1, 1)

Block (1, 0)

Block (1, 2)

Block (0, 1)

Block (0, 0)

Block (0, 2)

(c)Global Memory

Figure . : NVIDIA CUDA Memory Hierarchy, source: NVIDIA

General-Purpose Computation on GPUs

Processor count SM count D M Memory bandwidth Single-precision FLOPSGTX280 240 30 1024 MB 141.7 GB/s 933.12 GFLOPSGTX480 480 15 1536 MB 177.4 GB/s 1344.96 GFLOPS

Table . : NVIDIA GPU Overview

Aswith theGTX280, this depends on the presence ofMAD instructions. emaximum throughput dropsto 672.48 GFLOP/s without MAD instructions. e GTX480 has 1536MB of memory with a maximumthroughput of 177.4 GB/s. Each SM in theGTX480 has 32K 32-bit wide registers and 48KB of sharedmem-ory, which can be dynamically recon gured as an L1 cache.

Table .Table . gives an overview of the two NVIDIA cards described above.

. Hardware Overview

Part II

Implementation and Experiments

TwoApplications: Robot Localisation andRobotic Arm 5In order to test, verify and benchmark ourGPU-based distributed particle lter implementation, based on theideas presented in ChapterChapter , we use the following two estimation problems: (i) Unicycle robot localisation,and (ii) Robotic arm.

e unicycle localisation problem, discussed in Section .Section . , is a relatively small estimation problem whichdoes not allow us to test the performance and the accuracy of the lter under different conditions. We use thisapplication in order to illustrate the model-speci c aspects of the particle lter implementation in ChapterChapter .Section .Section . presents the robotic arm estimation problem. We use this estimation problem in order tomeasurethe performance of the lter in ChapterChapter and the accuracy in ChapterChapter .

. Unicycle Robot Localisation

In this estimation problem we use noisy range measurements from xed sensors to localise a mobile robotwhichmoves freely on a 2Dplane. e state consists of the tuple (𝑥, 𝑦, 𝜃)where (𝑥, 𝑦) denote the coordinatesof the robot, and 𝜃 refers to its orientation. e dynamical state equation is as follows:

𝑥(𝑡) = 𝑥(𝑡 − 1) + ̂𝑣̂𝑟 sin(𝜃(𝑡 − 1) + ̂𝑟Δ𝑡) − sin(𝜃(𝑡 − 1))

𝑦(𝑡) = 𝑦(𝑡 − 1) − ̂𝑣̂𝑟 cos(𝜃(𝑡 − 1) + ̂𝑟Δ𝑡) − cos(𝜃(𝑡 − 1))

𝜃(𝑡) = 𝜃(𝑡 − 1) + ̂𝑟Δ𝑡 + 𝛾Δ𝑡

Note that ̂𝑣 and ̂𝑟 represent (noisy) control inputs and 𝛾 is a noise term. e weights can be calculatedaccording to a Gaussian probability distribution function. e sensor data consists of the set of measureddistances to the xed sensors.

. Robotic Arm

e robotic arm, in this experiment, has a number of joints which can be controlled independently. It hasone degree of freedom per joint. Each joint has a sensor to measure the angle. ere is a camera mounted atthe end of the arm. is camera is used for tracking an object which is moving on a xed 2D plane. Figure .Figure .gives an illustration of this robotic arm.

e state equations for both the robotic arm joint movements as well as the moving object are as follows:

𝜃𝑖(𝑡) = 𝜃𝑖(𝑡 − 1) + Δ𝑡𝑢𝑖 + 𝜔𝜃 ∀𝑖 ∈ {0, … , 𝑁}

θ1

θ2

θ3

L1

L2

L3

Figure . : Robotic Arm with a Mounted Camera

. Robotic Arm

𝑥(𝑡) = 𝑥(𝑡 − 1) + 𝑣𝑥(𝑡 − 1)Δ𝑡 + 𝜔𝑥

𝑦(𝑡) = 𝑦(𝑡 − 1) + 𝑣𝑦(𝑡 − 1)Δ𝑡 + 𝜔𝑦

𝑣𝑥(𝑡) = 𝑣𝑥(𝑡 − 1) + 𝜔𝑣𝑋

𝑣𝑦(𝑡) = 𝑣𝑦(𝑡 − 1) + 𝜔𝑣𝑌

Here 𝜃𝑖 represents the angle of joint 𝑖 and 𝑢𝑖 represents the control signal for the actuator of joint 𝑖. (𝑥, 𝑦)and (𝑣𝑥, 𝑣𝑦) are the position and velocity vectors of the moving object. 𝜔 is the noise term.

e camera detects the object in its own frame of reference. erefore, in order to translate between theglobal coordinates and the camera reference frame, we need to apply, for each joint, the rotation and shimatrices according to the joint angle and link length:

𝑢(𝑡)𝑣(𝑡) = 1 0 0

0 1 0⎛⎜⎜⎝

⎡⎢⎢⎣

00

−𝐿𝑁

⎤⎥⎥⎦

+⎡⎢⎢⎣

1 0 00 cos(𝜃𝑁 ) − sin(𝜃𝑁 )0 sin(𝜃𝑁 ) cos(𝜃𝑁 )

⎤⎥⎥⎦

⎛⎜⎜⎝

⎡⎢⎢⎣

00

−𝐿𝑁−1

⎤⎥⎥⎦

+⎡⎢⎢⎣

1 0 00 cos(𝜃𝑁−1) − sin(𝜃𝑁−1)0 sin(𝜃𝑁−1) cos(𝜃𝑁−1)

⎤⎥⎥⎦

⋮⎛⎜⎜⎝

⎡⎢⎢⎣

00

−𝐿1

⎤⎥⎥⎦

+⎡⎢⎢⎣

1 0 00 cos(𝜃1) − sin(𝜃1)0 sin(𝜃1) cos(𝜃1)

⎤⎥⎥⎦

⎡⎢⎢⎣

cos(𝜃0) sin(𝜃0) 0− sin(𝜃0) cos(𝜃0) 0

0 0 1

⎤⎥⎥⎦

⎡⎢⎢⎣

𝑥𝑦0

⎤⎥⎥⎦

⎞⎟⎟⎠

…⎞⎟⎟⎠

⎞⎟⎟⎠

+ 𝜔𝑢𝑋𝜔𝑢𝑌

𝜃𝑖 refers to the angle of joint 𝑖, 𝐿𝑖 refers to the length of link 𝑖 and𝜔 is the noise term. e anglemeasurementsfor the individual joints are according to the following equation:

⎡⎢⎢⎢⎢⎣

𝜃0(𝑡)𝜃1(𝑡)

⋮𝜃𝑁 (𝑡)

⎤⎥⎥⎥⎥⎦

=⎡⎢⎢⎢⎣

𝜃0(𝑡)𝜃1(𝑡)

⋮𝜃𝑁 (𝑡)

⎤⎥⎥⎥⎦

+ 𝜔𝑘

e noise term is depicted by 𝜔. For the sake of simplicity, all noise terms are assumed to be Gaussian,although particle lters do not impose such a restriction.

e main advantage of this robotic arm model is its parametric form which by adjusting the number ofjoints, allows for increasing and/or decreasing the state dimensions.

Two Applications: Robot Localisation and Robotic Arm

Particle Filtering on theGPU 6In order to have an efficient particle lter implementation on the GPU, we need to modify the standard par-ticle lter algorithm. We presented a distributed particle ltering scheme in ChapterChapter which serves as basisfor our GPU implementation. We use the unicycle robot localisation application, presented in ChapterChapter , toillustrate the model-speci c aspects of the lter implementation. e code for the unicycle implementationis listed in Appendix AAppendix A.

In this chapter we will discuss how this distributed lter design translates to a concrete implementation ontheGPU. Section .Section . discusses the general aspects of ourGPU implementation, while Section .Section . speci callydiscusses the memory organisation. Section .Section . discusses the implementation of each kernel.

. Filter Design

e distributed particle ltering scheme, discussed inChapterChapter , forms the basis of ourGPU implementation.e basic idea of this approach is to build a network of small particle lters instead of a single large particle

lter. is network of particle lters has three main characteristics:

1. Individual particle lters remain small, which allows for efficient resampling.

2. Excluding the particle exchange step, the remainder of the lter algorithm can be executed indepen-dently for each particle lter.

3. Scaling is achieved by adding more particle lters, instead of increasing the size of individual particlelters.

Recall, from ChapterChapter , that with the CUDA programming model kernels run on the GPU as blocks ofthreads. eseblocks are executed concurrently ondifferent StreamingMultiprocessors (SM). readswithina block can communicate using fast on-chip memory and synchronise their execution using barrier instruc-tions. However, there are a number of limits which we need to consider. We discussed in ChapterChapter that thetotal number of threads within each block is limited to 512 for 1.x devices, and 1024 for 2.x devices. Further-more, there is also a limit on the available shared memory (16KB for 1.x devices, and 48KB for 2.x devices).

Based on this, we choose the following mapping:

1. Each particle is represented by a single CUDA thread;

2. Each particle lter is represented by a single CUDA thread block.

is particular mapping has a number of advantages. e sorting and resampling steps, which require a lotof communication are limited to a single thread block. erefore, we can utilise the fast shared memory, aswell as the synchronisation mechanisms for an efficient implementation of these steps.

Parameter SymbolNumber of Particles 𝑚Number of Particle Filters 𝑁Particle Filter Network Topology Star/Ring/2D Mesh/2D Torus/3D MeshNumber of Exchanged Particles 𝑡Resampling Frequency 𝑟

Table . : Particle Filter Parameters

ere are, however, a number of disadvantages to this approach. e amount of available shared memoryas well as the limit on the number of threads within a thread block restrict the size of an individual particlelter. Nevertheless, this is in line with our design strategy, in which scalability is achieved by increasing the

total number of particle lters, instead of the size of individual particle lters.Based on the details of AlgorithmAlgorithm we identify the following steps for our particle lter implementation.

1. Pseudo-Random Number Generation

2. Sampling

3. Importance weights

4. Local Sorting

5. Global Estimate

6. Particle Exchange

7. Resampling

e generation of pseudo-random numbers is listed here as a separate step. Pseudo-random numbers areused both for the sampling as well as the resampling step. We have chosen to generate these numbers sepa-rately at the start of each iteration. Table .Table . lists the available parameters affecting the behaviour of the lter.We will explain these parameters further when we discuss the corresponding kernel.

. Data Layout

ere are three main data types for holding the data needed by the particle lter. e state structure is themain data structure containing the data of a single particle. ese are stored in an array to the length of thewhole particle population . ese structures are initialised on the GPU at the start of the execution. ereare also the control and measurement structures, holding respectively, the control (actuation) valuesand the (noisy) measurements. e particle weights are stored separately in a single-precision oating pointarray.

Listing .Listing . shows these three data structures for the unicycle model discussed in ChapterChapter .

. Individual Kernels

In this section we will discuss, in detail, the implementation of each step of the particle lter algorithm.

We actually allocate twice the required memory for the particle data. is is explained in the discussions for the sorting kernel


Listing . : Particle Filter Datatypedef struct _state{

float x;float y;float theta;

} state;typedef struct _control{

float velocity;float angular_velocity;

} control;typedef struct _measurement{

float sensor_data[NUM_SENSORS];} measurement;

struct state* particle_data;float* particle_weights;

. . Pseudo-RandomNumber Generation

As with any Monte Carlo simulation technique, particle lters rely heavily on random numbers. Pseudo-random number generators (PRNGs) are, therefore, an essential part of any Monte Carlo simulation. Not allPRNGs are suitable for parallel execution on GPUs. We will now examine two PRNGs which we use for ourexperiments. For a detailed overview of GPU implementations of various PRNGs, see [Demchik, 2011Demchik, 2011].

Mersenne Twister

Mersenne Twister [Matsumoto and Nishimura, 1998bMatsumoto and Nishimura, 1998b] is a highly popular pseudo-random number genera-tor, characterised by its large period. Although not directly suitable for cryptographic purposes, MersenneTwister is used in a wide range of applications, including Monte Carlo simulations. A distributed scheme isproposed in [Matsumoto and Nishimura, 1998aMatsumoto and Nishimura, 1998a], which dynamically creates Mersenne Twister parametersbased on process IDs. is enables concurrent Mersenne Twister PRNGs to produce uncorrelated numbersequences. e NVIDIA CUDA SDK provides a CUDA implementation [Podlozhnyuk, 2007Podlozhnyuk, 2007] based onthis scheme with a period of 2607 − 1. We will, however, use an optimised CUDA implementation from[Saito, 2010Saito, 2010] which is claimed to be 1.5 times faster and has a period of 211213-1.

is CUDA implementation produces uniform oating point numbers in the range of [1,2). We use theBox-Muller transformation [Box and Muller, 1958Box and Muller, 1958] to transform these uniformly distributed numbers to anormal distribution.

Xorshi

Xorshi [Marsaglia, 2003Marsaglia, 2003] is a family of PRNGs which are based on the repeated use of a simple construct,namely the exclusive-or (xor) of a number with a shi ed version of itself. Compared to Mersenne Twister,these PRNGs are much simpler and faster. Xorshi PRNGs, however, have generally much smaller periods(e.g. 264, 2128, 2192).

e NVIDIA CUDA Toolkit provides the CU ND library [NVIDIA, 2010bNVIDIA, 2010b]. CU ND is a CUDA

Particle Filtering on the GPU

Listing . : Sampling/Importance Weights Kernel__global__ void sampling_importance_weights

(state* const d_particle_data,float* const d_particle_weights,const control control_data,const measurement measurement_dataconst float* const d_random,const float dt);

implementation of xorwow, which is a member of the xorshi PRNG family. It is claimed to have periodgreater than2190, which shouldbe enough for our experiments. CU NDcanproduceuniformlydistributednumbers in the range of (0,1] and normally distributed numbers with a given mean and standard deviation.

. . Sampling/ImportanceWeights

e sampling step, as described in Section .Section . , involves generating new particles 𝑥[𝑚]𝑡 from the previous parti-

cles 𝑥[𝑚]𝑡−1. is is achieved by sampling from the state transition distribution, taking into account any possible

control values. e pseudo-random numbers generated in the previous step are used here for generatingsamples from the state transition distribution.

e importance weights calculation, which uses measurement data to assign weights to the particles, iscombined with sampling into a single CUDA kernel.

e measurement and control structures are directly passed to this kernel as parameters. e CUDA com-piler places these structures in constant memory, which is cached on each SM. In the applications we haveencountered so far, the measurement and control structures have easily t the constant memory (64KB). Iffor a particular application, these data structures exceed the constant memory size, they need to be placedeither in texture memory or global memory.

As mentioned in the previous section, the pseudo-random numbers are generated prior to the executionof this kernel. A pointer to the data structure holding these numbers is passed along to the kernel as well.

. . Local Sorting

Each particle lter needs to internally sort its particles according to their weights. is serves two purposes:First of all, for the particle exchange step, each particle lter needs to send its 𝑡 particles with the highestweight, and for each neighbour, replace 𝑡 particles with the lowest weight with the particles received. Sec-ondly, in order to determine the global estimate, the local winner from each block needs to be calculated.

Listing . : Particle Sort Kernel__global__ void block_sort(

state* const d_particle_data_sorted,float* const d_particle_weights,const state* const d_particle_data_unsorted,const int num_particles);

Weuse a bitonic sort [Batcher, 1968Batcher, 1968] implementation from theNVIDIACUDASDK,whichhas a complex-ity of𝑂(𝑛 𝑙𝑜𝑔2(𝑛)). Bitonic sort is an example of a sorting network, whichuses a xed sequenceof comparisonsin order to sort the input. is makes them a ractive for parallel sorting.


e particles are generally too large to entirely t in shared memory. erefore, we only store the weightsin shared memory and sort them separately. We keep track of the sorted indices with an index array (alsostored in shared memory). Once the weights are sorted, each thread reads from global memory from theposition stored in the indices array, and writes back to its own location. Recall from ChapterChapter that in orderto achieve high memory throughput, all accesses need to be contiguous. is is only the case for the writes.Nevertheless, this allows us to apply our lter implementation to much larger problems.

As a result of not using sharedmemory for the actual particle data, we cannot sort the particle data in-placein global memory. We need to allocate two arrays: one for holding the unsorted particles and another one forthe sorted particles. is prevents any con ict between reads and writes from different threads within eachblock.

. . Global Estimate

For this step we need to implement a reduction operation in order to calculate an estimate given a probabilitydensity function. Whether to use a weighted average of all the particle values or to choose the single mostrepresentative particle as the estimate depends on the particular application. We consider the single particlewith the highest weight to be global estimate of the particle lter. In order to calculate this, we need to ndthe particle with the highest weight amongst the local winners from each particle lter.

To this end we use an external library called rust . rust is an STL-like library providing a high-levelinterface for CUDA implementations ofmany useful algorithms. We use thrust::max_element() tond the particle with the highest weight amongst the local winners from each block.

. . Exchange

Listing . : Particle Exchange Kernel__global__ void exchange_ring(

const state_data d_particle_data,float* const d_particle_weights,const int num_particles,const int num_blocks);

__global__ void exchange_2dtorus(const state_data d_particle_data,float* const d_particle_weights,const int num_particles,const int num_blocks);

ere are two parameters in uencing the behaviour of this kernel. e particle lter network topologydetermines which blocks exchange particles. We have implementations for the following networks: (i) Star,(ii) Ring and (iii) 2D Torus. With the star topology, all blocks write their local winners to a global datastructure. ese particles are subsequently sorted, which determines the global estimate. Furthermore, allblocks read the same particles back. Other topologies, such as the ring network, are more distributed in thatexchanged particles are unique to direct neighbouring blocks.

ere is also the parameter 𝑡, the number of exchanged particles, which determines howmany particles aretransferred in a single exchange step.

http://code.google.com/p/thrust/http://code.google.com/p/thrust/


http://code.google.com/p/thrust/

. . Resampling

Listing . : Particle Resampling Kernel__global__ void resampling(

const state_data d_particle_data,const state_data d_particle_data_tmp,const float* const d_particle_weights,const float* const d_random,const int num_particles,const float resampling_freq);

e resampling step generates a newparticle set by drawing randomly (with replacement) from the originalset according to the particle weights. In order to implement this, we rst need to calculate the cumulativesum of all the weights within each thread block. Our implementation is based on the parallel pre x sum from[Harris et al., 2007Harris et al., 2007] which consists of two steps: the up-sweep phase and the down-sweep phase. In the up-sweep phase, the elements of the array are traversed “upwards” in a binary tree fashion in order to calculatethe partial sums. In the down-sweep phase, the tree is traversed back down to calculate the sums in place.Both phases are logarithmic in time.

With the cumulative sum calculated, each threads picks a uniformly distributed number in the range of[0, 1) (generated separately) and multiplies that with the total sum of the weights. With a custom binarysearch implementation (also logarithmic), we nd thehighest indexwith aweight non-larger than the randomselection. is index points to the chosen particle.

Similar to the reasoning used in Section . .Section . . for the sorting kernel, the actual particle data is not stored inshared memory as it is not guaranteed to t. erefore, the reads will be irregular and non-contiguous whilethe writes remain contiguous and aligned.

. Generic Particle Filtering

Listing . : Model Speci c Code__device__ void sampling(

state* const particle_data,const control control_data,const float* const d_random,const float dt);

__device__ float importance_weight(const state* const particle_data,const measurement measurement_data);

In this chapterwehave reviewedour implementationof thedistributedparticle lter algorithmon theGPU.In order to be able to reuse this lter implementation for different applications we have tried to separate themodel-speci c parts from generic lter code.

e data structures needed to store the particle data as well as other model-speci c elements (i.e. measure-ment, control) are depicted in Listing .Listing . . From the kernels discussed in this chapter, the onlymodel-speci c

. Generic Particle Filtering

kernel is sampling_importance(). For convenience, this kernel calls two device functions which areresponsible for the actual calculations of the particle state (sampling) and the importanceweights. Listing .Listing .shows the signature of these two device functions. e complete implementation of the unicycle model forour particle ltering framework is presented in Appendix AAppendix A.

ere are a number of consideration for the implementation of custom applications with this framework:

• ere should be enough globalmemory available for allocating data structures for holding the pseudo-random numbers, the weights and twice the particle data.

• e measurement and control structures should together t into the constant memory.

• e particle with the highest weight is considered to be the global estimate. Some additional steps areneeded if a particular applications needs to extractmore information from the particle population (e.g.weighted average).


Performance Analysis andOptimisations 7e GPU implementation of the generic particle ltering framework is discussed in ChapterChapter . In this chap-

ter, we evaluate the performance of the theGPU implementation using the robotic arm application, discussedin ChapterChapter , on two CUDA enabled NVIDIA GPUs. Based on the results, we perform a number of optimi-sations in order to improve the implementation efficiency.

First, in Section .Section . , we discuss a performance model we use throughout this chapter for visualising theperformance of the different kernels. Next, in Section .Section . , we examine the performance of the basic imple-mentation fromChapterChapter and identify the bo lenecks. In Section .Section . , weperformanumber of optimisationsin order to improve the efficiency of the implementation. We examine the scalability of the lter implemen-tation in a number of directions in Section .Section . , and nally, in Section .Section . we will conclude this chapter.

. PerformanceModel

In order to visualise the performance of our implementation, we use the roo inemodel [Williams et al., 2009Williams et al., 2009].Focusing on oating-point performance, the roo ine model sets an upper bound on the performance given acertain operational intensity. In contrast to arithmetic intensity, operational intensity refers to the number ofoperations per byte of D M traffic, ignoring any traffic between the processor and the caches.

e roo ine is calculated according to:

A ainable GFLOPs/sec = 𝑚𝑖𝑛(Peak Floating-Point Performance,Peak Memory Bandwidth × Operational Intensity)

e roo ine is presented on a two-dimensional plot with the operational intensity on the x-axis and thea ainable GFLOPs on the y-axis. e intersection point of the diagonal line (memory bound) and the hori-zontal line (computation bound) represents the point of peak computation performance and memory band-width. In order to nd theupper boundof the oating-point performance for a given kernel, wedrawa verticalline from its operational intensity. e intersection point with the roo ine represents the upper bound. Ifthis intersection point lies on the diagonal line, the kernel is memory bound on this platform, otherwise it iscomputation bound.

. . Platform Bounds

In order to draw an accurate roo ine model for the GPU cards used in our experiments, we need to calculaterealistic upper bounds. We use a speci c benchmark kernel in order tomeasure the a ainable memory band-width. is kernel only issues aligned and contiguous memory accesses. e achieved results relative to thespeci cations of the hardware are presented in Table .Table . .

Operational Intensity [FLOPs/Byte]

Att

ain

able

Th

rou

ghp

ut

[GF

LO

P/s

]

16

32

64

128

256

512

1024

2048

1/4 1/2 1 2 4 8 16

Instructions

-MAD

+MAD

GPU

GTX280

GTX480

Figure . : Roo ine Model

GPU A ained Bandwidth Speci ed Bandwidth RatioGTX480 158.72 GB/s 177.4 GB/s 89.5%GTX280 125.9 GB/s 141.7 GB/s 88.8%

Table . : A ained vs Speci ed Memory Bandwidth

In ChapterChapter , we calculated different values for the overall instruction throughput for both the GTX480and the GTX280 depending on whetherMAD instructions are executed (or dual-issuing on the SFUs for theGTX280). Figure .Figure . depicts the roo ine model for the GTX480 and the GTX280. For both cards we havedrawn separate computation bounds depending on whether MAD instructions are being issued. We ignorethe GTX280 dual-issuing on the SFUs as it can only happen in limited circumstances which do not apply toour implementation.

In the following section, we will identify the kernels contributing most to the total runtime and use theroo ine model to be er understand their behaviour.

. Baseline Performance

In this section we measure the runtime of the different parts of the particle lter algorithm using the roboticarm application, as described in ChapterChapter . e GPU implementation details are examined in ChapterChapter . Forthis experiment we use the lter and model parameters listed in Table .Table . .


Number of particles 512Number of blocks 2048Particle network con guration RingNumber of exchanged particles 1Resampling frequency 1.0Number of joints 5State dimension (#joints + 4) 9

Table . : Parameters for performance measurements

e timing measurements are done using a host CPU timer (based on gettimeofday()). e resultswere veri ed against the GPU execution times reported by the NVIDIA CUDA pro ler. Table .Table . presentsthese results for both theGTX480 as well as the GTX280. From these results we can conclude that the globalreduction step (thrust::max_element) in order to determine the actual estimate as well as the particle ex-change kernel do not signi cantly contribute to the total runtime of the lter. Although we used a ring net-work topology for this particular experiment, other network topologies (e.g. star, 2D torus) perform similarly.While individual particle lter blocks only exchange a single particle in this experiment, other experimentshave shown that increasing the number of exchanged particles does not signi cantly increase the runtime ofthe particular kernel. erefore, for the remainder of this chapter we will focus only on the PRNGs as well asthe following three kernels: sampling_importance(), block_sort() and resampling().

Kernel Runtime (μs) Contrib.curand (xorwow) 819 5.70%sampling_importance 5889 41.30%block_sort 4984 35.00%thrust::max_element 181 1.30%exchange 45 0.30%resampling 2329 16.30%total 14247

(a)GTX480

Kernel Runtime (μs) Contrib.curand (xorwow) 1310 4.90%sampling_importance 9659 36.10%block_sort 8185 30.60%thrust::max_element 242 0.90%exchange 151 0.60%resampling 7238 27.00%total 26785

(b)GTX280

Table . : Breakdown

In the following sections we will further examine each of these steps in order to be er understand theseperformance gures.

Performance Analysis and Optimisations

. . Pseudo-RandomNumber Generators

Batch size

Sam

ple

s p

er s

eco

nd

106

107

108

109

1010

104 105 106 107 108

MTGP - GTX280

MTGP - GTX480

xorwow - GTX280

xorwow - GTX480

Figure . : Performance comparison of two PRNG implementations

One of the most crucial steps in the particle lter algorithm is the generation of pseudo-random numbers.We discussed two PRNGs inChapterChapter : (i)MTGP, an optimisedMersenneTwister CUDA implementation,and (ii) xorwow, a Xorshi -based PRNG from NVIDIA.In this experiment, we use both PRNGs to producea batch of standard normally distributed random numbers. As mentioned before, MTGP only produces uni-formly distributed numbers. erefore, we use a custom Box-Muller transformation in order to transformthese number into a normal distribution.

We have noticed that the batch size has a great in uence on the performance of both PRNGs. Figure .Figure .presents the results of this experiment with different batch sizes. We expect xorwow to outperform MTGPas it is a much simpler algorithm. e experiment results con rm this to be the case only for large batchsizes. MTGP outperforms xorwow in all the test cases with smaller batch sizes. Since NVIDIA released theirxorwow implementation with the 3.2 release of CUDA in late 2010, future releasesmight be be er optimisedfor smaller batch sizes.


. . Individual Kernel Analysis

Now that we have identi ed the main kernels which contribute to the overall runtime, we will examine eachkernel in detail. To this purpose we will calculate, for each kernel, the total number of bytes transferred fromand to global memory as well as the instruction count. is allows us to calculate the utilisation of the GPUresources, and to identify performance bo lenecks.

In contrast to global memory traffic, it is muchmore difficult to come up with a de nitive gure for the in-struction count as a lot of the architectural features of modern GPUs are relatively unknown. Some featuresof the GT200 (e.g. GTX280) have been uncovered through microbenchmarks [Wong et al., 2010Wong et al., 2010]. In par-ticular, they have calculated the latencies and throughput of various oating-point and integer instructions.

e results of these benchmarks are limited to the GTX280. However, as the code for the microbenchmarksis made available online , we have been able to run the benchmark code on the GTX480 with minor modi -cations.

e results of our experiments with themicrobenchmark code formeasuring the throughput of the variousinstructions are presented in Table .Table . . e measured throughput on the GTX280 is identical to those pre-sented in [Wong et al., 2010Wong et al., 2010]. We have also extended the benchmark to additionally measure the throughputof integer and oating-point (Boolean) comparison operations.

throughput (ops/clock) throughput (ops/clock)Instruction type GTX280 GTX480fadd 7.9 28.1fmul 11.2 28.1iadd 7.9 16.0imul 1.7 16.0ixor 7.9 28.1iand 7.9 28.1__sinf() 2.0 4.0__cosf() 2.0 4.0__expf() 2.0 4.0icmp 4.0 15.7fcmp 2.7 15.7

Table . : Instruction throughput microbenchmark results for GTX280 and GTX480

In order to calculate the total FLOP count when we are dealing with mixed integer and oating-point op-eration types, we use the measured throughput gures. e oating-point addition instruction is considereda single FLOP while that of other instructions is scaled according to the relative difference in throughput.Since we observe a different behaviour on the GTX480 for certain instruction types, we calculate separateFLOP counts for each platform.

sampling_importance()

e sampling_importance() kernel, as discussed in ChapterChapter , calls twomodel-speci c device func-tions for updating the particles and calculating the weights. However, in order to negate any possible effectsof caching as both functions need to load certain particle variables, we have combined these two device func-tions into one.

http://www.stuffedcow.net/research/cudabmkhttp://www.stuffedcow.net/research/cudabmk


http://www.stuffedcow.net/research/cudabmk


Att

ain

able

Th

rou

ghp

ut

[GF

LO

P/s

]

16

32

64

128

256

512

1024

2048

1/4 1/2 1 2 4 8 16

Performance

Prediction

Actual

Kernel

block_sort

resampling

sampling

Figure . : Roo ine Model and Baseline Results - GTX480

Each thread reads one instance of the state_data structure, and writes back an updated version. Withthe chosen experiment parameters, this structure contains 9 single-precision oating-point variables (36B).For each state variable, we have a normally distributed random number (4B). Each thread also writes theweight of the particle to global memory, which is again a single-precision oating-point number (4B). isleads to a total of 36B + 36B = 72B of data read per thread and 36B + 4B = 40B of data wri en. erefore,we have, in total, 112B of read/writes per thread.

FLOP count FLOP countInstruction type count (parametric) count GTX280 GTX480fadd 12 + 6 × #joints 42 42 42fmul 10 + 8 × #joints 50 ∼36 50__sinf() #joints 5 ∼20 ∼36__cosf() #joints 5 ∼20 ∼36__expf() 1 1 ∼4 ∼8

total 122 172

Table . : Instruction count for the sampling_importance() kernel

e instructions for the sampling_importance() kernel is listed in Table .Table . . is operationalintensity is, according to these gures, 122

112 = 1.09 FLOPs/byte on the GTX280 and 172112 = 1.54 FLOPs/byte



Att

ain

able

Th

rou

ghp

ut

[GF

LO

P/s

]

16

32

64

128

256

512

1024

2048

1/4 1/2 1 2 4 8 16

Performance

Prediction

Actual

Kernel

block_sort

resampling

sampling

Figure . : Roo ine Model and Baseline Results - GTX280

on the GTX480. Both these gures are well below the point where the memory access latencies are hiddenby computation. is kernel will ultimately be bound by the available memory throughput.

In order to gure our which roo ine applies to this kernel, we examined the compiled binary using theNVIDIA provided disassembler. We noticed a great number of MAD instructions (FMA instructions onthe GTX480) which can, in theory, run at twice the rate of other oating-point instructions. Nevertheless,because this kernel is ultimately memory bound, it does not change the predicted outcome.

e results of our measurements are presented in Table .Table . and Table .Table . and on the roo ine model inFigure .Figure . and Figure .Figure . . e measured memory throughput on both cards is far lower than what we expectto achieve. e reason for this is that none of thememory accesses are coalesced. erefore, all the read fromwithin a (half-)warp are serialised. In Section . .Section . . , we will discuss our modi cations in order to achievebe er memory throughput.

. . Local sort kernel: block_sort()

block_sort() uses a bitonic sorting network in order to sort the particles within each block according totheir weight. Because thestate_data structure is not guaranteed to t the sharedmemory, we have optedto store only theweights aswell as an index in sharedmemory. Once theweights are sorted, we apply the indexto the actual particle data. In our bitonic sorting network, each thread handles two particles. erefore, eachthreads needs to read 2× the state_data structure as well 2× the weight (float) and eventually writethem back to global memory. is leads to a total of 160B read/writes per thread.

Table .Table . presents the number of FLOPs per thread for this kernel. From this we can calculate the oper-


FLOP count FLOP countInstruction type count GTX280 GTX480imul 53 ∼247 ∼94iadd 90 90 ∼158ixor 8 8 8iand 53 53 53icmp 53 ∼105 ∼95fcmp 45 ∼132 ∼81

total 635 489

Table . : Instruction count for the block_sort() kernel

ational intensity: 635160 = 3.97 FLOPs/byte for the GTX280 and 489

160 = 3.06 FLOPs/byte for the GTX480.In order to determine the computational bound for this kernel we examine whether MAD instructions areissued. Unlike the sampling_importance() kernel, we encountered no MAD instructions when weexamined the CUDA binary with a disassembler. erefore, we consider the lower roo ine to be a realisticupper bound for the performance of this kernel.

Although these gures suggest that this kernel is compute bound on the GTX280 and memory bound onthe GTX480, the irregular read access pa erns prevent us from fully utilising the available memory band-width.

e experiment results are listed in Table .Table . andTable .Table . . ese results are also visualised on the roo inemodel in Figure .Figure . and Figure .Figure . . ese results indicate a very low memory bandwidth utilisation. Asmentioned above, irregular memory access pa ern prevents us from fully utilising the available bandwidth.In Section . .Section . . , we will discuss the options for reducing the uncoalesced memory transaction overhead.

. . Resampling kernel: resampling()

Each thread in the resampling() kernel reads 3× float and 1 state_data structure and writes 1state_data structure to global memory. erefore, we have to total of 84B read/writes per thread.

FLOP count FLOP countInstruction type count GTX280 GTX480imul 108 ∼502 ∼190iadd 108 108 ∼190icmp 27 ∼54 ∼97fmul 1 ∼1 1fadd 18 18 18fcmp 10 ∼30 ∼18

total 713 514

Table . : Instruction count for the resampling() kernel

e number of instructions for this kernel is presented in Table .Table . . From this we can calculate the opera-tional intensity for the GTX280 as 713

84 = 8.49 FLOPs/byte and for the GTX480 as 51484 = 6.12 FLOPs/byte.

Similar to the block_sort() kernel, no MAD instructions are used, which results in this kernel being


compute bound on both the GTX480 and GTX280. However, the memory accesses are highly irregularand thus prevent from fully utilising the available bandwidth. e experiment results listed in Table .Table . andTable .Table . as well as the roo inemodel in Figure .Figure . and Figure .Figure . con rm this to be the case. Wewill discussoptimising uncoalesced memory transactions in Section . .Section . . .

Kernel Runtime Mem throughput Mem Util Instr roughput Instr Utilsampling_importance 5889 μs 18.57 GB/s 11.5% 28.52 GFLOP/s 2.1%block_sort 4984 μs 15.68 GB/s 9.7% 47.91 GFLOP/s 7.1%resampling 2329 μs 35.22 GB/s 21.8% 215.52 GFLOP/s 32.0%

Table . : Memory and instruction throughput and utilisation - GTX480

Kernel Runtime Mem throughput Mem Util Instr roughput Instr Utilsampling_importance 9659 μs 11.32 GB/s 9.0% 12.33 GFLOP/s 1.3%block_sort 8185 μs 9.54 GB/s 7.6% 37.88 GFLOP/s 12.2%resampling 7238 μs 11.33 GB/s 9.0% 96.20 GFLOP/s 30.9%

Table . : Memory and instruction throughput and utilisation - GTX280

. Optimisations

e analysis in the previous section indicates that even for compute bound kernels, irregular memory trans-actions resulted in suboptimal performance. In this section we focus our a ention speci cally to memoryoptimisations.

. . EnablingMemory Coalescing

Recall from ChapterChapter , where we discussed the technical details of the GPU architecture, that the optimummemorybandwidth canonly be achieved ifmemory accesses followcertainpa ernswhich allow for coalescedmemory loads. In this section, we will a empt to modify the memory layout of the data-structures to allowfor be er coalescing.

Depending on the particular GPU architecture, different requirements have to be met in order for globalmemory accesseswithing a (half-)warp tobe coalesced. Fordeviceswith compute capability 1.3 (e.g.GTX280)the word size needs to be either 1, 2, 4, 8 or 16 bytes and all the accessed elements within a half-warp need tolie in a single segment (segment size is 32-128 bytes depending on the word size).

Coalescing in 2.x devices is quite different because of the L1 and L2 caches. Global memory requests arebroken down into 128-byte L1 cache requests. ere are no alignment requirements.

e state_data structure which holds the particle data, even with a single joint, is larger than themaximum word size for coalescing on 1.x devices. erefore, we need to modify the way we store this data.A common technique is to store the data in a Structure of Arrays (SoA) instead of the more obvious Array ofStructures (AoS). In this scheme, each member of the original structure is stored in a separate array. All thevariables in our model are of the type float (4B) which is guaranteed to be aligned, and therefore suitablefor coalesced reads and writes.


Listing . : Particle data stored in a SoA fashiontypedef struct _state{

float* angles;float* x;float* y;float* vX;float* vY;

} state;

One particular thing we need to consider is that since the angles array is actually two-dimensional, weneed to store the angle values according to particle order and not angle order. e reason for this is thatwhen all the threads from a single (half-)warp access a certain angle value, these reads must be contiguousfor coalescing. e same reasoning holds for the array holding the generated random values.

Having discussed the SoA structure, we need to remember that from the three kernels discussed so far,only the sampling_importance() kernel lends itself particularly well to coalescing. In this step, eachthreads reads its own particle data and eventually writes it back without any communication with otherthreads. e other two kernels (block_sort(), and resampling()), however, have sca ered readpa erns which are not suited for coalescing.

e results of the experiments with the particle data stored in a SoA fashion are presented in Table .Table . forthe GTX480 and in Table .Table . for the GTX280.

Orig Opt1 Mem Mem Instr InstrKernel Runtime Runtime Speedup roughput Util roughput Utilsampl_imp 5889 μs 809 μs 7.28× 135.20 GB/s 83.7% 207.63 GFLOP/s 15.4%block_srt 4984 μs 1928 μs 2.59× 40.52 GB/s 25.1% 123.84 GFLOP/s 18.4%resampling 2329 μs 1463 μs 1.59× 56.07 GB/s 34.7% 343.10 GFLOP/s 51.0%total 14247 μs 5296 μs 2.69×

Table . : Memory optimisations: Speedup, throughput and utilisation - GTX480

Orig Opt1 Mem Mem Instr InstrKernel Runtime Runtime Speedup roughput Util roughput Utilsampl_imp 9659 μs 1038 μs 9.31× 105.37 GB/s 81.9% 114.78 GFLOP/s 18.5%block_srt 8185 μs 6323 μs 1.29× 12.36 GB/s 9.8% 49.04 GFLOP/s 15.8%resampling 7238 μs 5941 μs 1.22× 13.81 GB/s 11.0% 117.20 GFLOP/s 37.7%total 26785 μs 14946 μs 1.79×

Table . : Memory optimisations: Speedup, throughput and utilisation - GTX280

As expected, the sampling_importance() kernel bene ts the most from this particular optimisa-tion. e results show that, in this kernel, we are now utilising most of the available memory bandwidth. Inthe following section, we will examine the ways we can optimise the other two kernels.

. Optimisations

Listing . : Particle data packed in float4stypedef struct _state{

float4* angles1;float* angles2;float4* pos;

} state;

. . Improving UncoalescedMemory Accesses

e SoA structure discussed in the previous section had only a limited impact on the overall performance ofthe implementation as two important kernels have sca ered read access pa erns.

Recall fromearlier discussion that globalmemory transactions have aminimumsize of 32 bytes. Wheneverwe are reading sca eredfloats, we are effectively only using 1

4 of the available bandwidth. Oneway to reducethis overhead is to store the data in 16 byte chunks. is will reduce the overhead to 1

2 of the bandwidth. Tothis purpose we will use the built-in CUDA vector type float4 vector types, although the same could alsobe achieved with a custom structure and explicit alignment speci ers.

Wemodify thestate_data structure tousefloat4 for storing the a ributes. Asmentioned inTable .Table . ,All the experiments in this chapter have been run with 5 joints for the robotic arm. erefore we still end upwith a single float variable.

Opt1 Opt2 Mem Mem Instr InstrKernel Runtime Runtime Speedup roughput Util roughput Utilsampl_imp 809 μs 799 μs 1.01× 136.89 GB/s 84.7% 210.22 GFLOP/s 15.6%block_srt 1928 μs 1924 μs 1.00× 40.61 GB/s 25.1% 124.10 GFLOP/s 18.5%resampling 1463 μs 1536 μs 0.95× 53.41 GB/s 33.1% 326.79 GFLOP/s 48.6%total 5296 μs 5312 μs 1.00×

Table . : Uncoalesced memory optimisations: Speedup, throughput and utilisation - GTX480

Opt1 Opt2 Mem Mem Instr InstrKernel Runtime Runtime Speedup roughput Util roughput Utilsampl_imp 1038 μs 1466 μs 0.71× 74.61 GB/s 59.4% 81.27 GFLOP/s 13.1%block_srt 6323 μs 4172 μs 1.52× 18.73 GB/s 14.9% 74.32 GFLOP/s 23.9%resampling 5941 μs 2927 μs 2.03× 28.03 GB/s 22.3% 237.88 GFLOP/s 76.5%total 14946 μs 10161 μs 1.47×

Table . : Uncoalesced memory optimisations: Speedup, throughput and utilisation - GTX280

e results of these modi cations are presented in Table .Table . and Table .Table . . ese results, together withthe earlier discussed optimisations, are also visualised on the roo ine model in Figure .Figure . and Figure .Figure . .With thesemodi cations, we do not see any improvement on theGTX480, but as expected, on theGTX280,the block_sort() and resampling() kernels bene t the most from this memory layout. However,



Att

ain

able

Th

rou

ghp

ut

[GF

LO

P/s

]

16

32

64

128

256

512

1024

2048

1/4 1/2 1 2 4 8 16

Performance

Prediction

Baseline

Opt1

Opt2

Kernel

block_sort

resampling

sampling

Figure . : Roo ine Model for Each Kernel - GTX480

on theGTX280, we observe a drop in performance in thesampling()where all memory accesses are coa-lesced. Wewere able to reproduce this drop in performance on theGTX280with a simplefloat4_copy()kernel which only copies float4 values from one array to another. We could only achieve 76% of the band-width of a similar kernel which copied 4× floats. is is consistent with the measured bandwidth for thesampling() kernel. is, however, does not occur with the float2 data type, nor on the GTX480.

In order to con rmwhether irregular memory access pa erns are actually causing any performance degra-dation for block_sort() and resampling(), we modify both kernels to remove all irregular accesspa erns. We are careful that none of the computations for sorting in the block_sort() kernel, and thecumulative sum and binary search for the resampling() kernel are optimised away by the compiler. Onthe GTX480 this reduces the runtime of block_sort() to 1427 μs and for the GTX280 to 3440 μs. eruntime of resampling() is reduced to 1051 μs on the GTX480 and to 1634 μs on the GTX280.

ese results, rst of all, indicate that the performance hit of irregular global memory reads is far greateron the GTX280 than on the GTX480. e sophisticated two-layer global memory caching on the Fermiarchitecture clearly reduces the impact of sca ered reads. More importantly, we can conclude from theseresults that there are other factors contributing to the gap between the theoreticalmaximumand the achievedthroughput for these two kernels. Considering both these kernels extensively use the shared memory forsorting as well as parallel scan implementations, it is the most likely place to nd potential inefficiencies.

In ChapterChapter , we discussed that one ofmain issues regarding the use of sharedmemories are bank con icts.If threads within a warp access memory locations on the same physical bank, a bank con ict occurs and theserequest are serialised. e CUDApro ler fromNVIDIA canmeasure the number of occurred bank con icts.Running the application through this pro ler con rms a large number of shared memory bank con icts for

. Optimisations


Att

ain

able

Th

rou

ghp

ut

[GF

LO

P/s

]

16

32

64

128

256

512

1024

2048

1/4 1/2 1 2 4 8 16

Performance

Prediction

Baseline

Opt1

Opt2

Kernel

block_sort

resampling

sampling

Figure . : Roo ine Model for Each Kernel - GTX280

both the block_sort() and resampling() kernels. Solving this issue is much easier for the parallelscan part where the access strides are much more regular. We do not, however, investigate this any further.

. Scalability

In the previous sections, we have analysed the performance of the various parts of the particle lter algo-rithm with a xed particle lter con guration. We now explore the performance of our implementation un-der varying conditions. We consider scaling in the following three directions: (i) number of particles perlter, (ii) number of particle lter blocks, and (iii) state dimensions. e results presented in this section are

obtained on the GTX480, although the same scaling trends can be observed on the GTX280 as well.

. . Scaling the Number of Particles per Filter

e number of particles in a particle lter plays a crucial role in its estimation quality. In our distributed parti-cle lter design, discussed in ChapterChapter , we have introduced a two-level hierarchy for the particle population:(i) groups of particles forming a particle lter block, and (ii) a network of smaller particle lters. is designwas introduced in order to overcome the inherent limitation of scaling individual particle lter.

e limited amount of sharedmemory and registers available to SMs, aswell as the limitationof the numberof threads within each thread-block are architectural constraints restricting the number of threads for eachparticle lter block. Nevertheless, we examine the performance of our lter implementationwith the numberof particles ranging from 4 to 512. e number of blocks is 2048.


Number of particles per filter

Ru

nti

me

(μs)

32

64

128

256

512

1024

2048

4096

8192

4 8 16 32 64 128 256 512

rand

sampling

sort

global search

exchange

resampling

total

(a)Runtime for each kernel

Number of particles per filter

0.0

0.2

0.4

0.6

0.8

1.0

4 8 16 32 64 128 256 512

rand

sampling

sort

global search

exchange

resampling

(b)Relative runtimes

Figure . : Filter performance with varying number of particles per block

e results presented in Figure .Figure . indicate that the compute-heavy sorting and resampling stages clearlydominate the runtime when the number of particles increases. e particle exchange as well as the globalsearch stages are, as expected, unaffected by the number of particles.

. . Scaling the Number of Particle Filter Blocks

In contrast to the number of particles per block, scaling the number of blocks is not bound by the samelimitations. e usual boundary in this case is the amount of particle datawe can store on globalmemory. Forthis experiment, we range the number of blocks from 4 to 2048. e number of particles per thread remains512 for the whole experiment. e total number of particles, therefore, ranges from 2048 to 1048576.

Number of filters

Ru

nti

me

(μs)

16

32

64

128

256

512

1024

2048

4096

8192

4 8 16 32 64 128 256 512 1024 2048

rand

sampling

sort

global search

exchange

resampling

total


Number of filters

0.0

0.2

0.4

0.6

0.8

1.0

4 8 16 32 64 128 256 512 1024 2048

rand

sampling

sort

global search

exchange

resampling


Figure . : Filter performance with varying number of particle lter blocks

. Scalability

Figure .Figure . depicts the results of this experiment. As with the previous section, the compute-heavy sortingand resampling stages dominate the runtime. e results con rm the runtime to scale linearly with respectto the number of particle lter blocks.

. . Scaling the State Dimensions

e scaling directions discussed in the previous two section involved lter parameters. ere is, however,another important aspect which is the model-speci c part of the lter (the sampling_importance()kernel). How efficiently this can be implemented on the GPU architecture depends on the model itself. erobotic arm application, used throughout this chapter, is particularlywell suited for this purpose. By adjustingthe number of joints of the robotic arm, we can simulate a smaller and/or larger problem. In this experiment,the state dimension ranges from 8 (4 joints) to 48 (44 joints).

State dimensions

Ru

nti

me

(μs)

0

2000

4000

6000

8000

10000

12000

8 16 24 32 40 48

rand

sampling

sort

global search

exchange

resampling

total


State dimensions

0.0

0.2

0.4

0.6

0.8

1.0

8 16 24 32 40 48

rand

sampling

sort

global search

exchange

resampling


Figure . : Filter performance with varying state dimensions

From the experiment results, presented in Figure .Figure . , we can conclude that with an increasing state dimen-sion the efficient sampling_importance() becomes the dominant factor in the overall performance.As the problem becomes more complex, the “ ltering” aspect becomes less relevant and the efficiency of themodel implementation determines the overall runtime.

. Discussion

In this chapter we examined the performance of our particle lter implementation on two GPU platforms.Using the roo ine model, we determined performance upper bounds for each kernel. is enabled us toidentify major performance bo lenecks.

Furthermore, we applied two memory optimisations in order to increase the implementation efficiency.e result of these two optimisations is a speedup of 2.69× for the GTX480 and 2.63× for the GTX280.We analysed the scaling behaviour in three directions: (i) increasing the size of individual lters in the net-

work, (ii) increasing the number of lters in the network and (iii) increasing problem size (state dimensions).e size of each particle lter can only increase up to a certain point as a result of platform limitations. e

number of particle lters, however, does not suffer from the same limitations. e amount of global memory


is the rst platform limit we encounter when scaling up the number of particle lters. e runtime scaleslinearly with respect to the number of particle lters. With an increasing problem size, the model-speci csampling and weight calculations become the dominant factor in the total runtime. From this we can con-clude that as the problem becomes more complex, the efficiency of the model implementation determinesthe overall runtime.

. Discussion

Filter Accuracy 8In this chapterwe use the robotic arm application described inChapterChapter in order to examine the behaviour ofour particle lter implementation under different scenarios. We are mainly interested in the lter estimationquality, and the effect of various model and lter parameters.

First, in Section .Section . , we will discuss the overall experiment setup used throughout this chapter. Section .Section .presents the experiment results for a standard centralised particle lter implementation as a baseline for theother results, while Section .Section . discusses that of our distributed particle lter implementation on the GPU.We conclude this chapter in Section .Section . with an analysis of the results presented in this chapter. Appendix BAppendix Bpresents a detailed overview of the results of the experiments of this chapter.

. Filtering Scenario

We use the robotic arm application, described in ChapterChapter , for the experiments throughout this chapter inorder to examine the lter behaviour under different conditions. e chosen parameters for this model arelisted in Table .Table . .

Number of joints 5State dimension (#joints + 4) 9Arm length 0.5m

Table . : Robotic Arm Parameters

e simulated noise parameters are listed in Table .Table . . All noise terms are chosen to be Gaussian. is,however, is not a restriction imposed by the particle lter.

𝜔𝜃 (0, 0.1) rad/s𝜔𝑘 (0, 0.1) rad𝜔𝑢𝑋 , 𝜔𝑢𝑌 , 𝜔𝑥, 𝜔𝑦 (0, 0.1) m𝜔𝑣𝑋 , 𝜔𝑣𝑌 (0, 0.1) m/s

Table . : Simulation Noise Parameters

e object tracked by the camera follows a lemniscate path. e lter is run at 25 Hz. In order to comparedifferent lter results, we calculate the estimation error 𝑒 according to:

𝑒 = (𝑥 − 𝜇)𝑇 𝐶−1(𝑥 − 𝜇)

Number of particles

Est

imat

ion

err

or

[-]

0.5

1

2

4

8

16

32

64

128

256

512

2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384

Figure . : Estimation Error for a Centralised Particle Filter

where 𝑥 is the state estimation vector, 𝜇 the actual state vector and 𝐶 is a diagonal matrix with the expectedstandard deviations of the measurement errors. We consider 𝑒 < 1.0 to be an acceptable estimation error forthis experiment.

. Centralised Particle Filter Results

In order to have a reference for the lter performance, we start with a standard Sequential Importance Re-sampling particle lter implementation as depicted in AlgorithmAlgorithm . We run this experiment with the numberof particle varying between 2 and 16384.

Figure .Figure . presents the mean estimation error ± the standard deviation of 100 runs for each particle ltercon guration. From these results it is clear that we need at least 4096 particles for an acceptable estimation.Although this sequential particle lter implementation is not optimised, with even 8192 particles, it is, on amodern desktop processor, a factor of 100 slower than what is needed for real-time estimation.

. Distributed Particle Filter Results

In this section wewill focus on theGPUparticle lter implementation discussed in ChapterChapter andChapterChapter .e following lter parameters, explained in Section .Section . , each in uence the lter estimation results in differ-

ent ways:

• Filter network topology

• Filter network dimension

• Number of exchanged particles


-5

0

5

-10 -5 0 5 10

Position

actual

estimate

(a)Object trajectory

t

Est

imat

ion

err

or

0.25

1

4

16

64

(b)Overall estimation error

Figure . : Filtering scenario with a low number of particles

• Resampling frequency

With the exception of the rst two parameters, we consider each parameter in isolation in order to gain abe er understanding of its implications for the ltering process.

. . Filter Convergence

In this sectionwe discuss an experiment in order to examine the lter convergence. As discussed in the exper-iment setup, the tracking object follows a lemniscate path. We initialise the lter to an incorrect state (one ofthe foci of the lemniscate). is experiment testswhether the lter converges to the actual path and can followthe object through the curves. e simulated noise terms are as described in Table .Table . . e lter parametersare listed in Table .Table . .

Figure .Figure . depicts the results of this experiment in a low particle count se ing (16 lters each running 16particles), while Figure .Figure . depicts those for a high particle count (256 lters each running 256 particles).

ese gures include a visual illustration of the actual and estimated object trajectory as well as the totalestimation error (including joint angle measurements). is con rms our earlier results that this particular

Filter Accuracy

-5

0

5

-10 -5 0 5 10

Position

actual

estimate

(a)Object trajectory

t

Est

imat

ion

err

or

0.0

0.2

0.4

0.6

0.8

1.0

(b)Overall estimation error

Figure . : Filtering scenario with a high number of particles

problem requires a large number of particles for an accurate estimation, and furthermore, validates the ltercorrectness.

. . Filtering Frequency

One of our original goals was to enable real-time particle ltering for complex estimation problems. eruntime performance of the particle lter depends not only on the application-speci c model update andimportanceweight calculations, but also on the lter con guration (e.g. number of concurrent lters, networktopology). ChapterChapter contains a detailed analysis on the performance of the various parts of the algorithm.

e lter parameters are listed in Table .Table . . e number of particles for each particle lter is xed at 512,while the number of lters ranges from 4 to 4096.

Figure .Figure . presents the results in the form of maximum achievable frequency for two GPU platforms. eachievable frequency is well beyond the requirements for this experiment. We can reach 1KHz frequencieswith around 100000 particles in total.


Filter network topology RingNumber of exchanged particles 1Resampling frequency 1.0

Table . : Particle Filter Parameters

Number of particles

Max

imu

m f

req

uen

cy [

Hz]

64

128

256

512

1024

2048

4096

4×512 16×512 64×512 256×512 1024×512 4096×512

GPU

GTX280

GTX480

Figure . : Maximum achievable particle lter frequency

. . Filter Network Size and Topology

With centralised particle lters, the total number of particles is themost de ning parameter for the estimationquality. is is equally true for distributed particle lters. Complex estimation problems with a large dimen-sional state vector require a huge number of particles in order for the random sampling process to produceparticles near the correct state. In this section we perform a number of experiments in order to determine thebehaviour of our distributed lter implementation with various lter dimensions.

Recall fromChapterChapter wherewe proposed a number of possible con gurations for particle exchangewithinthe particle lter network. We have observed that the way particles are exchanged highly effects the lterscaling behaviour. erefore, we will study the effects of these two parameters together. We have chosen forthe following network topologies:

1. Star

2. Ring

Filter Accuracy

Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

128

256

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(a)Star Network

Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

128

256

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(b)Ring Network

Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

128

256

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(c)2D Torus Network

Figure . : Estimation error with varying particle lter network size and topology


3. 2D Torus

e experiment setup is discussed in Section .Section . . e remaining lter parameters are chosen as follows:number of exchanged particles 𝑡 = 1 and the resampling frequency 𝑟 = 1.0.

e results of this experiment are presented in Figure .Figure . . e presented estimation error is the averageerror of 100 runs for each con guration from the same simulation trace. From this we can clearly concludethat the star network topology performs theworst. e star network con guration results in a loss of diversityamongst the whole particle population as the same particles are fed into all lters. is loss of diversity resultsin a decreased lter performance.

emost interesting observation fromboth the ring and 2D torus network is that in all cases, a low numberof particles can be compensated by adding more lters. is validates our strategy of dividing larger particlelters into a network of smaller lters. We also observe that with a low number of lters, the ring network

outperform the 2D torus, while, a large number of lters favours the 2D torus network. e extra connectionspresent in the 2D torus network allow a faster propagation of more likely particles in larger networks, whileresulting in duplicate particles (and the loss of diversity) in smaller networks.

. . Particle Exchange

In this section we will examine the effects of the number of particles exchanged in each round 𝑡 on the lterbehaviour. To this end, we use the same robotic arm experiment with a ring network. e estimation erroris calculated in the same fashion. We run this experiment for a number of different values for 𝑡, the numberof exchanged particles for each lter with its neighbours. In a ring network, each lter receives 2𝑡 particlesand sends back 𝑡 particles. In order to prevent neighbouring lters from overriding each others particles, thenumber of particles for each lter, 𝑚, should be greater than the total number of particles sent and received:𝑚 > 3𝑡. erefore, certain parameter con gurations are invalid.

Figure .Figure . illustrates the results of this experiment. e bene ts of particle exchange is evident when wecompare the results of the case where no particles are exchanged with the other results. Exchanging manyparticles does offer some minor improvements, but exchanging a single particle is generally sufficient for thelikely particles to spread across the lter network.

. . Resampling Frequency

We discussed resampling in Section . .Section . . , where we mentioned the loss of diversity amongst the particlepopulation as one of its major implications. In order to determine the effects of sampling too o en or tooli le, we run the same experiment from the two previous sections with different resampling frequencies 𝑟.We run this experiment for a number of different values for 𝑟, ranging from 𝑟 = 0 (no resampling) to 𝑟 = 1(always resampling). As with the previous two experiments, we calculate the average estimation error for 100runs from the same trace for each con guration.

e results of this experiment are presented in Figure .Figure . . It is quite evident from these results that re-sampling is an essential step to the whole particle ltering process. When no resampling takes place, addingmore particles or lters does not improve the estimation signi cantly. Even infrequent resampling greatly im-proves the lter performance. e optimum resampling frequency, however, depends on the lter networksize. Lower resampling frequencies favour low particle and low lter count se ings (top le corner in theplots), while other combinations favour frequent resampling.

Filter Accuracy

Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(a)𝑡=0Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(b)𝑡=1

Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

8

16

32

64

128

256

512

(c)𝑡=2Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

16

32

64

128

256

512

(d)𝑡=4

Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

32

64

128

256

512

(e)𝑡=8Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

64

128

256

512

(f)𝑡=16

Figure . : Estimation error with varying number of exchanged particles


Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(a)𝑟=0.0Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(b)𝑟=0.2

Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(c)𝑟=0.4Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(d)𝑟=0.6

Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(e)𝑟=0.8Number of filters

Est

imat

ion

err

or

[-]

0.25

0.5

1

2

4

8

16

32

64

4 8 16 32 64 128 256 512 1024 2048

Particles

p/filter

4

8

16

32

64

128

256

512

(f)𝑟=1.0

Figure . : Estimation error with varying resampling frequency

Filter Accuracy

. Discussion

In this chapter we examined various aspects of our particle lter implementation using the robotic arms ap-plication discussed in ChapterChapter . Although unusable for any real-time purposes, the centralised particle lterimplementation serves as a baseline for the distributed particle lter results.

e presented results con rm our GPU implementation to be suitable for real-time estimation for evenlarger problem sizes. Perhaps the most important results of this chapter is that, given a proper lter networktopology, a network of particle lters can match (or even outperform) a single large centralised particle lter.Evenminimal communication amongst the individual lters is sufficient to spread likely particles throughoutthe network.

ere is, however, not a clear optimal con guration. Nevertheless, the general trend we have observed isthat in low particle se ings infrequent resampling, limited communication and a low connectivity network(e.g. ring) gives the best results. High particle se ings tend to perform be er with amore connected network(e.g. 2D torus), frequent resampling and increased communication.

. Discussion

Conclusions and FutureWork 9Real-time particle ltering, evenwith estimation problemswhich requiremillions of particles is possible withcurrent generation consumer-grade GPUs. e results presented in this thesis a est to this. e suitabilityof the GPU for particular estimation problems also depends on problem-speci c characteristics which areorthogonal to the choice of lter type.

In this thesis we have analysed the particle lter, a powerful Monte Carlo based Bayesian estimation tech-nique, in order to nd the right amount of parallelism for an efficient implementation on modern GPUs.Based on an earlier work of [Simone o and Keviczky, 2009Simone o and Keviczky, 2009], we introduced algorithmic changes to the “clas-sical” particle lter which allows for distributing the computation on separate execution units. Our modi-cations are not speci c to GPUs, and thus can be used for efficient implementations on other multi-core

platforms or even clusters.With our changes to the particle lter algorithm, we introduced a number of parameters which affect the

behaviour of the lter. With a number of experiments, we have been able to quantify the effects of theseparameters, both on the estimation quality as well as the runtime.

We have analysed the performance of our GPU implementation, both analytically and empirically. Ouranalytical models are based on calculating performance upper bounds according to the operational inten-sity of the different kernels in our application. Empirically, we measured the effective utilisation of the GPUresources (i.e. memory bandwidth and instruction throughput) on two GPUs from different hardware gen-erations. Using these results as a guideline for optimisation targets, we managed to increase the efficiency ofour implementation using memory optimisations.

We believe this work can be extended in three directions. e rst direction relates to the realisation ofour particle lter design on other hardware platforms. ese platform could range from conventional multi-core processors to embedded platforms to grids. Each platform will present new challenges with regards toperformance portability.

Another direction for future work is case studies with different types of estimation problems. We expect togain a be er understanding of the particle lter design with data available from more experiments.

Yet another direction for further research is focusing on the implementation efficiency, as there are still anumber of kernels underutilising the available resources. is requires a be er understanding of the GPUhardware.

Bibliography

[Arulampalam et al., 2002] Arulampalam, M. S., Maskell, S., Gordon, N., and Clapp, T. (2002). A Tutorialon Particle Filters for Online Nonlinear/Non-Gaussian Bayesian tracking. IEEE Transactions on SignalProcessing, 50(2):174–188. Cited on page 55.

[Bashi et al., 2003] Bashi, A. S., Jilkov, V. P., Li, X. R., andChen, H. (2003). Distributed Implementations ofParticle Filters. InProceedings of the Sixth International Conference of Information Fusion, pages 1164–1171.Cited on page 1212.

[Batcher, 1968] Batcher, K. E. (1968). Sorting networks and their applications. In Proceedings of the 1968spring joint computer conference, AFIPS ’68 (Spring), pages 307–314, New York, NY, USA. ACM. Citedon page 3232.

[Bolić et al., 2010] Bolić, M., Athalye, A., Hong, S., and Djurić, P. (2010). Study of Algorithmic and Archi-tectural Characteristics of Gaussian Particle Filters. Journal of Signal Processing Systems, 61(2):205–218.Cited on page 1212.

[Bolić et al., 2005] Bolić, M., Djurić, P. M., and Hong, S. (2005). Resampling Algorithms and Architec-tures for Distributed Particle Filters. IEEE Transactions on Signal Processing, 53(7):2442–2450. Cited onpage 1212.

[Box and Muller, 1958] Box, G. E. andMuller, M. E. (1958). ANote on theGeneration of RandomNormalDeviates. e Annals of Mathematical Statistics, 29(2):610–611. Cited on page 3131.

[Brun et al., 2002] Brun, O., Teuliere, V., and Garcia, J. M. (2002). Parallel Particle Filtering. Journal ofParallel and Distributed Computing, 62(7):1186–1202. Cited on pages 1111 and 1212.

[Demchik, 2011] Demchik, V. (2011). Pseudo-random number generators forMonte Carlo simulations onATI Graphics Processing Units. Computer Physics Communications, 182(3):692–705. Cited on page 3131.

[Doucet et al., 2000] Doucet, A., Godsill, S., and Andrieu, C. (2000). On sequential Monte Carlo samplingmethods for Bayesian ltering. Statistics and Computing, 10:197–208. Cited on page 88.

[Fatahalian and Houston, 2008] Fatahalian, K. and Houston, M. (2008). A Closer Look at GPUs. Commu-nications of the ACM, 51(10):50–57. Cited on page 1717.

[Flynn, 1966] Flynn, M. J. (1966). Very High-Speed Computing Systems. Proceedings of the IEEE,54(12):1901–1909. Cited on page 1818.

[Garland and Kirk, 2010] Garland, M. and Kirk, D. B. (2010). Understanding roughput-Oriented Archi-tectures. Communications of the ACM, 53(11):58–66. Cited on pages 1717 and 1818.

[Gordon et al., 1993] Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). Novel approach tononlinear/non-Gaussian Bayesian state estimation. Radar and Signal Processing, IEE Proceedings F,140(2):107–113. Cited on page 77.

[Harris et al., 2007] Harris, M., Sengupta, S., and Owens, J. D. (2007). Parallel Pre x Sum (Scan) withCUDA. In Nguyen, H., editor,GPUGems 3, chapter 39. Addison-Wesley Professional. Cited on page 3434.

[Hendeby et al., 2010] Hendeby, G., Karlsson, R., and Gustafsson, F. (2010). Particle Filtering: e Needfor Speed. EU SIP Journal on Advances in Signal Processing, 2010. Cited on page 1212.

[Kalman, 1960] Kalman, R. E. (1960). ANewApproach toLinear Filtering andPredictionProblems. Trans-actions of the ASME - Journal of Basic Engineering, 82:35–45. Cited on page 66.

[Khronos Group, 2010] Khronos Group (2010). e OpenCL Speci cation. Cited on page 1919.

[Kotecha and Djuric, 2003] Kotecha, J. H. and Djuric, P. M. (2003). Gaussian Particle Filtering. IEEETransactions on Signal Processing, 51(10):2592–2601. Cited on page 1212.

[Lindholm et al., 2008] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. (2008). NVIDIA Tesla:A Uni ed Graphics and Computing Architecture. IEEEMicro, 28(2):39–55. Cited on page 1818.

[Marsaglia, 2003] Marsaglia, G. (2003). Xorshi RNGs. Journal of Statistical So ware, 8(14). Cited onpage 3131.

[Matsumoto and Nishimura, 1998a] Matsumoto, M. and Nishimura, T. (1998a). Dynamic Creation ofPseudorandom Number Generators. In Proceedings of the ird International Conference on Monte Carloand Quasi-Monte Carlo Methods in Scienti c Computing, pages 56–69. Cited on page 3131.

[Matsumoto and Nishimura, 1998b] Matsumoto, M. and Nishimura, T. (1998b). Mersenne Twister: A623-Dimensionally Equidistributed Uniform Pseudo-RandomNumber Generator. ACMTransactions onModeling and Computer Simulation, 8(1):3–30. Cited on page 3131.

[NVIDIA, 2006] NVIDIA (2006). NVIDIA GeForce 8800 GPU Architecture Overview. NVIDIA Corpora-tion, Santa Clara, CA, USA. Cited on page 1818.

[NVIDIA, 2010a] NVIDIA (2010a). CUDA C Programming Guide. NVIDIA Corporation, Santa Clara,CA, USA. Cited on pages 1818 and 1919.

[NVIDIA, 2010b] NVIDIA (2010b). CUDA CU ND Library. NVIDIA Corporation, Santa Clara, CA,USA. Cited on page 3131.

[Podlozhnyuk, 2007] Podlozhnyuk, V. (2007). Parallel Mersenne Twister. NVIDIA Corporation, SantaClara, CA, USA. Cited on page 3131.

[Rosén et al., 2010] Rosén, O., Medvedev, A., and Ekman, M. (2010). Speedup and Tracking AccuracyEvaluation of Parallel Particle Filter Algorithms Implemented on a Multicore Architecture. In 2010 IEEEInternational Conference on Control Applications (CCA), pages 440–445. Cited on page 1212.

[Saito, 2010] Saito, M. (2010). A Variant of Mersenne Twister Suitable for Graphic Processors.http://arxiv.org/abs/1005.4973http://arxiv.org/abs/1005.4973. Cited on page 3131.

Bibliography

http://arxiv.org/abs/1005.4973

[Simone o and Keviczky, 2009] Simone o, A. and Keviczky, T. (2009). Recent Developments in Dis-tributed Particle Filtering: Towards Fast and Accurate Algorithms. In 1st IFAC Workshop on Estimationand Control of Networked Systems. Cited on pages 22, 1111, 1313, 1616, and 6363.

[Swerling, 1958] Swerling, P. (1958). A proposed stagewise differential correction procedure for satellitetracking and prediction. Technical report, ND Corporation, Boston, MA, USA. Cited on page 66.

[ run et al., 2005] run, S., Burgard, W., and Fox, D. (2005). Probabilistic Robotics. Intelligent roboticsand autonomous agents. e MIT Press. Cited on page 55.

[West and Harrison, 1997] West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models.Springer Series in Statistics. Springer-Verlag, New York, second edition. Cited on page 55.

[Williams et al., 2009] Williams, S., Waterman, A., and Pa erson, D. (2009). Roo ine: An Insightful VisualPerformance Model for Multicore Architectures. Communications of the ACM, 52(4):65–76. Cited onpage 3737.

[Wong et al., 2010] Wong,H., Papadopoulou,M.-M., Sadooghi-Alvandi,M., andMoshovos, A. (2010). De-mystifying GPUMicroarchitecture throughMicrobenchmarking. In 2010 IEEE International Symposiumon Performance Analysis of Systems & So ware (ISPASS), pages 235–246. Cited on page 4141.

[Wulf and McKee, 1995] Wulf, W. A. and McKee, S. A. (1995). Hi ing the Memory Wall: Implications ofthe Obvious. ACM SIGARCH Computer Architecture News, 23(1):20–24. Cited on page 1818.

Bibliography

Unicycle Robot LocalisationImplementation A

e unicycle robot localisation problem has been introduced in ChapterChapter . e complete implementation ofthis model for our CUDA-based particle ltering framework is presented.

Listing A. : Unicycle Localisation Application

const int NUM_SENSORS = 16;

const float NOISE_VELOCITY = 0.2;const float NOISE_ANGULAR_VELOCITY = 0.1;const float NOISE_GAMMA = 0.01;

typedef struct _state{

float x;float y;float theta;

}state;

typedef struct _control{

float velocity;float angular_velocity;

}control;

typedef struct _measurement{

float distance[NUM_SENSORS];}measurement;

__device__ void sampling(state* const particle_data,const control control_data,const float* const d_random,const float dt)

{

const float x = particle_data->x;const float y = particle_data->y;const float theta = particle_data->theta;

const float velocity = control_data.velocity +NOISE_VELOCITY * d_random[0];

const float angular_velocity =control_data.angular_velocity +NOISE_ANGULAR_VELOCITY * d_random[1];

const float gamma = NOISE_GAMMA * d_random[2];

particle_data->x = x +velocity / angular_velocity *(sinf(theta+angular_velocity*dt) - sinf(theta));

particle_data->y = y -velocity / angular_velocity *(cosf(theta+angular_velocity*dt) - cosf(theta));

particle_data->theta = theta + angular_velocity*dt + gamma*dt;}

__device__ float importance_weight(const state* const particle_data,const measurement measurement_data)

{const float x = particle_data->x;const float y = particle_data->y;

const float norm_factor =1.0f / powf(2.0f*((float)M_PI), NUM_SENSORS/2);

float value=0;

for (int i=0; i<NUM_SENSORS; ++i){

const float d1=x-d_sensor_position_x[i];const float d2=y-d_sensor_position_y[i];const float d3= d_sensor_position_z[i];

const float vect=sqrtf(d1*d1+d2*d2+d3*d3) -measurement_data.distance[i];

value += vect*vect;}

return norm_factor*expf(-value);}

Robotic Arm Filtering Results Be results presented here are obtained with the robotic arm application discussed in ChapterChapter . e experi-

ment setup and parameters are presented in ChapterChapter .

B. Centralised Particle Filter

Num

bero

fpar

ticles

24

816

3264

128

𝜇=

242.4

884

𝜇=

229.8

312

𝜇=

222.7

339

𝜇=

139.3

13𝜇

=46

.2131

5𝜇

=16

.9971

9𝜇

=5.2

109

𝜎=

217.7

88𝜎

=20

7.014

𝜎=

247.0

90𝜎

=17

7.815

𝜎=

57.76

7𝜎

=27

.148

𝜎=

4.287

256

512

1024

2048

4096

8192

1638

4𝜇

=2.4

8612

4𝜇

=1.7

4970

2𝜇

=1.3

2045

6𝜇

=1.0

1757

7𝜇

=0.8

0619

42𝜇

=0.6

8161

8𝜇

=0.5

7731

73𝜎

=1.2

26𝜎

=0.6

02𝜎

=0.3

60𝜎

=0.2

18𝜎

=0.1

49𝜎

=0.0

96𝜎

=0.0

75

TableB

.:E

stim

ationer

rorf

orce

ntralis

edpa

rticle

lter

B. Centralised Particle Filter

B. Distributed Particle Filter

B Robotic Arm Filtering Results

Num

bero

fpar

ticlesp

erlte

r

48

1632

6412

825

651

2

Num

lters

4𝜇

=22

7.952

4𝜇

=17

3.075

3𝜇

=74

.2490

1𝜇

=27

.8548

4𝜇

=4.4

7115

8𝜇

=2.1

1039

6𝜇

=1.3

5313

8𝜇

=0.9

8997

19𝜎

=20

7.918

𝜎=

175.5

47𝜎

=10

8.088

𝜎=

69.77

0𝜎

=4.3

31𝜎

=0.8

84𝜎

=0.3

80𝜎

=0.2

74

8𝜇

=22

2.034

1𝜇

=16

5.230

6𝜇

=60

.2252

𝜇=

13.35

894

𝜇=

3.021

62𝜇

=1.4

8680

2𝜇

=1.0

3264

7𝜇

=0.7

9199

86𝜎

=22

3.819

𝜎=

144.3

96𝜎

=83

.377

𝜎=

20.53

4𝜎

=2.5

14𝜎

=0.5

56𝜎

=0.3

17𝜎

=0.1

64

16𝜇

=22

4.417

9𝜇

=17

2.283

4𝜇

=63

.7662

2𝜇

=14

.0408

6𝜇

=2.2

2571

2𝜇

=1.2

5504

3𝜇

=0.8

1178

02𝜇

=0.6

4814

96𝜎

=19

3.451

𝜎=

178.3

32𝜎

=83

.083

𝜎=

55.20

7𝜎

=1.2

89𝜎

=0.4

59𝜎

=0.2

13𝜎

=0.1

04

32𝜇

=21

4.508

5𝜇

=16

5.695

6𝜇

=64

.1328

2𝜇

=5.0

0664

3𝜇

=1.6

8157

1𝜇

=0.9

5751

23𝜇

=0.6

6750

79𝜇

=0.5

6037

28𝜎

=16

7.475

𝜎=

151.6

01𝜎

=98

.222

𝜎=

5.906

𝜎=

0.889

𝜎=

0.338

𝜎=

0.159

𝜎=

0.094

64𝜇

=20

9.088

4𝜇

=16

9.281

9𝜇

=42

.9077

6𝜇

=3.2

9562

1𝜇

=1.2

9523

𝜇=

0.781

8636

𝜇=

0.596

1169

𝜇=

0.498

006

𝜎=

224.7

64𝜎

=15

8.555

𝜎=

56.47

5𝜎

=2.9

33𝜎

=0.4

62𝜎

=0.2

32𝜎

=0.1

23𝜎

=0.0

72

128

𝜇=

214.9

03𝜇

=14

6.885

9𝜇

=35

.7691

9𝜇

=2.1

1074

5𝜇

=1.0

2381

7𝜇

=0.6

9268

2𝜇

=0.5

1284

66𝜇

=0.4

2933

93𝜎

=19

5.308

𝜎=

146.5

50𝜎

=56

.555

𝜎=

1.475

𝜎=

0.471

𝜎=

0.188

𝜎=

0.102

𝜎=

0.063

256

𝜇=

204.1

097

𝜇=

130.7

894

𝜇=

24.03

496

𝜇=

1.921

983

𝜇=

0.818

0299

𝜇=

0.597

5755

𝜇=

0.447

4109

𝜇=

0.384

1442

𝜎=

197.0

87𝜎

=14

9.786

𝜎=

41.59

6𝜎

=1.3

43𝜎

=0.2

57𝜎

=0.1

55𝜎

=0.0

86𝜎

=0.0

53

512

𝜇=

200.3

37𝜇

=12

7.872

8𝜇

=19

.4518

2𝜇

=1.2

8082

7𝜇

=0.7

1825

55𝜇

=0.4

9109

07𝜇

=0.4

0347

5𝜇

=0.3

4694

68𝜎

=17

6.888

𝜎=

191.2

16𝜎

=42

.733

𝜎=

0.558

𝜎=

0.255

𝜎=

0.103

𝜎=

0.074

𝜎=

0.043

1024

𝜇=

199.3

808

𝜇=

121.1

254

𝜇=

8.217

067

𝜇=

1.015

018

𝜇=

0.582

461

𝜇=

0.453

2942

𝜇=

0.386

2672

𝜇=

0.320

8877

𝜎=

173.1

91𝜎

=12

9.980

𝜎=

23.42

2𝜎

=0.5

77𝜎

=0.1

55𝜎

=0.0

98𝜎

=0.0

68𝜎

=0.0

44

2048

𝜇=

190.4

131

𝜇=

115.5

985

𝜇=

5.947

686

𝜇=

0.808

9744

𝜇=

0.516

9247

𝜇=

0.391

295

𝜇=

0.346

02𝜇

=0.2

9900

87𝜎

=20

2.827

𝜎=

173.5

71𝜎

=15

.493

𝜎=

0.421

𝜎=

0.127

𝜎=

0.076

𝜎=

0.053

𝜎=

0.037

TableB

.:E

stim

ationer

rorf

orstar

netw

ork


Num

bero

fpar

ticlesp

erlte

r

48

1632

6412

825

651

2

Num

lters

4𝜇

=10

2.539

8𝜇

=62

.5019

6𝜇

=34

.0523

𝜇=

10.53

582

𝜇=

3.004

43𝜇

=1.7

2757

9𝜇

=1.2

8038

4𝜇

=0.9

5750

6𝜎

=13

3.782

𝜎=

86.90

1𝜎

=76

.424

𝜎=

14.83

6𝜎

=1.8

39𝜎

=0.6

12𝜎

=0.3

85𝜎

=0.1

95

8𝜇

=38

.7166

1𝜇

=20

.6602

4𝜇

=8.1

6969

8𝜇

=3.3

7399

𝜇=

1.926

268

𝜇=

1.384

282

𝜇=

0.958

8313

𝜇=

0.777

0571

𝜎=

46.36

4𝜎

=21

.509

𝜎=

6.709

𝜎=

2.257

𝜎=

0.864

𝜎=

0.475

𝜎=

0.195

𝜎=

0.138

16𝜇

=17

.8298

𝜇=

10.62

569

𝜇=

4.550

879

𝜇=

2.167

783

𝜇=

1.433

354

𝜇=

1.035

43𝜇

=0.7

9955

05𝜇

=0.6

6141

42𝜎

=11

.624

𝜎=

8.902

𝜎=

4.080

𝜎=

0.977

𝜎=

0.557

𝜎=

0.237

𝜎=

0.137

𝜎=

0.109

32𝜇

=11

.3148

7𝜇

=5.7

1457

8𝜇

=2.9

3999

5𝜇

=1.5

8492

8𝜇

=1.1

6399

4𝜇

=0.8

0659

48𝜇

=0.6

7945

44𝜇

=0.5

6594

43𝜎

=6.7

11𝜎

=4.0

24𝜎

=1.6

50𝜎

=0.4

97𝜎

=0.3

42𝜎

=0.1

37𝜎

=0.0

95𝜎

=0.0

72

64𝜇

=7.5

3983

7𝜇

=3.6

5773

4𝜇

=2.0

1333

4𝜇

=1.1

8912

8𝜇

=0.8

5615

69𝜇

=0.6

6325

72𝜇

=0.5

6942

71𝜇

=0.4

9660

07𝜎

=4.1

48𝜎

=1.6

16𝜎

=0.6

85𝜎

=0.3

22𝜎

=0.1

69𝜎

=0.1

10𝜎

=0.0

67𝜎

=0.0

53

128

𝜇=

5.502

683

𝜇=

2.841

592

𝜇=

1.509

556

𝜇=

0.935

4803

𝜇=

0.686

675

𝜇=

0.572

7136

𝜇=

0.495

1787

𝜇=

0.432

6917

𝜎=

2.637

𝜎=

1.238

𝜎=

0.560

𝜎=

0.217

𝜎=

0.138

𝜎=

0.071

𝜎=

0.057

𝜎=

0.043

256

𝜇=

3.805

812

𝜇=

1.958

428

𝜇=

1.088

706

𝜇=

0.757

0641

𝜇=

0.577

5182

𝜇=

0.502

9114

𝜇=

0.439

7668

𝜇=

0.381

9917

𝜎=

1.595

𝜎=

0.818

𝜎=

0.283

𝜎=

0.163

𝜎=

0.075

𝜎=

0.066

𝜎=

0.039

𝜎=

0.039

512

𝜇=

2.943

974

𝜇=

1.510

945

𝜇=

0.839

4746

𝜇=

0.611

5893

𝜇=

0.504

6428

𝜇=

0.436

7984

𝜇=

0.381

8363

𝜇=

0.343

9244

𝜎=

0.980

𝜎=

0.493

𝜎=

0.205

𝜎=

0.100

𝜎=

0.060

𝜎=

0.043

𝜎=

0.029

𝜎=

0.029

1024

𝜇=

2.142

178

𝜇=

1.132

082

𝜇=

0.704

4835

𝜇=

0.516

5899

𝜇=

0.439

9009

𝜇=

0.398

3444

𝜇=

0.354

3179

𝜇=

0.313

3846

𝜎=

0.717

𝜎=

0.342

𝜎=

0.148

𝜎=

0.063

𝜎=

0.041

𝜎=

0.039

𝜎=

0.030

𝜎=

0.026

2048

𝜇=

1.601

057

𝜇=

0.834

1214

𝜇=

0.575

6661

𝜇=

0.453

0789

𝜇=

0.397

4585

𝜇=

0.356

4218

𝜇=

0.324

6974

𝜇=

0.291

1823

𝜎=

0.561

𝜎=

0.204

𝜎=

0.086

𝜎=

0.049

𝜎=

0.034

𝜎=

0.031

𝜎=

0.026

𝜎=

0.023

TableB

.:E

stim

ationer

rorf

orrin

gne

twor

k

B Robotic Arm Filtering Results

Num

bero

fpar

ticlesp

erlte

r

48

1632

6412

825

651

2

Num

lters

4𝜇

=15

4.575

𝜇=

124.0

169

𝜇=

70.97

812

𝜇=

22.24

618

𝜇=

6.950

467

𝜇=

2.149

334

𝜇=

1.371

315

𝜇=

1.036

454

𝜎=

192.6

36𝜎

=15

9.728

𝜎=

111.6

22𝜎

=34

.030

𝜎=

12.64

2𝜎

=1.1

36𝜎

=0.5

15𝜎

=0.2

49

8𝜇

=70

.7307

6𝜇

=62

.1337

𝜇=

32.89

054

𝜇=

8.205

879

𝜇=

2.445

901

𝜇=

1.471

483

𝜇=

0.998

0664

𝜇=

0.790

7979

𝜎=

102.1

10𝜎

=14

5.064

𝜎=

52.15

9𝜎

=7.7

03𝜎

=1.6

18𝜎

=0.6

17𝜎

=0.3

10𝜎

=0.1

77

16𝜇

=20

.9783

4𝜇

=16

.7103

7𝜇

=8.5

7145

7𝜇

=3.5

7726

3𝜇

=1.3

9599

5𝜇

=1.0

0347

6𝜇

=0.7

9786

15𝜇

=0.6

5033

18𝜎

=16

.383

𝜎=

21.83

5𝜎

=8.5

65𝜎

=4.1

52𝜎

=0.5

39𝜎

=0.3

13𝜎

=0.2

28𝜎

=0.1

08

32𝜇

=9.0

7093

𝜇=

5.217

487

𝜇=

3.675

169

𝜇=

1.520

823

𝜇=

1.016

102

𝜇=

0.757

8095

𝜇=

0.622

4237

𝜇=

0.535

8646

𝜎=

7.748

𝜎=

5.410

𝜎=

4.443

𝜎=

0.731

𝜎=

0.380

𝜎=

0.169

𝜎=

0.103

𝜎=

0.073

64𝜇

=3.4

6215

2𝜇

=2.6

8789

4𝜇

=1.3

9248

7𝜇

=1.0

0815

8𝜇

=0.7

4048

39𝜇

=0.6

1739

12𝜇

=0.5

1907

56𝜇

=0.4

5680

45𝜎

=3.4

60𝜎

=3.5

35𝜎

=1.3

24𝜎

=0.3

24𝜎

=0.1

97𝜎

=0.1

10𝜎

=0.0

65𝜎

=0.0

52

128

𝜇=

1.737

855

𝜇=

1.262

515

𝜇=

0.953

6437

𝜇=

0.744

1259

𝜇=

0.588

911

𝜇=

0.530

1779

𝜇=

0.460

2782

𝜇=

0.410

3854

𝜎=

0.855

𝜎=

0.495

𝜎=

0.315

𝜎=

0.194

𝜎=

0.128

𝜎=

0.075

𝜎=

0.055

𝜎=

0.045

256

𝜇=

1.144

598

𝜇=

0.877

9246

𝜇=

0.732

2191

𝜇=

0.586

9851

𝜇=

0.503

7322

𝜇=

0.438

7583

𝜇=

0.404

7316

𝜇=

0.363

8851

𝜎=

0.391

𝜎=

0.244

𝜎=

0.159

𝜎=

0.096

𝜎=

0.069

𝜎=

0.059

𝜎=

0.042

𝜎=

0.028

512

𝜇=

0.885

039

𝜇=

0.702

9075

𝜇=

0.576

9665

𝜇=

0.502

267

𝜇=

0.433

4683

𝜇=

0.396

7512

𝜇=

0.362

5083

𝜇=

0.328

2409

𝜎=

0.239

𝜎=

0.125

𝜎=

0.08

2𝜎

=0.0

71𝜎

=0.0

50𝜎

=0.0

38𝜎

=0.0

30𝜎

=0.0

30

1024

𝜇=

0.755

3266

𝜇=

0.610

4205

𝜇=

0.505

615

𝜇=

0.435

3883

𝜇=

0.389

7054

𝜇=

0.363

4693

𝜇=

0.331

7855

𝜇=

0.300

7419

𝜎=

0.177

𝜎=

0.096

𝜎=

0.069

𝜎=

0.049

𝜎=

0.039

𝜎=

0.034

𝜎=

0.027

𝜎=

0.027

2048

𝜇=

0.628

7862

𝜇=

0.511

2716

𝜇=

0.434

2013

𝜇=

0.389

3534

𝜇=

0.355

9662

𝜇=

0.332

184

𝜇=

0.307

2191

𝜇=

0.282

078

𝜎=

0.116

𝜎=

0.066

𝜎=

0.046

𝜎=

0.043

𝜎=

0.030

𝜎=

0.031

𝜎=

0.025

𝜎=

0.019

TableB

.:E

stim

ationer

rorf

or2D

toru

snetwor

k


Adapting Particle Filter Algorithms to the GPU Architecture

Documents

Transcript of Adapting Particle Filter Algorithms to the GPU Architecture