Analog Vision - Neural Network Inference Acceleration ... · which demonstrates large improvements...

IMPERIAL COLLEGE LONDON

DEPARTMENT OF COMPUTING

Analog Vision - Neural NetworkInference Acceleration using Analog

SIMD Computation in the Focal Plane

Author:Matthew Wong

Supervisors:Prof. Paul Kelly

Dr. Sajad Saeedi

Submitted in partial fulfillment of the requirements for the MSc degree inComputing Science of Imperial College London

September 2018

Abstract

Deep Convolutional Neural Networks (CNN) have revolutionised the field of com-puter vision in recent years, achieving and even exceeding human level performanceon key vision tasks. Yet current deep learning implementations are often computa-tionally and energy intensive, requiring powerful systems to support their deploy-ment. This places computer vision capabilities out of reach of a vast range of po-tential applications, particularly in the fields of robotics, embedded systems, andalways-on devices.

This thesis presents a high-speed, energy-efficient CNN architecture utilising the ca-pabilities of a unique class of devices known as Focal Plane Sensor Processors (FPSP).We introduce novel techniques to convert standard neural network layers to FPSPcode, and demonstrate a method of training networks to increase their robustnessto the effects of hardware noise. We then showcase a successful implementation of aCNN on the SCAMP-5 FPSP hardware, and demonstrate an estimated 85% reductionin inference time and 84% improvement in energy efficiency over existing state-of-the-art implementations, achieving handwritten digit recognition at >90% accuracyat 3000 fps, using only 0.5mJ per recognised digit.

Acknowledgments

I would like to thank the following people, without whom this project would nothave been possible:

• Prof Paul Kelly at Imperial College London for introducing me to this projectand for inspiring me with his limitless enthusiasm.

• Dr Sajad Saeedi at Imperial College London for always believing in the poten-tial of the project and for providing me with such committed and dedicatedsupervision.

• Dr Stephen Carey and Dr Jianing Chen at the University of Manchester forgraciously hosting me in their lab and for answering my numerous questionsabout the workings of the SCAMP-5.

• Thomas Debrunner at Imperial College London, for sharing access to his FPSPcode generator, which was an indispensable element in the success of thisproject.

• Ong Wai Hong at Imperial College London, for always being willing to offeradvice on mathematical questions.

iii

A Note on Spelling

In general, British English spelling has been used in this thesis. However, for techni-cal terms where American English spelling is widespread in the literature, we haveadopted to use that spelling rather than the British English equivalent. The mostprominent cases which occur repeatedly over the course of this thesis are ”Analog”,”Regularization”, and ”Binarization”.

iv

Contents

1 Introduction 1

2 Background 42.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Early Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 52.1.4 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Focal-Plane Sensor-Processors . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Hardware Architecture - SCAMP5 Vision Chip . . . . . . . . . 122.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.3 Vision Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.4 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Analog Computing . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Focal Plane Vision 203.1 FPSP Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Custom Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Focal Plane Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.1 Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.2 Code Generation Approach . . . . . . . . . . . . . . . . . . . 323.3.3 On-Simulator Implementation . . . . . . . . . . . . . . . . . . 353.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Noise-In-The-Loop Training . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

v

CONTENTS Table of Contents

3.4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4.5 Hardware Recommendations . . . . . . . . . . . . . . . . . . 44

4 AnalogNet 484.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Input Binarization . . . . . . . . . . . . . . . . . . . . . . . . 504.2.2 Filter Execution . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.3 Filter Result Binarization . . . . . . . . . . . . . . . . . . . . . 604.2.4 Readout and Pooling . . . . . . . . . . . . . . . . . . . . . . . 614.2.5 Dense Layer Computation . . . . . . . . . . . . . . . . . . . . 62

4.3 Noise-in-the-loop Training . . . . . . . . . . . . . . . . . . . . . . . . 624.3.1 Direct Noise Incorporation . . . . . . . . . . . . . . . . . . . . 62

4.4 Network Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 644.4.1 Experimental Set-Up . . . . . . . . . . . . . . . . . . . . . . . 644.4.2 Augmented MNIST Training . . . . . . . . . . . . . . . . . . . 65

4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 664.5.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 664.5.2 MNIST Test Set Performance . . . . . . . . . . . . . . . . . . 69

5 SCAMP-5 Hardware Analysis 735.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2 SCAMP-5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Systematic Error Types . . . . . . . . . . . . . . . . . . . . . . 745.3 Systematic Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.3.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . 775.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 795.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Random Noise Modelling . . . . . . . . . . . . . . . . . . . . . . . . 845.4.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 875.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 Conclusion 916.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.1 Software Research . . . . . . . . . . . . . . . . . . . . . . . . 936.2.2 Hardware Research . . . . . . . . . . . . . . . . . . . . . . . . 93

Appendices 95

A Ethics Checklist 96

B Ethical and Professional Considerations 99

vi

Table of Contents CONTENTS

Bibliography 100

vii

Chapter 1

Introduction

The successes of neural networks and deep learning have become increasingly promi-nent in recent years, both within the literature as well as in the broader media.These technologies have been at the heart of key innovations in fields ranging frommachine translation (Cho et al., 2014) to music synthesis (Mor et al., 2018) andhuman-computer interaction (Goode, 2018). Perhaps nowhere has this been moreapparent than in the field of computer vision, which has been revolutionised by thearrival of deep learning (Rawat and Wang, 2017). Problems that have been the sub-ject of decades of research have been solved within the span of a few short years,and every day that passes seems to bring a range of potential new applications.

Yet, for all their successes, today’s deep neural networks suffer from what has beentermed the ‘inference efficiency’ problem (Google, 2018). Essentially, while thesenetworks perform extremely well when running on specialised hardware such asGPUs, they are in many cases not light-weight, fast and/or energy efficient enough tobe effectively deployed for real-time applications on less powerful hardware (Nvidia,2017). This has limited the range of potential applications, and have put many keyinnovations out of the reach of areas where their availability could likely be of greatvalue.

The challenge today is to find new ways of allowing for the deployment of neu-ral networks wherever they may be needed, which could range anywhere from au-tonomous vehicles on city streets to environmental monitoring systems in tropicalrainforests. This will require a combination of both hardware and software innova-tions. On the hardware front, computing chips capable of running neural networksat low power and/or high-frame rates could potentially open up a myriad of newuse cases, for example in embedded and robotic systems. Similarly, software inno-vations could allow neural networks to be adapted for and implemented on a rangeof both new and existing hardware, paving the way for a vast array of potential newapplications.

The overall aim of this project was to develop and implement an ultra-fast, energyefficient Convolutional Neural Network architecture on a class of devices known asFocal-Plane Sensor-Processors (FPSP). Unlike regular imaging sensors, FPSPs are ca-

1

Chapter 1. Introduction

pable of conducting a significant amount of image processing directly in the focalplane, thereby reducing the need for image data to be transferred to a separate pro-cessing device. The particular focus of this project’s development work was on theSCAMP-5, an experimental Focal-Plane Sensor-Processor (FPSP) capable of process-ing images at very low power while sustaining extremely high frame rates (1000-100,000 fps) by making use of analog, rather than digital, computation.

The work carried out for this thesis comprised two major areas of inquiry. The firstarea concerned research into the methods and techniques required to successfullyimplement a neural network on a generic FPSP, given the unique characteristicsof these devices and the difficulties inherent in programming them (such as pixel-parallel programming and non-trivial amounts of hardware noise). To our knowl-edge, there had not previously been a successful effort to carry out neural networkinference on an FPSP, thus necessitating a need to independently develop methodssuitable for achieving this goal. To this end, several novel methods are proposed thattogether make the implementation of neural networks on an FPSP possible.

The second area of inquiry focused on applying those techniques towards the im-plementation of a CNN on the SCAMP-5 hardware. While the techniques devel-oped formed a valuable base to begin working from, the challenges of workingwith actual hardware in a physical environment necessitated the development ofadditional innovations. A SCAMP-5 CNN implementation, AnalogNet, is introducedwhich demonstrates large improvements in inference speed and energy efficiencyas compared to present state-of-the-art inference solutions. To our knowledge, thismarks the first time a CNN has been successfully implemented on the SCAMP-5, aswell as the first time neural network inference has been implemented using analogcomputation in general.

As a result of the successful conclusion of those lines of inquiry, we are able to presenta high-speed, energy-efficient CNN architecture utilising the unique capabilities ofFPSPs. We introduce novel techniques to convert standard neural network layers toFPSP code, and demonstrate a method of training networks to increase their robust-ness to the effects of hardware noise. We then showcase a successful implementationof a CNN on the SCAMP-5 FPSP hardware, and demonstrate an estimated 85% re-duction in inference time and 84% improvement in energy efficiency over existingstate-of-the-art implementations, achieving handwritten digit recognition at >90%accuracy at 3000 fps, using only 0.5mJ per recognised digit.

This thesis is organised as follows:

• Chapter 1, Introduction describes the motivation for this project, as well asan overview of the research ultimately carried out.

• Chapter 2, Background provides a primer on Neural Networks and Focal-Plane Sensor-Processors (FPSPs), and a review of related work being carriedout towards the goal of inference acceleration for neural networks.

• Chapter 3, Focal Plane Vision introduces the methods developed to facilitatethe successful implementation of neural networks on FPSPs, and demonstrates

2

Chapter 1. Introduction

the feasibility of doing so using a simulated FPSP implementation.

• Chapter 4, AnalogNet: An On-Device Implementation describes the suc-cessful implementation of a neural network on the SCAMP-5 hardware, andprovides details of key performance statistics.

• Chapter 5, SCAMP-5 Hardware Analysis describes an analysis of the hard-ware characteristics of the SCAMP-5, and presents a comprehensive noise modelfor the device comprising a Systematic Error Model and a Random Error Model.

• Chapter 6, Conclusion considers the contributions made by this project to thebroader literature, and outlines promising avenues of future research.

3

Chapter 2

Background

This section surveys two strands of literature that are at the heart of this project,the first being the literature on neural networks, and the second being the literatureon Focal-Plane Sensor-Processors (in particular the literature on the SCAMP-5 VisionSystem).

It also provides a overview of related work in the field of Neural Network Infer-ence Acceleration, detailing work being carried out towards this goal using software,hardware and analog computing approaches.

2.1 Neural Networks

Neural Networks were first developed in the 1950s and early 1960s by researchersdrawing inspiration from the workings of the human brain. These early neural net-works had limited successes, and it has only been in recent years that great advanceshave been made in the field.

2.1.1 Early Work

One of the first neural network implementations was known as the Perceptron (Rosen-blatt, 1957). In a perceptron network, each neuron had the following activationfunction:

f(x) =

{1

∑ni=0wixi > 0

−1 otherwise(2.1)

In other words, if the sum of all incoming connections (weights × inputs + bias)to a neuron was positive, the neuron output would be 1; otherwise, it would be -1(Mitchell, 1997).

4

Chapter 2. Background 2.1. NEURAL NETWORKS

The perceptron had some success, as it was demonstrated that it was able to success-fully classify linearly separable examples, including those represented by the logicalfunctions AND, OR and NAND (Petridis, 2018).

However, at the time, there were no known techniques for training multi-layer net-works; this, together with the insufficient computing power available to researchersat the time, limited the practical applicability of perceptrons and other early neu-ral network implementations. It has only been in recent years that neural networkshave undergone a dramatic revival.

2.1.2 Deep Learning

Deep learning simply involves the use of neural networks with a very large num-ber of hidden layers. Today’s state-of-the-art deep neural networks usually havehundreds of hidden layers, thousands of neurons, and millions of weights and pa-rameters.

Deep neural networks rose to prominence in 2012 with the release of AlexNet,which successfully outperformed every other entry in that year’s ImageNet challenge(Krizhevsky et al., 2012). This marked the start of a renaissance in neural networks,with advances in deep learning having come at a rapid pace since then.

AlexNet was a Convolutional Neural Network, which was a type of neural networkespecially well-suited to image processing tasks. Given that this project is primarilyconcerned with computer vision, it is expected that our research will make extensiveuse of CNNs.

There are also a range of other deep neural networks such as Generative AdversarialNetworks and Recurrent Neural Networks, each of which are particularly well-suitedto different types of applications. These networks have been used in everything fromartistic style transfer (Zhu et al., 2017) to speech recognition (Graves et al., 2013). Itis however beyond the scope of this project to consider these architectures in greaterdetail.

2.1.3 Convolutional Neural Networks

As noted above, Convolutional Neural Networks (CNNs) are a class of Deep NeuralNetworks that are particularly well-suited to working with images.1 Like all DNNs,CNNs are comprised of multiple layers of neurons, each of which have a collection oftrained weights corresponding to their respective inputs. The distinguishing featuresof CNNs are their use of one or more convolutional layers, which are themselvescomprised of multiple convolution filters. Unlike in a standard feedforward networkwhere each pixel is processed in isolation, convolutional layers give CNNs the ability

1Information in this section primarily from Karpathy (2018)

5

2.1. NEURAL NETWORKS Chapter 2. Background

to identify spatial features such as lines, edges, corners, giving them a significantadvantage over standard feedforward networks in image processing tasks.

CNNs are generally composed of three types of layers - Convolutional layers, Poolinglayers, and Fully-Connected layers (also known as Dense layers). Most CNN archi-tectures tend to alternate between convolutional and pooling layers, before endingwith one or more fully-connected layers. In this section, we will consider the opera-tion of each of these layers in some depth.

Convolutional Layers

Convolutional layers form the core of CNNs; they are responsible for most of thecomputational heavy-lifting and form the bulk of the layers in most CNNs.

The convolution filter forms the basis of the convolution layer, with each convo-lutional layer containing multiple convolution filters. These convolution filters arebased on the earlier computer vision concept of convolutions.

Convolutions have previously been used in computer vision in the form of convolu-tion kernels, which function as matrices used to apply effects to images. For example,the following kernel on the left can be used to sharpen an image:

0 −1 0−1 5 −10 −1 0

Figure 2.1 shows the result of the application of a convolution filter to an input im-age. As the convolution filter is moved over the input image, a matrix multiplicationis carried out between the filter and the values in each 3× 3 window, with the resultsof that multiplication forming the output of the convolution operation.

6


Figure 2.1: Effect of a 3 × 3 Convolution Filter applied to a 5 × 5 input

Unlike the fixed values in a convolution kernel, the values in the convolution filtersare variable and can be adjusted during training. As network training proceeds,the network learns suitable values for each convolution filter using the principles ofgradient descent. Once training is completed, the network has essentially produced aseries of convolutional kernels which can be used at run time to compute the resultsof the network.

Figure 2.2: Sample convolution filters learnt in model trained by Krizhevsky et al.(2012)

From empirical observations, it appears the convolution filters are often used bythe network to learn basic spatial features such as lines, edges, and visual patterns(see Figure 2.2), which can then be further processed by subsequent layers of thenetwork.

Pooling Layers

Pooling layers are a means of reducing the size of the inputs to subsequent layersof the network. This has the advantage of reducing the number of parameters that

7


need to be maintained and calculated at subsequent network layers. Consider, forexample, a 28 × 28 image. Were an input of this size to be connected directly to afully-connected layer, each neuron in the FC layer would have 784 inputs with 784distinct weights (+1 bias term). If, however, pooling with a 2 × 2 window size isemployed, the size of the input would now be reduced to 14 × 14; each neutronwould now only have 196 inputs with 196 distinct weights.

There are two main types of pooling, max pooling and average pooling:

When using max pooling, a pool window is iterated over the input, with the greatestvalue in the pool window taken as the post-pooling value (Figure 2.3).

Figure 2.3: Max pooling using a 2 × 2 pool window

When using average pooling, rather than taking the greatest value, the average of thevalues in the pooling window is taken as the post-pooling value (Figure 2.4).

Figure 2.4: Average pooling using a 2 × 2 pool window

Fully-Connected Layers

Fully-connected layers are normally used as the final layers of a CNN; unlike CNNs,regular neural networks are comprised solely of fully-connected layers. In a fully-connected layer, every neuron is connected to all activations from the previous layer(hence the term fully-connected); we see this in Figure 2.5, where every neuron inLayer B is connected the all the activations emanating from Layer A.

8


Figure 2.5: Schematic of a Fully-Connected Layer (Layer B)

2.1.4 Network Training

Before neural networks can be used at run-time, they need to be trained; this isthe process of setting the value of all the weights and biases to optimal values thatultimately minimises the overall error of the neural network.

Forward and Backward Passes

The forward and backward pass are key concepts in the training of neural net-works.

During the forward pass, the input is presented to the neural network, and the corre-sponding network output is then computed (N.B. this is the same process that occursat run-time, where the neural network returns a prediction based on the given in-put). The network’s output is compared to the ground truth value for that input,and a corresponding network loss is calculated using a chosen loss function, basedon the difference (if any) between the output and the ground truth.

During the backward pass, weights are adjusted with the objective of minimizingnetwork loss. This is done using the principle of Gradient Descent and is implementedin multilayer networks using the Backpropagation algorithm.

This cycle is repeated until it is determined that an optimal amount of training hasbeen conducted.

9


Gradient Descent 2

Gradient descent is a search algorithm that seeks to find a minima value for a givenfunction. When applied to a neural network, the aim is to find a local (or ideallyglobal) minima across a high-dimensional search space.

Intuitively, gradient descent operates by computing the slope of the error functionwith respect to each weight in the network, and updating the weight in a directionsuch that the error ‘rolls’ down the slope. In other words, for each weight wi (for aparticular input), given a corresponding error E, compute ∆wi = −n dE

dwi, where n is

a constant chosen as the learning rate. We can then update wi according to the rulewi = wi + ∆wi, and the weight will then have been updated in a direction whichhelps reduce the overall network error.

Backpropagation 3

Backpropagation is the main way in which modern multilayer neural networks aretrained. Intuitively, backpropagation trains a neural network by determining thedegree to which each particular weight contributed to the total error, and usingthis metric to decide how much (and in what direction) each weight should be ad-justed during each backward pass. The backpropagation algorithm is specified asfollows:

Forward Pass: propagate the input forward through the network

Step 1: Compute for the current input the output ou of every output unit u

Backward Pass: propagate the errors backward through the network

Step 2: For each network output unit k, calculate its error term δk, where ok repre-sents the output of the unit and tk represents the target output for that unit:

δk = ok(1− ok)(tk − ok) (2.2)

Step 3: For each hidden unit h, calculate its error term δh:

δh = oh(1− oh)∑

k∈outputs

wkhδk (2.3)

Step 4: Update each network weight wji:

wji = wji + δwji (2.4)

where: δwji = n∆jxji and n is the learning rate (2.5)

2Information in this section primarily from Mitchell (1997)3Information used and equations shown in this section primarily from Mitchell (1997)

10


Regularization 4

Regularization is a technique that is usually used to ameliorate the problem of over-fitting during network training. Overfitting is a phenomena that occurs when amachine learning algorithm ’learns’ its training data too closely but is unable to gen-eralise unseen examples, resulting in a situation where it has extremely high trainingaccuracy but low validation and test accuracy.

Regularization usually comes in two forms: L1 regularization and L2 regularization.Both of these techniques work by adding an additional term to the network’s lossfunction which applies a penalty to large weights and incentivises smaller weights;it is generally thought that networks with smaller weights are less likely to be over-fitted.

L1 regularization modifies the loss function in the following way, where E0 is theoriginal error function, λ is an arbitrary constant, and w is the value of each weightin the network:

E = E0 + λ∑

all weights

|w| (2.6)

As can be seen, L1 regularization adds an additional error proportional to the sumof the absolute value of all network weights. Essentially, the larger the weights inthe networks, the greater the additional error.

L2 regularization modifies the loss function slightly differently. As before, E0 is theoriginal error function, λ is an arbitrary constant, and w is the value of each weightin the network.

E = E0 + 0.5 ∗ (λ∑

all weights

w2) (2.7)

Similar to L1 regularization, L2 regularization adds an additional error as well; thiserror, however, is proportional to the sum of the squares of all network weights. Aswith L1 regularization, the larger the weights, the greater the additional error.

The effect of introducing this additional error is to incentivise the training process tofavour smaller, rather than larger, network weights. To see how this occurs, considerthe derivative of the new error function for L2 regularization with respect to eachindividual weight, where E is the new error function and w is the particular weightthat the error function is being differentiated with respect to:

dE

dw=dE0

dw+ λw (2.8)

Employing the principle of gradient descent, the current weight will be adjusted inthe following way, where n is the learning rate:

∆w = −ndE0

dw− nλw (2.9)

4Information used and equations shown in this section primarily from Petridis (2018)

11

2.2. FOCAL-PLANE SENSOR-PROCESSORS Chapter 2. Background

As we can see, as training proceeds, both E0 and w will be minimised; which of thetwo will be minimised to a greater degree will be determined by the size of λ. If λis small, we prefer to minimize E0, while is λ is large, we prefer small weights overminimizing E0.

In general, by modifying the loss function, regularization provides a means by whichthe network training process can be influenced. This implication is of particularimportance to this project, as shall be further elaborated upon below.

2.2 Focal-Plane Sensor-Processors

Focal-Plane Sensor-Processor (FPSP) chips are a special class of imaging devices inwhich the sensor arrays and processor arrays are embedded together on the samesilicon chip (Zarandy, 2011). Unlike traditional vision systems, in which sensor ar-rays send collected data to a separate processor for processing, FPSPs allow datato be processed in place on the imaging device itself. This unique architecture en-ables ultra-fast image processing even on small, low-power devices, because costlytransfers of large amounts of data are no longer necessary. As this project will focusprimarily on implementing a Convolutional Neural Network on the SCAMP-5 VisionChip, the rest of this discussion will be centred around the particular architectureand capabilities of the SCAMP5.

2.2.1 Hardware Architecture - SCAMP5 Vision Chip

The SCAMP-5 Vision Chip is a Focal-Plane Sensor-Processor (FPSP) developed at theUniversity of Manchester (Carey et al., 2013a). The chip comprises 65,536 Pro-cessing Elements (PEs) integrated in a 256 × 256 imager array (See Figure 2.6).Each individual PE includes a photodetector (pixel) and a processor (ALU, registers,control, and I/O circuits).

Processor instructions are common across the device, with each individual PE exe-cuting the common instructions on their own local data. Each PE also has an activityflag, which can be set as required, allowing for some degree of local autonomyby specifying instructions to be carried out only by selected PEs. These flags cantherefore be used to implement conditional operations when necessary. Instructionsare received from a microcontroller attached to the chip, which sends a sequenceof 79-bit instruction words determining the algorithm to be executed. Instructionsare executed simultaneously across the PE array, allowing instructions to be rapidlycompleted in parallel.

Each PE comprises 6 analog registers (A-F) and 13 digital registers (R0-R12). A keydistinguishing feature of the SCAMP-5 is that, unlike almost all mainstream proces-sors today, arithmetic operations are carried out by the analog registers. These op-erations, including summation, subtraction, division and squaring, are implemented

12

Chapter 2. Background 2.2. FOCAL-PLANE SENSOR-PROCESSORS

Figure 2.6: Schematic of the architecture of the SCAMP-5 Vision Chip (Carey et al.,2013a)

using analog current-mode circuits and are able to operate directly on the analogpixel values without a need for analog to digital conversion.

Unique Considerations

While the use of analog registers allows the SCAMP-5 to achieve levels of perfor-mance that would not normally be possible, it also introduces various errors to com-putations that would not normally be present in digital architectures (Carey et al.,2013a).

To illustrate the errors introduced by working with analog registers, consider theoperation of copying a value from register B to register A at position i, j:

Ai,j = Bi, j + k1Bi, j + k2 + εi,j(t) + δi,j

k2 is a fixed error that can be corrected with a constant error correction operation.However, k1, the signal dependence of error cannot be corrected. Given a clock of10MHz, and a nominal register range of 0 to 100, k1 is 0.07. εi,j(t) represents therandom error associated with a register transfer (RMS value of 0.09 averaged acrossthe array), and δi,j is the error due to fixed pattern noise, i.e. a constant error specificto the location and registers being copied (0.05 averaged across the array).

Table 2.1 gives the result of performing some example computations on the SCAMP-5. As can be seen, some operations can be performed reasonably accurately, whileothers incur a significant error.

13

2.2. FOCAL-PLANE SENSOR-PROCESSORS Chapter 2. Background

Operation Expected Result SCAMP-5 Result Error (%)

neg(-80) 80 85 6.25

div2(20) 10 14 40

add(30, 50) 80 81 1.25

sub(-80, 100) -180 -128 28.8

Table 2.1: FPSP Simulator Core Instruction Set

2.2.2 Performance

The fully-parallel interface coupled with the use of analog registers for arithmetic op-erations has allowed the SCAMP-5 to achieve superior outcomes on key performancemetrics, particularly in terms of frame rate and power consumption.

The SCAMP-5 architecture allows for the transfer of a complete image frame fromthe image sensor array to the processor array in one clock cycle (100ns), whichequates to a sensor processing bandwidth of 655GB/s (Martel and Dudek, 2016).This allows for the implementation of vision algorithms at extremely high framerates which are simply unattainable with traditional architectures. For example,Carey et al. (2013a) demonstrated an object-tracking algorithm running at 100,000fps. On the other hand, when operating at lower frame rates, the SCAMP5 canfunction at ultra-low power consumption rates. Carey et al. (2013b) demonstrated avision system capable of carrying out loiterer detection, which operated continuouslyat 8fps for 10 days powered by three standard AAA batteries.

These superior performance characteristics have positioned the SCAMP-5 as an idealdevice for implementing vision algorithms in low-power embedded computing sys-tems (Martel and Dudek, 2016).

2.2.3 Vision Algorithms

Various high-level vision algorithms serving various functions have been successfullyimplemented on the SCAMP-5. J. Chen et al. (2017) implemented a parallelizedFAST16 corner detection algorithm on the SCAMP5; their implementation was ableto extract relevant features at 2300 fps. Debrunner (2017) implemented the ViolaJones Face Detection algorithm on the SCAMP5, demonstrating significant energysavings as compared to running the algorithm on contemporary CPUs. Martel etal. (2018) introduced a method for obtaining depth information from a scene inreal-time using a focus-tunable liquid lens in conjunction with an algorithm runningon the SCAMP-5; they noted that their results would not have been possible using

14

Chapter 2. Background 2.3. RELATED WORK

a conventional camera and a central processor as doing so would have incurred aprohibitive communication overhead.

2.2.4 Code Generation

Debrunner (2017) introduced a code generator which could automatically generateSCAMP5 instructions computing the result of convolution kernels populated witharbitrary values.

For example, given the following convolution kernel:

−1 −1 −1−1 8 −1−1 −1 −1

the code generator produced the following sequence of SCAMP5 instructions:

Listing 2.1: Sample SCAMP5 instructions produced by Code Generatoradd(B, A, A);

neg(A, A);

add(B, B, B);

add(B, B, B);

east(C, A);

add(B, B, C);

west(C, A);

add(B, B, C);

south(C, A);

north(A, A);

add(A, A, C);

east(C, A);

add(B, B, C);

add(B, A, B);

west(A, A);

add(A, B, A);

Prior to this, SCAMP-5 instructions for the computation of convolution kernels had tobe written manually. The development of the code generator was thus an importantstep in paving the way for the implementation of high-level vision algorithms onthe SCAMP5 by allowing future researchers to build upon this ability in order toimplement more sophisticated algorithms.

2.3 Related Work

The effort to create new light-weight, low-latency, energy-efficient neural networkimplementations has led to the development of various software and hardware so-lutions to the problem of Neural Network Inference Acceleration. In this section, we

15

2.3. RELATED WORK Chapter 2. Background

will consider the various approaches being pursued, and examine some representa-tive examples of these approaches.

2.3.1 Software

The first broad approach taken to Neural Network Inference Acceleration has beento search for software solutions that allow state-of-the-art networks to be run onexisting less-powerful hardware (e.g. mobile devices).

Light-weight Architectures

There has been much recent interest in the development of smaller and more effi-cient neural network architectures, given that smaller and more efficient architec-tures would allow for a vast expansion of the reach of neural networks, allowingfor their use in new applications that were previously unavailable due to resourceconstraints.

A particularly prominent example of this approach has been MobileNet (Howard etal., 2017). MobileNet introduced a class of light-weight neural network architecturesspecifically designed for use on smartphones and other mobile devices.

MobileNet’s key innovation was the develpment of two global hyperparameters thatallowed model developers to make trade-offs between latency (how long the net-work takes to return a result) and accuracy by altering the size and complexity of theresulting architecture. This allowed network architectures to be customised to theresource constraints of particular application, allowing model developers to achieveoptimal performance across a range of applications.

Simplified Computations

An alternative approach to creating lightweight neural networks has been by findingways to simplify the computations necessary to evaluate the result of the neural net-work at run-time. Standard neural network implementations represent their inputsand weights using 32-bit floating point numbers. While using 32-bit floating pointarithmetic is an easy way to preserve accuracy at train-time (especially when train-ing is conducted on GPUs optimised for such calculations), it also results in poorinference efficiency at run-time when these models are run on less powerful devices(Google, 2018). Simpler and more efficient computations would therefore allow foran increase in the computation rate at run-time and thus an increase in inferenceefficiency.

One example of this approach was the training of Quantized Neural Networks, asdescribed by Hubara et al. (2016). QNNs are neural networks with weights and ac-tivations represented at extremely low precisions (e.g. using only 1-bit to 4-bits). By

16


using low-precision representations, QNNs were able to replace most arithmetic op-erations with bit-wise operations, significantly increasing computation speed whiledecreasing power consumption. Even so, Hubara et al. were able to achieve accuracyresults similar to 32-bit networks.

2.3.2 Hardware

While software innovations have allowed for the use of existing hardware in net-work inference, there is a limit to the level of efficiency that can be reached by usingsuch hardware. Consequently, there has also been a parallel drive towards hard-ware innovation, particularly to develop hardware devices specifically customisedfor implementing computer vision and neural network algorithms.

Vision Processing Units

The recent breakthroughs in computer vision has led to a proliferation of specialisedhardware devices specifically designed to run Convolutional Neural Networks andother related computer vision algorithms. These include the Intel Movidius Neu-ral Compute Stick, Microsoft Holographic Processing Unit and the MobileEye EyeQ.These devices are collectively referred to as Vision Processing Units, although theyencompass a wide range of different approaches and ideas. At their core, however,they are all working towards the same goal, which is the design of hardware that iscapable of efficiently (in terms of both speed and power consumption) implementingrun-time neural networks. Other hardware devices, such as Google Tensor Process-ing Units are aimed at optimising network training; such devices are beyond thescope of this report.

One such hardware device is the Eyeriss, an accelerator chip for deep CNNs devel-oped at the Massachusetts Institute of Technology by Y.-H. Chen et al. (2016), pro-viding support for a large number of convolutional layers, millions of filter weights,and varying shapes (filter sizes, number of filters and channels). Eyeriss achievesthis at a hardware level through minimizing data movement by exploiting datareuse, as well as making use of data statistics to minimize energy use throughzeros skipping/gating to avoid unnecessary reads and computations. Eyeriss wasable to run the convolutions in AlexNet at 35fps with 278mW power consumption, alevel of energy efficiency that is comparable to that which has been achieved on theSCAMP-5.

Dynamic Vision Sensors

An alternative hardware approach has concentrated on entirely rethinking the inputupon which computer vision algorithms operate. As seen above, CNNs generallyoperate on entire images, taking an image as input and producing relevant output

17

2.3. RELATED WORK Chapter 2. Background

such as classification or detection data. This is in line with the approach takenby traditional cameras and vision sensors, which output full image frames at fixedintervals.

In contrast, Dynamic Vision Sensors such as the event cameras developed by Tobi’sDelbruck group (Lichtsteiner et al., 2008), output asynchronous events at microsec-ond resolution, with an event generated every time a pixel value changes beyonda specified threshold Scaramuzza, 2015. Unlike in a traditional vision sensor, nointensity information is provided beyond the binary data that an event has occurred.A DVS can operate with extremely low latency and at extremely low power. As withthe SCAMP-5, this raises the possibility of exciting new applications. For example,Maqueda et al. (2018) demonstrated how event cameras could be used in conjunc-tion with an adapted CNN architecture to predict a vehicle’s steering angle.

However, while the SCAMP-5 requires new ways of thinking about computationalproblems, the overall goal is still to implement pre-existing vision algorithms (e.g.image classification CNNs) more efficiently. In contrast, the lack of pixel intensitymeans that DVS simply cannot implement pre-existing vision algorithms, and insteadrequire entirely new vision algorithms (including specially adapted neural networks)which could perhaps be used in entirely new applications.

2.3.3 Analog Computing

While analog computing remains an often ignored field, it has been enjoying a re-vival in recent years, with researchers applying it profitably towards problems suchas biological simulations and the solving differentiable equations. This has also beentrue in the deep learning space, where a handful of researchers have been exploringthe potential for using analog computing techniques to accelerate both inference andtraining.

Analog for Inference

The most notable contribution made to utilising analog techniques in neural networkinference has been the RedEye Image Sensor developed at Rice Univeristy (Likamwaet al., 2016). Likamwa et al. developed a design for an image sensor architec-ture that moved early sections of vision processing to the analog domain, resultingin a sensor with the capability to compute convolutional layers in the analog do-main. They demonstrated that such a design would result in significant reductionsin energy consumption by reducing the energy burden of analog readout to the hostsystem. However, while they were able to validate their claims using a simulatedcircuit, they did not provide an actual hardware implementation. This project there-fore seeks to build upon their work by implementing a similar analog computationparadigm on an actual hardware device.

Analog for Training

Apart from using analog computation to optimise neural network inference, work

18


has also been carried out on using analog techniques to accelerate network training.Ambrogio et al. (2018) at IBM demonstrated a method for the training of deep neu-ral networks using analog memory that was able to achieve accuracies equivalentto GPU-based systems. They are currently exploring the development of prototypechips employing this technology, and have calculated that their implementation ex-ceeds today’s GPUs in speed and energy efficiency by two orders of magnitude.

19

Chapter 3

Focal Plane Vision

In this chapter, we present Focal Plane Vision, a set of techniques allowing us toharness the unique capabilities of a class of devices known as Focal-Plane Sensor-Processors (FPSP). By doing so, we are able to lay the foundations for hardwareimplementations providing vast improvements in inference speed, power consump-tion and data bandwidth during the execution of key computer vision tasks.

This chapter details our efforts to design and build a Convolutional Neural Networkready for implementation on a Focal-Plane Sensor-Processor:

• Section 3.1 describes the development of an FPSP simulator which we coulduse in the process of developing our CNN.

• Section 3.2 explains a method for training CNN weights in a specific wayallowing for seamless implementation on an FPSP.

• Section 3.3 outlines the design of our Focal Plane CNN and demonstrates itssimulated performance.

• Section 3.4 presents a method for increasing the resiliency of a CNN to thecomputational noise inherent in most FPSPs.

3.1 FPSP Simulator

In this section, we describe the development of a Python-based custom simulatorreplicating the core functionality of a Focal-Plane Sensor-Processor. Such a simulatorprovided us with a key tool needed in the development of new FPSP algorithms, byallowing for the testing and debugging of algorithms before they were implementedon the actual analog device.

20

Chapter 3. Focal Plane Vision 3.1. FPSP SIMULATOR

3.1.1 Motivation

While the developers of the SCAMP-5 created an official simulator for that device(J. Chen, 2018), we found that their implementation was not suited to the partic-ular needs of our development process. Most significantly, while their simulatorattempted to model some (but not all) hardware effects, they did not provide a wayfor these effects to be disabled. Such a feature was particularly important for ourpurposes because accurate simulated computation (i.e. without noise) would al-low us to verify the veracity of our algorithms by comparing our computed resultsagainst those produced by standard neural network implementations (e.g. Tensor-flow).

3.1.2 Approach

The general approach taken was to build a simulator that would replicate the corecapabilties of FPSPs in general, rather than developing a complete replica of theSCAMP-5. This was an important distinction to make, as while the SCAMP devicefamily has a large number of unique features, our overall aim was to develop a visionparadigm that could ideally be run on any FPSP.

Core Instruction Set

To this end, we identified two sets of four instructions each that we saw as repre-senting the fundamental capabilities of an FPSP:

• Neighbour Operations: Neighbour operations are fundamental to the FPSPdesign philosophy and are what ultimately set FPSPs apart from any othercomputation device. Specifically, we included support for all four cardinaldirections.

• Basic Arithmetic: These comprised fundamental arithmetic operations, in par-ticular addition, subtraction, negation and division-by-2. Notably excludedfrom this instruction set was multiplication, which analog FPSPs (includingthe SCAMP-5) tend to lack support for.

The core instruction set is listed in Table 3.1. These instructions provide a fullyfunctional FPSP simulator, replicating the core functionality that one would expectto find in any FPSP.

Expanded Instruction Set

Beyond the core instruction set, we also developed an expanded instruction set thatimplemented selected SCAMP-5 specific features. These primarily focused on in-structions for working with simulations of the SCAMP-5’s digital registers and FLAGregister. The FLAG register is a digital register used to determine if a processing ele-ment will be active; if a processing element’s FLAG register is set to 1, it will executeinstructions, otherwise, it will simply ignore instructions issued.

21

3.1. FPSP SIMULATOR Chapter 3. Focal Plane Vision

Neighbour Operations

north(x)

south(x)

east(x)

west(x)

Basic Arithmetic

add(x, y)

sub(x, y)

div2(x, y)

sneg(x, y)

Table 3.1: FPSP Simulator Core Instruction Set

Activity InstructionsWHERE(A)

ALL()

Logic Operations

AND(A)

OR(A)

NOT(A)

Table 3.2: FPSP Simulator Expanded Instruction Set

We identified another two sets of instructions which were required for the expandedinstruction set:

• Activity Instructions: These instructions controlled the value of the FLAGregister, thereby determining which processing elements were active.

• Logic Operations: These comprised fundamental logic operations for use withdigital registers, in particular, AND, OR and NOT.

The expanded instruction set is listed in Table 3.2.

Noise Model Support

In addition to replicating the core functionality of FPSPs, we also needed the simula-tor to provide an option for replicating one of their key drawbacks - noise and otherhardware effects.

22

Chapter 3. Focal Plane Vision 3.1. FPSP SIMULATOR

Figure 3.1: Example of an FPSP edge case

We designed our simulator to allow hardware effects to either be represented bya complete noise model, or for individual effects to be separately investigated asrequired. For example, the accuracy of division for small values could be varied todetermine the potential benefit to developing hardware with more accurate divisioncapabilities. This allowed us to compare results obtained when computation wasperformed with or without varying degrees of noise.

Edge Cases

A particular point of note when working with FPSPs is the handling of (literal) edgecases, in which one has to decide how processing elements on the edge of the focalplane handle certain operations.

Consider, for example, a processing element (PE) X located on the left-most edge ofthe focal plane (Fig. 3.1). If a west() instruction is issued, the value of X will bemoved to Y. The issue, of course, is that there is no PE to the west of X. DifferentFPSPs may handle such situations differently.

In the case of our simulator, we decided to adopt the approach taken on the SCAMP-5, in which such edge cases will simply result in the affected PE obtaining a value of0. This behavior is also particularly convenient with regards to the implementationof convolutions, because it is functionally equivalent to the ’zero-padding’ optionavailable in standard convolution implementations.

3.1.3 Implementation

The simulator itself was implemented in Python. The FPSP processor array wasinitially represented using Python lists, and in later versions of the simulator as anumpy array for more efficient computation.

The simulator was configured to accept a set of test images as input, allowing for

23

3.2. CUSTOM REGULARIZATION Chapter 3. Focal Plane Vision

the evaluation of standardized benchmarks such as the MNIST test set. This, forexample, allowed us to demonstrate that (without noise) our algorithms running onan FPSP produced the same result as a Keras/Tensorflow implementation runningon a CPU/GPU. Given that the simulator was created primarily for internal researchpurposes, the primary mode of user interaction was through the terminal and thesource code.

Furthermore, our implementation was developed with debugging purposes in mind.As each analog and digital ’register’ was implemented simply as a numpy array, auser could choose to examine the value of any coordinate point on any register at anypoint during the computation. This proved extremely valuable when investigatingthe behaviour of newly developed algorithms.

3.2 Custom Regularization

In this section, we introduce a novel technique for training Convolutional NeuralNetworks that facilitates their implementation on Focal-Plane Sensor-Processors. Wedo this by developing a custom regularization method which limits the expressionof weight values to an arbitrary set of numerical values, and show that this can beachieved with a negligible impact on overall network accuracy.

Due to the inherent nature of FPSPs, a crucial limitation is the limited processingcapabilities of each individual processing element. This severely limits the scopeof mathematical operations that they are capable of executing, and pose particulardifficulties for the computation of neural networks. While alternative computationtechniques exist, there does not exist a method for training a neural network ca-pable of making use of these techniques. Our method bridges that gap, providinga way to train a CNN such that it can utilise the method developed by Debrunner(2017).

3.2.1 Motivation

Implementing a neural network on the SCAMP-5 posed a set of unique computationalchallenges due to the chip’s unique design philosophy.

Neural networks are normally implemented using 32-bit floating point numberswhich are used to represent weights and intermediate values; in particular, a neuralnetwork implementation necessarily makes extensive use of floating-point multipli-cation in computing the final result of the network. However, the SCAMP-5 has nosupport for floating-point operations. In fact, as noted in Chapter 2, a SCAMP-5programmer does not even have access to a multiplication operation, let alone thehigh-precision multiplication neural networks expect.

Faced with this problem, Debrunner (2017) introduced a method that allowed oneto approximate the result of a floating-point multiplication operation to arbitrary

24

Chapter 3. Focal Plane Vision 3.2. CUSTOM REGULARIZATION

precision using a sequence of repeated division and addition operations. WhileDebrunner’s method may theoretically allow multiplication to be conducted to asprecise a level of precision as might be necessary, the inaccuracy of SCAMP-5 com-putations means that multiplication capabilities remain severely limited. The keylimitation is due to the fact that a result of greater precision requires a greater num-ber of division operations to be carried out. As input values get smaller, division onthe SCAMP-5 become less accurate (refer to the noise analysis carried out in Chap-ter 5 for more details). This constraint means that in an actual implementation themaximum approximation depth that can reasonably be sustained is 3, correspond-ing to multipliers in the set M = {0, 0.125,−0.125, 0.25,−0.25...}. In other words,available multiplication options are limited to b × m where b ∈ Z : b ∈ [−128, 127]and m ∈M .

Given that a neural network’s weights routinely vary across a wide range of repre-sentable numbers, the computation limitations outlined above posed a serious andnon-trivial challenge standing in the way of the successful implementation of a neu-ral network on an FPSP.

3.2.2 Approach

The key question to be solved was therefore that of how a neural network could betrained such that its weights only took on values at specific intervals.

Our approach to solving this problem drew inspiration from the concept of regular-ization, which is commonly-employed technique used to prevent over-fitting duringthe training of neural networks.

As described in Chapter 2, regularization reduces the chance of over-fitting by limit-ing the size of individual network weights. This is done by adding an additional termto the original loss function, which increases the total loss based on the size of thenetwork’s individual weights. This effectively incentivises the selection of smallerweights during training, as the larger the size of an individual weight, the more itcontributes to the overall loss.

We noticed that a similar approach could be taken in incentivising the selection ofan arbitrary set of weight values during training. The key was to find the rightregularization function. The appropriate regularization function for the task couldbe formulated in the following terms:

• Let V be the arbitrary set of weight values which we seek to select duringtraining, e.g. V = {0, 0.25,−0.25, 0.5,−0.5...}.

• Let R = |f(x)| be the appropriate regularization function, and let M be the setof minimum points of R such that M = {m : R(m) = 0}.

• The appropriate regularization function would be any R for which M = V ,and that when using R in training it would result in a set of trained weightsW = {w : w ∈ V }.

25


In other words, the appropriate regularization function was one in which the loca-tion of the function’s minimum points would coincide with our desired set of weightvalues.


The following section describes the implementation of our approach towards customregularization. We first explain how we formulated the appropriate regularizationfunction, and then detail how that function was used within an overall networktraining procedure.

Custom Regularization Function

We observed that a function with minimum points at specific intervals could be bestrepresented by a periodic function. To this end, we decided that the cosine functionwould be an appropriate choice for our custom regularizer.

The cosine function is given by:

f(x) = cos(x) (3.1)

To create our target regularization function, we needed to apply a sequence of suit-able transformations.

We first observed that the function took on minimum values of −1 rather than 0.This could be easily changed by performing a vertical translation:

f(x) = cos(x) + 1 (3.2)

The next important observation was that the function had a period ω = 2π, whichalso meant that its minimum points occurred at intervals of 2π. For the functionf(x) = cos(Ax), the period ω is given by the following formula:

ω =2π

A(3.3)

By modifying the value of A and scaling the function accordingly, we were able toset the period (and therefore the intervals of the minimum points) to any arbitraryinterval. To obtain a period ω = 0.25, we therefore used a value of A = 8π ≈ 25.132,obtaining a function as follows:

f(x) = cos(25.132x) + 1 (3.4)

Finally, we observed that while the intervals were correct, the sequence was notcentred on 0. We therefore needed to apply a final horizontal translation, whichgave us our regularization function.

f(x) = cos(25.132(x− 0.125)) + 1 (3.5)

26


Figure 3.2: Graphical representation of f(x) = cos(25.132(x− 0.125)) + 1

Fig. 3.2 gives the graphical representation of our regularization function.

A general form of the regularization function for any given interval u can be ex-pressed with the following formula:

f(x) = cos(2π

u(x− (

u

2))) + 1 (3.6)

Regularized Re-Training

Once the appropriate regularization function R for the required interval had beendetermined as per the steps above, the new loss function to be used in the trainingloop could be defined as follows, where L0 was the original loss function:

L = L0 +R (3.7)

We found that the most effective way of utilising our regularizer was to employ it ina re-training loop, which was carried out on a network which had first been trainedwithout a regularizer. Doing so allowed us to first train a network using the fullrange of numerical expression available, and thereafter fine-tune that training byre-training it with our regularization function.

Note that in line with the approach described below in Section 3.4 and Section 4.2in which convolutional layers were computed on the FPSP while dense layers werecomputed on a microcontroller/CPU, only convolutional layers were subjected tocustom regularization during the re-training loop during experiments in this section.However, if required, as in Section 3.3, the custom regularizer could also be appliedto dense layers in addition to the convolutional layers.

The final step in the procedure was to round the convolutional layer weights tothe nearest appropriate interval. While the regularizer would generally succeed in

27


Algorithm 1 Custom Regularization

1: procedure STANDARD TRAINING(weights)2: loop3: train network using standard loss function L0

4: end loop5: return weights6: end procedure

7: procedure RE-TRAINING(weights)8: loop9: train convolutional layers using regularized loss function L = L0 +R

10: train dense layers using standard loss function L0

11: end loop12: return weights13: end procedure

14: procedure ROUNDING(weights)15: round all convolutional layer weights to nearest interval u16: return weights17: end procedure

training weights that were extremely close to the required intervals (see Fig. 3.4),it could not produce exact values. We therefore needed to perform a final roundingoperation so that we had an exact value suitable for use on an FPSP.

The full training procedure is outlined in Algorithm 1.

3.2.4 Results

The following section presents the results of the custom regularization procedurewhen applied to a CNN trained on the MNIST dataset. The targeted regularizationinterval in this training instance was u = 0.25.

Weight Distributions

Figure 3.3 shows the distribution of the values of convolutional layer weights afterthe completion of standard training. As it clear from the histogram, the weightstook on a wide range of different values, making use of the full range of numericalexpression available.

Figure 3.4 shows the distribution weight values in the network’s convolution layerafter the completion of re-training using the custom regularization function. Unlikein the previous histogram, we see that the weights were no longer evenly distributed.Instead, all weights were clustered around the key numerical intervals as desired,indicating the success of the regularization procedure.

28


Figure 3.3: Histogram of conv layer weights after standard training

Figure 3.4: Histogram of conv layer weights after regularized re-training

29


Figure 3.5: Histogram of conv layer weights after final rounding

Procedure Stage Test Accuracy

Standard Training 92.82%

Direct Rounding 74.84%

Custom Regularization followed by Rounding 92.90%

Table 3.3: Network Test Accuracy for rounding convolutional layer weights to intervalof u = 0.25 using, with and without custom regularization

Figure 3.5 shows the distribution of values after the weights from Figure 3.4 wererounded to the nearest interval u. We see that there is little difference between thetwo histograms, except that after rounding weights no longer fall on both sides ofeach interval step but rather fall precisely on the interval.

Network Test Accuracy

Table 3.3 compares the effectiveness of custom regularization and direct rounding(i.e. where convolutional filter weights are simply rounded to the nearest intervalu) against a standard baseline. We see that when direct rounding was used, the net-work demonstrated a significant reduction in test accuracy. When using our customregularization technique, however, test accuracy remained almost exactly the sameas the baseline value.

The set of convolutional layer weights successfully trained using this procedurewould later be used in the development of the SCAMP-5 hardware implementationdescribed in Chapter 4.

30

Chapter 3. Focal Plane Vision 3.3. FOCAL PLANE NET

Figure 3.6: Standard Architecture and Workflow of a Basic CNN

3.3 Focal Plane Net

This section outlines the work undertaken towards the development of Focal PlaneNet, a Convolutional Neural Network designed for implementation on an FPSP. Dueto their unique design, standard CNNs cannot simply be run on FPSPs; instead acustomised solution is required.

3.3.1 Network Design

We sought to design a network that could execute the full forward pass of a Convo-lutional Neural Network entirely in the focal plane, with the only data transferredoff-plane being the result of the classification process.

While this may not be the best network design for a specific implementation (e.g.in Chapter 4 we design a network that performs convolutions in the focal planeand subsequently transfers those results off-plane), the aim was to demonstrate thefeasibility of performing the entire computation in the focal plane, with the corollarybeing that differing amounts of computation could be performed in the focal planeas appropriate for specific implementations.

Fig. 3.6 shows the architecture of a standard CNN in its most basic form. As can beseen, image data is first transferred from the focal plane to an off-plane processingdevice, at which point the result of each layer is calculated in sequence. The ar-chitecture comprises a single convolutional layer with multiple convolutional filters,followed by a pooling layer, and finally a dense layer.

31

3.3. FOCAL PLANE NET Chapter 3. Focal Plane Vision

Figure 3.7: Architecture of Focal Plane Net

Fig. 3.7 shows the architecture of FocalPlaneNet. There are a number of key obser-vations to be made vis-a-vis that of a standard CNN:

• Data Processing Location: The image is no longer directly transferred to aprocessing device. Instead, the capabilities of FPSPs are leveraged to conductprocessing directly on the focal plane. It is only when focal plane processing iscomplete that relevant data is transmitted to an off-plane processing device.

• Layer Weights: There are no longer network layers which take weights asinput. Instead, the weights in each layer have been used to generate specialisedblocks of code which are designed to perform the requisite task on an FPSP.

In the following section, the approach taken in generating each of the relevant codeblocks is explained.

3.3.2 Code Generation Approach

The following paragraphs explain the approach taken in conversion of each of thelayers in the standard architecture into executable FPSP code blocks. The general ap-proach taken was to express each layer’s functionality in terms of convolutions, fol-lowing which FPSP code could be generated using Debrunner’s code generator.

Convolutional Layers

The conversion of convolution layers was perhaps the most straightforward. Afterthe training of a network using the custom regularization technique descirbed inthe previous section, the weights matrix of each convolution filter was extracted.

32


Debrunner’s automatic code generator (Debrunner, 2017) was then used to generatea sequence of FPSP instructions which could accurately compute the result of thatparticular convolution. Finally, each convolution filter was assigned the use of itsown unique set of registers for computation and the code updated accordingly. Theconversion procedure is shown in Algorithm 2.

Algorithm 2 Convolutional Layer Conversion

1: procedure CONVERT

2: for filter in conv layer do3: extract weights matrix4: generate corresponding FPSP code block using code generator5: update code block with unique base register6: end for7: return all code blocks8: end procedure

Pooling Layer

The conversion of the pooling layer to FPSP code was carried out in two sequentialsteps as detailed below. For the purposes of illustration, the use of average poolingin the original network was assumed, although a procedure may also be derived formax pooling if required.

Calculation

2D average pooling involves computing the average of n× n data points. While thiscomputation is normally carried out directly, it can also be thought of as being aconvolution represented by a convolution kernel with uniform values.

For example, the following convolution filter would compute the average of eachpixel and its surrounding eight pixels:

19

19

19

19

19

19

19

19

19

For ease of approximation, sum pooling was used during the conversion process inplace of average pooling. This was done with the consideration that it would haveno impact on future layers, given that it was simply a linear scaling of the originalfunction. The following convolution filter was therefore employed to carry out thecalculation: 1 1 1

1 1 1

1 1 1

33


Figure 3.8: Diagram demonstrating the process of compaction

Compaction

At this point, all necessary calculations had been completed. However, the resultswere not located at the right locations on the focal plane.

In a standard CNN pooling layer, the completion of pooling leads to a reduction inimage size. For example, a 9× 9 image becomes a 3× 3 image after pooling.

In the case of an FPSP, however, given that the convolution was carried out in placeand at a pixel level, the calculated values were scattered across the focal plane,rather than compacted into a smaller-sized image. An example of this situation isshown in the left-half of Fig. 3.8.

Actions were thus taken to compact the image by shifting the calculated values to ap-propriate positions on the focal plane as shown in Fig. 3.8. This could be achievedusing each processing element’s FLAG register, allowing groups of pixels to be se-lected as necessary. Suitable neighbour operation could then be applied, movingcalculated values inwards towards their appropriate positions.

Dense Layer

The computation of dense layers is often thought of in terms of matrix multiplica-tions. For example, the computation of a sample dense layer with 9 inputs and 2neurons can be represented as follows:

[1 2 3 4 5 6 7 8 9

]×

0.1 0.20.3 0.40.5 0.60.7 0.80.9 1.01.1 1.21.3 1.41.5 1.61.7 1.8

=[52.5 57

]

34


However, as Karpathy (2018) notes, one can also think of the computation of adense layer as being the result of a series of successive 1-D convolutions applied tothe input.

Consider, that in the example above, we could take[0.1 0.3 0.5 . . .

]as our first

convolution kernel, and[0.2 0.4 0.6 . . .

]as our second convolution kernel. This

would give us the following, with ∗ representing a convolution operation:

[1 2 3 4 5 6 7 8 9

]∗[0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7

]→[52.5

]

[1 2 3 4 5 6 7 8 9

]∗[0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

]→[57]

In the case of the FPSP, however, the input after the compaction step above wouldstill be two-dimensional, given that no flattening operation was carried out (unlikein the standard implementation).

We could therefore convert each convolution filter into a 3 × 3 matrix and applythe convolutions as follows, with each filter corresponding to the weight values of aspecific neuron:

1 2 34 5 67 8 9

∗0.1 0.3 0.5

0.7 0.9 1.11.3 1.5 1.7

→ [52.5

]1 2 3

4 5 67 8 9

∗0.2 0.4 0.6

0.8 1.0 1.21.4 1.6 1.8

→ [57]

In this way, the task of computing a dense layer was successfully reduced to the taskof computing a series of convolution filters, code for which could be generated inthe same way as before.

3.3.3 On-Simulator Implementation

A regularised network was first trained using the procedure described in Section3.2. Thereafter, using the methods described in the previous section, each layer ofthis trained network was converted to FPSP code. The converted FPSP code wasthen run on our FPSP simulator, thereby allowing for the simulation of the inferenceprocedure of the network when provided with test images.

35


3.3.4 Results

We verified the accuracy of our simulated implementation by comparing the interme-diate and final results produced by the simulated implementation and the standardKeras/Tensorflow implementation. In all test cases using images drawn from theMNIST test set, the network computed on the simulated FPSP exactly reproducedthe results produced by the standard implementation, thereby ascertaining the ac-curacy of our implementation.

Intermediate Values Example

For example, these were results from the simulated FPSP for the computed values ofthe first convolution filter after pooling:

[0.27469135802469136, 0.0, 0.0]

[7.430555555555555, 16.850308641975307, 0.8518518518518519]

[0.0, 6.373456790123457, 0.0]

The corresponding results of the Keras/Tensorflow network at the same stage ofcomputation were as follows:

[[[ 0.27469137 ]

[ 0. ]

[ 0. ]]

[[ 7.4305553 ]

[16.85031 ]

[ 0.8518518 ]]

[[ 0. ]

[ 6.373457 ]

[ 0. ]]]

As can be seen, apart from minor differences in precision and number representation,the computed values are exactly the same.

Final Values Example

Similarly, the final computed outputs for the test image were also exactly the sameacross the two implementations:

Simulated Implementation:

[-11.946952160493826, -6.577739197530862, -3.338927469135803,

1.049382716049382, -6.252700617283953, -7.754436728395057,

-14.915509259259258, 8.994405864197532, -1.1778549382716053,

1.1730324074074079]

Keras/Tensorflow Implementation:

[-11.946953 -6.5777407 -3.3389273 1.0493841 -6.252701 -7.754436

36

Chapter 3. Focal Plane Vision 3.4. NOISE-IN-THE-LOOP TRAINING

-14.915509 8.994406 -1.177855 1.173033 ]

The accuracy of these results therefore demonstrated that a complete neural networkcould indeed be successfully implemented on an FPSP by converting each standardneural network layer to FPSP code using the methods developed above.

3.4 Noise-In-The-Loop Training

This section demonstrates a method for training neural networks to take hardware-level noise into account during the training loop. While FPSPs hold great promisefor inference acceleration, their benefits come at a cost of significant noise and otherhardware effects which prevent the effective deployment of neural networks. Ourmethod provides a means of training a neural network for use on a device witharbitrary amounts of noise while incurring minimal loss in performance. We thusprovide a means of opening the potential of FSPSs to developers of inference accel-eration techniques within the computer vision community.

3.4.1 Motivation

Using the digital simulator, we observed that including noise for FPSP operationssubstantially reduced test accuracy, with an increasing amount of noise leading to acorresponding decrease in accuracy (see Fig. 3.12). This phenomenon was entirelyexpected, given that introducing noise effectively leads to a divergence between thetraining and test environments, thus resulting in the model learnt being a inaccu-rate representation of the test set ’reality’. For a neural network to be successfullyimplemented on an FPSP, a solution would thus be needed that would allow for thedeployment of a network even in the face of substantial noise.

3.4.2 Approach

The key idea was to train a network on the environment it was actually likely to befaced with at test time. To do that, the effects of noise would need to be introducedinto training process so that the network could learn to take them into account.

However, this would not be possible with a network exclusively implemented on anFPSP, such as the previous section’s Focal Plane Net. The reason for this was thatonce layers were converted to instruction sequences, they were transformed from aseries of continuous weights (which could be trained using backpropagation), to aset of undifferentiable discrete elements. Converting all layers to FPSP operationsmeant that the network no longer had any values that could be trained to take noiseinto account.

37

3.4. NOISE-IN-THE-LOOP TRAINING Chapter 3. Focal Plane Vision

ComputationalPlatform Advantages Drawbacks Most Suitable For:

FPSPMassively Parallel

ProcessingLimited Computational

AccuracyConvolutional

LayersCPU Accurate Computation Iterative Processing Dense Layers

Table 3.4: Comparison between CPU and FPSP Computation

The solution developed in response to this roadblock took advantage of the fact thatFPSPs usually also have a micro-controller operating in tandem which is used to per-form crucial functions such as managing the FPSP’s clock cycle and handling output.The micro-controller could of course also be used to carry out a limited amount ofcomputation. Crucially, while less efficient, this data processing would be accurateand free from the noise associated with FPSP computation; used strategically, thiscomputational capability could prove extremely valuable.

We therefore decided to implement a division of labour between the FPSP and themicro-controller, with each device performing computations suited to its particularadvantages and working together to achieve an optimal result. The key questionwas thus how this division of labour should be carried out, and in particular, whichcomputations should be carried out on which device. To answer this question, wefirst needed to analyse the advantages and disadvantages of each computation op-tion.

FPSP and Micro-Controller Comparison

A typical CNN comprises a number of convolutional and pooling layers, as well asa number of dense (fully-connected) final layers. The nature of FPSPs mean thatthey are particularly well-suited to the rapid execution of convolution operations.While a CPU would have to proceed iteratively over an image frame, computing theresult of the convolution filter at each successive pixel, the massively parallel natureof the FPSP processor array means that it can simultaneously compute the result ofa convolution filter at all points in an image within a single frame loop. A CPU canthus apply a particular convolution filter to all pixels in an image in O(n) time (i.e.proportional to the number of pixels in an image), while an FPSP can compute thesame result in O(1) time.1

At the same time, the vastly more precise computation offered by a CPU would beof great benefit in computing the final dense layers, where numerical accuracy isof greater importance. This is simply because the dense layers perform the compu-tations that lead directly to the result of the neural network. As long as the com-putation of the final dense layers remained accurate, our hypothesis was that thosedense layer weights could be trained to take into account inaccuracies introduced byprevious, possibly less accurately computed layers.

1The overall time complexity of a convolution operation also depends on the size of the con-volution kernel. However, for the purposes of this comparison we limit the discussion to the timecomplexity of applying a convolution kernel of fixed size.

38


Figure 3.9: Comparison between architecture of a standard CNN and our proposeddivision of labour

Given the particular advantages and disadvantages of each computational platform(see Table 3.4), we concluded that the best division-of-work would be to rapidlycompute the result of the convolutional layers using the FPSP, followed by accu-rately computing the results of the final dense layers using the micro-controller. Fig.3.9 overlays the architecture of a standard network with the proposed division oflabour.


The idea which we sought to implement was essentially that later layers in a neuralnetwork could successfully adapt to the introduction of noise in earlier layers if thisnoise was itself included in the training loop. Specifically, we hypothesised that anaccurately computed dense layer could learn to account for noise introduced in thecomputation of previous convolutional layers.

Our implementation of this idea is detailed in the following sections. The first sec-tions details the development of a custom Keras layer to simulate the noise intro-duced in computation, and the second details the training and re-training procedurethat was employed to implement this technique.

Simulator-In-The-Loop

The key to taking hardware noise into account during the training process was todevelop a method to add the effects of the noise to the values used in training.

39


Figure 3.10: Comparison between original network and Simulator-In-The-Loop Net-work

This was done by implementing a variant of our FPSP simulator directly in the train-ing loop, using a custom Keras layer to replace the standard Keras convolution layer.Fig. 3.10 compares the architecture of the original network to that of the networkafter the inclusion of the custom Keras layer.

This custom layer expressed FPSP operations in terms of Tensorflow functions ac-cessed through the Keras backend. For example, the FPSP operation north() couldbe rewritten as the following set of Tensorflow functions:

// K refers to the Keras backend, i.e. Tensorflow

// Get the first 27 rows of each image in batch

sliced = K.slice(input_tensor, (0, 0, 0, 0), (-1, 27, -1, -1))

// Create a row of zeros for each image in batch

zeros = K.zeros((32, 1, 28, 1), dtype=’float32’)

// Concatenate the zeros with the original 27 rows

result = K.concatenate([zeros, sliced], axis=1)

// In effect, the top row now comprises zeros, while all

other rows have moved downwards by one place. The original

last row is no longer represented.

Once each FPSP operation had been rewritten in terms of Tensorflow code, they

40


Figure 3.11: Noise-In-The-Loop Overall Training Procedure

could be put together in sequence as had been done on the simulator. A convolutioncould therefore be represented in the custom layer by a series of FPSP operations, inthe same way as had been done for the simulated implementation.

The function for each operation was then wrapped in a general function representingthe error model that should be taken into account during the training process. Thisallowed for the amount and type of noise accrued by each operation to be variedas required. The full north() operation shown above was therefore represented in afunction as follows:

def north(input_tensor):

sliced = K.slice(input_tensor, (0, 0, 0, 0), (-1, 27, -1, -1))

zeros = K.zeros((32, 1, 28, 1), dtype=’float32’)

result = K.concatenate([zeros, sliced], axis=1)

return error_model(result)

Training Details

As before, a network was first trained using the custom regularization method de-scribed above. The convolution layer weights were then converted into FPSP codeblocks using the technique described in Section 3.3.2.

Finally, the modified network was then put through a new round of training, wherenoise could be represented using the error model. It was during this last roundof training that the weights in the dense layer were re-trained to take noise intoaccount. Figure 3.11 gives a diagrammatic representation of the overall trainingprocess.

3.4.4 Results

Our hypothesis was verified by experimental results, which convincingly demon-strated that noise-in-the-loop training was a very powerful method for amelioratingthe effects of noise on the implementation of neural networks.

Our experiments were divided into two stages. We first considered the effective-ness of our technique in dealing with the effects of random noise, and followingwe which considered its effectiveness when deployed against systematic noise (withsome degree of randomness).

41


Figure 3.12: Effect of noise on test accuracy for network trained with standard training(red) and noise-in-the-loop training (green), with zero-noise baseline in blue

Random Noise

Noise was represented in the form of a uniform random variable N , where d ∈ Rrepresented the degree of noise applied; the greater the value of d, the greater thepotential distortion to the calculated values:

N ∼ U([1− d, 1 + 0.5d]) (3.8)

We then introduced noise to the inference process by multiplying the result R ofeach simulated FPSP operation by N , such that:

R 7→ R×N (3.9)

The network was first trained and tested using the normal training procedure, andthereafter trained using the noise-in-the-loop training technique. The results of bothprocedures in response to increasing amounts of noise is shown in Fig. 3.12.

We see that at every level of noise, noise-in-the-loop training consistently producedresults that outperformed standard training. Furthermore, we see that even at mod-erately large amounts of noise per operation, such as d = 10%, noise-in-the-looptraining actually produced test accuracies that were higher than the initial base-line! This is not entirely unexpected, given that noise could be thought of as ameans of reducing overfitting, thereby making it more likely that the trained net-work would generalise better to examples in the test set. It was only at increasingly

42


Baseline (No Noise) 91.41%

Standard Training, Noise applied during Inference 81.37%

Noise-in-the-loop Training 93.01%

Table 3.5: Test results when 10% systematic error and ±1% random error is applied

absurd amounts of noise that noise-in-the-loop training began struggling to cope,but even so it still consistently provided an advantage over simply using standardtraining.

Systematic Noise

While the performance of noise-in-the-loop training against random noise was sig-nificant, the most impressive results came when it was applied against systematicnoise.

While some degree of random noise is usually present in FPSP computations, theyare usually not of the form depicted in the previous section. Instead, FPSP noisenormally takes the form of a (relatively) fixed amount of systemic noise, coupledwith a smaller degree of random noise. For example, it is known that each neighbouroperation results on the SCAMP-5 results in a small value loss due to the loss ofcharge when charges are transferred between analogue registers. However, this lossvalue is not exact, and varies around an average with the variation determined bya range of physical factors. This is the type of error we sought to replicate in thefollowing experiments.

As before, we introduced a noise multiplier to the result of each operation, with a10% systemic error and a ±1% random error as follows:

N ∼ U([0.89, 0.91]) (3.10)

The results with and without noise-in-the-loop training are shown in Table 3.5. Ascan be seen, the introduction of noise alone led to a significant fall in the test accu-racy of the network. However, when noise-in-the-loop training was used, not onlydid the network not suffer a drop in performance, it even outperformed the originalbaseline!

The results were even more dramatic when slightly more systemic noise was in-troduced. In this case, a noise multiplier with 25% systematic error and a ±2.5%random error was used as follows:

N ∼ U([0.725, 0.775]) (3.11)

The results for this experiment are shown in Table 3.6. It can be seen that trainingwithout noise-in-the-loop led to a complete collapse in test accuracy, with the result

43


Baseline (No Noise) 91.41%

Standard Training, Noise applied during Inference 11.42%

Noise-in-the-loop Training 87.57%

Table 3.6: Test results when 25% systematic error and ±2.5% random error is applied

being almost no better than random guessing. When noise-in-the-loop training wasemployed, the network only experienced a slight drop in test accuracy from the base-line, a massive improvement over the result produced by standard training.

3.4.5 Hardware Recommendations

As we will see in the subsequent chapters, Noise-In-The-Loop training is almost en-tirely effective in alleviating the noise introduced by the SCAMP-5. This is becausethe amount of error introduced by the SCAMP-5 is generally well-within the noiselevels that were successfully tested for above.

However, it is conceivable that developers of future FPSPs might wish to toleratehigher levels of systematic and/or random noise, especially if such noise might bethe trade-off incurred in introducing other important features such as additionalregisters or wider range of numerical representation.

While Noise-In-The-Loop training clearly provides an improvement over the statusquo for all levels of noise as seen above, it is particularly intriguing that at small tomedium levels of noise networks trained using Noise-In-The-Loop training can evenoutperform their zero-noise benchmark. It is thus of interest to determine moreprecisely how much systematic and/or random noise can theoretically be alleviatedby Noise-In-The-Loop training with no loss in network inference accuracy, so thatFPSP hardware developers can use this information to their advantage in designingfuture devices.

Random Noise

Fig. 3.13 presents the effects of gradually increasing the amount of Random Errorapplied to a network trained with Noise-In-The-Loop training. As before, we intro-duced a noise multiplier to the result of each operation, as follows:

N ∼ U([1− d, 1 + d]) (3.12)

As can be seen, up to around 12% random error, Noise-In-The-Loop training in factresults in performances better than the standard baseline.

44


Figure 3.13: Test accuracy after Noise-In-The-Loop Training when no Systematic Errorand increasing amount of Random Error is applied.

Systematic Noise

Fig. 3.14 presents results obtained when testing with increasing amounts of system-atic error and a small fixed amount of random error. The noise multiplier used inthis set of experiments was as follows:

N ∼ U([(1− d)− 0.025, (1− d) + 0.025]) (3.13)

We see that at up to 15% systematic error per operation, Noise-In-The-Loop train-ing in fact continually outperforms its benchmark. It is only as the error increasesbeyond that point that Noise-In-The-Loop training cannot fully compensate for theadditional systematic noise being introduced.

Combination of Systematic Noise and Increasing Random Noise

Having established that up to 15% systematic error can be tolerated, we then pro-ceed to determine how much additional random noise can be included before Noise-In-The-Loop training is overwhelmed. Using a value of 10% for systematic error(well within the acceptable range), we then add an increasing amount of randomerror. The noise multiplier used was as follows:

N ∼ U([0.9− d, 0.9 + d]) (3.14)

We see that just before the random error reaches 10%, the network’s test accuracyslips below the benchmark.

In general, it appears that up to 10% of both random error and systematic error canbe comfortably alleviated by using Noise-In-The-Loop training. This is a significant

45


Figure 3.14: Test accuracy after Noise-In-The-Loop Training when no Random Errorand increasing amount of Systematic Error is applied.

Figure 3.15: Test accuracy after Noise-In-The-Loop Training when Systematic Error of10% and increasing amount of Random Error is applied.

46


finding, because (as shall be explained in Chapter 5), many errors encountered onthe SCAMP-5 tend to be smaller than these limits. This would then suggest thatit might be beneficial to build future FPSPs with increased noise, if such noise isincurred in the provision of other functionality. However, it should be noted that thisanalysis is a preliminary one using a simplified conception of FPSP noise. The noisecharacteristics of actual devices such as the SCAMP-5 tends to be somewhat morecomplex, and may therefore cause Noise-In-The-Loop training to respond differentlyto increasing amounts of noise. It would therefore be beneficial for future prototypedevices to have their noise profiles tested with Noise-In-The-Loop training, in orderto establish exactly how much noise could potentially be incurred with no loss inneural network inference accuracy.

47

Chapter 4

AnalogNet

This chapter details the development of AnalogNet, a Focal-Plane CNN designedspecifically for implementation on the SCAMP-5 Vision System.

After the successful development of a simulated network implementation as de-scribed in Chapter 3, the next step was to implement a similar network on an actualFocal-Plane Sensor-Processor, in this case the SCAMP-5. The final implementationof AnalogNet was based on the general design features introduced in the previouschapter, although the particular constraints of implementation on an specific hard-ware device meant that new elements had to be introduced and different techniquesdeveloped in order to achieve a successful implementation.

4.1 Contribution

In this section, we showcase a method to implement a convolutional neural networkon the SCAMP-5 analog FPSP. We demonstrate that such an approach can carryout image classification tasks to a high degree of accuracy, while achieving largeimprovements in per-frame computation time and energy efficiency over existingstate-of-the-art computer vision solutions. To our knowledge, this is the first timethat a convolutional neural network has been successfully implemented on analoghardware in general, and on the SCAMP chip family in particular. While previouswork such as Likamwa et al. (2016) has used simulated devices to demonstrate thebenefits of such an approach, our study is the first to demonstrate an implementationrunning on an actual hardware device.

4.2 Network Design

The design of the AnalogNet architecture was generally based on the noise-in-the-loop network introduced in the previous section, which was designed such that con-

48

Chapter 4. AnalogNet 4.2. NETWORK DESIGN

Figure 4.1: High-level overview of the network architecture to be implemented on theSCAMP-5

volutions would be carried out on an FPSP and subsequent layers computed on adigital micro-controller. Given that the SCAMP-5 was known to introduce significantamounts of noise during computation, it was anticipated that noise-in-the-loop train-ing would be required, and that the network developed for that purpose would thusbe a useful starting point. Fig. 4.1 gives a high-level overview of the architecture wesought to implement on the SCAMP-5.

Nonetheless, in seeking to implement that network on the SCAMP-5 hardware, im-portant adaptations were required. Fig. 4.2 gives an outline of AnalogNet’s eventualdesign, with each component explained in detail in the sections to follow.

Figure 4.2: Architecture of AnalogNet

49

4.2. NETWORK DESIGN Chapter 4. AnalogNet

Figure 4.3: Effect of environmental light levels on image captured

4.2.1 Input Binarization

One major effect of moving from a simulated implementation to the physical hard-ware of the SCAMP-5 was the impact of environmental conditions on experimentaloutcomes.

When working with a simulated environment, training and test images from theMNIST dataset were simply loaded from system memory. However, when workingwith the SCAMP-5, images had to be sourced from the physical environment. Inother words, in order to process an image of a digit, a representation of that digithad to be physically placed in front of the vision system.

This introduced two particular issues in relation to light levels, image exposures andoverall lighting conditions:

Lack of Autoexposure

Unlike most modern digital cameras, the SCAMP-5, being a prototype research de-vice, does not have automatic exposure control. This introduced particular concernswith regards to light levels, image exposures and overall lighting conditions. For ex-ample, it was observed that results of experiments conducted in the morning woulddiffer from those conducted in the afternoon, simply because the orientation of thelab’s windows meant that the lab was better illuminated with natural light at partic-ular times. Fig. 4.3 shows the result of applying different lighting conditions to theexact same scene.

Frame Exposure Time

A key feature of the SCAMP-5 is its ability to carry out vision processing at extremelyhigh frame rates. However, dramatically different frame rates result in vast differ-ences in the time available for the absorption of photons. As the frame rate increases,the amount of time available for light absorption decreases, resulting in the recordedimage appearing progressively darker. Fig. 4.4 illustrates the result of using different

50


Figure 4.4: Effect of frame rate on image captured

frame rates to process a scene.

The following methods were available for controlling image illumination:

• Ambient Light Levels: The most straightforward and obvious method ofcontrolling the illumination in the captured SCAMP-5 images was simply tovary the amount of ambient light in the physical environment, e.g. by addingmore lamps or increasing their brightness.

• Lens Exposure Setting: By adjusting the aperture of the lens, the amount oflight captured could be increased or decreased.

• Frame Gain Parameter: This was a software parameter which adjusted thelight sensitivity of each pixel’s PIX register.

Fig. 4.5 shows the effect of adjusting each of these settings independently. Usedin conjunction, they could potentially be used to significantly adjust illuminationlevels.

Nonetheless, these procedures alone were limited in their effectiveness, even whenused in unison. Fig. 4.6 compares an analog image taken at 2000fps with one takenat 10fps. Despite the using all the methods detailed above to capture the 2000fpsimage, it is clear that it remains considerably underexposed as compared to the 10fpsimage.

This situation was therefore not considered optimal for the purposes of this project,given that it would make performing consistent experiments across varying framerates and lightning conditions nearly impossible.

Instead, it was decided that the input image should be thresholded and binarized,before being converted back to standardised analog values based on the result of thebinarization procedure.

51


Figure 4.5: Effect of adjusting each light enhancement method independently. All im-ages taken at 55fps.

52


Figure 4.6: Methods to increase image illumination are ineffective against very highframe rates

Binarization Procedure

The aim of the binarization procedure was to transform the original analog imagelocated on an analog register into a binary image located on a digital register. Anyvalue above a threshold t would be represented as a 1, while all other values wouldbe represented as a 0:

f(n) =

{0, n < t

1, n ≥ t(4.1)

The binarization procedure was implemented using the following SCAMP-5 analogueassembly code:

scamp5_in(E, threshold) // Load threshold value into AREG E

get_image(C, D); // Load image into AREG C

sub(A, C, E); // A = image - threshold

where(A); // Set FLAG to 1 only if A > 0

CLR(R6); // Set DREG R6 to 0

53


OR(R5, R6, FLAG) // Copy FLAG to R5

ALL(); // Set all FLAG back to 1

// Binarized image is now stored in DREG R5

Binarized Analog Image

After the binary images were produced, it was necessary to convert them back tostandardized analog values so that further analog computation could be carriedout.

All 1s on the binary image were now assigned a standard analog value of 120, whileall 0s were assigned a value of 0 as follows:

f(n) =

{120, n = 1

0, n = 0

The standardisation procedure was implemented using the following SCAMP-5 ana-logue assembly code:

scamp5_in(E, 120) // AREG E = 120

scamp5_in(F, 0) // AREG F = 0

WHERE(R5); // Copy DREG R5 to FLAG

mov(A, E); // A = 120 where FLAG is 1


NOT(R6, R5); // Set DREG R6 to inverse of DREG R5

WHERE(R6) // Copy DREG R6 to FLAG

mov(A, F) // A = 0 where FLAG is 1


// Binarized analogue image is now stored in AREG A

By adjusting the threshold value based on the specific experimental conditions en-countered, this procedure allowed for the production of near consistent binarizedanalog images across a wide range of different lighting conditions.

54


Figure 4.7: MNIST digit taken at 55fps and 2000 fps

Figure 4.8: Binarized MNIST digit taken at 55 fps and 2000 fps

Fig. 4.7 compares an image produced of an MNIST digit at 2000fps with anotherproduced at 55fps. As expected, the image taken at 2000fps is very underexposed as

55


compared to the image taken at 55fps. Fig. 4.8 compares the same two images afterinput binarization is applied. It is notable that, unlike the case before, there is nowclose to no difference between the two images of the digit, thereby demonstratingthe success of the input binarization procedure.

4.2.2 Filter Execution

At this stage, the binarized analog image had been stored in Analog Register (AREG)A, and the task at hand was to perform n convolutions on the image in AREG A cor-responding to the n convolutional filters in the network. Before doing so, however, itwas imperative that the proper utilisation of available registers was considered.

Register Allocations

Each processing element (PE) in the SCAMP-5’s pixel parallel array contains six ana-log registers, AREG A-F. It was necessary to decide how best to allocate these reg-isters to fulfill the goal of executing the computation of n convolutional filters. Thefollowing sections detail the decisions made and the basis of these decisions.

SCAMP-5 Hardware Workaround Register

The design of the SCAMP-5’s hardware means that there are two rules that must befollowed when writing SCAMP-5 analog assembly code:

• The first and last arguments to an operation cannot be the same. There-fore, for the computation of A = B + A, add(A, A, B) is valid code, whileadd(A, B, A) is invalid. Similarly, neg(A, A) is not valid, while neg(B, A) isvalid.

• The two inputs to an operation cannot be the same. For our purposes, this con-dition primarily relates to the addition and subtraction operators add(_, X, Y)

and sub(_, X, Y), since they are the only major operators with two inputs.

In seeking to produce a general solution for FPSP code generation, Debrunner’scode generator Debrunner (2017) produces code which is not compatible with thoserules and thus cannot actually be executed on the SCAMP-5 hardware. Consider thefollowing statement, which is often produced by the code generator:

add(E, E, E) // AREG E = E + E

// Undefined behaviour when executed on the SCAMP-5

To execute the statement above on the SCAMP-5, an alternative method must beused:

mov(F, E) // AREG F = E

add(E, E, F) // AREG E = E + F

56


Filter Computation RegistersNumber of Instructions

in Generated Code1 No Solution2 353 234 265 26

Table 4.1: Relationship between number of registers allocated for computation andnumber of instructions found in generated code

// Achieves intended computation of E = E + E

While the alternative method produces the expected result, it also requires the use ofan additional register. This means that one register must always be kept free duringthis stage of the process such that computations of this form can be accommodated;should there be no free registers (e.g. if AREG F above was being used to store thevalue of an intermediate computation), vital data being used for computation wouldend up being lost.

In line with this requirement, AREG F was therefore reserved for the purpose ofaccommodating this particular hardware workaround.

Filter Computation Registers

Debrunner’s code generator generates code with the assumption that a number ofanalogue registers are available for storing intermediate values during the computa-tion process. The number of registers available for this purpose can be specified asa parameter for the code generator, which will then determine the optimal solutiongiven the registers available.

In general, allocating additional registers for filter computation tends to result inshorter instruction sequences. For particularly difficult filters, allocating too fewregisters might even result in no solution being found. Table 4.1 illustrates theseconsiderations, showing the statistics for the computation of a sample convolutionalfilter.

However, as there were now only five remaining analog registers, any increase inthe number of registers used for filter computation would result in fewer registersbeing available for use as base registers (see below).

Balancing these two competing trade-offs, it was decided that two analog registers,AREG D and AREG E, would be reserved for the purposes of filter computation.Experience working with the code generator indicated that two registers were usu-ally sufficient to compute a solution to the weight matrices that were trained, evenif a solution with fewer instruction steps might be available if more registers were

57


Figure 4.9: Illustration of the base register concept: ’live’ registers are shown in green.

allocated.

Base Registers

The decisions made above left three registers available for use as base registers.

A key limitation of the current SCAMP architecture is the lack of any form of memorywithin each processing element. Programming in analog assembly for the SCAMP-5is thus unlike writing assembly code for any other conventional processor, where onealways has access to a block of memory where intermediate computation results canbe stored.

This posed a particular challenge because it meant that the only place where in-termediate results could be stored were in the analog registers themselves, whichalso meant that analog values holding intermediate values had to be reserved andprevented from being used for any other purposes. This was the basis for the baseregister concept.

Under the base register concept, each convolutional filter was allocated an analogregister to be used as its base register. When computations were being carried outto calculate the result of the nth filter, the base register allocated filter n would be’live’, together with the hardware workaround register (AREG F) and the filter com-putation registers (AREG D and AREG E). These four registers would then be usedto calculate the result of applying filter n, with the final result found in the baseregister. The concept is illustrated graphically in Fig. 4.9. The final result of eachfilter would therefore remain unaffected by the computation of the other filters. Atthe end of the process, the results of filter 1 would be stored in AREG A, that of filter2 in AREG B, and that of filter 3 in AREG C.

Given that each base register corresponded to a capacity for one convolutional filter,and that there were three remaining registers available for use as base registers,

58


Figure 4.10: Result of applying each convolution filter to the input image

this meant that the network being designed could accommodate three convolutionalfilters in its first layer.1

Filter Computation

Once the appropriate register allocation was decided, each convolution filter wasthen executed using its relevant set of registers, with the result of each convolutionstored in that set’s base register. SCAMP-5 code to execute each trained convolu-tion filter was generated using Debrunner’s Code Generator, as was done with thesimulated implementation.

Fig. 4.10 shows the result of applying a sample of three trained convolution filtersto images of MNIST digits. Notice that, as expected, each trained filter performs adifferent transformation on the original image and thereby identifies a distinct setof features. Looking at the results of applying each convolution filter, it appears thatFilter 1 identifies underside horizontal edges, while filter 2 identifies vertical edgesand filter 3 identifies parallel diagonal edges.

Activation Function

Finally, after the computation of the filter results, the activation function used dur-ing training was applied. In this case, the Rectified Linear Unit (RELU) activationfunction had been used, which applied the following function to the convolutional

1We had this constraint in mind when designing the networks in the previous chapter; it is nota coincidence that the networks used in the previous chapter’s simulated implementations had aconvolutional layer with three filters.

59


Figure 4.11: Effect of different thresholds on the outcome of filter binarization

filter results:f(x) = max(0, x) (4.2)

This meant that any base register in any processing eleement with a post-computationvalue less than 0 was reset to zero, while all other registers were left unaltered.

4.2.3 Filter Result Binarization

At this stage, the results of applying the three filters in the convolutional layer canbe found in the three base registers.

As noted above, it was decided that this was the point at which readout to the micro-controller was to occur, so that further computation could be carried out there. Thiswas to be done using the SCAMP-5’s rapid readout capability, which is limited toreading from digital registers.

The same binarization process described in Section 4.2.1 could again be used to con-vert the convolution results to binary images stored on digital registers. Unlike atthe Input Binarization stage, however, the selection of threshold values no longerdepended on the illumination of the original image. Instead, the appropriate thresh-old value depended on the characteristics of the specific convolutional filter underconsideration.

The selection of an appropriate threshold was important because an incorrect thresh-old could lead to unnecessary inaccuracy in the data readout. A threshold that wasset too high would lead to the loss of data, while a threshold that set too low wouldlead to the inclusion of excessive noise. Fig. 4.11 shows the binary images producedat different thresholds for the same convolution result.

Once each convolution result was converted into a binary image and stored on adigital register, the data was ready to be read out to the micro-controller.

60


4.2.4 Readout and Pooling

Readout was carried out using the following C++ function executed on the NXPmicro-controller:

scamp5_scan_events(R5, buffer, n);

This specialised function from the SCAMP-5 API scans the coordinates of (at most) n1s on a digital register, storing the results in a C++ buffer. For most convolutionalfilters, a value of n = 100 was usually sufficient to capture all points in the binaryimage, due to the general tendency of trained post-RELU convolution filter results tobe sparse in nature.

These coordinates were then marked as 1s in a 28 x 28 two-dimensional integerarray, while all other points were marked as 0s by default. At this point, the bi-nary image stored in the digital register had been replicated in the micro-controller’smemory despite only necessitating the coordinates of less than 100 points.

Sum-pooling was then conducted using the information in the two-dimensional ar-ray; since all values in the array were either 1 or 0, this was essentially conductedby counting the number of marked points in each 9 × 9 square.

This produced a sequence of 9 values per convolution filter and a total of 27 valuesacross all three filters, a sample of which, for the digits 0, 1 and 2 are as follows:

Values for 0:

[0, 0, 0, 3, 11, 0, 36, 6, 0,

0, 0, 0, 1, 0, 0, 0, 32, 11,

0, 1, 5, 0, 12, 12, 3, 2, 14, 4]

Values for 1:

[0, 0, 0, 2, 1, 0, 21, 3, 0,

0, 0, 0, 8, 0, 0, 20, 10, 0,

0, 5, 0, 6, 2, 4, 5, 8, 2]

Values for 2:

[9, 0, 0, 0, 0, 0, 29, 0, 9,

4, 0, 0, 0, 0, 0, 21, 0, 4,

4, 0, 0, 1, 1, 1, 8, 3, 4]

It is clear that each of these digits results in the production of a very different pat-tern of values, and it is this difference that formed the basis for the classificationsubsequently carried out by the dense layer.

61

4.3. NOISE-IN-THE-LOOP TRAINING Chapter 4. AnalogNet

4.2.5 Dense Layer Computation

The micro-controller uses these 27 results as the input to the final dense layers ofthe network, with the outcome determining the network’s classification result.

An important limitation to note at this point is that the NXP micro-controller usedin the Vision System does not have support for floating point multiplication. Inresponse to this limitation, we used integer multiplication in place of floating pointmultiplication. To this end, all Dense layer weights were multiplied by 10x, x ∈ Z≥0,with only the integer component of the result retained. A large enough value ofx was required to be selected such that each weight was reasonably approximatedin integer form (given that an insufficiently large value of x would simply resultin some weights being represented as 0). Upon examination of the values of therelevant weights, we decided that x = 5 was a reasonable choice.

Final Result

As this network was only required to run inference, the softmax activation functionused for training was unnecessary. Instead, we simply use a standard max-indexfunction to determine the final output of the network.

4.3 Noise-in-the-loop Training

Having designed a suitable neural network for implementation on the SCAMP-5,a key challenge was determining how the effects of hardware noise on networkaccuracy could be minimized.

This section presents Direct Noise Incorporation, a method for training a neural net-work to be implemented on the SCAMP-5 that is robust to hardware noise introducedduring computation.

4.3.1 Direct Noise Incorporation

Direct Noise Incorporation was developed as a variant of the noise-in-the-loop train-ing method described in in Section 3.4. While the noise-in-the-loop training methodwas the most convenient way of training a neural network for use on an FPSP, it re-quired access to a comprehensive noise model for the device on which the networkwas to be implemented, which at the time did not yet exist for the SCAMP-5.

Noise-in-the-loop training essentially assumed that convolutional layer weights werefixed upon the completion of initial training; they had to be, otherwise it wouldnot be possible to generate the corresponding FPSP code blocks. Noise-in-the-looptraining then added arbitrary noise to the result of each FPSP operation, before re-training the dense layer, at which point the noise would be taken into account duringthe training loop. The challenge was therefore how the specific noise introduced by

62

Chapter 4. AnalogNet 4.3. NOISE-IN-THE-LOOP TRAINING

Figure 4.12: The values produced after the pooling procedure incorporates the cumu-lative effects of all noise introduced in previous stages; these values can be used to trainnew dense layer weights.

each SCAMP-5 operation could be quantified and taken into account without accessto a comprehensive noise model.

The solution presented itself upon examination of the structure of the network thathad been designed. It was observed that the cumulative effects of all noise intro-duced were accounted by the 27 values produced after the pooling procedure (seeFig. 4.12).

By collecting the 27 values produced for a single training image, we could learnthe result of the network’s computation up to that point inclusive of all noise. Bycollecting that data for all 60000 training images, we would have access to a fulltraining set that directly incorporated all noise introduced by the SCAMP-5. Wecould then use this data to train a dense layer which would now, as before, be robustto noise (Fig. 4.13).

Essentially, instead of using a noise model to estimate the values of the 27 post-pooling neurons, Direct Noise Incorporation directly obtained those values post-hardware computation, thereby obtaining actual values inclusive of all noise. Al-though time-consuming, logistically inconvenient, and in general more cumbersomethan using a pre-computed noise model, this method in fact resulted in a trainingprocess that most successfully replicated the hardware’s characteristics, given thatit took into account any and all hardware effects, including any that a noise modelmay have inadvertently overlooked.

63

4.4. NETWORK IMPLEMENTATION Chapter 4. AnalogNet

Figure 4.13: The network now has dense layer weights trained to take noise introducedby SCAMP-5 computation into account.

4.4 Network Implementation

After developing a method for training AnalogNet using noise-in-the-loop training,the next task was to implement the entire network experimentally. This required anappropriate experimental set-up, as well as the development of software necessaryto support that set-up.

4.4.1 Experimental Set-Up

As noted in Section 4.2.1, working with standard machine learning datasets on theSCAMP-5 posed unique logistical challenges due to its design as a vision system de-signed to process live focal plane data. This meant that unlike a standard CPU/GPUimplementation, one could not simply read in images as data and transfer them tothe device for processing. Instead, vision processing could only take place if theSCAMP-5 was shown images that could be detected in the focal plane.

Our solution to this was to implement an image display system developed usingPython and OpenCV, which allowed us to read in images from the MNIST datasetand display them on a computer screen for the SCAMP-5 to detect.

The Vision System could then be mounted on a tripod and positioned in front of thescreen such that the displayed image occupied the appropriate position on the focalplane. Figure 4.14 shows the experimental set-up with the screen and Vision Systemin place, as well as the image detected in the focal plane by the Vision System.

64

Chapter 4. AnalogNet 4.4. NETWORK IMPLEMENTATION

Figure 4.14: Experimental setup and corresponding image captured in focal plane

4.4.2 Augmented MNIST Training

The standard MNIST dataset is important for benchmarking purposes, because itgives us a clear indication of the performance of our network and device vis-a-visother state-of-the-art implementations. However, from a practical perspective, usingonly the standard MNIST dataset is insufficient.

This is particularly important due to the physical nature of the Vision System, whereone will never face identical experimental conditions. For example, each time anexperiment is run using the MNIST test set, the Vision System will be placed in aslightly different position, as will the image window displayed on-screen; it is simplynot possible to perfectly replicate all physical variables when experiments are runover different sessions. The effect is that all images in the training/test set will beeffectively translated by an arbitrary number of pixels in the x or y-axis, or that theimages will be slightly magnified or diminished (if the camera is slightly closer orfurther away from the screen).

Of course, there are ways to control many of these variables (e.g. the Vision Systemcould be physically fixed in position). However, it is more productive to focus ontraining a more robust neural network. After all, if the Vision System is to be usedfor real-world object recognition, we would want it to successfully recognise objectswithout first needing those objects to be in precisely the right position.

Augmented Live Image Display

Data augmentation is a well-known technique for increasing the robustness and gen-eralisability of neural networks. Common data augmentation procedures includetranslations, magnifications, rotations and so on.

We added a procedure for data augmentation in our Live Image Display system, with

65

4.5. PERFORMANCE EVALUATION Chapter 4. AnalogNet

each displayed image being subject to a random translation and magnification beforebeing displayed. As before, Direct Noise Incorporation could then used to generatea new set of training data, which was could be used to train a new, more robust,network.

4.5 Performance Evaluation

4.5.1 Evaluation Methodology

Evaluation of our SCAMP-5 implementation was carried out by comparing the per-formance of the SCAMP-5 to a range of other device classes in terms of test accuracy,inference speed and energy consumption. The following device classes, and thecorresponding device models used, are as follows:

• CPU computation was tested using a PC running the Intel Core i7-4930K3.40GHz processor.

• GPU computation run on a PC running the Intel Xeon CPU E5-1630 v3 3.70GHzprocessor and fitted with a NVIDIA GTX 1080 GPU.

• VPU computation was tested using an Intel Movidius Myriad 2 Neural Com-pute Stick.

The Intel Movidius Myriad 2 is also currently one of the leading solutions avail-able for energy-efficient always-on computer vision applications, and was thereforeconsidered to b the state-of-the-art benchmark implementation for the purposes ofperformance comparison.

Test Accuracy

Digital Devices

As the CPU, GPU and VPU were all digital devices, they were all expected to computea network in the exact same way. Accuracy was therefore measured once across allthree classes of devices, and represented the accuracy that should be achieved by alldigital devices.

A network was first trained on a CPU and then simply run on the MNIST test set,providing the relevant test score.

SCAMP-5 Vision System

Training and testing on the SCAMP-5 was less straightforward, as some measureswere necessary to ensure that tests were as fair as possible. This was an importantconsideration, given that because the SCAMP-5 received MNIST data as images inthe focal plane, minor differences in environmental conditions could have adverseeffects on test numbers. This, of course, was unlike the case with testing on a CPU,

66

Chapter 4. AnalogNet 4.5. PERFORMANCE EVALUATION

where an image read from memory is always represented by the exact set of datapoints.

The most straightforward way to conduct testing would have been as follows:

• 1. Show each set of training images to device (i.e. images for 0s, 1s) etc.

• 2. As per Direct Noise Incorporation, readout and record the 27 pooling valuesfor each training image.

• 3. Use pooling values to train weights for dense layer.

• 4. Program trained dense layer weights into micro-controller.

• 5. Display test images and record predictions.

However, the issue with the method above was that subtle environmental changesmight have taken place between steps 1 and 5. For example, movement of theposition of the sun might have changed light levels in the room, or the camera mighthave accidentally been moved ever so slightly. In our experience, such issues tendedto result in a accuracy loss of about 1-2%. This was not a large number but stillsignificant when conducting benchmarking. To alleviate this issue, we employed amethod of collecting test data for each digit immediately after collecting the trainingdata for that digit. Therefore, while environmental conditions for digit 9 may bedifferent from that of digit 1, this is not an issue in testing since the environmentalconditions for both the training and test sets of digit 1 were largely similar.

The new testing procedure was thus as follows:

• 1. Show the training images for only one digit (e.g. images for 0).

• 2. As per Direct Noise Incorporation, readout and record the 27 pooling valuesfor each training image.

• 3. Show the test images for the same digit.

• 4. Readout and record the 27 pooling values for each test image.

• 5. Repeat steps 1-4 for all digits.

• 6. Train dense layer weights using the pooling values from the training data.

• 7. Apply trained dense layer weights to the pooling values from the test dataand record predictions.

Inference Speed

The procedure for measuring inference speed was consistent across all devices. Eachdevice was tasked with conducting inference on 10 cycles of images from the MNISTtest set, with the total computation time of each cycle being measured and used tocalculated an average per-cycle duration. This was then used to calculate an averageper frame computation time for each device.

67


For the CPU, GPU and VPU implementations, it was also necessary to estimate theapproximate inference time that would have been incurred in performing analog-to-digital conversion on an imagining sensor and transferring that data to the device. Aswe were working with 28x28 images, it would not be fair measure the time taken totransfer images by commercially available webcams, as these webcams were usuallymeant to transfer high-resolution images and therefore took large amounts of timeto do so (typically ¿50,000 us). We decided that the fairest way to measure thiswas to measure how long the SCAMP-5 (which can to some degree also be used as astandard webcam) took to readout 784 pixels worth of image data, and use that asour estimate for data transfer time.

Power and Energy Consumption

Because of the variety of different devices on which tests were performed, a numberof different methods were used to measure power and energy consumption statis-tics.

CPU

Energy statistics for CPU computation were measured using Intel’s RAPL (RunningAverage Power Limit) driver available in the Linux kernel. RAPL provides energyconsumption information, using a software power model which estimates energyusage (Pandruvada, 2014). It has been shown these power measurements are gen-erally accurate (Rotem et al., 2012). The estimated energy usage measured wasdivided by the number of frames in each test cycle, and used to estimate the per-frame energy consumption. This could then be combined with statistics on per-frameinference time to determine the level of power consumption.

GPU

Energy statistics for GPU computation were measured using the NVIDIA SystemManagement Interface, a command-line utility which allows a user to query infor-mation on the GPU’s state. We used the Power Draw metric for our purposes, whichprovides the GPU’s last measured power draw to within an accuracy of +/- 5 watts.This value was then multiplied by the measured per frame computation time, to givean estimate of the per-frame energy consumption.

VPU and SCAMP-5

As both the VPU and SCAMP-5 are USB devices, power consumption could be mea-sured using a USB power meter which could be set up as an intermediate connectionbetween the USB device and the USB port. Measurements were conducted using thePortaPow Dual USB V3 Power Monitor, which provided current draw and voltageand which could therefore be used to calculate power consumption in watts. Asbefore, this could then be multiplied by the per frame computation time to give anindication of the per frame energy consumption.

However, the power consumption measured for the VPU only represented the power

68


Figure 4.15: MNIST Test Set Accuracy

consumed by the VPU chip. For the VPU chip to actually be used for inference in apractical setting, it must be connected to a host computer or micro-controller (e.g.a Raspberry Pi) and a camera. The estimated power consumed by these additionalcomponents was therefore added to the original power measurements in order togive a fair estimate of the total power consumed by a VPU-based inference system.Our estimate was based on data collected by RasPi.TV (2016).

4.5.2 MNIST Test Set Performance

This section details key performance metrics, including test accuracy, inference speed,power consumption and energy consumption, recorded when AnalogNet conductedinference on the images in the MNIST Test Set.

Test Accuracy

Fig. 4.15 shows the MNIST test set accuracy achieved by the SCAMP-5 at 15 fps and3000 fps, and digital devices (i.e. CPU, GPU and VPU). As can be seen, at 15 fps, theperformance of the SCAMP-5 at 92.65% is nearly indistinguishable from that of thedigital devices at 93.16%. At 3000 fps, the performance of the SCAMP-5 drops to aslightly lower, but still extremely respectable figure of 90.2%.

The performance of the SCAMP-5 at 15 fps demonstrates that CNN inference canbe run on a analog computing device at comparable performance to that of digi-tal devices, thus showing that the noise introduced during analog computation has

69


Figure 4.16: Per-Frame Inference Time

essentially been successfully taken into account during training.

While the performance of the SCAMP-5 drops slightly at 3000 fps, this is likely tobe indicative of difficulties inherent in carrying out imaging at extremely high framerates, rather than any particular issue with computation. In particular, one possibledifficulty arises from the refresh rate of the LCD screens used to display test imagesto the SCAMP-5, which are generally around 60 Hz. At 3000 fps, the SCAMP-5 mightend up capturing multiple images mid-transition, potentially introducing additionalinput noise that could not be accounted for.

Inference Speed

Fig. 4.16 shows the time taken by each of our test devices to perform CNN inferenceon a single image frame. The total inference time for the CPU, GPU and VPU havebeen split into two components, namely the recorded computation time and the esti-mated data transfer time. The SCAMP-5 does not have a separate data transfer timecomponent because all computation occurs on the Vision System and computationtime recorded includes all necessary data readouts.

As can be seen, it is very clear that the SCAMP-5 is the superior device in terms ofper-frame inference time. Even before adding the estimated data transfer time, theSCAMP-5 already significantly outperforms every other test device, with the marginfurther increasing once the data transfer time is factored in.

70


Figure 4.17: Power Consumption

Component Power Consumption (W)VPU Chip 0.64 W

Raspberry Pi Zero and camera in operation 1.2 WTotal 1.84 W

Table 4.2: Breakdown of VPU power consumption estimates.

Power Consumption

Fig. 4.17 shows the estimated power consumption values for each of our test devices.As expected, both the CPU and GPU consume a vast amount of power compared tothe VPU and the SCAMP-5. Interestingly, the VPU and the SCAMP-5, both of whichhad low power consumption as a design objective, were estimated to use almost thesame amount of power.

As explained in the previous section, power consumption for the VPU could only bedirectly measured for the VPU chip itself, while the power used by a micro-controllerand camera would need to be added. Table 4.2 gives a breakdown of the powerconsumption estimates for the VPU.

Energy Consumption

Fig. 4.18 gives the per-frame energy consumption for each of our test devices. Whilepower consumption gives an indication of the energy consumed per unit time, per-frame energy consumption figures also take inference speeds into account; deviceswhich are able to conduct inference faster will thus use less energy per frame.

71


Figure 4.18: Per-Frame Energy Consumption

As can be seen, both the VPU and the SCAMP-5 use far less energy than the CPUand GPU. Furthermore, while both the SCAMP-5 and the VPU have similar levels ofpower consumption, as seen in the previous section, the SCAMP-5’s superiority ininference speed translates into a large reduction in per-frame energy consumptionas compared to the VPU, making the SCAMP-5 the clear leader in both inferencespeed and energy consumption.

72

Chapter 5

SCAMP-5 Hardware Analysis

In this chapter, we conduct an analysis of the hardware characteristics of the SCAMP-5 FPSP. The error and noise introduced by the analog nature of computation on theSCAMP-5 has been a central theme over the course of this thesis. While it was clearfrom the beginning of the project that the SCAMP-5 introduces a significant amountof noise during computations, it was not known what characteristics that noise hadas a comprehensive noise model had never been created for the SCAMP-5.

We therefore aimed to understand the nature of noise and other hardware errors ex-perienced during computation on the SCAMP-5, and sought to quantify these effectsthrough the development of a set of mathematical models.

5.1 Contribution

We present in the following sections the first comprehensive high-level noise modelof the a SCAMP device. While some previous noise modeling had been conducted(e.g. Carey et al., 2013a), this work focused circuit level noise analysis, which isinsufficient for a programmer seeking to implement high-level algorithms who wouldwant to know what impact noise would have on a given algorithm. In doing so, wealso demonstrate methods for developing a noise model that can be applied to newiterations of SCAMP chips as and when they are released.

5.2 SCAMP-5 Error Analysis

In this section, we will examine the error characteristics of the SCAMP-5 by consid-ering the output produced by a SCAMP operation in detail. In particular, we willconsider the range of outputs produced when the add(X, 70, 10) operation is per-formed repeatedly.

73

5.2. SCAMP-5 ERROR ANALYSIS Chapter 5. SCAMP-5 Hardware Analysis

Figure 5.1: Computation results when performing add(50,20) over 100 iterations

On a digital chip, an operation performing the addition operation 50+20 will alwaysyield a accurate result of 70, no matter how many times the operation is carried out.On an analog chip such as the SCAMP-5 however, the result each time the operationis performed can differ significantly from the expected result. Consider the plot inFig. 5.1, which shows the result of performing 100 calculations of 50+20 on theSCAMP-5.

The presence of both random error and systematic error can be observed. Eachtime the operation is performed, the result is slightly different. At the same time,the results all differ from the expected accurate result in a similar way, albeit bydifferent degrees (due to the random error). These effects are even more obviouswhen considering a range of inputs. Fig. 5.2 and 5.3 plots the mean, standarddeviation, and minimum and maximum outcomes when 30 is added to the value inthe x-axis. The red diagonal line represents the expected accurate result. As can beseen, there is a significant amount of variation around each mean, with some inputsleading to larger variations than others. There is also a varying amount of systematicerror, which reverses its direction as input values get larger.

5.2.1 Systematic Error Types

In addition to the distinction between systematic error and random error, one mustalso make a distinction between different types of systematic error when workingwith the SCAMP-5. In particular, we identified three main categories of systematicerror, namely:

• Computation Error• Boundary Error• Division Error

74

Chapter 5. SCAMP-5 Hardware Analysis 5.2. SCAMP-5 ERROR ANALYSIS

Figure 5.2: Computation results when performance add(x, 30) over a range of differ-ence x values. Line in red is showing the expected result from accurate computation.

Figure 5.3: Magnified portion of Fig. 5.2

75

5.2. SCAMP-5 ERROR ANALYSIS Chapter 5. SCAMP-5 Hardware Analysis

Each of these error categories will be explained in further detail in the followingsections.

Computation Error

Computation error is the most obvious form of systematic error and is what was ob-served Fig. 5.1 in the previous section. Essentially computation error is a systematicerror that results in the distribution of results being centred above or below the ac-curate value. This error affects all the major SCAMP operations, and is generally themost common systematic error that a SCAMP programmer will encounter.

Boundary Error

Boundary error arises from the way values are stored in the SCAMP-5 circuitry aselectric currents with differing magnitudes. Because the range over which the cur-rent in each register is allowed to vary is finite, the number of values each SCAMP-5register can represent is also finite. In practice, this means that the SCAMP-5’s analogregisters are designed to represent only values in the range [−128, 127]. This, then,is necessarily of concern for operations which might attempt to significantly increaseor decrease the value of a register beyond these values, in particular addition andsubtraction.

This therefore leads to the concept of boundary error, which is specific to the add()and sub() operations. The error occurs when an operation is attempted which wouldlead to a result beyond the range of representable values. The boundary is a hardlimit, and the result of any operation that would return a value beyond the boundarywould simply be the value of that boundary.

Division Error

Division error, as the name suggests, is a systematic error specifically associated withthe division operation. It arises from the way in which division is implemented on theSCAMP circuitry. As mentioned in the background review, analog division is carriedout simply by channeling current stored in one register to two separate registers,thereby splitting the current in half and performing a division by two.

The systematic error arises when one tries to perform division on values at the lowerend of the [−128, 127] range. While the division operation is reasonably accurate atlarge values, it becomes less accurate as the value to be divided gets smaller. Fig. 5.4demonstrates the result of applying division to a range of different values. As canbe seen, the smaller the value being divided, the less accurate the division operationbecomes.

We have therefore seen that the analog nature of the SCAMP-5 chip results in mul-tiple unique types of systematic error (not to mention the random error discussedin the previous section), all of which a SCAMP-5 programmer must be aware of inorder to successfully implement vision algorithms on the device.

76

Chapter 5. SCAMP-5 Hardware Analysis 5.3. SYSTEMATIC NOISE MODEL

Figure 5.4: Computation results when performing div2(x) over a range of differentvalues for x. Line in red is showing the expected result from accurate computation.

5.3 Systematic Noise Model

The following two sections present two approaches to the development of noise mod-els for the SCAMP-5. We decided to produce two noise models, a Systematic NoiseModel and a Random Noise Model which would together provide a comprehensivepicture of the SCAMP-5’s noise landscape.

5.3.1 Objective

The aim of the Systematic Noise Model was to approximately model the systematicerror for each of the eight major FPSP operations, specifically:

• Neighbour Operations: north( ), south( ), east( ), west( )

• Arithmetic Operations: neg( ), div2( ), add( , ), sub( , )

5.3.2 Model Formulation

It was decided that polynomial regression would be used to formulate the noisemodel for each operation. This was allow us to produce a deterministic model rep-resenting the predicted result for each input as a polynomial function.

77

5.3. SYSTEMATIC NOISE MODEL Chapter 5. SCAMP-5 Hardware Analysis

Order of M Value of Coefficients Calculated0 3.391 0.48, 3.392 −8.45× 10−5, 0.48, 3.833 −6.61× 10−7,−8.46× 10−5, 0.48, 3.834 2.63× 10−9,−6.61× 10−7,−1.20× 10−4, 0.48, 3.89

Table 5.1: Polynomial coefficients when polynomial regression of order M is performedon div2 computation results dataset.

To do so, we fit the data using a polynomial of the following form

y(x,w) = w0 + w1x+ w2x2 + ...+ wMxM =

M∑j=0

wjxj (5.1)

where M is the order of the polynomial, and w is the vector representing the poly-nomial coefficients w0, ..., wm (Bishop, 2006).

For a given polynomial of order M , the values of w can be found by minimizingan error function which measures the divergence between the function y(x,w) andour recorded data points. In this task, we used the squared error as the error func-tion, which is the sum of the squares of the difference between the predicted valuey(xn,w) and the corresponding target value tn for each training example n. Thefunction we sought to minimize, E(w) was therefore defined as follows:

E(w) =1

2

N∑n=1

[y(xn,w)− tn]2 (5.2)

The next issue is therefore the question of which order of M to choose. Empiri-cally, running experiments on the data collected for each of the operations quicklyindicated that an order of 1 was appropriate, as increasing the order of M simply re-sulted in the additional coefficients taking on near-zero values (see Table A). Whilemore sophisticated solutions such as cross-validation on a separate test set couldhave been used, as per Bishop (2006), the experimental results indicated that suchmethods were not necessary for the present task.

Finally, a slightly different representation was needed to fit the data for the dual-input operations, i.e. addition and subtraction. For these operations, multiple re-gression was used. As before, examination of the data indicated that an order of 1was appropriate. We could therefore fit the data using an equation of the followingform:

y(a, b,w) = w0 + w1a+ w2b (5.3)

where a and b are the two operation arguments and w is the vector representing thepolynomial coefficients w0, w1, w2.

78



Samples were drawn by repeatedly performing each operation on the SCAMP-5for a range of different input argument values and reading the results produced.Three processing elements were randomly selected to be the source of readings tobe taken.

For single-argument operations, experiments for r = operation(x) were conductedsuch that x ∈ {−120,−110...110, 120}. For each value of x, 1000 values of r wereread from each of the randomly selected processing elements.

For dual-argument operations, experiments for r = operation(a, b) were conductedsuch that a ∈ {−120,−110...110, 120} and b ∈ {20, 50, 100}. For each unique combi-nation of a and b, 1000 values of r were read from each of the selected processingelements.

5.3.4 Results

Following the approach described above for each of the operations, we were able toproduce a set of equations which can be used to estimate the computation results ofoperations after systematic error is taken into account.

The Systematic Error Model of the negation and division operations and the fourneighbour operations can be represented with the following set of equations, wherex is the input argument:

neg(x) = −0.954x+ 7.84 (5.4)

div2(x) = 0.482x+ 3.39 (5.5)

north(x) = 0.982x+ 1.07 (5.6)

south(x) = 0.977x+ 0.631 (5.7)

east(x) = 0.979x+ 0.0471 (5.8)

west(x) = 0.979x− 2.08 (5.9)

Graphical depictions of each of these models, in comparison to the models for accu-rate computation, are provided in Fig. 5.5 - 5.10.

79


Figure 5.5: Negation operation: Estimated result with systematic error (orange) andexpected result for accurate computation (green).

The following equations were obtained for the addition and subtraction operationsusing multiple regression, with the max() and min() functions added to representthe boundary error introduced by the limited range of numerical representationavailable on the SCAMP-5:

add(a, b) = min(0.958a+ 0.930b+ 6.86, 127) (5.10)

sub(a, b) = max(0.948a− 0.945b− 2.74,−128) (5.11)

While it is difficult to provide a meaningful graphical representation of the dual-argument operations, Fig. 5.11 and 5.12 provide a graphical representation ofadd(a, 50) and sub(a, 50) respectively using the same techniques used with the single-argument operations to give the reader a general idea of the overall shape of thesystemic error involved.

With the equations above, we have therefore provided a set of equations that arecapable of describing the systematic error likely to be encountered when the eightbasic FPSP operations are run on the SCAMP-5.

80


Figure 5.6: Division operation: Estimated result with systematic error (orange) andexpected result for accurate computation (green).

Figure 5.7: North operation: Estimated result with systematic error (orange) and ex-pected result for accurate computation (green).

81


Figure 5.8: South operation: Estimated result with systematic error (orange) and ex-pected result for accurate computation (green).

Figure 5.9: East operation: Estimated result with systematic error (orange) and ex-pected result for accurate computation (green).

82


Figure 5.10: West operation: Estimated result with systematic error (orange) and ex-pected result for accurate computation (green).

Figure 5.11: Sample addition operation: Estimated result with systematic error (or-ange), expected result for accurate computation (green), and boundary limit (red).

83

5.4. RANDOM NOISE MODELLING Chapter 5. SCAMP-5 Hardware Analysis

Figure 5.12: Sample subtraction operation: Estimated result with systematic error (or-ange), expected result for accurate computation (green) and boundary limit (red).

5.4 Random Noise Modelling

This section considers the best way of developing a model to represent the randomnoise inherent in SCAMP operations. This complements the work described in theprevious section, which produced a model accounting for the systematic error ob-served on the SCAMP-5. However, being a deterministic model, the Systematic ErrorModel is unable to account for the random error that is introduced by the SCAMP-5.An additional model is therefore needed in order to provide a comprehensive viewof the SCAMP-5’s noise characteristics.

5.4.1 Approach

The presence of a random element in the values produced by the SCAMP-5 indicatesthat a stochastic model is needed. Essentially the idea would be to infer a probabilitydistribution of possible outcomes for a given set of inputs using the data that we areable to collect. This is a task perfectly suited to a technique known as Kernel DensityEstimation (KDE).

Kernel Density Estimation

The aim of Kernel Density Estimation is to infer the probability density functionof a random variable, given a finite set of data drawn from that variable. This isexactly the problem which we would like to solve; essentially, given x outcomes

84

Chapter 5. SCAMP-5 Hardware Analysis 5.4. RANDOM NOISE MODELLING

Figure 5.13: Two histograms produced using the same set of data points. Observe thatthe histogram on the left appears to depict a bimodal distribution, while the histogramon the right appears to depict a unimodal distribution (VanderPlas, 2016).

for a SCAMP-5 operation, what is the underlying probability distribution producingthose outcomes?

The use of Kernel Density Estimation was motivated by the difficulties associatedwith using histograms, which are also a means of estimating an underlying probabil-ity distribution. However, histograms suffer from a key deficiency, which is that thesize of the ’bins’ chosen can significantly alter the histogram produced. Consider thetwo histograms shown in Fig. 5.13. The first histogram appears to depict a bimodaldistribution, while the second histogram appears to show a unimodal distribution.However, both histograms were in fact produced using the exact same set of datapoints, with the only change being the choice of bin size. Effects such as this canseverely limit the usefulness of the intuition seemingly provided by histograms, andit is for this reason that a more sophisticated solution was needed.

Kernel Density Estimation helps to alleviate this problem by essentially creating ahistogram with each histogram block centred on a data point, rather than fixed inline with the histogram interval. The blocks are then ’stacked’ as and where theyoverlap, producing a distribution of the kind shown in Fig. 5.14. This is a far moreaccurate representation of the original data set, which was a set of points drawnfrom two normal distributions.

Finally, to smooth out the distribution produced, we can replace the ’blocks’ used ateach data point with a smooth function such as a Gaussian distribution, which whenstacked produces the following smooth function, as shown in Fig. 5.15.

This ’stacking’ of functions is the essence of Kernel Density Estimation, where akernel is applied at each data point in order to estimate an overall probability densityfunction. The ’block’ used in Fig. 5.14 is actually referred to as the tophat kernel,while the Gaussian function used in Fig. 5.15 is known as the Gaussian kernel.

A mathematical definition of the kernel density estimator for an unknown probability

85


Figure 5.14: ’Stacked’ histogram blocks centred on individual data points (VanderPlas,2016).

Figure 5.15: Blocks in 5.14 replaced with Gaussian distributions (VanderPlas, 2016).

86


density p(x) is provided as follows:

p(x) =1

N

N∑n=1

Kh(x− xi) (5.12)

where {x1, x2, ..., xN} is a set of N samples drawn from p(x) and Kh is a kernelfunction with bandwidth h.

KDE Bandwidth Selection

A key consideration in the implementation of KDE is the selection of the kernel size,otherwise known as the bandwidth.

Choosing a bandwidth that is too narrow will result in a KDE that over-fits the dataand includes quirks of the data not otherwise found in the underlying distribution.On the other hand, choosing a bandwidth that is too wide will under-fit the data,smoothening out important features of the underlying distribution.

K-fold Cross-Validation

Bandwidth selection (especially for multi-dimensional KDE) is an active area of re-search, and various methods have been proposed for bandwidth selection (e.g. Sil-verman, Botev et al. ).

For the purposes of this project, K-fold cross-validation in conjunction with a GridSearch across potential bandwidth parameter values was employed, using the pro-cess outlined by VanderPlas (2016).


Using the same data as in the previous section, we had access to a dataset containinginformation on the running of 753,000 SCAMP-5 instructions. Combined with theSystematic Error Model developed in the previous chapter, this dataset could beused as the basis for developing a Random Error Model, expressing how much acomputation result might differ from the prediction given by the Systematic ErrorModel.

Before beginning the KDE process, a question to consider was whether the randomerror across operations and input values was best represented in terms of absolutenumerical values or percentage values. This was done by comparing the numericalerror range and percentage error range for a given operation over a range of in-put values, the error range being the difference between the maximum result andminimum result recorded for an input value. If the error is best represented in per-centage terms, we would expect that the numerical error range would get smallerfor small input values and larger for large input values, while the error range in

87


Figure 5.16: Comparison between numerical error range (blue) and percentage errorrange (orange) for operation north(x) over a range of input arguments.

percentage terms would remain fairly constant. If the error is best represented inabsolute numerical terms, we would instead see the exact opposite.

Fig. 5.16 shows a comparison between the numerical error range and percentageerror range for the north(x) operation over a range of input arguments. As wecan see, while the error range expressed as a percentage fluctuates wildly, growingespecially large at input values close to zero, the numerical error remains fairlyconstant throughout the whole range of input values. Similar results could also beobserved for other operations. This clearly indicated that the random error was bestexpressed in absolute numerical terms, rather than in percentage terms.

Once it had been established that the random error was experienced in absoluteterms, we were able to calculate the random error attributable to a given compu-tation. For each instruction in the dataset, we therefore calculated the value of theerror E as follows:

E = R− sys(op,x) (5.13)

where R refers to the hardware’s recorded computation result, op is the relevantoperation, x is a vector representing the input arguments for that operation, andsys() is a function representing the Systematic Error Model.

5.4.3 Results

Having calculated the random error attributable to every instruction in the dataset,we were able to produce a Kernel Density Estimation expressing the likely probability

88


Figure 5.17: Kernel Density Estimation for Random Error modelled on 753,000 instruc-tion dataset collected on the SCAMP-5.

Figure 5.18: KDE from Fig. 5.17 overlaid with approximated Gaussian distribution.

89


density function governing the distribution of random error values, as shown in Fig.5.17. The KDE used a gaussian kernel and a bandwidth value of 0.68, which wasfound to be optimal during the Grid-Search/Cross-Validation process.

As can be seen, the probability density function produced closely resembles that of aGaussian distribution. By identifying appropriate values of µ and σ, where µ = 0.05and σ = 1.60, we were able to closely approximate the KDE found using a Gaussiandistribution as shown in Fig. 5.18.

The Random Error Model for the SCAMP-5 can therefore be specified by the follow-ing Gaussian distribution:

X ∼ N(0.05, 1.602) (5.14)

90

Chapter 6

Conclusion

This thesis contains contributions to the fields of Neural Network Inference Acceler-ation, Focal Plane Sensor Processors, and Analog Computing.

Using our customised neural network architecture implemented on the SCAMP-5, wewere able to demonstrate handwritten digit recognition at >90% accuracy at 3000fps, using only 0.5 mJ per recognised digit. To our knowledge, this project marks thefirst time a neural network has been successfully implemented on an FPSP (eitheranalog or digital). It is also the first time that the neural network inference pipelinehas been successfully implemented on analog hardware of any kind.

It is our hope that with further research, an analog FPSP solution can soon be de-veloped and implemented for use in a significant real-world application, and thatFPSPs will eventually come to be seen as a standard component in any computervision inference pipeline.

6.1 Contribution

The work carried out over the course of this thesis presents the following contribu-tions:

• Custom Regularization: A technique for optimising CNN network weightssuch that network layers can be accurately converted to FPSP code.

• Noise-In-The-Loop Training: A method for training neural networks such thathardware noise is taken into account at the training stage, allowing for thetraining of neural networks that are robust to the effects of hardware noise.

• AnalogNet Architecture: A neural network architecture specifically designedfor use on the SCAMP-5 FPSP.

• AnalogNet Implementation: A neural network implementation on the SCAMP-5 capable of performing digit recognition at>90% accuracy at 3000 fps, achiev-

91

6.2. FUTURE WORK Chapter 6. Conclusion

ing an estimated 85% improvement in inference time and 84% improvementin energy efficiency over the state-of-the-art.

• SCAMP-5 Noise Model: A comprehensive noise model for the SCAMP-5 mod-elling the occurrence of both Systematic and Random noise on the device.

This makes important contributions to three distinct strands of research.

First, this thesis represents a contribution to the growing field of research focusedon developing methods for neural network inference acceleration. Work in this di-verse field has recently been carried out with increasing urgency, with researchersdeveloping a broad range of potential methods encompassing everything from spe-cialised Vision Processing Units to 3D-printed light filters. Our research contributesto the diversity of this field, introducing a novel solution using a class of devicesnever before used for this purpose. Our approach demonstrates significant promisefor further development, having already achieved large gains in speed and energyefficiency over existing state-of-the-art methods.

Furthermore, our work represents a significant advance in Focal-Plane Sensor Pro-cessor research. Over the course of the project, numerous parties we interacted withexpressed skepticism and doubt that such a complex task could be carried out by thecurrent generation of FPSPs in general and by the SCAMP-5 in particular. Expertsin the field even pointed out that all previous work on the SCAMP device familyhad been limited to far simpler tasks, and that nothing this ambitious had ever beenattempted on these devices. Our research has demonstrated that not only is thispossible, but that even more ambitious applications are now within reach. It is ourhope that our results will spur greater interest in the field, having demonstrated theimmense potential of such devices.

Finally, this thesis has contributed to the revival currently occurring in the analogcomputing literature. Our work has built on the work undertaken by projects suchas RedEye, which demonstrated using simulated implementations that the use ofanalog computing techniques can lead to great speed and efficiency gains in neu-ral network computation. Our work has now provided an analog-based on-deviceimplementation, concretely demonstrating the potential gains that can be achievedthrough the use of analog computing, whether in computer vision or in any otherfield of computing.

6.2 Future Work

Now that the immense potential of implementing neural networks on FPSPs to fa-cilitate inference acceleration and energy efficiency has been conclusively demon-strated, the next step is to work towards the widespread deployment of FPSP-basedneural networks across a wide range of potential applications.

We believe that our work opens up a number of promising new lines of future re-search, in terms of developing both better software and better hardware.

92

Chapter 6. Conclusion 6.2. FUTURE WORK

6.2.1 Software Research

Further work will need to be carried out in the area of FPSP software developmentto consolidate and extend the contributions made by this project. In this section, wehighlight two areas of research which we feel will yield the most significant contribu-tion, the first eventually allowing for the use of state-of-the-art network architecturesand the second allowing for deployment in ultra-long-term power-restricted appli-cations.

Greater Network Complexity

Further work can profitably be carried out towards the implementation of broaderand deeper neural networks on the SCAMP-5 in particular and on FPSPs in general.Analog Vision presented an architecture that, while capable of classifying MNISTdigits with high accuracy, would not have been complex enough to handle visiontasks likely to be encountered in deployment environments. Research effort shouldtherefore be directed towards implementing networks with more layers (both con-volutional and dense) and more filters per convolutional layer. A particular researchconcern will likely be finding novel ways of doing so while still keeping within themajor computational constraints imposed by FPSP development. Such work willpave the way towards the successful implementation of deployment-standard net-works such as Inception-V3 on FPSPs, eventually producing solutions ready for im-mediate use in real-world applications.

SCAMP-5 Sleep Mode

An area of application where FPSP-based CNN solutions hold great promise is intheir deployment on always-on devices. With their extremely low levels of powerconsumption per recognised frame, they could potentially be deployed for weeksor months on a single battery charge when run at low frame rates, by sending theVision System into ”sleep” mode in between frames. As mentioned above, this wasa capability that was demonstrated on the SCAMP-3, when a device drawing powerfrom only three AAA batteries conducted loiterer detection at 8 fps continuously for10 days (Carey et al., 2013b).

However, the SCAMP-5 uses a new, more complicated microprocessor. The imple-mentation of sleep mode would need to be developed from scratch, and is consid-ered to be a non-trivial task and requiring significant development work and an deepunderstanding of the NXP micro-controller and its associated software.

6.2.2 Hardware Research

While we strongly recommend that further work be carried out in the areas of soft-ware research outlined above, we are nonetheless of the opinion that the achieve-ment of deployable solutions is not likely without significant improvements in FPSPhardware. In this section, we detail two areas of hardware development which we

93

6.2. FUTURE WORK Chapter 6. Conclusion

have identified as likely having the greatest positive impact for the prospects of FPSP-based neural networks. We focus our suggestions mainly on the SCAMP-5, given thatit is the FPSP we are most familiar with and is also one of the leading FPSPs in thefield.

Analog Memory

Perhaps the most significant contribution that can be made is the introduction ofsome form of analog memory at each processing element (PE).

Unlike a standard processor, each SCAMP-5 processing element has no access tomemory. As discussed in Section 3.3, this means that the only way to store interme-diate results and other relevant values is to hold them in an analog register. Giventhat the SCAMP-5 only has six analog registers per PE, this places a severe limit onthe complexity of computation that can be carried out. In particular, this constraintmeans that any CNN on the SCAMP-5 with more than one convolutional layer canonly have up to three convolutional filters.

The introduction of analog memory would thus greatly facilitate the implementationof complex neural networks, with an increase in the amount of memory available di-rectly corresponding to an increase in the ability to handle network complexity.

Colour Vision

Another significant hardware contribution that can be made is the development of afull-colour SCAMP device. The SCAMP-5 currently only supports greyscale images.While this is sufficient for tasks such as digit recognition, colour is an importantsource of information for more complex real-world applications, with the limitationto using greyscale resulting in the loss of a significant amount of that information. Itis likely that any widespread deployment of FPSP-based solutions would first requiresupport for colour vision in order to adequately deal with the complexity inherent inmany real-world applications.

94

Appendices

95

Appendix A

Ethics Checklist

Yes No

Section 1: HUMAN EMBRYOS/FOETUSES

Does your project involve Human Embryonic Stem Cells? X

Does your project involve the use of human embryos? X

Does your project involve the use of human foetal tissues / cells? X

Section 2: HUMANS

Does your project involve human participants? X

Section 3: HUMAN CELLS / TISSUES

Does your project involve human cells or tissues? (Other than fromHuman Embryos/Foetuses i.e. Section 1)?

X

Section 4: PROTECTION OF PERSONAL DATA

Does your project involve personal data collection and/or processing? X

Does it involve the collection and/or processing of sensitive personaldata (e.g. health, sexual lifestyle, ethnicity, political opinion, religiousor philosophical conviction)?

X

Does it involve processing of genetic information? X

Does it involve tracking or observation of participants? It should benoted that this issue is not limited to surveillance or localization data. Italso applies to Wan data such as IP address, MACs, cookies etc.

X

Does your project involve further processing of previously collected per-sonal data (secondary use)? For example Does your project involvemerging existing data sets?

X

Section 5: ANIMALS

Does your project involve animals? X

Section 6: DEVELOPING COUNTRIES

96

Chapter A. Ethics Checklist

Does your project involve developing countries? X

If your project involves low and/or lower-middle income countries, areany benefit-sharing actions planned?

NA

Could the situation in the country put the individuals taking part in theproject at risk?

NA

Section 7: ENVIRONMENTAL PROTECTION AND SAFETY

Does your project involve the use of elements that may cause harm tothe environment, animals or plants?

X

Does your project deal with endangered fauna and/or flora /protectedareas?

X

Does your project involve the use of elements that may cause harm tohumans, including project staff?

X

Does your project involve other harmful materials or equipment, e.g.high-powered laser systems?

X

Section 8: DUAL USE

Does your project have the potential for military applications? X

Does your project have an exclusive civilian application focus? X

Will your project use or produce goods or information that will requireexport licenses in accordance with legislation on dual use items?

X

Does your project affect current standards in military ethics e.g., globalban on weapons of mass destruction, issues of proportionality, discrim-ination of combatants and accountability in drone and autonomousrobotics developments, incendiary or laser weapons?

X

Section 9: MISUSE

Does your project have the potential for malevolent/criminal/terroristabuse?

X

Does your project involve information on/or the use of biological-,chemical-, nuclear/radiological-security sensitive materials and explo-sives, and means of their delivery?

X

Does your project involve the development of technologies or the cre-ation of information that could have severe negative impacts on humanrights standards (e.g. privacy, stigmatization, discrimination), if misap-plied?

X

Does your project have the potential for terrorist or criminal abuse e.g.infrastructural vulnerability studies, cybersecurity related project?

X

Section 10: LEGAL ISSUES

Will your project use or produce software for which there are copyrightlicensing implications?

X

97

Chapter A. Ethics Checklist

Will your project use or produce goods or information for which thereare data protection, or other legal implications?

X

Section 11: OTHER ETHICS ISSUES

Are there any other ethics issues that should be taken into consideration? X

98

Appendix B

Ethical and ProfessionalConsiderations

The main ethical considerations for this project revolve around the Dual Use poten-tial of the research. As a low-power hardware device well-suited to embedded com-puting and robotics applications, given sufficient software capabilities, the SCAMP5might very well come to be seen as having potential military applications, especiallyfor devices operating in the field away from convenient sources of power. Nonethe-less, while future innovations might introduce sufficient capabilities for military use,the current work that this project is focused on has little possibility of direct mil-itary application given that any real-world application would need to be far moresophisticated than simply classifying handwritten numbers in the MNIST dataset.Furthermore, while this research may eventually pave the way for potential militaryuses, it also paves the way for a great number of potential civilian uses. Numerousbeneficial applications could result from having the ability to use low power neuralnetworks throughout the general environment, and in that regard we conclude thaton balance the case for proceeding with the research is compelling.

99

Bibliography

Ambrogio, Stefano et al. (2018). “Equivalent-accuracy accelerated neural-networktraining using analogue memory”. In: Nature 558.7708, pp. 60–67. ISSN: 14764687.DOI: 10.1038/s41586-018-0180-5. URL: http://dx.doi.org/10.1038/s41586-018-0180-5.

Bishop, Christopher (2006). Pattern Recognition and Machine Learning. 1st ed. Springer-Verlag New York. ISBN: 978-0-387-31073-2.

Carey, Stephen J et al. (2013a). “A 100,000 fps Vision Sensor with Embedded 535GOPS/ W 256x256 SIMD Processor Array C182 C183”. In: Proc. of the VLSI Circuits Sym-posium 2013, pp. 182–183. ISBN: 9784863483484. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.699.9147&rep=rep1&type=pdf.

Carey, Stephen J et al. (2013b). “Low power high-performance smart camera systembased on SCAMP vision sensor”. In: Journal of Systems Architecture 59.10, pp. 889–899. ISSN: 1383-7621. DOI: 10.1016/J.SYSARC.2013.03.016. URL: https://www.sciencedirect.com/science/article/pii/S1383762113000490?via%3Dihub.

Chen, Yu-Hsin et al. (2016). Eyeriss: An Energy-Efficient Reconfigurable Accelerator forDeep Convolutional Neural Networks. URL: http://www.rle.mit.edu/eems/wp-content/uploads/2016/02/eyeriss_isscc_2016.pdf.

Chen, Jianing (2018). Scamp5d Vision System 1.1.0 Device Development Library Doc-umentation. URL: https://personalpages.manchester.ac.uk/staff/jianing.chen/scamp5d_lib_doc_html/index.html.

Chen, Jianing et al. (2017). Feature Extraction using a Portable Vision System. URL:http://personalpages.manchester.ac.uk/staff/p.dudek/papers/chen-

iros2017.pdf.

Cho, Kyunghyun et al. (2014). “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation”. In: URL: http://arxiv.org/abs/1406.1078.

Debrunner, Thomas (2017). Automatic Code Generation and Pose Estimation on Cel-lular Processor Arrays ( CPAs ). URL: https://www.imperial.ac.uk/media/

100

http://dx.doi.org/10.1038/s41586-018-0180-5

http://dx.doi.org/10.1038/s41586-018-0180-5

http://dx.doi.org/10.1038/s41586-018-0180-5

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.699.9147&rep=rep1&type=pdf

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.699.9147&rep=rep1&type=pdf

http://dx.doi.org/10.1016/J.SYSARC.2013.03.016

https://www.sciencedirect.com/science/article/pii/S1383762113000490?via%3Dihub

https://www.sciencedirect.com/science/article/pii/S1383762113000490?via%3Dihub

http://www.rle.mit.edu/eems/wp-content/uploads/2016/02/eyeriss_isscc_2016.pdf

http://www.rle.mit.edu/eems/wp-content/uploads/2016/02/eyeriss_isscc_2016.pdf

https://personalpages.manchester.ac.uk/staff/jianing.chen/scamp5d_lib_doc_html/index.html

https://personalpages.manchester.ac.uk/staff/jianing.chen/scamp5d_lib_doc_html/index.html

http://personalpages.manchester.ac.uk/staff/p.dudek/papers/chen-iros2017.pdf

http://personalpages.manchester.ac.uk/staff/p.dudek/papers/chen-iros2017.pdf

http://arxiv.org/abs/1406.1078


https://www.imperial.ac.uk/media/imperial-college/faculty-of-engineering/computing/public/student-projects-pg-2016-17/DebrunnerT-Automatic-Code-Generation-and-Pose-Estimation-on-Cellular-Processor-Arrays.pdf


Chapter B. Ethical and Professional Considerations

imperial- college/faculty- of- engineering/computing/public/student-

projects-pg-2016-17/DebrunnerT-Automatic-Code-Generation-and-Pose-

Estimation-on-Cellular-Processor-Arrays.pdf.

Goode, L (2018). Google I/O 2018: How Google’s Duplex Demo Stole the Show —WIRED. URL: https://www.wired.com/story/google-duplex-phone-calls-ai-future/.

Google (2018). Fixed Point Quantization — TensorFlow. URL: https://www.tensorflow.org/performance/quantization.

Graves, Alex et al. (2013). “Speech Recognition with Deep Recurrent Neural Net-works”. In: Proceedings of the IEEE International Conference on Acoustics, Speechand Signal Processing, pp. 6645–6649. ISBN: 978-1-4799-0356-6. DOI: 10.1109/ICASSP.2013.6638947. URL: http://arxiv.org/abs/1303.5778.

Howard, Andrew G. et al. (2017). “MobileNets: Efficient Convolutional Neural Net-works for Mobile Vision Applications”. In: ISSN: 0004-6361. DOI: arXiv:1704.

04861. URL: http://arxiv.org/abs/1704.04861.

Hubara, Itay et al. (2016). “Quantized Neural Networks: Training Neural Networkswith Low Precision Weights and Activations”. In: URL: https://arxiv.org/pdf/1609.07061.pdf.

Karpathy, Andrej (2018). CS231n Convolutional Neural Networks for Visual Recogni-tion. URL: http://cs231n.github.io/convolutional-networks/.

Krizhevsky, Alex et al. (2012). “ImageNet Classification with Deep ConvolutionalNeural Networks”. In: Advances in Neural Information Processing Systems 25. Ed.by F Pereira et al. Curran Associates, Inc., pp. 1097–1105. URL: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-

neural-networks.pdf.

Lichtsteiner, Patrick et al. (2008). “A 128 x 128 120 dB 15 us latency asynchronoustemporal contrast vision sensor”. In: IEEE Journal of Solid-State Circuits 43.2,pp. 566–576. ISSN: 00189200. DOI: 10.1109/JSSC.2007.914337.

Likamwa, Robert et al. (2016). “RedEye: Analog ConvNet Image Sensor Architecturefor Continuous Mobile Vision”. In: Proceedings - 2016 43rd International Sympo-sium on Computer Architecture, ISCA 2016, pp. 255–266. ISSN: 01635964. DOI:10.1109/ISCA.2016.31.

Maqueda, Ana I et al. (2018). “Event-based Vision meets Deep Learning on SteeringPrediction for Self-driving Cars”. In: URL: http://rpg.ifi.uzh.ch/docs/CVPR18_Maqueda.pdf%20http://arxiv.org/abs/1804.01310.

101





https://www.wired.com/story/google-duplex-phone-calls-ai-future/

https://www.wired.com/story/google-duplex-phone-calls-ai-future/

https://www.tensorflow.org/performance/quantization

https://www.tensorflow.org/performance/quantization

http://dx.doi.org/10.1109/ICASSP.2013.6638947

http://dx.doi.org/10.1109/ICASSP.2013.6638947


http://dx.doi.org/arXiv:1704.04861

http://dx.doi.org/arXiv:1704.04861


https://arxiv.org/pdf/1609.07061.pdf

https://arxiv.org/pdf/1609.07061.pdf

http://cs231n.github.io/convolutional-networks/

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf



http://dx.doi.org/10.1109/JSSC.2007.914337

http://dx.doi.org/10.1109/ISCA.2016.31

http://rpg.ifi.uzh.ch/docs/CVPR18_Maqueda.pdf%20http://arxiv.org/abs/1804.01310

http://rpg.ifi.uzh.ch/docs/CVPR18_Maqueda.pdf%20http://arxiv.org/abs/1804.01310


Martel, Julien and Piotr Dudek (2016). “Vision Chips with In-pixel Processors forHigh-performance Low-power Embedded Vision Systems”. In: ASR-MOV Work-shop, CGO’16.

Martel, Julien et al. (2018). “Real-Time Depth from Focus on a Programmable FocalPlane Processor”. In: IEEE Transactions on Circuits and Systems I: Regular Papers65.3, pp. 925–934. ISSN: 15498328. DOI: 10.1109/TCSI.2017.2753878. URL:https://www.research.manchester.ac.uk/portal/files/66318412/martel_a_

tcas_2018.pdf.

Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill. ISBN: 0-07-115467-1.

Mor, Noam et al. (2018). “A Universal Music Translation Network”. In: URL: https:/ / research . fb . com / wp - content / uploads / 2018 / 05 / 1805 - 07848 . pdf ?

%20https://arxiv.org/pdf/1805.07848.pdf.

Nvidia (2017). Accelerating AI Inference Performance in the Data Center and Beyond— The Official NVIDIA Blog. URL: https://blogs.nvidia.com/blog/2017/09/25/ai-inference/.

Pandruvada, Srinivas (2014). Running Average Power Limit. URL: https://01.org/blogs/2014/running-average-power-limit-%EF%BF%BD%EF%BF%BD%EF%BF%BD-

rapl.

Petridis, Stavros (2018). Course 395: Machine Learning - Lectures. URL: https://ibug.doc.ic.ac.uk/courses.

RasPi.TV (2016). Raspberry Pi Zero 1.3 Power Usage with camera. URL: https://

raspi.tv/2016/raspberry-pi-zero-1-3-power-usage-with-camera.

Rawat, Waseem and Zenghui Wang (2017). “Deep Convolutional Neural Networksfor Image Classification: A Comprehensive Review”. In: Neural Computation 29.9,pp. 2352–2449. ISSN: 0899-7667. URL: https://www.mitpressjournals.org/doi/pdf/10.1162/neco_a_00990%20http://www.mitpressjournals.org/doi/

abs/10.1162/neco_a_00990.

Rosenblatt, F (1957). The Perceptron, a Perceiving and Recognizing Automaton. URL:https://books.google.co.uk/books/about/The_Perceptron_a_Perceiving_

and_Recogniz.html?id=P_XGPgAACAAJ&redir_esc=y.

Rotem, Efraim et al. (2012). “Power-Management Architecture of the Intel Microar-chitecture Code-Named Sandy Bridge”. In: IEEE Micro 32.2, pp. 20–27. ISSN: 0272-1732. DOI: 10.1109/MM.2012.12. URL: http://ieeexplore.ieee.org/document/6148200/.

102

http://dx.doi.org/10.1109/TCSI.2017.2753878

https://www.research.manchester.ac.uk/portal/files/66318412/martel_a_tcas_2018.pdf

https://www.research.manchester.ac.uk/portal/files/66318412/martel_a_tcas_2018.pdf

https://research.fb.com/wp-content/uploads/2018/05/1805-07848.pdf?%20https://arxiv.org/pdf/1805.07848.pdf



https://blogs.nvidia.com/blog/2017/09/25/ai-inference/

https://blogs.nvidia.com/blog/2017/09/25/ai-inference/

https://01.org/blogs/2014/running-average-power-limit-%EF%BF%BD%EF%BF%BD%EF%BF%BD-rapl



https://ibug.doc.ic.ac.uk/courses

https://ibug.doc.ic.ac.uk/courses

https://raspi.tv/2016/raspberry-pi-zero-1-3-power-usage-with-camera

https://raspi.tv/2016/raspberry-pi-zero-1-3-power-usage-with-camera

https://www.mitpressjournals.org/doi/pdf/10.1162/neco_a_00990%20http://www.mitpressjournals.org/doi/abs/10.1162/neco_a_00990



https://books.google.co.uk/books/about/The_Perceptron_a_Perceiving_and_Recogniz.html?id=P_XGPgAACAAJ&redir_esc=y

https://books.google.co.uk/books/about/The_Perceptron_a_Perceiving_and_Recogniz.html?id=P_XGPgAACAAJ&redir_esc=y

http://dx.doi.org/10.1109/MM.2012.12

http://ieeexplore.ieee.org/document/6148200/

http://ieeexplore.ieee.org/document/6148200/


Scaramuzza, Davide (2015). “Tutorial on Event-based Vision for High-Speed Robotics”.In: URL: http://rpg.ifi.uzh.ch.

VanderPlas, Jake (2016). Python Data Science Handbook. 1st ed. O’Reilly Media.ISBN: 1491912057.

Zarandy, Akos. (2011). Focal-plane sensor-processor chips. Springer, p. 305. ISBN:1441964746.

Zhu, Jun Yan et al. (2017). “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks”. In: Proceedings of the IEEE International Confer-ence on Computer Vision. Vol. 2017-Octob, pp. 2242–2251. ISBN: 9781538610329.DOI: 10.1109/ICCV.2017.244. URL: http://arxiv.org/abs/1703.10593.

103

http://rpg.ifi.uzh.ch

http://dx.doi.org/10.1109/ICCV.2017.244


Analog Vision - Neural Network Inference Acceleration ... · which demonstrates large improvements...

Documents

Transcript of Analog Vision - Neural Network Inference Acceleration ... · which demonstrates large improvements...