STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 ·...

21
This manuscript is published in Neural Networks. Please cite it as: Kheradpisheh, S.R., Ganjtabesh, M., Thorpe, S.J., Masquelier, T., STDP-based spiking deep convolutional neural networks for object recognition. Neural Networks (2017). https: // doi. org/ 10. 1016/ j. neunet. 2017. 12. 005 STDP-based spiking deep convolutional neural networks for object recognition Saeed Reza Kheradpisheh 1,2, * , Mohammad Ganjtabesh 1 , Simon J. Thorpe 2 , and Timoth´ ee Masquelier 2 1 Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran 2 CerCo UMR 5549, CNRS – Universit´ e Toulouse 3, France Abstract Previous studies have shown that spike-timing- dependent plasticity (STDP) can be used in spik- ing neural networks (SNN) to extract visual fea- tures of low or intermediate complexity in an un- supervised manner. These studies, however, used relatively shallow architectures, and only one layer was trainable. Another line of research has demon- strated – using rate-based neural networks trained with back-propagation – that having many layers increases the recognition robustness, an approach known as deep learning. We thus designed a deep SNN, comprising several convolutional (trainable with STDP) and pooling layers. We used a tem- poral coding scheme where the most strongly acti- vated neurons fire first, and less activated neurons fire later or not at all. The network was exposed to natural images. Thanks to STDP, neurons progres- sively learned features corresponding to prototyp- ical patterns that were both salient and frequent. Only a few tens of examples per category were re- quired and no label was needed. After learning, the complexity of the extracted features increased along the hierarchy, from edge detectors in the first * Corresponding author. Email addresses: [email protected] (SRK), [email protected] (MG), [email protected] (ST) [email protected] (TM). layer to object prototypes in the last layer. Coding was very sparse, with only a few thousands spikes per image, and in some cases the object category could be reasonably well inferred from the activ- ity of a single higher-order neuron. More gener- ally, the activity of a few hundreds of such neurons contained robust category information, as demon- strated using a classifier on Caltech 101, ETH-80, and MNIST databases. We also demonstrate the superiority of STDP over other unsupervised tech- niques such as random crops (HMAX) or auto- encoders. Taken together, our results suggest that the combination of STDP with latency coding may be a key to understanding the way that the pri- mate visual system learns, its remarkable process- ing speed and its low energy consumption. These mechanisms are also interesting for artificial vision systems, particularly for hardware solutions. Keywords: Spiking Neural Network, STDP, Deep Learning, Object Recognition, and Temporal Cod- ing Introduction Primate’s visual system solves the object recogni- tion task through hierarchical processing along the ventral pathway of the visual cortex [10]. Through this hierarchy, the visual preference of neurons gradually increases from oriented bars in primary visual cortex (V1) to complex objects in inferotem- 1 arXiv:1611.01421v3 [cs.CV] 25 Dec 2017

Transcript of STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 ·...

Page 1: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

This manuscript is published in Neural Networks. Please cite it as:Kheradpisheh, S.R., Ganjtabesh, M., Thorpe, S.J., Masquelier, T., STDP-based spiking deepconvolutional neural networks for object recognition. Neural Networks (2017).https: // doi. org/ 10. 1016/ j. neunet. 2017. 12. 005

STDP-based spiking deep convolutional neural networksfor object recognition

Saeed Reza Kheradpisheh1,2,∗, Mohammad Ganjtabesh1, Simon J. Thorpe2, andTimothee Masquelier2

1 Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran,Tehran, Iran

2 CerCo UMR 5549, CNRS – Universite Toulouse 3, France

Abstract

Previous studies have shown that spike-timing-dependent plasticity (STDP) can be used in spik-ing neural networks (SNN) to extract visual fea-tures of low or intermediate complexity in an un-supervised manner. These studies, however, usedrelatively shallow architectures, and only one layerwas trainable. Another line of research has demon-strated – using rate-based neural networks trainedwith back-propagation – that having many layersincreases the recognition robustness, an approachknown as deep learning. We thus designed a deepSNN, comprising several convolutional (trainablewith STDP) and pooling layers. We used a tem-poral coding scheme where the most strongly acti-vated neurons fire first, and less activated neuronsfire later or not at all. The network was exposed tonatural images. Thanks to STDP, neurons progres-sively learned features corresponding to prototyp-ical patterns that were both salient and frequent.Only a few tens of examples per category were re-quired and no label was needed. After learning,the complexity of the extracted features increasedalong the hierarchy, from edge detectors in the first

∗Corresponding author.Email addresses:[email protected] (SRK),[email protected] (MG),[email protected] (ST)[email protected] (TM).

layer to object prototypes in the last layer. Codingwas very sparse, with only a few thousands spikesper image, and in some cases the object categorycould be reasonably well inferred from the activ-ity of a single higher-order neuron. More gener-ally, the activity of a few hundreds of such neuronscontained robust category information, as demon-strated using a classifier on Caltech 101, ETH-80,and MNIST databases. We also demonstrate thesuperiority of STDP over other unsupervised tech-niques such as random crops (HMAX) or auto-encoders. Taken together, our results suggest thatthe combination of STDP with latency coding maybe a key to understanding the way that the pri-mate visual system learns, its remarkable process-ing speed and its low energy consumption. Thesemechanisms are also interesting for artificial visionsystems, particularly for hardware solutions.Keywords: Spiking Neural Network, STDP, DeepLearning, Object Recognition, and Temporal Cod-ing

Introduction

Primate’s visual system solves the object recogni-tion task through hierarchical processing along theventral pathway of the visual cortex [10]. Throughthis hierarchy, the visual preference of neuronsgradually increases from oriented bars in primaryvisual cortex (V1) to complex objects in inferotem-

1

arX

iv:1

611.

0142

1v3

[cs

.CV

] 2

5 D

ec 2

017

Page 2: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

poral cortex (IT), where neural activity providesa robust, invariant, and linearly-separable objectrepresentation [10, 9]. Despite the extensive feed-back connections in the visual cortex, the first feed-forward wave of spikes in IT (∼ 100−150 ms post-stimulus presentation) appears to be sufficient forcrude object recognition [54, 19, 33].

During the last decades, various computationalmodels have been proposed to mimic this hierar-chical feed-forward processing [15, 28, 48, 36, 31].Despite the limited successes of the early mod-els [42, 16], recent advances in deep convolutionalneural networks (DCNN) led to high perform-ing models [27, 58, 51]. Beyond the high preci-sion, DCNNs can tolerate object variations as hu-mans do [24, 25], use IT-like object representa-tions [5, 22], and match the spatio-temporal dy-namics of the ventral visual pathway [7].

Although the architecture of DCNNs is somehowinspired by the primate’s visual system [29] (a hi-erarchy of computational layers with gradually in-creasing receptive fields), they totally neglect theactual neural processing and learning mechanismsin the cortex.

The computing units of DCNNs send floating-point values to each other which correspond to theiractivation level, while, biological neurons commu-nicate to each other by sending electrical impulses(i.e., spikes). The amplitude and duration of allspikes are almost the same, so they are fully charac-terized by their emission time. Interestingly, meanspike rates are very low in the primate visual sys-tems (perhaps only a few of hertz [50]). Hence, neu-rons appear to fire a spike only when they have tosend an important message, and some informationcan be encoded in their spike times. Such spike-time coding leads to a fast and extremely energy-efficient neural computation in the brain (the wholehuman brain consumes only about 10-20 Watts ofenergy [34]).

The current top-performing DCNNs are trainedwith the supervised back-propagation algorithmwhich has no biological root. Although it workswell in terms of accuracy, the convergence is ratherslow because of the credit assignment problem [45].Furthermore, given that DCNNs typically have mil-lions of free parameters, millions of labeled exam-

ples are needed to avoid over-fitting. However, pri-mates, especially humans, can learn from far fewerexamples while most of the time no label is avail-able. They may be able to do so thanks to spike-timing-dependent plasticity (STDP), an unsuper-vised learning mechanism which occurs in mam-malian visual cortex [38, 18, 37]. According toSTDP, synapses through which a presynaptic spikearrived before (respectively after) a postsynapticone are reinforced (respectively depressed).

To date, various spiking neural networks (SNN)have been proposed to solve object recognitiontasks. A group of these networks are actually theconverted versions of traditional DCNNs [6, 20, 13].The main idea is to replace each DCNN comput-ing unit with a spiking neuron whose firing rate iscorrelated with the output of that unit. The aimof these networks is to reduce the energy consump-tion in DCNNs. However, the inevitable drawbacksof such spike-rate coding are the need for manyspikes per image and the long processing time. Be-sides, the use of back-propagation learning algo-rithm and having both positive (excitatory) andnegative (inhibitory) output synapses in a neuronare not biologically plausible. On the other hand,there are SNNs which are originally spiking net-works and learn spike patterns. First group of thesenetworks exploit learning methods such as auto-encoder [40, 4] and back-propagation[1] which arenot biologically plausible. The second group con-sists of SNNs with bioinspired learning rules whichhave shallow architectures [3, 17, 44, 59, 11] or onlyone trainable layer [36, 2, 23].

In this paper we proposed a STDP-based spik-ing deep neural network (SDNN) with a spike-time neural coding. The network is comprised ofa temporal-coding layer followed by a cascade ofconsecutive convolutional (feature extractor) andpooling layers. The first layer converts the inputimage into an asynchronous spike train, where thevisual information is encoded in the temporal or-der of the spikes. Neurons in convolutional layersintegrate input spikes, and emit a spike right afterreaching their threshold. These layers are equippedwith STDP to learn visual features. Pooling layersprovide translation invariance and also compact thevisual information [48]. Through the network, vi-

2

Page 3: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

sual features get larger and more complex, whereneurons in the last convolutional layer learn anddetect object prototypes. At the end, a classifierdetects the category of the input image based onthe activity of neurons in the last pooling layer withglobal receptive fields.

We evaluated the proposed SDNN on Caltechface/motorbike and ETH-80 datasets with large-scale images of various objects taken form differ-ent viewpoints. The proposed SDNN reached theaccuracies of 99.1% on face/motorbike task and82.8% on ETH-80, which indicates its capabilityto recognize several natural objects even under se-vere variations. Based on our knowledge, there isno other spiking deep network which can recog-nize large-scale natural objects. We also examinedthe proposed SDNN on the MNIST dataset whichis a benchmark for spiking neural networks, andinterestingly, it reached 98.4% recognition accu-racy. In addition to the high performance, the pro-posed SDNN is highly energy-efficient and workswith a few number of spikes per image, whichmakes it suitable for neuromorphic hardware imple-mentation. Although current state-of-the-art DC-NNs achieved stunning results on various recogni-tion tasks, continued work on brain-inspired modelscould end up in strong intelligent systems in future,which can even help us to improve our understand-ing of the brain itself.

Proposed Spiking Deep Neural

Network

A sample architecture of the proposed SDNN withthree convolutional and three pooling layers isshown in Fig. 1. Note that the architectural prop-erties (e.g., the number of layers and receptive fieldsizes) and learning parameters should be optimizedfor the desired recognition task.

The first layer of the network uses Differenceof Gaussians (DoG) filters to detect contrasts inthe input image. It encodes the strength of thesecontrasts in the latencies of its output spikes (thehigher the contrast, the shorter the latency). Neu-rons in convolutional layers detect more complexfeatures by integrating input spikes from the pre-

vious layer which detects simpler visual features.Convolutional neurons emit a spike as soon as theydetect their preferred visual feature which dependson their input synaptic weights. Through the learn-ing, neurons that fire earlier perform the STDP andprevent the others from firing via a winner-take-all mechanism. In this way, more salient and fre-quent features tend to be learned by the network.Pooling layers provide translation invariance usingmaximum operation, and also help the network tocompress the flow of visual data. Neurons in pool-ing layers propagate the first spike received fromneighboring neurons in the previous layer which areselective to the same feature. Convolutional andpooling layers are arranged in a consecutive order.Receptive fields gradually increase through the net-work and neurons in higher layers become selectiveto complex objects or object parts.

It should be noted that the internal potentials ofall neurons are reset to zero before processing thenext image. Also, learning only happens in convo-lutional layers and it is done layer by layer. Sincethe calculations of each neuron is independent ofother adjacent neurons, to speed-up the computa-tions, each of the convolution, pooling, and STDPoperations are performed in parallel on GPU.

DoG and temporal coding

The important role of the first stage in SNNs is toencode the input signal into discrete spike eventsin the temporal domain. This temporal codingdetermines the content and the amount of infor-mation carried by each spike, which deeply affectsthe neural computations in the network. Hence,using efficient coding scheme in SNNs can lead tofast and accurate responses. Various temporal cod-ing schemes can be used in visual processing (seeref. [53]). Among them, rank-order coding is shownto be efficient for rapid processing (even possibly inretinal ganglion cells) [55, 43].

Cells in the first layer of the network apply a DoGfilter over their receptive fields to detect positive ornegative contrasts in the input image. DoG wellapproximates the center-surround properties of theganglion cells of the retina. When presented withan image, these DoG cells detect the contrasts and

3

Page 4: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Input Image DoG (Temporal Coding) Conv 1

Conv 2

Conv 3

Pool 1

Pool 2 Global pooling

Classifier

? 𝑤2𝐷

2

𝑤1𝐷

𝑤1𝑐1

𝑛1 𝑛1

𝑤2𝑐1

𝑤1𝑝1

𝑤2𝑝1

𝑤1𝑐2

𝑤2𝑐2

𝑤1𝑝2

𝑤2𝑝2

𝑤1𝑐3

𝑤2𝑐3

Conv-window

(real-value)

Conv-window

(spiking)

Pooling window Spiking synapse

Real-value synapse

𝑛2

𝑛2 𝑛3 𝑛3

Figure 1: A sample architecture of the proposed SDNN with three convolutional and three pooling layers. The first layer applies ON- andOFF-center DoG filters of size wD

1 × wD2 on the input image and encode the image contrasts in the timing of the output spikes. The ith

convolutional layer, Conv i, learns combinations of features extracted in the previous layer. The ith pooling layer, Pool i, provides translationinvariance for features extracted in the previous layer and compress the visual information using a local maximum operation. Finally theclassifier detects the object category based on the feature values computed by the global pooling layer. The window size of the ith convolutionaland pooling layers are indicated by wci

1,2 and wpi1,2, respectively. The number of the neuronal maps of the ith convolutional and pooling layer

are also indicated by ni below each layer.

emit a spike; the more strongly a cell is activated(higher contrast), the earlier it fires. In other word,the order of the spikes depends on the order of thecontrasts. This rank-order coding is shown to beefficient for obtaining V1 like edge detectors [8] aswell as complex visual features [36, 23] in highercortical areas.

DoG cells are retinotopically organized in twoON-center and OFF-center maps which are respec-tively sensitive to positive and negative contrasts.A DoG cell is allowed to fire if its activation isabove a certain threshold. Note that this schemegrantees that at most one of the two cells (positiveor negative) can fire in each location. As mentionedabove, the firing time of a DoG cell is inversely pro-portional to its activation value. In other words,if the output of the DoG filter at a certain loca-tion is r, the firing time of the corresponding cellis t = 1/r. For efficient GPU-based parallel com-puting, the input spikes are grouped into equal-sizesequential packets. At each time step, spikes of onepacket are propagated simultaneously. In this way,a packet of spikes with near ranks (carrying simi-lar visual information) are propagated in parallel,while, the next spike packet will be processed in thenext time step.

Convolutional layers

A convolutional layer contains several neuronalmaps. Each neuron is selective to a visual featuredetermined by its input synaptic weights. Neuronsin a specific map detect the same visual feature butat different locations. To this end, synaptic weightsof neurons belonging to the same map should al-ways be the same (i.e., weight sharing). Within amap, neurons are retinotopically arranged. Eachneuron receives input spikes from the neurons lo-cated in a determined window in all neuronal mapsof the previous layer. Hence, a visual feature in aconvolutional layer is a combination of several sim-pler feature extracted in the previous layer. Notethat the input windows of two adjacent neurons arehighly overlapped. Hence, the network can detectthe appearance of the visual features in any loca-tion.

Neurons in all convolutional layers are non-leaky integrate-and-fire neurons, which gather in-put spikes from presynaptic neurons and emita spike when their internal potentials reach aprespecified threshold. Each presynaptic spikeincreases the neuron’s potential by its synapticweight. At each time step, the internal potential

4

Page 5: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

of the ith neuron is updated as follows:

Vi(t) = Vi(t− 1) +∑j

Wj,iSj(t− 1), (1)

where Vi(t) is the internal potential of the ith con-volutional neuron at time step t, Wj,i is the synapticweight between the jth presynaptic neuron and theith convolutional neuron, and Sj is the spike trainof the jth presynaptic neuron (Sj(t− 1) = 1 if theneuron has fired at time t − 1, and Sj(t − 1) = 0otherwise). If Vi exceeds its threshold, Vthr, thenthe neuron emits a spike and Vi is reset:

Vi(t) = 0 and Si(t) = 1, if Vi(t) ≥ Vthr. (2)

Also, there is a lateral inhibition mechanism inall convolutional layers. When a neuron fires, in anspecific location, it inhibits other neurons in thatlocation belonging to other neuronal maps (i.e., re-sets their potentials to zero) and does not allowthem to fire until the next image is shown. In ad-dition, neurons are not allowed to fire more thanonce. These together provides an sparse but highlyinformative coding, because, there can be at mostone spike at each location which indicates the exis-tence of a particular visual feature in that location.

Local pooling layers

Pooling layers help the network to gain invarianceby doing a nonlinear max pooling operation overa set of neighboring neurons with the same pre-ferred feature. Some evidence suggests that sucha max operation occurs in complex cells in visualcortex [48]. Thanks to the rank-order coding usedin the proposed network, the maximum operationof pooling layers simply consists of propagating thefirst spike emitted by the afferents [46].

A neuron in a neuronal map of a pooling layerperforms the maximum operation over a windowin the corresponding neuronal map of the previouslayer. Pooling neurons are integrate-and-fire neu-rons whose input synaptic weights and thresholdare all set to one. Hence, the first input spike acti-vates them and leads to an output spike. Regardingto the rank-order coding, each pooling neuron is al-lowed to fire at most once. It should be noted thatno learning occurs in pooling layers.

Another important role of pooling layers is tocompress the visual information. Regarding tothe maximum operation performed in pooling lay-ers, adjacent neurons with overlapped inputs wouldcarry redundant information (each spike is sent tomany neighboring pooling neurons). Hence, in theproposed network, the overlap between the inputwindows of two adjacent pooling neurons (belong-ing to the same map) is set to be very small. Ithelps to compress the visual information by elim-inating the redundancies, and also, to reduce thesize of subsequent layers.

STDP-based learning

As mentioned above, learning occurs only in convo-lutional layers which should learn to detect visualfeatures by combining simpler features extractedin the previous layer. The learning is done layerby layer, i.e., the learning in a convolutional layerstarts when the learning in the previous convolu-tional layer is finalized. When a new image is pre-sented, neurons of the convolutional layer competewith each other and those which fire earlier triggerSTDP and learn the input pattern.

A simplified version of STDP [36] is used:

∆wij =

{a+wij(1− wij), if tj − ti ≤ 0,a−wij(1− wij), if tj − ti > 0,

(3)

where i and j respectively refer to the index of post-and presynaptic neurons, ti and tj are the corre-sponding spike times, ∆wij is the synaptic weightmodification, and a+ and a− are two parametersspecifying the learning rate. Note that the exacttime difference between two spikes does not affectthe weight change, but only its sign is considered.Also, it is assumed that if a presynaptic neurondoes not fire before the postsynaptic one, it willfire later. These simplifications are equivalent toassuming that the intensity-latency conversion ofDoG cells compresses the whole spike wave in a rel-atively short time interval (say, 20−30 ms), so thatall presynaptic spikes necessarily fall close to thepostsynaptic spike time, and the time lags are neg-ligible. The multiplicative term wij(1−wij) ensuresthe weights remain in the range [0,1] and thus main-

5

Page 6: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

tains all synapses in an excitatory mode in addingto implementing soft-bound effect.

Note that choosing large values for the learningparameters (i.e., a+ and a−) will decrease the learn-ing memory, therefore, neurons would learn the lastpresented images and unlearn previously seen im-ages. Also, choosing tiny values would slow downthe learning process. At the beginning of the learn-ing, when synaptic weights are random, neuronsare not yet selective to any specific pattern andrespond to many different patterns, therefore, theprobability for a synapse to get depressed is higherthan being potentiated. Hence, by setting a− to begreater than a+, synaptic weights gradually decayinsofar as neurons can not reach their threshold tofire anymore. Therefore, a+ is better to be greaterthan a−, however, by setting a+ to be much greaterthan a−, neurons will tend to learn more than onepattern and respond to all of them. All in all, it isbetter to choose a+ and a− not too big and not toosmall, and it is better to set a+ a bit greater thana−.

During the learning of a convolutional layer, neu-rons in the same map, detecting the same feature indifferent locations, integrate input spikes and com-pete with each other to do the STDP. The firstneuron which reaches the threshold and fires, if any,is the winner (global intra-map competition). Thewinner triggers the STDP and updates its synapticweights. As mentioned before, neurons in differ-ent locations of the same map have the same inputsynaptic weights (i.e., weight sharing) to be selec-tive to the same feature. Hence, the winner neuronprevents other neurons in its own map to do STDPand duplicates its updated synaptic weights intothem. Also, there is a local inter-map competi-tion for STDP. When a neuron is allowed to do theSTDP, it prevents the neurons in other maps withina small neighborhood around its location from do-ing STDP. This competition is crucial to encourageneurons of different maps to learn different features.

Because of the discretized time variable in theproposed model, it is probable that some competi-tor neurons fire at the same time step. One pos-sible scenario is to pick one randomly and allowit to do STDP. But a better alternative is to pickthe one which has the highest potential indicating

higher similarity between its learned feature andinput pattern.

Synaptic weights of convolutional neurons initi-ate with random values drown from a normal dis-tribution with the mean of µ = 0.8 and STD ofσ = 0.05. Note that by choosing a small µ, neuronswould not reach their threshold to fire and will notlearn anything. Also, by choosing a large σ, someinitial synaptic weights will be smaller (larger) thanothers and have less (more) contribution to the neu-ron activity, and regarding the STDP rule, theyhave a higher tendency to converge to zero (one).In other words, dependency on the initial weightswill be higher for a large σ.

As the learning of a specific layer progresses, itsneurons gradually converge to different visual fea-tures which are frequent in the input images. Asmentioned before, learning in the subsequent con-volutional layer statrs whenever the learning in thecurrent convolutional layer is finalized. Here wemeasure the learning convergence of the lth convo-lutional layer as

Cl =∑f

∑i

wf,i(1− wf,i)/nw (4)

where, wf,i is the ith synaptic weight of the fth fea-ture and nw is the total number of synaptic weights(independent of the features) in that layer. Cl tendsto zero if each of the synaptic weights converge to-wards zero or one. Therefore, we stop the learningof the lth convolutional layer, whenever Cl was suf-ficiently close to zero (i.e. Cl < 0.01).

Global pooling and classification

The global pooling layer is only used in the clas-sification phase. Neurons of the last layer performa global max pooling over their corresponding neu-ronal maps in the last convolutional layer. Such apooling operation provides a global translation in-variance for prototypical features extracted in thelast convolutional layer. Hence, there is only oneoutput value for each feature, which indicates thepresence of that feature in the input image. Theoutput of the global pooling layer over the train-ing images is used to train a linear SVM classifier.

6

Page 7: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

In the testing phase, the test object image is pro-cessed by the network and the output of the globalpooling layer is fed to the classifier to determine itscategory.

To compute the output of the global poolinglayer, first, the threshold of neurons in the last con-volutional layer were set to be infinite, and then,their final potentials (after propagating the wholespike train generated by the input image) weremeasured. These final potentials can be seen asthe number of early spikes in common between thecurrent input and the stored prototypes in the lastconvolutional layer. Finally, the global pooling neu-rons compute the maximum potential at their cor-responding neuronal maps, as their output value.

Results

Caltech face/motorbike dataset

We evaluated our SDNN on the face and motorbikecategories of the Caltech 101 dataset available athttp://www.vision.caltech.edu (see Fig. 4 for sam-ple pictures). The training set contains 200 ran-domly selected images per category, and remain-ing images constitute the test set. The test imagesare not seen during the learning phase but used af-terward to evaluate the performance on novel im-ages. This standard cross-validation procedure al-lows measuring the system’s ability to generalize,as opposed to learning the specific training exam-ples. All images were converted to grayscale valuesand rescaled to be 160 pixels in height (preserv-ing the aspect ratio). In all the experiments, weused linear SVM classifiers with penalty parameterC = 1.0 (optimized by a grid search in the range of(0, 10]).

Here, we used a network similar to Fig. 1, withthree convolutional layers each of which followed bya pooling layer. For the first layer, only ON-centerDoG filters of size 7 × 7 and standard deviationsof 1 and 2 pixels are used. The first, second andthird convolutional layers consists of 4, 20, and 10neuronal maps with conv-window sizes of 5 × 5,16×16×4, and 5×5×20 and firing thresholds of 10,60, and 2, respectively. The pooling window sizes of

the first and second pooling layers are 7×7 and 2×2with the strides of 6 and 2, correspondingly. Thethird pooling layer performs a global max poolingoperation. The learning rates of all convolutionallayers are set to a+ = 0.004 and a− = 0.003. Inaddition, each image is processed for 30 time steps.

Fig. 2 shows the preferred visual features of someneuronal maps in the first, second and third con-volutional layers through the learning process. Tovisualize the visual feature learned by a neuron, abackward reconstruction technique is used. Indeed,the visual features in the current layer can be recon-structed as the weighted combinations of the visualfeatures in the previous layer. This backward pro-cess continues until the first layer, whose preferredvisual features are computed by DoG functions. Asshown in Fig. 2A, interestingly, each of the fourneuronal maps of the first convolutional layer con-verges to one of the four orientations: π/4, π/2,3π/4, and π. This shows how efficiently the associ-ation of the proposed temporal coding in DoG cellsand unsupervised learning method (the STDP andlearning competition) led to highly diverse edge de-tectors which can represent the input image withedges in different orientations. These edge detec-tors are similar to the simple cells in primary visualcortex (i.e., V1 area) [8].

Fig. 2B shows the learning progress for the neu-ronal maps of the second convolutional layer. Asmentioned, the first convolutional layer detectsedges with different orientations all over the im-age, and due to the used temporal coding, neu-rons corresponding to edges with higher contrasts(i.e., salient edges) will fire earlier. On the otherhand, STDP naturally tends to learn those combi-nation of edges that are consistently repeating inthe training images (i.e., common features betweenthe target objects). Besides, the learning competi-tion tends to prevent the neuronal maps from learn-ing similar visual features. Consequently, neuronsin the second convolutional layer learn the mostsalient, common, and diverse visual features of thetarget objects, and do not learn the backgroundsthat drastically change between images. As seen inFig. 2B, each of the maps gradually learns a differ-ent visual feature (combination of oriented edges)representing a face or motorbike feature.

7

Page 8: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Figure 2: The synaptic changes of some neuronal maps in different layers through the learning with the Caltech face/motorbike dataset. A)The first convolutional layer becomes selective to oriented edges. B) The second convolutional layer converges to object parts. C) The thirdconvolutional layer learns the object prototype and respond to whole objects.

The learning progress for two neuronal maps ofthe third convolutional layer are shown in Fig. 2C.As seen, one of them gradually becomes selective toa complete motorbike prototype as a combinationof motorbike features such as back wheel, middlebody, handle, and front wheel detected in the sec-ond layer. Also, the other map learns a whole faceprototype as a combination of facial features. In-deed, the third convolutional layer learns the wholeobject prototypes using intermediate complexity

features detected in the previous layer. Neurons inthe second layer compete with each other and sendspikes toward the third layer as they detect theirpreferred visual features. Since, different combina-tions of these features are detected for each objectcategory, neuronal maps of the third layer will learndifferent prototypes of different categories. There-fore, the STDP and the learning competition mech-anism direct neuronal maps of the third convolu-tional layer to learn highly category specific proto-

8

Page 9: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Figure 3: Each curve shows the variation of the convergence index through the learning of a convolutonal layer. The weight histogram of thefirst convolutional layer at some critical points during the learning are shown next to its convergence curve.

types.

As mentioned in the previous section, learningat the lth convolutional layer stops whenever thelearning convergence index Cl tends to zero. Here,in Fig. 3, we presented the evolution of Cl dur-ing the learning of each convolutional layer. Also,we provided the weight histogram of the first con-volutional layer at some critical points during thelearning process. Since the initial synaptic weightsare drown from a normal distribution with µ = 0.8and σ = 0.05, at the beginning of the learning,the convergence index of the lth layer starts fromCl ' 0.16. Since features are not formed at theearly iterations, neurons respond to almost everypattern, therefore many of the synapses are de-pressed most of the times and gradually move to-wards 0.5, and Cl peaks. As a consequence, neu-rons’ activity decreases, and they start respondingto a few of the patterns and not to others. Fromthen, synapses contributed in these patterns are re-peatedly potentiated and others are depressed. Dueto the nature of the employed soft-bound STDP,learning is faster when the weights are around 0.5and Cl rapidly decreases. Finally, at the end ofthe learning, as features are formed and synapticweights converge to zero or one, Cl tends to zero.

As seen in Fig. 3, the weight changes in lower lay-ers are faster than in higher layers. This is mainlydue to the stronger competition among the featuremaps in higher layers. For instance, in the firstlayer, there are only four feature maps with smallreceptive fields. If a neuron from one feature maploses the competition at one location, another neu-ron can still win the competition at some other lo-cation and do the STDP. While, in the third layer,there are more feature maps with larger receptivefields. If a neuron wins the competition, it pre-vents neurons of other feature maps from doing theSTDP in a much wider area. Hence, for each image,only a few of the feature maps update their weights.Therefore, the competition is stronger in higher lay-ers and learning takes a much longer time. Besidethe competition factor, neurons in higher layers be-come selective to more complex and less frequentfeatures so they do not reach their threshold as of-ten as neurons in lower layers (e.g., all images con-tain edges).

Fig. 4 shows the accumulated spiking activityof the DoG and the following three convolutionallayers over all time steps, for two face and motor-bike sample images. For each layer, the preferredfeatures of some neuronal maps with color coded

9

Page 10: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Figure 4: The spiking activity of the convolutional layers with the face and motorbike images. The preferred features of neuronal maps in eachconvolutional layer are shown on the right. Each feature is coded by a specific color border. The spiking activity of the convolutional layers,accumulated over all the time steps, is shown in the corresponding panels. Each point in a panel indicates that a neuron in that location hasfired at a time step, and the color of the point indicates the preferred feature of the activated neuron.

borders are demonstrated on top, and their corre-sponding spiking activity are shown in panels belowthem. Each colored point inside a panel indicatesthe neuronal map of the neuron which has fired inthat location at a time step. As seen, neurons ofthe DoG layer detect image contrasts, and edge de-tectors in the first convolutional layer detect theorientation of edges. Neurons in the second convo-lutional layer, which are selective to intermediatecomplexity features, detect their preferred visualfeature by combining input spikes from edge detec-tor cells in the first layer. Finally, the coincidenceof these features activates neurons in the third con-

volutional layer which are selective to object pro-totypes. As seen, when a face (motorbike) imageis presented, neurons in the face (motorbike) mapsfire. To better illustrate the learning progress ofall the layers as well as their spiking activity in thetemporal domain, we prepared a short video (seeVideo 1).

As mentioned in the previous section, the out-put of the global pooling layer is used by a linearSVM classifier to specify the object category of theinput images. We trained the proposed SDNN ontraining images and evaluated it over the test im-ages, where the model reached the categorization

10

Page 11: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Figure 5: Recognition accuracies (mean ± std) of the proposed SDNNfor different number of training images per category used in the STDP-based feature learning and/or training the classifier. The red curvepresents the model’s accuracy when the different number of images areused to train the network (by unsupervised STDP) and the classifieras well. The blue curve shows the model’s accuracy when the layers ofthe network are trained using STDP with 400 images (200 from eachcategory), and the classifier is trained with different number of labeledimages per category.

accuracy of 99.1 ± 0.2%. It shows how the ob-ject prototypes, learned in the highest layer, canwell represent the object categories. Furthermore,we also calculated the single neuron accuracy. Inmore details, we separately computed the recogni-tion accuracy of each neuron in the global poolinglayer. Surprisingly, some single neurons reachedan accuracy of 93%, and the mean accuracy was89.8%. Hence, it can be said that single neuronsin the highest layer are highly class specific, anddifferent neurons carry complementary informationwhich altogether provide robust object representa-tions.

To better demonstrate the role of the STDPlearning rule in the proposed SDNN, we used ran-dom features in different convolutional layers andassessed the final accuracy. To this end, we firsttrained all the three convolutional layers of the net-work (using STDP) until they all converged (synap-tic weights had a bimodal distribution with 0 and 1as centers). Then, for a certain convolutional layer,we counted the number of active synapses (i.e.,close to one) of each of its feature maps. Corre-sponding to each learned feature in the first convo-lutional layer, we generated a random feature withthe same number of active synapses. For the sec-

ond and third convolutional layers, the number ofactive synapses in the random features was dou-bled. Note that with fewer active synapses, neu-rons in the second and third convolutional layerscould not reach their threshold, and with more ac-tive synapses, many of the neurons tend to fire to-gether. Anyways, we evaluated the network usingthese random features. Table 1 presents the SDNNaccuracy when we had random features in the third,or second and third, or in all the three convolutionallayers. As seen, by replacing the learned features ofthe lower layers with random ones, the accuracy de-creases more. Especially, using random features inthe second layer, the accuracy severely drops. Thisshows the critical role of intermediate complexityfeatures for object categorization.

We also evaluated how the proposed SDNN is ro-bust to noise. To this end, we added a white noiseto the neurons’ threshold during both training andtesting phases. Indeed, for each image, we added tothe threshold of each neuron, Vthr, a random valuedrawn from a uniform distribution in range ±α%of Vthr. We evaluated the proposed DCNN for dif-ferent amount of noise (from α = 5% to 50%) andthe results are provided in Table 2. Up to the 20%of noise, the accuracy is still reasonable, but by in-creasing the noise level, the accuracy dramaticallydrops and reaches to the chance level in case of 50%of noise. In other words, the network is more or lessable to tolerate the instability caused by the noisebelow 20%, but when it goes further, the neurons’behavior drastically change during the learning andSTDP can not extract informatic features from theinput images.

In another experiment, we changed the numberof training samples and calculated the recognitionaccuracy of the proposed SDNN. For instance, wetrained the network and the classifier with 5 sam-ples from each category, and then evaluated thefinal system with the test samples. As shown bythe red curve in Fig. 5, with 5 images per categorythe model reached the accuracy of 78.2%, and only40 images from each category are sufficient to reach95.1% recognition accuracy. Although having moretraining samples leads to higher accuracies, the pro-posed SDNN can extract diagnostic features andreach reasonable accuracies even using a few tens

11

Page 12: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Table 1: Recognition accuracies of the proposed SDNN with random features in different convolutional layers.

Conv layers with random features Non 3rd 2nd & 3rd 1st & 2nd & 3rdAccuracy (%) 99.1 80.2 67.8 66.3

Table 2: Recognition accuracies of the proposed SDNN for differen amounts of noise.

Noise level α = 0% α = 5% α = 10% α = 20% α = 30% α = 40% α = 50%Accuracy (%) 99.1 95.4 91.6 84.3 63.7 57.6 54.2

Figure 6: Some sample images of different object categories of ETH-80 in different viewpoints. For each image, the preferred feature of anactivated neuron in the third convolutional layer is shown in below.

of training images. Due to the unsupervised na-ture of STDP, the proposed SDNN does not suffermuch from the overfitting challenge caused by smalltraining set size in supervised learning algorithmssuch as back-propagation. In a real world, the num-ber of labeled samples is very low. Therefore, learn-ing in humans or other primates is mainly unsu-pervised. Here, in another experiment, we trainedthe network using STDP over 200 images per cat-egory, then we used different portions of these im-ages as labeled samples (from 1 image to 200 im-ages per category) to train the classifier. It lets ussee whether the visual feature that the network haslearned from the unlabeled images (using unsuper-vised STDP) is sufficient for solving the categoriza-tion task with only a few labeled training samples.As shown by the blue curve in Fig. 5, with onlyone sample per category the model could reach theaverage accuracy of 93.8%. It suggests that the un-supervised STDP can provide a rich feature spacewell explaining the object space and reducing the

need for labeled examples.

ETH-80 dataset

The ETH-80 dataset contains eight different ob-ject categories: apple, car, toy cow, cup, toy dog,toy horse, pear, and tomato (10 instances per cat-egory). Each object is photographed from 41 view-points with different view angles and different tilts.Some examples of objects in this dataset are shownin Fig. 6. ETH-80 is a good benchmark to showhow the proposed SDNN can handle multi-objectcategorization tasks with high inter-instance vari-ability, and how it can tolerate huge view-pointvariations. In all the experiments, we used linearSVM classifiers with penalty parameter C = 1.2(optimized by a grid search in the range of (0, 10]).

Five randomly chosen instances of each objectcategory are selected for the training set used in thelearning phase. The remaining instances constitutethe testing set, and are not seen during the learn-

12

Page 13: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Table 3: Recognition accuracies of the proposed SDNN and some othermethods over the ETH-80 dataset. Note that all the models are trainedon 5 object instances of each category and tested on other 5 instances.

Method Accuracy (%)HMAX [23] 69.0

Convolutional SNN [23] 81.1Pre-trained AlexNet 79.5Fine-tuned AlexNet 94.2Supervised DCNN 81.9Unsupervised DCA 80.7

Proposed SDNN 82.8

ing phase. All the object images were converted tograyscale values. To evaluate the proposed SDNNon ETH-80, we used a network architecturally sim-ilar to the one used for Caltech face/motorbikedataset. The other parameters are also similar, ex-cept for the number of neuronal maps in the secondand third convolutional and pooling layers. Herewe used 400 neuronal maps in each of these layers.

Similar to the caltech dataset, neuronal maps ofthe first convolutional layer converged to the fouroriented edges. Neurons in the second and thirdconvolutional layers also became selective to in-termediate features and object prototypes, respec-tively. Fig. 6 shows sample images from the ETH-80 dataset and the preferred features of some neu-ronal maps in the third convolutional layer whichare activated for those images. As seen, neurons inthe highest layer respond to different views of dif-ferent objects, and altogether, provide an invariantobject representation. Thus, the network learns 2Dand view-dependent prototypes of each object cat-egory to achieve 3D representations.

As mentioned before, we evaluated the proposedSDNN over the test instances of each object cate-gory which are not shown to the network during thetraining. The recognition accuracy of the proposedSDNN along with some other models on the ETH-80 dataset are presented in Table 3. HMAX [49]is one of the classic computational models of theobject recognition process in visual cortex. It has5000 features and uses a linear SVM as the clas-sifier (see [23] for more details). ConvolutionalSNN [23] is an extension of the Masquelier et al.2007 [36] which had one trainable layer with 1200

visual features and used linear SVM for classifi-cation. AlexNet is the first DCNN that signifi-cantly improved recognition accuracy on ImagenetDataset. Here, we compared our proposed SDNNwith both Imagenet pre-trained and ETH-80 fine-tuned versions of AlexNet. To obtain the accuracyof the pre-trained AlexNet, images were shown tothe model and feature vectors of the last layer wereused to train and test a linear SVM classifier. Also,to fine-tune the Imagenet pre-trained AlexNet onETH-80, its decision layer was replaced by an eight-neuron decision layer and trained by the stochasticgradient descent learning algorithm.

Regarding the fact that the pre-trained AlexNethas not seen ETH-80 images and fine-tunedAlexNet has already trained by millions of imagesfrom Imagenet dataset, the comparison may notbe fair enough. Therefore, we compared the pro-posed SDNN to a supervised DCNN with the samestructure but having two extra dense and decisionlayers on top with 70 and 8 neurons, respectively.The DCNN was trained on the gray-scaled imagesof ETH-80 using the backpropagation algorithm.The ReLU and soft-max activation functions wereemployed for the intermediate and decision layers,respectively. We used a cross-entropy loss func-tion with L2 kernel regularization and L1 activityregularization terms. We also performed a 50%dropout regularization on the dense layer. Thehyper-parameters such as learning rates, momen-tums, and regularization factors were optimized us-ing a grid search. Also, we used an early stoppingstartegy to prevent the network from overfitting.Eventually, the supervised DCNN reached the av-erage accuracy of 81.9% (see Table 3). Note that wetried to evaluate the DCNN with a dense layer ofmore than 70 neurons, but all the time the networkgot quickly overfitted on the training data with noaccuracy improvement over the test set. It seemsthat the supervised DCNN suffers from the lack ofsufficient training data.

In addition, we compared the proposed SDNN toa deep convolutional autoencoder (DCA) which isone of the best unsupervised learning algorithms inmachine learning. We developed a DCA with anencoder network having the same architecture asour SDNN followed by a decoder network with re-

13

Page 14: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

versed architecture. The ReLU activation functionwas used in the convolutional layers of both encoderand decoder networks. We used the cross-entropyloss function and stochastic gradient descent learn-ing algorithm to train the DCA. The learning pa-rameters (i.e., learning rate and momentum) wereoptimized through a grid search. When the learn-ing had converged, we eliminated the decoder partand used the encoder’s output representations totrain and test a linear SVM classifier. Note thatDCA was trained on gray-scaled images of ETH-80.The DCA reached the average accuracy of 80.7%(see Table 3) and the proposed SDNN could out-perform the DCA network with the same structure.

We also evaluated the proposed SDNN that hastwo convolutional layers. Indeed, we removed thethird convolutional layer, applied the global pool-ing over the second convolutional layer, and traineda new SVM classifier over its output. The model’saccuracy dropped by 5% and reached to 77.4%. Al-though the features in the second layer representobject parts with intermediate complexity, it is thecombination of these features in the last layer whichculminates the object representations and makesthe classification easier. In other words, the simi-larity between the smaller parts of objects of dif-feren categories but with similar shapes (e.g., appleand potato) is higher than the whole objects, hence,the visual features in the higher layers can providebetter object representations.

In a subsequent analysis, we computed the confu-sion matrix, to see which categories are mostly con-fused with each other. Fig. 7 illustrates the confu-sion matrix of the proposed SDNN over the ETH-80dataset. As seen, most of the errors are due to themiscategorization of dogs, horses, and cows. Wechecked whether these errors belong to the somespecific viewpoints or not. We found out that theerrors are uniformly distributed between differentviewpoints. Some misclassified samples are shownin Fig. 8. Overall, it can be concluded that thesecategorization errors are due to the overall shapesimilarity between these object categories.

The other important aspect of the proposedSDNN is the computational efficiency of the net-work. For each ETH-80 image, on average, about9100 spikes are emitted in all the layers, i.e., about

Figure 7: The confusion matrix of the proposed SDNN over the ETH-80 dataset.

Figure 8: Some misclassified samples of the ETH-80 dataset by theproposed SDNN.

0.02 spike per neuron per image. Note that thenumber of inhibitory events is equal to the numberof spikes. These together points to the fact that theproposed SDNN can recognize objects with highprecision but low computational cost. This effi-ciency is caused by the association of the proposedtemporal coding and STDP learning rule which ledto a sparse but informative visual coding.

MNIST dataset

MNIST [30] is a benchmark dataset for SNNs whichhas been widely used [21, 59, 44, 39, 11, 12]. Wealso evaluated our SDNN on the MNIST datasetwhich contains 60,000 training and 10,000 testhandwritten single-digit images. Each image is ofsize 28 × 28 pixels and contains one of the digits0–9. For the first layer, ON- and OFF-center DoGfilters with standard deviations of 1 and 2 pixels areused. The first and second convolutional layers re-spectively consist of 30 and 100 neuronal maps with5 × 5 convolution-window and firing thresholds of15 and 10. The pooling-window of the first poolinglayer was of size 2×2 with the stride of 2. The sec-

14

Page 15: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Figure 9: The Gabor-like features learned by the neuronal maps of thefirst convolutional layer from the MNIST images. The red and greencolors receptively indicate the strength of input synapses from ON-and OFF-center DoG cells.

ond pooling layer performs a global max operation.Note that the learning rates of all convolutional lay-ers were set to a+ = 0.004 and a− = 0.003. Inall the experiments, we used linear SVM classifierswith penalty parameter C = 2.4 (optimized by agrid search in the range of (0, 10]).

Fig. 9 shows the preferred features of some neu-ronal maps in the first convolutional layer. Thegreen and red colors correspond to ON- and OFF-center DoG filters. Interestingly, this layer con-verged to Gabor-like edge detectors with differentorientations, phase and polarity. These edge fea-tures are combined in the next layer and provideeasily separable digit representation.

Recognition performance of the proposedmethod and some recent SNNs on the MNISTdataset are provided in Table 4. As seen, theproposed SDNN outperforms unsupervised SNNsby reaching 98.4% recognition accuracy. Besides,the accuracy of the proposed SDNN is close to the99.1% accuracy of the totally supervised rate-basedSDNN [12] which is indeed the converted version ofa traditional DCNN trained by back-propagation.

The important advantage of the proposed SDNNis the use of much fewer spikes. Our SDNN usesonly about 600 spikes for each MNIST images intotal for all the layers, while the supervised rate-based SDNN uses thousands of spikes per layer [12].Also, because of using rate-based neural coding insuch networks, they need to process images for hun-dreds of time steps, while our network process the

MNIST images in 30 time steps only. Notably,whenever a neuron in our network fires it inhibitsother neurons at the same position, therefore, thetotal number of inhibitory events per each MNISTimage is equal to the number of spikes(i.e., 600).As stated in Section 2, the proposed SDNN usesa temporal code which encodes the information ofthe input image in the spike times, and each neu-ron in all layers, is allowed to fire at most once.This temporal code associated with unsupervisedSTDP rule leads to a fast, accurate, and efficientprocessing.

Discussion

Recent supervised DCNNs have reached high accu-racies on the most challenging object recognitiondatasets such as Imagenet. Architecture of thesesnetworks are largely inspired by the deep hierar-chical processing in the visual cortex. For instance,DCNNs use retinotopically arranged neurons withrestricted receptive fields, and the receptive fieldsize and feature complexity of the neurons grad-ually increase through the layers. However, thelearning and neural processing mechanisms appliedin DCNNs are inconsistent with the visual cortex,where neurons communicate using spikes and learnthe input spike patterns in a mainly unsupervisedmanner. Employing such mechanisms in DCNNscan improve their energy consumption and decreasetheir need for an expensive supervised learning withmillions of labeled images.

A popular approach in previous researches is toconvert pre-trained supervised DCNNs into equiva-lent spiking network. To simulate the floating-pointcalculations in DCNNs, they have to use the firingrate as the neural code, which in result increasesthe number of required spikes and the processingtime. For instance, the converted version of a sim-ple two-layer DCNN for the MNIST dataset withimages of 28×28 pixels requires thousands of spikesand hundreds of time steps per image. On the otherhand, there are some SDNNs [40, 4, 1] which areoriginally spiking network but employ firing ratecoding and biologically implausible learning rulessuch as autoencoders and back-propagation.

15

Page 16: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

Table 4: Recognition accuracies of the proposed SDNN and some other SNNs over the MNIST dataset.

Architecture Neural coding Learning-type Learning-rule Accuracy (%)Dendritic neurons [21] Rate-based Supervised Morphology learning 90.3

Convolutional SNN [59] Spike-based Supervised Tempotron rule 91.3Two layer network [44] Spike-based Unsupervised STDP 93.5

Spiking RBM [39] Rate-based Supervised Contrastive divergence 94.1Two layer network [11] Spike-based Unsupervised STDP 95.0Convolutional SNN [12] Rate-based Supervised Back-propagation 99.1

Proposed SDNN Spike-based Unsupervised STDP 98.4

Here, we proposed a STDP-based SDNN witha spike-time coding. Each neuron was allowed tofire at most once, where its spike-time indicates thesignificance of its visual input. Therefore, neuronsthat fire earlier are carrying more salient visual in-formation, and hence, they were allowed to do theSTDP and learn the input patterns. As the learn-ing progresses, each layer converged to a set of di-verse but informative features, and the feature com-plexity gradually increases through the layers fromsimple edge features to object prototypes. Theproposed SDNN was evaluated on several imagedatasets and reached high recognition accuracies.This shows how the proposed temporal coding andlearning mechanism (STDP and learning competi-tion) lead to discriminative object representations.

The proposed SDNN has several advantages toits counterparts. First, our proposed SDNN is thefirst spiking neural network with more than onelearnable layer which can process large-scale nat-ural object images. Second, due to the use of anefficient temporal coding, which encodes the visualinformation in the time of the first spikes, it canprocess the input images with a low number ofspikes and in a few processing time steps. Third,the proposed SDNN exploits the bio-inspired andtotally unsupervised STDP learning rule which canlearn the diagnostic object features and neglect theirrelevant backgrounds.

We compared the proposed SDNN to severalother networks including unsupervised methodssuch as HMAX and convolutional autoencoder net-work, and supervised methods such as DCNNs.The proposed SDNN could outperform the unsu-pervised methodes which shows its advantages inextracting more informatic features from training

images. Also, it was better than the superviseddeep network which largely suffered from overfit-ting and lack of sufficient training data. Although,the state-of-the-art supervised DCNNs have stun-ning performance on large datasets like Imagenet,but contrary to the proposed SDNN, they fall introuble with small and mideum size datasets.

Our SDNN could be efficiently implemented inparallel hardware (e.g., FPGA [57]) using addressevent representation (AER) [52] protocol. WithAER, spike events are represented by the addressesof sending and receiving neurons, and time is rep-resented by the asynchronous occurrence of spikeevents. Since these hardware are much faster thanbiological hardware, simulations could run severalorder of magnitude faster than real time [47]. Theprimate visual system extracts the rough contentof an image in about 100 ms [54, 19, 26, 33]. Wethus speculate that some dedicated hardware willbe able to do the same in the order of a millisecondor less.

Also, the proposed SDNN can be modified to usespiking retinal models [56, 35] as the input layer.These models mimic the spatiotemporal filteringof the retinal ganglion cells with center/surroundreceptive fields. Alternatively, we could use neu-romorphic asynchronous event-based cameras suchas dynamic vision sensor (DVS), which generateoutput events when they capture transients in thescene [32]. Finally, due to the DoG filtering in theinput layer of the proposed SDNN, some visual in-formation such as texture and color are lost. Hence,future studies should focus on encoding these addi-tional pieces of information in the input layer.

Biological evidence indicate that in additionto the unsupervised learning mechanisms (e.g.,

16

Page 17: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

STDP), there are also dopamine-based reinforce-ment learning strategies in the brain [41]. Besides,although it is still unclear how supervised learn-ing is implemented in biological neural networks,it seems that for some tasks (e.g., motor controland sensory inputs prediction) the brain must con-stantly learn temporal dynamics based on errorfeedback [14]. Employing such reinforcement andsupervised learning strategies could improve theproposed SDNN in different aspects which are in-evitable with unsupervised learning methods. Par-ticularly, they can help to reduce the number ofrequired features and to extract optimized task-dependent features.

Supporting Information

Video 1 We prepared a video available at https://youtu.be/u32Xnz2hDkE showing the learningprogress and neural activity over the Caltech faceand motorbike task. Here we presented the faceand motorbike training examples, propagated thecorresponding spike waves, and applied the STDPrule. The input image is presented at the top-leftcorner of the screen. The output spikes of the in-put layer (i.e., DoG layer) at each time step is pre-sented in the top-middle panel, and the accumula-tion of theses spikes is shown in the top-right panel.For each of the subsequent convolutional layers, thepreferred features, the output spikes at each timestep, and the accumulation of the output spikes arepresented in the corresponding panels. Note that4, 8, and 2 features from the first, second and thirdconvolutional layers are selected and shown, respec-tively. As mentioned, the learning occurs layer bylayer, thus, the label of the layer which is currentlydoing the learning is specified by the red color. Asseen, the first layer learns to detect edges, the sec-ond layer learns intermediate features, and finallythe third layer learns face and motorbike prototypefeatures.

Acknowledgments

This research received funding from the EuropeanResearch Council under the European Unions 7th

Framework Program (FP/2007-2013) / ERC GrantAgreement n.323711 (M4 project). The authorsthank the NVIDIA Academic Programs team fordonating a GPU hardware.

References

[1] Yoshua Bengio, Dong-Hyun Lee, Jorg Born-schein, and Zhouhan Lin. Towards biologicallyplausible deep learning. arXiv:1502.04156,2015.

[2] Michael Beyeler, Nikil D Dutt, and Jef-frey L Krichmar. Categorization and decision-making in a neurobiologically plausible spikingnetwork using a stdp-like learning rule. NeuralNetworks, 48:109–124, 2013.

[3] Joseph M Brader, Walter Senn, and StefanoFusi. Learning real-world stimuli in a neuralnetwork with spike-driven synaptic dynamics.Neural computation, 19(11):2881–2912, 2007.

[4] Kendra S Burbank. Mirrored stdp implementsautoencoder learning in a network of spik-ing neurons. PLoS Computational Biology,11(12):e1004566, 2015.

[5] Charles F Cadieu, Ha Hong, Daniel LKYamins, Nicolas Pinto, Diego Ardila, Ethan ASolomon, Najib J Majaj, and James J Di-Carlo. Deep neural networks rival the repre-sentation of primate it cortex for core visualobject recognition. PLoS Computational Biol-ogy, 10(12):e1003963, 2014.

[6] Yongqiang Cao, Yang Chen, and DeepakKhosla. Spiking deep convolutional neuralnetworks for energy-efficient object recogni-tion. International Journal of Computer Vi-sion, 113(1):54–66, 2015.

[7] Radoslaw Martin Cichy, Aditya Khosla, Dim-itrios Pantazis, Antonio Torralba, and AudeOliva. Comparison of deep neural networks tospatio-temporal cortical dynamics of humanvisual object recognition reveals hierarchicalcorrespondence. Scientific Reports, 6, 2016.

17

Page 18: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

[8] Arnaud Delorme, Laurent Perrinet, and Si-mon J Thorpe. Networks of integrate-and-fireneurons using rank order coding b: Spike tim-ing dependent plasticity and emergence of ori-entation selectivity. Neurocomputing, 38:539–545, 2001.

[9] James J DiCarlo and David D Cox. Untan-gling invariant object recognition. Trends inCognitive Sciences, 11(8):333–341, 2007.

[10] James J DiCarlo, Davide Zoccolan, andNicole C Rust. How does the brain solve visualobject recognition? Neuron, 73(3):415–434,2012.

[11] Peter U Diehl and Matthew Cook. Unsuper-vised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in com-putational neuroscience, 9:99, 2015.

[12] Peter U Diehl, Daniel Neil, Jonathan Binas,Matthew Cook, Shih-Chii Liu, and MichaelPfeiffer. Fast-classifying, high-accuracy spik-ing deep networks through weight and thresh-old balancing. In IEEE International JointConference on Neural Networks (IJCNN),pages 1–8, Killarney, Ireland, July 2015.

[13] Peter U Diehl, Guido Zarrella, Andrew Cas-sidy, Bruno U Pedroni, and Emre Neftci. Con-version of artificial recurrent neural networksto spiking neural networks for low-power neu-romorphic hardware. In IEEE InternationalConference on Rebooting Computing, pages 1–8, San Diego, California, USA, October 2016.

[14] Kenji Doya. Complementary roles of basalganglia and cerebellum in learning and mo-tor control. Current Opinion in Neurobiology,10(6):732–739, 2000.

[15] K Fukushima. Neocognitron : a self organiz-ing neural network model for a mechanism ofpattern recognition unaffected by shift in po-sition. Biological Cybernetics, 36(4):193–202,1980.

[16] Masoud Ghodrati, Amirhossein Farzmahdi,Karim Rajaei, Reza Ebrahimpour, and Seyed-Mahdi Khaligh-Razavi. Feedforward object-vision models only tolerate small image varia-tions compared to human. Frontiers in Com-putational Neuroscience, 8(74):1–17, 2014.

[17] Stefan Habenschuss, Johannes Bill, and Bern-hard Nessler. Homeostatic plasticity inbayesian spiking networks as expectation max-imization with posterior constraints. In Ad-vances in Neural Information Processing Sys-tems, pages 773–781, Lake Tahoe, Nevada,USA, December 2012.

[18] Shiyong Huang, Carlos Rozas, Mario Trevino,Jessica Contreras, Sunggu Yang, Lihua Song,Takashi Yoshioka, Hey-Kyoung Lee, and Al-fredo Kirkwood. Associative hebbian synapticplasticity in primate visual cortex. The Jour-nal of Neuroscience, 34(22):7575–7579, 2014.

[19] Chou P Hung, Gabriel Kreiman, Tomaso Pog-gio, and James J DiCarlo. Fast readout of ob-ject identity from macaque inferior temporalcortex. Science, 310(5749):863–866, 2005.

[20] Eric Hunsberger and Chris Eliasmith.Spiking deep networks with lif neurons.arXiv:1510.08829, 2015.

[21] Shaista Hussain, Shih-Chii Liu, and ArindamBasu. Improved margin multi-class classifica-tion using dendritic neurons with morpholog-ical learning. In IEEE International Sympo-sium on Circuits and Systems (ISCAS), pages2640–2643, Melbourne, VIC, Australia, 2014.

[22] Seyed-Mahdi Khaligh-Razavi and NikolausKriegeskorte. Deep supervised, but not un-supervised, models may explain it corticalrepresentation. PLoS Computational Biology,10(11):e1003915, 2014.

[23] Saeed Reza Kheradpisheh, Mohammad Gan-jtabesh, and Timothee Masquelier. Bio-inspired unsupervised learning of visual fea-tures leads to robust invariant object recogni-tion. Neurocomputing, 205:382–392, 2016.

18

Page 19: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

[24] Saeed Reza Kheradpisheh, Masoud Gho-drati, Mohammad Ganjtabesh, and TimotheeMasquelier. Deep networks resemble humanfeed-forward vision in invariant object recog-nition. Scientific Reports, 6:32672, 2016.

[25] Saeed Reza Kheradpisheh, Masoud Gho-drati, Mohammad Ganjtabesh, and TimotheeMasquelier. Humans and deep networkslargely agree on which kinds of variation makeobject recognition harder. Frontiers in Com-putational Neuroscience, 10:92, 2016.

[26] H Kirchner and S J Thorpe. Ultra-rapid ob-ject detection with saccadic eye movements:Visual processing speed revisited. Vision Re-search, 46(11):1762–1776, 2006.

[27] Alex Krizhevsky, I Sutskever, and GE Hin-ton. Imagenet classification with deep convolu-tional neural networks. In Neural InformationProcessing Systems (NIPS), pages 1–9, LakeTahoe, Nevada, USA, 2012.

[28] Y LeCun and Y Bengio. Convolutional net-works for images, speech, and time series.In The Handbook of Brain Theory and Neu-ral Networks, pages 255–258. Cambridge, MA:MIT Press, 1998.

[29] Yann LeCun, Yoshua Bengio, and Geof-frey Hinton. Deep learning. Nature,521(7553):436–444, 2015.

[30] Yann LeCun, Leon Bottou, Yoshua Bengio,and Patrick Haffner. Gradient-based learningapplied to document recognition. Proceedingsof the IEEE, 86(11):2278–2324, 1998.

[31] Honglak Lee, Roger Grosse, Rajesh Ran-ganath, and Andrew Y. Ng. Convolutionaldeep belief networks for scalable unsupervisedlearning of hierarchical representations. pages1–8, New York, New York, USA, 2009. ACMPress.

[32] P Lichtsteiner, C Posch, and T Delbruck.An 128x128 120dB 15us-latency temporal con-trast vision sensor. IEEE J. Solid State Cir-cuits, 43(2):566–576, 2007.

[33] Hesheng Liu, Yigal Agam, Joseph R Madsen,and Gabriel Kreiman. Timing, timing, timing:fast decoding of object information from in-tracranial field potentials in human visual cor-tex. Neuron, 62(2):281–290, 2009.

[34] Wolfgang Maass. Computing with spikes. Spe-cial Issue on Foundations of Information Pro-cessing of TELEMATIK, 8(1):32–36, 2002.

[35] Pablo Martınez-Canada, Christian Morillas,Begona Pino, Eduardo Ros, and FranciscoPelayo. A Computational Framework for Re-alistic Retina Modeling. International Journalof Neural Systems, 26(7):1650030, 2016.

[36] Timothee Masquelier and Simon J Thorpe.Unsupervised learning of visual featuresthrough spike timing dependent plasticity.PLoS Computational Biology, 3(2):e31, 2007.

[37] David BT McMahon and David A Leopold.Stimulus timing-dependent plasticity in high-level vision. Current Biology, 22(4):332–337,2012.

[38] C Daniel Meliza and Yang Dan. Receptive-field modification in rat visual cortex inducedby paired visual stimulation and single-cellspiking. Neuron, 49(2):183–189, 2006.

[39] Peter O’Connor, Daniel Neil, Shih-Chii Liu,Tobi Delbruck, and Michael Pfeiffer. Real-timeclassification and sensor fusion with a spik-ing deep belief network. Frontiers in Neuro-science, 7:178, 2013.

[40] Priyadarshini Panda and Kaushik Roy. Unsu-pervised regenerative learning of hierarchicalfeatures in spiking deep networks for objectrecognition. In IEEE International Joint Con-ference on Neural Networks (IJCNN), pages1–8, Vancouver, Canada, July 2016.

[41] Marco Pignatelli and Antonello Bonci. Roleof dopamine neurons in reward and aversion:a synaptic plasticity perspective. Neuron,86(5):1145–1157, 2015.

19

Page 20: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

[42] Nicolas Pinto, Youssef Barhomi, David D Cox,and James J DiCarlo. Comparing state-of-the-art visual features on invariant object recogni-tion tasks. In IEEE workshop on Applicationsof Computer Vision (WACV), pages 463–470,Kona, Hawaii, USA, 2011.

[43] Geoffrey Portelli, John M Barrett, Ger-rit Hilgen, Timothee Masquelier, AlessandroMaccione, Stefano Di Marco, Luca Berdon-dini, Pierre Kornprobst, and Evelyne Ser-nagor. Rank order coding: a retinal infor-mation decoding strategy revealed by large-scale multielectrode array retinal recordings.Eneuro, 3(3):ENEURO–0134, 2016.

[44] Damien Querlioz, Olivier Bichler, PhilippeDollfus, and Christian Gamrat. Immunity todevice variations in a spiking neural networkwith memristive nanodevices. IEEE Transac-tions on Nanotechnology, 12(3):288–295, 2013.

[45] Edmund T.. Rolls and Gustavo Deco. Com-putational neuroscience of vision. Oxford uni-versity press, Oxford, UK, 2002.

[46] Guillaume A Rousselet, Simon J Thorpe, andMichele Fabre-Thorpe. Taking the max fromneuronal responses. Trends in Cognitive Sci-ences, 7(3):99–102, 2003.

[47] T Serrano-Gotarredona, T Masquelier, T Pro-dromakis, G Indiveri, and B Linares-Barranco.STDP and STDP variations with memristorsfor spiking neuromorphic learning systems.Frontiers in neuroscience, 7(February):2, jan2013.

[48] T Serre, L Wolf, S Bileschi, M Riesenhu-ber, and T Poggio. Robust object recognitionwith cortex-like mechanisms. IEEE Trans-actions on Pattern Analysis Machine Intelli-gence, 29(3):411–426, 2007.

[49] Thomas Serre, Aude Oliva, and Tomaso Pog-gio. A feedforward architecture accounts forrapid categorization. Proceedings of the Na-tional Academy of Sciences, 104(15):6424–6429, 2007.

[50] Shy Shoham, Daniel H OConnor, and RonenSegev. How silent is the brain: is there adark matter problem in neuroscience? Journalof Comparative Physiology A, 192(8):777–784,2006.

[51] Karen Simonyan and Andrew Zisserman. Verydeep convolutional networks for large-scale im-age recognition. arXiv:1409.1556, 2014.

[52] M Sivilotti. Wiring considerations in ana-log VLSI systems with application to field-programmable networks. PhD thesis, Comput.Sci. Div., California Inst. Technol., Pasadena,CA, 1991.

[53] Simon Thorpe, Arnaud Delorme, and RufinVan Rullen. Spike-based strategies for rapidprocessing. Neural Networks, 14(6):715–725,2001.

[54] Simon Thorpe, Denis Fize, Catherine Marlot,et al. Speed of processing in the human visualsystem. Nature, 381(6582):520–522, 1996.

[55] Rufin Van Rullen and Simon J Thorpe. Ratecoding versus temporal order coding: whatthe retinal ganglion cells tell the visual cortex.Neural Computation, 13(6):1255–1283, 2001.

[56] Adrien Wohrer and Pierre Kornprobst. VirtualRetina: a biological retina model and simula-tor, with contrast gain control. 26(2):219–49,apr 2009.

[57] A. Yousefzadeh, T. Serrano-Gotarredona, andB. Linares-Barranco. Fast Pipeline 128??128pixel spiking convolution core for event-drivenvision processing in FPGAs. In Proceedings of1st International Conference on Event-BasedControl, Communication and Signal Process-ing, EBCCSP 2015, 2015.

[58] Matthew D Zeiler and Rob Fergus. Visualiz-ing and understanding convolutional networks.In European Conference on Computer Vision(ECCV), pages 818–833, Zurich, Switzerland,September 2014.

20

Page 21: STDP-based spiking deep convolutional neural networks for object recognition … · 2017-12-27 · slow because of the credit assignment problem [45]. Furthermore, given that DCNNs

[59] Bo Zhao, Ruoxi Ding, Shoushun Chen, Bern-abe Linares-Barranco, and Huajin Tang. Feed-forward categorization on aer motion eventsusing cortex-like features in a spiking neuralnetwork. IEEE Transactions on Neural Net-works and Learning Systems, 26(9):1963–1978,2015.

21