NeuSpike-Net: High Speed Video Reconstruction via Bio ...

10
NeuSpike-Net: High Speed Video Reconstruction via Bio-inspired Neuromorphic Cameras Lin Zhu 1,2 Jianing Li 1,2 Xiao Wang 2 Tiejun Huang 1 Yonghong Tian 1,2, * 1 Department of Computer Science and Technology, Peking University, Beijing, China 2 Peng Cheng Laboratory, Shenzhen, China {linzhu, lijianing, tjhuang, yhtian}@pku.edu.cn [email protected] Abstract Neuromorphic vision sensor is a new bio-inspired imag- ing paradigm that emerged in recent years, which continu- ously sensing luminance intensity and firing asynchronous spikes (events) with high temporal resolution. Typically, there are two types of neuromorphic vision sensors, namely dynamic vision sensor (DVS) and spike camera. From the perspective of bio-inspired sampling, DVS only perceives movement by imitating the retinal periphery, while the spike camera was developed to perceive fine textures by simu- lating the fovea. It is meaningful to explore how to com- bine two types of neuromorphic cameras to reconstruct high quality image like human vision. In this paper, we propose a NeuSpike-Net to learn both the high dynamic range and high motion sensitivity of DVS and the full texture sampling of spike camera to achieve high-speed and high dynamic image reconstruction. We propose a novel representation to effectively extract the temporal information of spike and event data. By introducing the feature fusion module, the two types of neuromorphic data achieve complementary to each other. The experimental results on the simulated and real datasets demonstrate that the proposed approach is ef- fective to reconstruct high-speed and high dynamic range images via the combination of spike and event data. 1. Introduction In recent years, bio-inspired vision sensors have become very attractive in the field of self-driving cars, unmanned aerial vehicles, and autonomous mobile robots [23], due to their significant advantages over conventional frame-based cameras, such as high dynamic range and high temporal res- olution [13, 24]. Generally speaking, there are two ways of bio-inspired visual sampling manner: temporal contrast sampling and integral sampling. Among them, dynamic vision sensor (D- * Corresponding author. VS) [9, 1], a.k.a. event camera, is the most well-recognized bio-inspired vision sensor based on temporal contrast sam- pling, which measures the change of light intensity and out- puts high dynamic range events. From the biological point of view, DVS imitates the periphery of the retina [36] which is sensitive to motion. However, it is very difficult to recon- struct the texture from DVS. To solve this problem, some event-based sensors were developed subsequently by com- bining DVS and a frame-based active-pixel sensor (APS) such as DAVIS [4], or adding an extra photo-measurement circuit such as ATIS [30] and CeleX [15]. However, a mis- match exists due to the difference in the sampling time reso- lution between two kinds of heterogeneous circuits. Recent- ly, many algorithms were designed to reconstruct images using DVS [33, 31, 25, 34, 38, 44, 6]. There are also some algorithms that combine image and event to reconstruct tex- ture images [28, 29, 27, 33], which can obtain more texture information than only using the DVS signal. Different from DVS, there are a number of spiking image sensors following the basis of the integrate-and-fire neuron model [43, 19, 7, 35]. Some variants of spiking image sen- sors such as asynchronous pixel event tricolor vision sen- sor [22] and near infrared spiking image sensor [3] were proposed. Recently, Dong et al. [10, 47] proposed a spike camera based on fovea-like sampling method, which is with high spatial (250×400) and temporal resolutions (40,000 Hz). Moreover, there is a portable spike camera, a.k.a. Vi- dar, with a sampling rate of 20,000 Hz. For the spike cam- era, the spike firing frequency can be used to estimate light intensity [47]. Recently, a fovea-like texture reconstruction framework was proposed to reconstruct images [49]. In ad- dition, some methods based on the spike camera were de- veloped for spike coding [11, 48], tone mapping [16] and motion deblurring [45]. In human vision system [36], peripheral and foveal vi- sion is not independent, but is directly connected [37]. It is biologically plausible to the periphery and fovea are com- plementary. This motivates us to explore a question: how to combine the two neuromorphic cameras to reconstruct 2400

Transcript of NeuSpike-Net: High Speed Video Reconstruction via Bio ...

NeuSpike-Net: High Speed Video Reconstruction viaBio-inspired Neuromorphic Cameras

Lin Zhu1,2 Jianing Li1,2 Xiao Wang2 Tiejun Huang1 Yonghong Tian1,2,∗

1Department of Computer Science and Technology, Peking University, Beijing, China2Peng Cheng Laboratory, Shenzhen, China

{linzhu, lijianing, tjhuang, yhtian}@pku.edu.cn [email protected]

Abstract

Neuromorphic vision sensor is a new bio-inspired imag-ing paradigm that emerged in recent years, which continu-ously sensing luminance intensity and firing asynchronousspikes (events) with high temporal resolution. Typically,there are two types of neuromorphic vision sensors, namelydynamic vision sensor (DVS) and spike camera. From theperspective of bio-inspired sampling, DVS only perceivesmovement by imitating the retinal periphery, while the spikecamera was developed to perceive fine textures by simu-lating the fovea. It is meaningful to explore how to com-bine two types of neuromorphic cameras to reconstruct highquality image like human vision. In this paper, we proposea NeuSpike-Net to learn both the high dynamic range andhigh motion sensitivity of DVS and the full texture samplingof spike camera to achieve high-speed and high dynamicimage reconstruction. We propose a novel representationto effectively extract the temporal information of spike andevent data. By introducing the feature fusion module, thetwo types of neuromorphic data achieve complementary toeach other. The experimental results on the simulated andreal datasets demonstrate that the proposed approach is ef-fective to reconstruct high-speed and high dynamic rangeimages via the combination of spike and event data.

1. IntroductionIn recent years, bio-inspired vision sensors have become

very attractive in the field of self-driving cars, unmannedaerial vehicles, and autonomous mobile robots [23], due totheir significant advantages over conventional frame-basedcameras, such as high dynamic range and high temporal res-olution [13, 24].

Generally speaking, there are two ways of bio-inspiredvisual sampling manner: temporal contrast sampling andintegral sampling. Among them, dynamic vision sensor (D-

∗Corresponding author.

VS) [9, 1], a.k.a. event camera, is the most well-recognizedbio-inspired vision sensor based on temporal contrast sam-pling, which measures the change of light intensity and out-puts high dynamic range events. From the biological pointof view, DVS imitates the periphery of the retina [36] whichis sensitive to motion. However, it is very difficult to recon-struct the texture from DVS. To solve this problem, someevent-based sensors were developed subsequently by com-bining DVS and a frame-based active-pixel sensor (APS)such as DAVIS [4], or adding an extra photo-measurementcircuit such as ATIS [30] and CeleX [15]. However, a mis-match exists due to the difference in the sampling time reso-lution between two kinds of heterogeneous circuits. Recent-ly, many algorithms were designed to reconstruct imagesusing DVS [33, 31, 25, 34, 38, 44, 6]. There are also somealgorithms that combine image and event to reconstruct tex-ture images [28, 29, 27, 33], which can obtain more textureinformation than only using the DVS signal.

Different from DVS, there are a number of spiking imagesensors following the basis of the integrate-and-fire neuronmodel [43, 19, 7, 35]. Some variants of spiking image sen-sors such as asynchronous pixel event tricolor vision sen-sor [22] and near infrared spiking image sensor [3] wereproposed. Recently, Dong et al. [10, 47] proposed a spikecamera based on fovea-like sampling method, which is withhigh spatial (250×400) and temporal resolutions (40,000Hz). Moreover, there is a portable spike camera, a.k.a. Vi-dar, with a sampling rate of 20,000 Hz. For the spike cam-era, the spike firing frequency can be used to estimate lightintensity [47]. Recently, a fovea-like texture reconstructionframework was proposed to reconstruct images [49]. In ad-dition, some methods based on the spike camera were de-veloped for spike coding [11, 48], tone mapping [16] andmotion deblurring [45].

In human vision system [36], peripheral and foveal vi-sion is not independent, but is directly connected [37]. It isbiologically plausible to the periphery and fovea are com-plementary. This motivates us to explore a question: howto combine the two neuromorphic cameras to reconstruct

2400

Figure 1. The motivation of our approach. Neuromorphic cameras are inspired by the retina. DVS perceives movement by imitating theretinal periphery, while the spike camera was developed to sense fine textures by simulating the fovea. In this work, we combine the spikeand event data to achieve effective information complementarity and get better reconstruction quality.

high quality visual image like human vision? In fact, DVShas the ability of high speed and high dynamic range sens-ing, but it is difficult to perceive the texture information. Incontrast, the spike camera has the ability of full texture sam-pling like a conventional camera, but its dynamic range isgreatly affected by the noise. Meanwhile, its sampling abil-ity depends on high light intensity of the scene. In applica-tion, it is meaningful to combine the high dynamic range ofevent data and full texture of spike data to reconstruct highquality images.

In this paper, we combine the two types of neuromorphicdata to reconstruct high quality texture images, especially incomplex scenes such as high speed and low light. Contribu-tions of this paper are summarized as follows.

1) We first propose the reconstruction network combin-ing spike and event cameras (NeuSpike-Net). Accordingto the characteristics of the neuromorphic data, we explorethe learning-based joint reconstruction strategy, which canachieve high quality full texture reconstruction in complexscenes with different light intensities and motion speeds.

2) We propose a neuromorphic data representation to ex-tract useful temporal information hidden in spike and eventdata. With the help of neuromorphic data representation,the proposed network can effectively learn the features ofspike and event data.

3) We propose to simulate multi-scale spike data, whichconsiders the various noises existing in spike camera. Thesimulated dataset is generated for network training andtesting by simulating different light intensities and motionspeeds. Moreover, we build a hybrid cameras system to col-lect real world datasets to test the effectiveness of the model.

2. The Motivation of Our Approach

In this section, we analyze the interaction of fovea andperipheral in retina (Section 2.1), and the sampling prin-ciple of two types of neuromorphic cameras (Section 2.2).The relationship between the spike data and event data isfurther analyzed (Section 2.3), and the noise distribution ofthe spike camera is discussed in Section 2.4.

Figure 2. The sampling mechanism of DVS and spike camera.

2.1. Sampling Mechanisms of Human Retina

In human vision system, the retina is an important partthat receives the light and perceives the scene. The centerof the retina, a.k.a. fovea, is used for scrutinizing highlydetailed objects, and the peripheral vision is optimized forperceiving coarser motion information [39]. Research inthe last decade has shown that peripheral and foveal visionis not independent in human vision, but is directly connect-ed [37]. Humans use peripheral vision to select regions ofinterest and foveate them by saccadic eye movements forfurther scrutiny [41]. In fact, as stated in [37], peripheraland foveal inputs interact and influence each other to betterperceive the scene. There is an integration process to com-bine the foveal and peripheral information, which is benefi-cial to perception [37]. Recently, researchers in the neuro-morphic field developed several bio-inspired vision sensorsto simulate the properties of the human retina. In this paper,motivated by the perception mechanism of the retina, weexplore how to combine two kinds of bio-inspired cameras(the fovea-like spike camera and the peripheral-like DVS)to reconstruct high quality images (see Figure 1).

2.2. Bio-inspired Neuromorphic Cameras

For spike camera, each pixel independently accumulatesluminance intensity which inputs from an analog-to-digitalconverter (ADC), and generates a spike if the ADC valueexceeds the dispatch threshold φ [47]:∫ T

0

Idt ≥ φ, (1)

2401

where I refers to the luminance intensity (usually measuredby photocurrent in the circuit). Then the accumulator is re-set and all the charges on it are drained. At different pixels,the accumulation speed of the luminance intensity is differ-ent. As shown in Figure 2, the greater the light intensity, themore frequent the spikes are emitted. For a spike camera, apixel continuously measures the light intensity and emits aspike train with 40,000 Hz. At a certain sampling time, thestates (“1” or “0”) of all pixels form a spike plane.

The Dynamic Visual Sensor (DVS) [9, 1] tracks thelight intensity changes at each pixel, and fires asynchronousevents whenever the log intensity changes over a dispatchthreshold θ (see Figure 2):

| log(It+1)− log(It)| ≥ θ. (2)

Since each pixel individually responds to the light inten-sity changes, DVS does not have a fixed sampling rate. Fora pixel (x, y), if an event occurs at time t, the event is repre-sented as a four-dimensional tuple e = (t, x, y, p) where pdenotes the polarity of the event (“+1” for light intensity in-crease and “-1” for decrease). This representation is calledAddress Event Representation (AER), and is the standardformat used by event-based sensors.

Generally speaking, the event camera has high dynamicsensing ability to moving objects, but it can’t record texture.Spike camera has the ability of full texture sampling, but itsdynamic range is not as high as that of an event camera.Therefore, this work explores how to effectively combinethe two cameras to achieve complementarily.

2.3. Relationship between Spike and Event Data

Although the sampling mechanisms are different, spikeand event cameras both record the change of light intensity.Based on the light intensity information hidden in the data,the relationship between the spike and the event data is an-alyzed to guide the development of our model. Consideringthat Eq. (1) can be simplified as It ≥ φ, for spike cam-era, the average intensity of the pixel in this period can beestimated by

I =φ

tISI, (3)

where φ denotes the dispatch threshold, and tISI exactly cor-responds to the inter-spike interval (ISI).

For the event camera, we first map the event sequenceinto a continuous-time function which incorporate the sta-tistical description. An event sequence with N spike firingtimes {ti ∈ T |i = 1, 2, ..., N} can be described by a sumof Dirac delta functions

e(t) =∑i

piδ(t− ti), (4)

where pi refers to the polarity described in Eq. (4). The

Figure 3. The noise analysis of spike camera on bright and darkscenes. Top: the bright scene. Bottom: the dark scene. From theISI distribution, we can see that the bright scene suffer from thenoise type 1 (analyzed in Section 2.4), while the dark scene maininfluenced by the fixed pattern noise.

Dirac delta function has the following property δ(t) ={0, t 6= 0

∞, t = 0, with

∫δ(t)dt = 1.

Assuming that the positive (“+1”) and negative (“-1”)events are triggered according to same threshold θ, and θdoes not change between t and t + 1, according to Eq. (2),we have the follow expression

log(It+1)− log(It) = θ

∫ t+1

t

e(s)ds. (5)

Considering that the light intensities It+1 and It can berepresented by the ISI of spike train as described in Eq. (3),the trigger threshold θ is expressed as

θ =log(tISI1/tISI2)∫ t+1

te(s)ds

, (6)

where tISI1 and tISI2 denote the ISI at t and t + 1 in spiketrain, respectively. Therefore, in ideal conditions, we canobtain the dispatch threshold θ according to Eq. (7). Forany time ti > t, the light intensity Lti can be estimated by

Lti = exp(log(It) + θ

∫ ti

t

e(s)ds). (7)

However, there are large temporal noises in both spikeand event cameras, which has a great impact on image re-construction. We will analyze it in Section 2.4. Despite theinfluence of noise, Eq. (7) can help us to better design therepresentation of spike data and event data for our network.

2.4. Noise Analysis of Neuromorphic Data

The performance of image reconstruction from the neu-romorphic cameras is heavily affected by noise. For eventcameras, the image quality is directly affected by the dis-patch threshold which changes per pixel (due to manufac-turing mismatch), and also due to dynamic effects (incidentlight, time, etc.) [12, 2]. Moreover, the temporal noise be-comes significant in the conditions of low event threshold,high bandwidth, and low light intensity [8].

2402

Figure 4. Spike and event data representations. For spike flow,we use a temporal ISI map and Nc − 1 spike planes as the inputof the motion path. Meanwhile, the event flow is transformed intoevent integration according to the timestamp of the spike plane asthe motion path input (see Eq. (8)). The temporal information canbe effective explored by our spike and event representation.

For spike camera, if there is no noise exist, the textureimage can be accurately and quickly reconstructed by us-ing the TFI method [47]. However, the existence of tempo-ral noise also has a great impact on image reconstruction,mainly including the following two types: 1) For the con-stant light intensity, the ISI may be inconsistent due to thereadout and reset time delay in the circuit. For example, theaccurate ISI of the spike train at a certain time should be2, but the readout ISI may fluctuate between 1 and 2 dueto noise in the temporal domain. This is especially harm-ful to image reconstruction under high light intensity (seeFigure 3). 2) Under the low light intensity, the influence of1) becomes smaller due to the long interval of the emission.The main noise is the fixed pattern noise (e.g., dark currentnoise). At this time, the noise will actively emit the spikedue to the noise, which leads to the mismatch between theISI and the real light intensity, and also limits the maximumISI range. An analysis of fixed pattern noise can be foundin our supplementary material.

3. MethodologyIn this section, we first design neuromorphic data repre-

sentations for spike and event data (Section 3.1). The prin-ciple is to extract more useful temporal information. Theimage reconstruction network is introduced in Section 3.2and the feature fusion module for spike and event data isdetailed in Section 3.3. Finally, a multi-scale neuromorphicdata simulation method is introduced in Section 3.4.

3.1. Adaptive Neuromorphic Data Representation

Eq. (7) represents the relationship of the two types ofneuromorphic data. The light intensity at the subsequen-t time can be estimated by the ISI of the initial time andthe integration of the event data. According to the samplingprinciple of the spike camera, ISI represents the integrationtime for a pixel to reach one emission, which contains tem-poral information in this period. On the other hand, sincethe spike camera outputs in the form of spike planes, the

spikes can naturally be inputted to the network, where eachspike plane is used as a channel. For the representation ofspike data, as shown in Figure 4, we propose to use onechannel of ISI and N -1 channels of spike planes as the in-put of texture path, with a total of N channels.

The event data is discrete in spatio-temporal domain be-cause of the asynchronous nature. Eq. (7) can guide us todesign the representation of event data. Based on the times-tamp of spike planes, the asynchronous events are neededto be transformed into effective two-dimensional featuressuitable for network input. In our work, inspired by theintegrate-and-fire (IF) model [21], we design an event in-tegration model to extract temporal information from theevent flow and transform it into 2-D feature. The membranepotential V (t) is defined as

V (t) =

∫ t

0

(e(s) + exp(−(t− s)

τ)e(s))ds, (8)

where τ is the time constant to control the decay rate. Asshown in Figure 4, we calculate the accumulation of pos-itive and negative events respectively, and then add themtogether to get the final feature map as the motion path in-put.

3.2. Network Architecture

Our neural network is a fully convolutional network thataccommodates both event flow and spike flow. Figure 5clarifies the architecture of the network. The NeuSpike-Nethas two encoders and a decoder, so it is a variant of the U-shaped model [32]. The event flow and spike flow are firsttransformed as the size of Nc ×W × H and followed byNe encoder layers, Nr residual blocks, Nd decoder layers,and a final image prediction layer. The number of channelsis doubled after each encoder layer.

In the encoder, there are two input paths: motion pathand texture path. The two paths have the same encoderstructure, we use skip connections between symmetric en-coder and decoder layers. The motion path is designed toextract more useful features from the event data because theevent flow responds to moving objects. The texture pathcaptures the texture information hidden in the spike flow.The motion and texture features are fused by a feature fu-sion module in other encoder layers (see Section 3.3). Theprediction layer performs a depthwise convolution followedby a sigmoid layer to produce an image prediction. We useNc = 32, Ne = Nd = 3 and Nr = 2 in our model.

In our network, the loss function is defined as follows:

Ltotal = Ll2 + λLPL, (9)

where L`2 is the `2 loss 1T

∑Ti=1 ‖I∗i − I

gi ‖2, I∗i and Igi

denote the generated texture image and the ground truth im-

2403

Figure 5. NeuSpike-Net architecture. Our network contains two encoder paths correspond to event and spike flow respectively. The twopaths have the same encoder structure, we use skip connections between symmetric encoder and decoder layers. Besides, FFM are appliedto fuse the features of spike and event flow in each encoder layer. The prediction layer performs a depthwise convolution followed by asigmoid layer to produce an image prediction.

Figure 6. The feature fusion module. For better viewing, thefeature maps are obtained by adding up all channels. The featuremap is based on the driving scene in Fig.10. The motion pathhighlights the motion and HDR parts while the texture path paysmore attention to the texture information of the scene.

age, respectively. LPL is the perceptual loss [18]

LPL =1

Wi,jHi,j

Wi,j∑x=1

Hi,j∑y=1

(φi,j(I∗)x,y − φi,j(Ig)x,y)2, (10)

where φi,j is the feature map obtained by the j-th convolu-tion (after activation) before the i-th maxpooling layer Wi,j

and Hi,j are the dimensions of the feature maps.

3.3. Feature Fusion Module

Humans use peripheral vision to select regions of inter-est and foveate them for further scrutiny [41]. In our model,the motion path (peripheral) captures more high dynamicrange and motion information, and the texture path (fovea)preserves more detailed texture information. Inspired by thehuman vision, we propose an attention-based feature fusionmodule (FFM) to extract more useful features of two en-coder paths. Figure 6 shows the visualized intermediate fea-ture maps of the FFM. Since each channel of the input spike

and event data represents the temporal information of theneuromorphic data, the channels of the feature contain thetemporal information to some extent. Thus, the channel at-tention is applied to texture and motion features to highlightmore useful temporal information of the two paths. Giventhe intermediate feature maps of two paths Fm ∈ RC×H×W

and Ft ∈ RC×H×W as input, FFM first infers a 1D channelattention map Mc ∈ RC×1×1

F′

m = Mc(Fm)⊗ Fm and F′

t = Mc(Ft)⊗ Ft, (11)

where ⊗ denotes element-wise multiplication, Mc( · ) de-notes the channal attention module[42]. By introducing s-patial attention, the final fused feature is obtained by

Ffuse = Ms((F′

m ⊕ F′

t)⊕ Fd)⊗ (F′

m ⊕ F′

t), (12)

where⊕ denotes element-wise sum, Ms( · ) denotes the spa-tial attention module [26], and Fd denotes the feature mapof decoder.

3.4. Multi-scale Training Neuromorphic Data

Our network requires training data including event flow,spike flow and corresponding ground truth images. How-ever, the ground truth images are usually difficult to ob-tain. Thus, we propose to train the network on the simulatedmulti-scale neuromorphic data1.High Frame Rate Video Preparation Inspired by the DVSsimulator V2E [8], the videos are first converted into luma

1Please see our supplementary material for more details about the sim-ulated data.

2404

Figure 7. Effect of different input modalities. Column 1: theinput spike and event data. Column 2-4: the reconstructed imagesfrom event data, spike data, and event+spike data, respectively.

frames, then we adopt the Super-SLoMo video interpola-tion network [17] to increase the frame rate of the video.We use the videos in Object Tracking Evaluation of KIT-TI dataset[14] to generate the simulated data. The averageupsampling ratio is 750. The original 30 FPS videos are up-sampled to about 22,500 FPS, which is similar to the sam-pling frequency of a spike camera. We use the images inoriginal videos as the groundtruth to ensure they are clear.The upsampled video is used to generate spike and event da-ta according to the timestamps of the groundtruth images.

Table 1. Effect of different event/spike representations.Representation PSNR SSIMES + SR w/o TISI 26.31 0.8134Voxel + SR w/o TISI 26.74 0.8159ML + SR w/o TISI 27.86 0.8204IF + SR w/o TISI 27.65 0.8224ES + SR 29.35 0.9129Voxel + SR 29.44 0.9021ML + SR 30.16 0.9110IF + SR (Ours) 30.31 0.9234

Table 2. Effect of different input modalities.Modality PSNR SSIMEvent 12.85 0.5198Spike 26.84 0.8389Event + spike (Ours) 27.40 0.8510

Multi-scale Spike and Event Data The way to generatebasic spike data is intuitive: first, we set up an accumulatorfor each pixel. Each input image contributes to the accumu-lator according to the pixel gray value multiplied by a lightintensity scale. If the accumulated value exceeds the emis-sion threshold, spike “1” is fired, otherwise “0” is output.

To better simulate real-world scenarios, we generalizethe simulated data to a multi-scale form, including differentnoises, light conditions, and motion speeds. Different lightconditions are simulated by adjusting the light intensity s-cale to control the dense of generated spikes. Meanwhile,different motion speeds are simulated by adjusting the in-tegral times of each frame. The corresponding event datais generated by V2E using the same upsampled frames tosimulate different motions. Also, the contrast threshold isadjusted to simulate different light intensities.

Temporal Noise Simulation We consider simulating thenoise distribution (see Section 2.4) in real spike data. Tosimulate the noise under high light intensity, we add a ran-dom matrix to the initial accumulator. For simulating thefixed pattern noise in low light conditions, we first gener-ate a matrix N ∈ Rm×n that follows Gaussian distributionN (µ, σ2), wherem×n denotes the spatial resolution of thedata. Then the length of the spike interval is constrained by

Tx,y =

{Yx,y if Yx,y ≤ Nx,y

Nx,y if Yx,y > Nx,y

, (13)

where Yx,y is the original simulated spike interval at pix-el (x, y), and T(x,y) denotes the constrained spike interval.According to statistics on real spike data, the mean µ andstandard deviation σ are set to 180 and 50, respectively.

4. Experiment4.1. Training Details

Our network2 is trained on an NVIDIA 2080 Ti GPU.We adopt the batch size of 8 and Adam optimizer [20] in thetraining process. The network is trained for 60 epochs, witha learning rate of 10−4. The weight λ of perceptual lossis set as 0.01. During training, input images are randomlyflipped horizontally and vertically (with 0.5 probability) andcropped to 400×256. As stated in Section 3.4, the simulat-ed data (spike data, event data, and groundtruth images) iswith 5 different light scales and 5 different motion speeds.We use all the origin frames in the video “0000” to gener-ate training data, and the random frames in videos “0000”-“0019” to generate test data. In total, we use 1,183 data totrain the network and 745 data to test the performance.

4.2. Effect of Different Neuromorphic Inputs

Different Event/Spike Representations To evaluate the ef-fect of different event and spike representations, we conductexperiments on the simulated data. For the event represen-tation, we test event stacking (ES) [40], voxel grid (Vox-el) [46], Matrix-LSTM (ML) [5] and the proposed repre-sentation (Eq. (8)). For the spike data, we compare theeffect of two representations: the proposed spike represen-tation without temporal ISI (SR w/o TISI) and the spikeflow with temporal ISI (SR). ML is a learn-based represen-tation for event data, and voxel grid is widely used in DVSreconstruction. As the quantitative results shown in Table1, IF + SR performs better than other representations be-cause it is designed as the sampling mechanism describedin Eq. (7) which is more suitable to our framework. Byintroducing the proposed representation, the temporal com-ponents of event and spike data are effectively explored byour framework.

2Project page: https://sites.google.com/view/retina-recon/

2405

Figure 8. Quantitative results on the simulated dataset. Top: the results under simulated low light scene. Bottom: the results undersimulated high speed scene. TFP (win=n) means the reconstruction window size is set as n spike planes. Other methods use their defaultconfigurations. The results show that our method can reconstruct the car and person clearly while eliminating the noise completely.

Figure 9. Quantitative results of ultra high speed scenes in real-world dataset. The scene depicts a high speed fan with speeds from500 to 2600 RPM, we compare our method with four methods: TFP (win=128), SNM, FireNet, and E2VID. The event-based method isdifficult to reconstruct images in this scene because the events are too dense. For spike-based methods, there are artifacts in the results ofSNM and TFP.

Table 3. Quantitative results on simulated dataScene Method ∗TFP (128) ∗TFP (256) ∗TFI ∗SNM †FireNet †E2VID ∗†Ours

Normal PSNR 28.08 28.81 27.06 29.60 10.91 12.82 30.31SSIM 0.7038 0.7887 0.7750 0.8416 0.4640 0.5181 0.9234

Complex PSNR 22.70 22.58 23.38 26.44 11.78 12.85 27.40SSIM 0.5648 0.6104 0.7063 0.7722 0.5247 0.5198 0.8510

∗ Spike-based methods. † Event-based methods.

Different Input Modalities The valid the effect of twotypes of neuromorphic data, we conduct an ablation ex-periment on the complex scenes in simulated data. We de-sign a network using the texture path of our framework toreconstruct images that only uses spike data. Meanwhile,the event-based approach E2VID is used to test the perfor-mance of only event input. The results are shown in Table2 and Figure 7. Moreover, the results on real dataset alsoprove the effectiveness of combining spike and event data(see Figure 10 and Table 5). The event-based methods aredifficult to estimate the texture, while the spike-based meth-ods are rely on light intensity thus hard to reconstruct the H-DR details. By combining spike and event data, the textureand HDR information can be reconstructed effectively.

4.3. Effect of Different Network Structures

Network Structure. We compare different network archi-tectures for finding the best hyperparameters. Table 4 re-ports the result of replacing the FFM (our default architec-ture) by the element-wise sum. Moreover, rows 3-6 demon-strate that our model performs best with 3 encoders and 2residual blocks.Loss functions. We use the Ll2 + LPL loss by default, buthave evaluated many alternative loss functions. As shown

in Table 4 (rows 7-9), Ll1 + LPL has larger SSIM than ourdefault model, but the PSNR is lower. Other loss functionsdo not show advantages in the results.

4.4. Evaluation on the Simulated Dataset

Experiment Settings To evaluate the proposed network,we conduct experiments on both simulated and real-worlddatasets. We compare our method with other eight state-of-the-art methods, including three spike-based reconstructionmethods (TFP [47], TFI [47] and SNM [49]) and 5 event-based reconstruction methods (MF [25], HF [33], CF [33],FireNet [34], E2VID [31]). Among them, FireNet andE2VID are the learning-based methods, and CF uses bothevent and frame to reconstruct image.Results Figure 8 shows the qualitative results of two typicalscenes including low light and high light scenes. The result-s show that the proposed model is effective to handle thesescenes. Table 3 shows the quantitative evaluation of ourmethod on normal scenes and complex scenes (e.g., highspeed and low light), respectively. The results show thatour method is better than other methods, especially SSIM.In summary, the qualitative and quantitative results demon-strate that our method can reconstruct high quality imagesby fusing spike and event data.

2406

Figure 10. Quantitative results of the outdoor scenes in real data. The scenes are with different light intensities that can be estimatedby the firing density of the raw spike data. Top: a tank under high light intensity. Middle: a car under medium light intensity. Bottom: adriving scene under low light intensity. Our method performs better by combining event and spike data.

Table 4. Effect of different network structures and losses.Condition PSNR SSIM1. Our default model 30.31 0.92342. FFM→ Element-wise sum 29.74 0.91593. Encoders: 3→ 2 30.26 0.91044. Encoders: 3→ 4 29.59 0.92295. Residual blocks: 2→ 1 29.79 0.92436. Residual blocks: 2→ 3 29.80 0.90767. Loss: Ll2 + LPL→Ll1 29.68 0.91048. Loss: Ll2 + LPL→Ll2 29.57 0.90849. Loss: Ll2 + LPL→Ll1 + LPL 28.79 0.9297

4.5. Evaluation on the Real World Dataset

Experiment Settings Inspired by [16], we build a hybridcamera system (see Figure 11) consisting of a spike camera(Vidar), an event camera (DAVIS 346) and a beam splitter.Two cameras can record the same scene through the beamsplitter. The detail of this system can be found in our sup-plementary material. We construct a real dataset including15 sequences with different light conditions, which consistsof 5 outdoor scenes and 10 ultra high speed fan scenes (thefan with speeds from 500 RPM to 2600 RPM).Results Figure 9 shows the results of ultra high speedscenes. In this scenario, due to the event data are too dense,the event-based methods are difficult to reconstruct satis-factory results. With the help of spike data, our methodcan reconstruct clear letters under different light intensitiesand motion speeds Figure 10 shows the results on outdoorscenes with different light conditions. Compared with SN-M (spike-based) and CF (event+frame), our method can re-construct HDR scenes clearly with less noise, which showsthe advantages of the combination of event and spike. In“Tank”, the noise analyzed in Section 2.4 can be eliminat-ed completely. For the medium and low light scenes “Car”and “Driving”, our method can take advantage of the com-bination of two types of data to reconstruct more HDR de-tails. The quantitative evaluation are shown in Table 5. Weuse APS images as the groundtruth image to evaluate MSEand SSIM. Moreover, since APS cannot record high-speedscenes, we introduce a non-reference metric 2D-entropy toevaluate all real-world data. Note that CF uses the AP-S frames as the initial state, thus it performs better MSE

Table 5. Quantitative results on real world data.Method MSE SSIM 2D-entropy

Outdoor Outdoor High speed OutdoorTFI (spike) 0.0842 0.3396 8.8037 7.4797SNM (spike) 0.0785 0.4362 8.8594 8.5884E2VID (event) 0.1014 0.4167 8.4820 8.9180MF (event) 0.1281 0.4062 8.8895 5.5215HF (event) 0.1347 0.3845 8.9459 7.5335CF (event + frame) 0.0526 0.4787 9.7556 10.5473Ours (spike) 0.0810 0.4682 9.4246 10.1517Ours (spike+event) 0.0741 0.5046 9.8747 10.4897

Figure 11. The hybrid neuromorphic cameras system.

and 2D-entropy on outdoor scenes. Our method perform-s well in all quantitative results (the introduction of eventshas greatly improved SSIM). For more experimental results,please refer to our supplementary materials.

5. ConclusionIn this work, we propose to combine the high dynamic

range of event data and the full texture sampling of spikedata to reconstruct high quality visual images, especially inhigh speed and different light intensities scenes. To this end,the NeuSpike-Net is proposed to handle these two types ofneuromorphic data. An effective representation of spike andevent data is proposed to extract temporal information. Fur-thermore, a feature fusion module is designed to effectivelyfuse the two types of neuromorphic data. Our network istrained by a multi-scale simulated neuromorphic dataset. Totest the performance, we also build a hybrid neuromorphiccameras system to record the real-world dataset. Extensiveevaluations on simulated and real datasets show that the pro-posed approach achieves superior performance than variousexisting spike-based and event-based methods.Acknowledgments. This work was supported in part by theNational Natural Science Foundation of China under Contrac-t 62027804, Contract 61825101 and Contract 62088102.

2407

References[1] Patrick Lichtsteiner an Christoph Posch and Tobi Delbruck.

A 128 × 128 120 db 15 µs latency asynchronous temporalcontrast vision sensor. IEEE Journal of Solid-State Circuits,43(2):566–576, 2008. 1, 3

[2] R Baldwin, Mohammed Almatrafi, Vijayan Asari, and KeigoHirakawa. Event probability mask (epm) and event denois-ing convolutional neural network (edncnn) for neuromorphiccameras. In IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pages 1701–1710, 2020. 3

[3] Juan Antonio Lenero Bardallo, Jose-Maria Guerrero-Rodriguez, Ricardo Carmona-Galan, and Angel Rodriguez-Vazquez. On the analysis and detection of flames with anasynchronous spiking image sensor. IEEE Sensors Journal,18(16):6588–6595, Aug 2018. 1

[4] Christian Brandli, Raphael Berner, Minhao Yang, Shih-ChiiLiu, and Tobi Delbruck. A 240 × 180 130 db 3 µs latencyglobal shutter spatiotemporal vision sensor. IEEE Journal ofSolid-State Circuits, 49(10):2333–2341, 2014. 1

[5] Marco Cannici, Marco Ciccone, Andrea Romanoni, andMatteo Matteucci. A differentiable recurrent surface forasynchronous event-based data. In European Conference onComputer Vision (ECCV), pages 136–152. Springer, 2020. 6

[6] Jonghyun Choi, Kuk-Jin Yoon, et al. Learning to super re-solve intensity images from events. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages2768–2776, 2020. 1

[7] Eugenio Culurciello, Ralph Etienne-Cummings, and K-wabena A Boahen. A biomorphic digital image sensor. IEEEJournal of Solid-State Circuits, 38(2):281–294, 2003. 1

[8] Tobi Delbruck, Yuhuang Hu, and Zhe He. V2e: From videoframes to realistic dvs event camera streams, 2020. 3, 5

[9] Tobi Delbruck, Bernabe Linares-Barranco, Eugenio Culur-ciello, and Christoph Posch. Activity-driven, event-based vi-sion sensors. In IEEE International Symposium on Circuitsand Systems (ISCAS), pages 2426–2429, 2010. 1, 3

[10] Siwei Dong, Tiejun Huang, and Yonghong Tian. Spike cam-era and its coding methods. In Data Compression Confer-ence (DCC), pages 437–437, 2017. 1

[11] Siwei Dong, Lin Zhu, Daoyuan Xu, Yonghong Tian, andTiejun Huang. An efficient coding method for spike camerausing inter-spike intervals. In Data Compression Conference(DCC), pages 568–568. IEEE, 2019. 1

[12] Guillermo Gallego, Tobi Delbruck, Garrick Orchard, ChiaraBartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger,Andrew Davison, Joerg Conradt, Kostas Daniilidis, and Da-vide Scaramuzza. Event-based vision: A survey. IEEETransactions on Pattern Analysis and Machine Intelligence,pages 1–1, 2020. 3

[13] Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Da-vide Scaramuzza. Asynchronous, photometric feature track-ing using events and frames. In European Conference onComputer Vision (ECCV), pages 750–765, 2018. 1

[14] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are weready for autonomous driving? the kitti vision benchmarksuite. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2012. 6

[15] Menghan Guo, Jing Huang, and Shoushun Chen. Livedemonstration: A 768× 640 pixels 200meps dynamic visionsensor. In International Symposium on Circuits and Systems(ISCAS), pages 1–1. IEEE, 2017. 1

[16] Jin Han, Chu Zhou, Peiqi Duan, Yehui Tang, Chang Xu,Chao Xu, Tiejun Huang, and Boxin Shi. Neuromorphic cam-era guided high dynamic range imaging. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), June2020. 1, 8

[17] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-HsuanYang, Erik Learned-Miller, and Jan Kautz. Super slomo:High quality estimation of multiple intermediate frames forvideo interpolation. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 9000–9008, 2018. 6

[18] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Percep-tual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision (ECCV), pages694–711. Springer, 2016. 5

[19] Zaven Kalayjian and Andreas G. Andreou. Asynchronouscommunication of 2d motion information using winner-takes-all arbitration. Analog Integrated Circuits and SignalProcessing, 13(1):103–109, 1997. 1

[20] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014. 6

[21] Christof Koch and Idan Segev. Methods in neuronal model-ing: from ions to networks. MIT press, 1998. 4

[22] Juan Antonio Leero-Bardallo, D. H. Bryn, and Philipp H-fliger. Bio-inspired asynchronous pixel event tricolor visionsensor. IEEE Transactions on Biomedical Circuits and Sys-tems, 8(3):345–357, June 2014. 1

[23] Martin Litzenberger, Christoph Posch, D Bauer, Ahmed N-abil Belbachir, P Schon, B Kohn, and H Garn. Embeddedvision system for real-time object tracking using an asyn-chronous transient vision sensor. In Digital Signal Process-ing Workshop-signal Processing Education Workshop, pages173–178, 2006. 1

[24] Ana I Maqueda, Antonio Loquercio, Guillermo Gallego,Narciso Garcıa, and Davide Scaramuzza. Event-based visionmeets deep learning on steering prediction for self-drivingcars. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 5419–5427, 2018. 1

[25] Gottfried Munda, Christian Reinbacher, and Thomas Pock.Real-time intensity-image reconstruction for event camerasusing manifold regularisation. International Journal of Com-puter Vision, 126(12):1381–1393, 2018. 1, 7

[26] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee,Mattias Heinrich, Kazunari Misawa, Kensaku Mori, StevenMcDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Atten-tion u-net: Learning where to look for the pancreas. arXivpreprint arXiv:1804.03999, 2018. 5

[27] Liyuan Pan, Cedric Scheerlinck, Xin Yu, Richard Hartley,Miaomiao Liu, and Yuchao Dai. Bringing a blurry framealive at high frame-rate with an event camera. In IEEEConference on Computer Vision and Pattern Recognition(CVPR), pages 6820–6829, 2019. 1

2408

[28] Stefano Pini, Guido Borghi, and Roberto Vezzani. Learnto see by events: Color frame synthesis from event and rg-b cameras. In International Joint Conference on ComputerVision, Imaging and Computer Graphics Theory and Appli-cations, 2019. 1

[29] Stefano Pini, Guido Borghi, Roberto Vezzani, and Rita Cuc-chiara. Video synthesis from intensity and event frames. InInternational Conference on Image Analysis and Processing,pages 313–323. Springer, 2019. 1

[30] Christoph Posch, Daniel Matolin, and Rainer Wohlgenannt.An asynchronous time-based image sensor. IEEE Interna-tional Symposium on Circuits and Systems (ISCAS), pages2130–2133, 2008. 1

[31] Henri Rebecq, Rene Ranftl, Vladlen Koltun, and Davide S-caramuzza. Events-to-video: Bringing modern computer vi-sion to event cameras. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 3857–3866,2019. 1, 7

[32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmen-tation. In International Conference on Medical image com-puting and computer-assisted intervention, pages 234–241.Springer, 2015. 4

[33] Cedric Scheerlinck, Nick Barnes, and Robert Mahony.Continuous-time intensity estimation using event cameras.In IEEE Asian Conference on Computer Vision (ACCV),pages 308–324. Springer, 2018. 1, 7

[34] Cedric Scheerlinck, Henri Rebecq, Daniel Gehrig, NickBarnes, Robert Mahony, and Davide Scaramuzza. Fast im-age reconstruction with an event camera. In IEEE WinterConference on Applications of Computer Vision (WACV),pages 156–163, 2020. 1, 7

[35] Chen Shoushun and Amine Bermak. Arbitrated time-to-firstspike cmos image sensor with on-chip histogram equaliza-tion. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 15(3):346–357, 2007. 1

[36] Raunak Sinha, Mrinalini Hoon, Jacob Baudin, HaruhisaOkawa, Rachel OL Wong, and Fred Rieke. Cellular andcircuit mechanisms shaping the perceptual properties of theprimate fovea. Cell, 168(3):413–426, 2017. 1

[37] Emma EM Stewart, Matteo Valsecchi, and Alexander CSchutz. A review of interactions between peripheral andfoveal vision. Journal of Vision, 20(12):2–2, 2020. 1, 2

[38] Timo Stoffregen, Cedric Scheerlinck, Davide Scaramuz-za, Tom Drummond, Nick Barnes, Lindsay Kleeman, andRobert Mahony. Reducing the sim-to-real gap for event cam-eras. In European Conference on Computer Vision (ECCV),pages 534–549. Springer, 2020. 1

[39] Matteo Toscani, Karl R Gegenfurtner, and Matteo Valsec-chi. Foveal to peripheral extrapolation of brightness withinobjects. Journal of vision, 17(9):14–14, 2017. 2

[40] Lin Wang, Yo-Sung Ho, Kuk-Jin Yoon, et al. Event-basedhigh dynamic range image and very high frame rate videogeneration using conditional generative adversarial network-s. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 10081–10090, 2019. 6

[41] Christian Wolf and Alexander C Schutz. Trans-saccadic inte-gration of peripheral and foveal feature information is closeto optimal. Journal of vision, 15(16):1–1, 2015. 2, 5

[42] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and InSo Kweon. Cbam: Convolutional block attention module.In European conference on computer vision (ECCV), pages3–19, 2018. 5

[43] Woodward Yang. A wide-dynamic-range, low-power pho-tosensor array. In Proceedings of IEEE International Solid-State Circuits Conference ISSCC, pages 230–231, 1994. 1

[44] Song Zhang, Yu Zhang, Zhe Jiang, Dongqing Zou, JimmyRen, and Bin Zhou. Learning to see in the dark with events.In European Conference on Computer Vision (ECCV), pages666–682. Springer, 2020. 1

[45] Jing Zhao, Ruiqin Xiong, and Tiejun Huang. High-speedmotion scene reconstruction for spike camera via motionaligned filtering. In International Symposium on Circuits andSystems (ISCAS), pages 1–5, 2020. 1

[46] Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, andKostas Daniilidis. Unsupervised event-based optical flow us-ing motion compensation. In European Conference on Com-puter Vision (ECCV), pages 711–714. Springer, 2018. 6

[47] Lin Zhu, Siwei Dong, Tiejun Huang, and Yonghong Tian.A retina-inspired sampling method for visual texture recon-struction. In International Conference on Multimedia andExpo (ICME), pages 1432–1437, 2019. 1, 2, 4, 7

[48] Lin Zhu, Siwei Dong, Tiejun Huang, and Yonghong Tian.Hybrid coding of spatiotemporal spike data for a bio-inspiredcamera. IEEE Transactions on Circuits and Systems forVideo Technology, 31(7):2837–2851, 2021. 1

[49] Lin Zhu, Siwei Dong, Jianing Li, Tiejun Huang, and Y-onghong Tian. Retina-like visual image reconstruction vi-a spiking neural model. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 1438–1446,2020. 1, 7

2409