Cognitively Inspired Real-Time Vision...

87
Cognitively Inspired Real-Time Vision Core Ozgur Yilmaz 1 Department of Computer Engineering, Turgut Ozal University, Ankara, Turkey Ismail Ozsarac Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011, Turkey. Omer Gunay Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011, Turkey. Huseyin Ozkan Department of Image Processing, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011, Turkey. Abstract We introduce a cognitively inspired novel binary image representation and utilize it for real-time operating computer vision core, which is capable of simultane- ously detecting a specific object in an image, classifying an image region pro- vided by an algorithm such as motion detection, and tracking multiple objects in a video. In this framework, hidden layer representations of binary receptive field neural networks are utilized to generate compact image representations for various classification based functionalities. Object detection is implemented by learning a classifier on hidden layer activities and performing sliding win- dow based search on the image. Morever, a classification based object tracking algorithm is introduced that uses the proposed framework, whose tracking per- formance in standard datasets is shown to be comparable to the state of the art Email addresses: [email protected] (Ozgur Yilmaz), [email protected] (Ismail Ozsarac), [email protected] (Omer Gunay), [email protected] (Huseyin Ozkan) 1 Corresponding author Preprint submitted to Journal of L A T E X Templates May 7, 2015

Transcript of Cognitively Inspired Real-Time Vision...

Cognitively Inspired Real-Time Vision Core

Ozgur Yilmaz1

Department of Computer Engineering, Turgut Ozal University, Ankara, Turkey

Ismail Ozsarac

Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,

Turkey.

Omer Gunay

Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,

Turkey.

Huseyin Ozkan

Department of Image Processing, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,Turkey.

Abstract

We introduce a cognitively inspired novel binary image representation and utilize

it for real-time operating computer vision core, which is capable of simultane-

ously detecting a specific object in an image, classifying an image region pro-

vided by an algorithm such as motion detection, and tracking multiple objects

in a video. In this framework, hidden layer representations of binary receptive

field neural networks are utilized to generate compact image representations

for various classification based functionalities. Object detection is implemented

by learning a classifier on hidden layer activities and performing sliding win-

dow based search on the image. Morever, a classification based object tracking

algorithm is introduced that uses the proposed framework, whose tracking per-

formance in standard datasets is shown to be comparable to the state of the art

Email addresses: [email protected] (Ozgur Yilmaz),[email protected] (Ismail Ozsarac), [email protected] (Omer Gunay),[email protected] (Huseyin Ozkan)

1Corresponding author

Preprint submitted to Journal of LATEX Templates May 7, 2015

techniques. We further propose several additional functionalities on the same

core, such as sparse interest point extraction, salient motion detection, scene

recognition. These set of capabilities in the arsenal, artificial vision is expected

to perform the necessary fundamental operations in real time, paving a way

for more complex inferences, such as geometric computations and cross-modal

information fusion.

Keywords: Artificial Neural Networks, Cognitive Architecture, Computer

Vision, Local Binary Pattern, Hardware Implementation, Object

Classification, Object Tracking

1. Introduction

Neural network approaches in computer vision have become increasingly

popular due to their remarkably higher performance in complex tasks such as

large-scale classification [1] or multi-modal fusion [2]. The strength of neural net-

works is attributed to several mechanisms such as unsupervised feature learning5

from unlabeled data (compared to hand designing of features) [3], hierarchi-

cal processing via deep architectures (that discovers higher order relationships)

[4, 5] and exploitation of long-range statistical dependencies using recurrent pro-

cessing [3]. Neural networks provide a similar approach to the kernel methods

[6]: input is projected onto a high dimensional space of hidden units instead of10

basis functions, wherein a hyperplane is able to partition the data [7]. Since this

lifting to high dimension yields a powerful representation of the visual data, it

can naturally be utilized for several tasks, such as classification, detection, track-

ing, clustering, interest point detection. Thus, after an image or a video block

is ”analyzed” by a neural network via multi layer processing, the hidden layer15

activities that represent the visual input can be demultiplexed to many tasks,

in line with the processing carried out in human brain theorized in cognitive

neuroscience [8].

Neural Networks are also targeting at the real time embedded visual process-

ing needs [9], which has been continuously growing with increased demands in20

2

intelligent robotic platforms, such as Unmanned Aerial Vehicles (UAV). These

systems are expected to navigate and operate in autonomous fashion which en-

tails successful implementations of image and video understanding functions.

Scene recognition, detection of specific objects, classification of moving objects

and object tracking are some of the essential visual functionalities in an au-25

tonomous robotic system. Weight and energy specifications of such systems

restrict both the number of available functionalities and the computational

complexity of the algorithms, hence diminishes the operational capacity. We hy-

pothesize that, an embedded implementation of a visual processing core serving

as the common computational block to at least a subset of these functionalities30

can relax the aforementioned restrictions (Fig. 1). In this paper, we show that a

Field Programmable Gate Array (FPGA) implementation (see [10] for VLSI ar-

chitectures of artificial cognitive systems) of a neural network based vision block

is an efficient and effective method in embedded visual processing. Sparse and

overcomplete image representation formed in the neural network hidden layers,35

provides versatility and discriminative power [11, 12], that is harnessed through

a set of distinct processing needs. Our approach combines several computer vi-

sion modes on a common processing core, very similar to recent advances in core

video processing functions (optical flow, disparity orientation computation) [13].

Specifically, we show that object detection, classification and tracking functions40

can be executed on the same FPGA core in real-time, which can be embedded

in a robotic platform for surveillance and reconnaissance missions.

In order to propose this common architecture we propose a novel image

representation, which is the most important contribution of the paper. We in-

troduce a single layer neural network architecture with binary receptive fields45

and sparse binary responses, which greatly simplifies the computations per-

formed on FPGA compared to real valued neurons. The architecture resembles

Local Binary Patterns (LBP) [14] but with important differences in feature com-

putation. Neural network approach we adopt extends the LBP concept since

it additionally has the capability to generate convoluted binary patterns (con-50

volutional multi-layer network) and explore long-range statistical relationships

3

(recurrent processing).

In section 2, we discuss the related work on neural network based classifica-

tion and detection algorithms, classification based object tracking approaches,

local binary patterns and FPGA implementations of neural networks. The bi-55

nary receptive field based image representation is introduced in section 3. We

report our experimental findings in section 4, and highlight possible applications

of the framework by giving results on datasets. FPGA implementation of the

processing core is given in section 5. The paper concludes with the final remarks

and a discussion about several future research directions in Section 6.60

Sparse Image RepresentationImage

Analysis

ClassificationScene, Moving Object

DetectionObject, Texture

Tracking

Sparse FeaturesInterest Point and Descriptor

Figure 1: A multi purpose image analysis core creates a sparse and overcomplete image

representation that can be utilized for various applications.

2. Related Work

2.1. Object Classification and Neural Networks

Neural network algorithms have been successfully applied for object classi-

fication problems for more than 2 decades [4, 15], and the superiority in per-

formance gets more visible as the computational resources become abundant65

[1]. An alternative approach to neural networks is the design of discrimina-

tive visual features and some of the most prominent coding based methods

apply vector quantization using a learned dictionary of visual words [16, 17].

In a similar fashion, discriminative image representations are learned in neu-

ral network studies using RBMs [3], auto-encoders [18], convolutional networks70

[19, 1]. Sparsity in representation is also emphasized in neural network studies

4

[20]. The main difference between neural network and classical computer vi-

sion approaches lies in the intermediate representation: deep hierarchical and

distributed representation learned through both unsupervised pre-training and

supervised backpropagation vs. manual or semi-automatic design of visual fea-75

tures. Recently, efficient fusion of many different representations have been in-

vestigated [21], that improves upon individual hand-designed features. Despite

the existence of successful hand designed visual features, exploration of useful

features through a neural network framework shows superior performance in

recent benchmark studies [1].80

Even though deep architectures are shown to prevail in a wide range of tasks,

a single layer neural network with clustering based unsupervised learning ap-

proach can show state-of-the-art classification performance [7]. In this approach,

neural network hidden layer receptive fields are learned via k-means clustering

algorithm in an offline manner (Fig. 2a). As an efficient vector quantization85

method, K-means have been widely used for codebook generation in computer

vision literature [16], and it is also shown to be a viable alternative for learn-

ing receptive fields in a neural network [7, 22]. Learning receptive fields in a

single layer neural network shares similarities with codebook based computer

vision approaches. However neural network framework holds the potential to90

be extended to a hierarchical distributed representation with additional layers

as well as for performing recurrent processing. Overall, neural network learn-

ing provide a rich set of mechanisms for automatic discovery of discriminating

feature space, that is not available in classical computer vision feature design

frameworks.95

2.2. Local Binary Patterns and Neural Networks

Local Binary Pattern (LBP) [23, 14] is an image operator that constructs

a feature descriptor based on the texture patterns in the image defined by the

relative pixel values. LBP is shown to be successful in many texture [24] and ob-

ject classification tasks [25]. It is argued that signed differences between pixels100

convey most of the texture information in the images and has useful invariance

5

properties to intensity changes [26]. Selection of a subset of available LBP pat-

terns is shown [27] to be a good strategy since a small subset of the patterns

holds most of the discriminative power. An LBP pattern is equivalent to a spe-

cially constructed fixed binary decision tree [28] and this has been exploited for105

learning discriminative LBP like patterns from the data in a supervised frame-

work, using decision tree induction algorithms. Vector quantization methods

are used in [29, 26] to learn LBP patterns from the data in an unsupervised

manner.

2.3. FPGA Implementation of Object Classification, Tracking and Neural Net-110

works

Field Programmable Gate Array (FPGA) is widely popular in image pro-

cessing and computer vision applications due to their inherent parallelism and

low power consumption [13]. Specifically, image feature extraction algorithms

are well suited for FPGA hardware and recent advances show great potential115

for embedded computer vision [30]. Object detection tasks [31] and primary

vision operations such as binary processing [32] are implemented successfully

on FPGA chips. Recently, there is an ongoing effort to provide multi functional

vision architectures implemented in FPGAs [33].

FPGAs are also widely used in implementation of neural network approaches120

(see [9, 34] for a detailed review.). In general, neural network hardware imple-

mentations can be classified into three broad categories: DSP, ASIC and FPGA

based. DSP-based implementations are sequential overall, nevertheless the ar-

chitecture of the neurons is made as parallel as possible. ASIC implementations

are not reconfigurable, even though the hardware usage is optimized. On the125

other hand, FPGA implementations preserve the parallel architecture of the

neurons and offers flexibility in reconfiguration [34]. In [35] and [36], benefits

and obstacles of implementing a neural network in an FPGA are discussed. The

implementation of a multi-input neuron with linear/nonlinear excitation func-

tions using FPGA is analyzed in [37]. A recent study focuses on the implementa-130

tion of convolutional neural networks on FPGAs [38]. In addition to recognition

6

and detection, object tracking algorithms are also realized on FPGA chips [39].

However, to the best of our knowledge, a general neural network based FPGA

framework was not proposed before that serves as a common building block

for many visual functions such as classification, detection, tracking and interest135

point extraction.

2.4. Object Detection

Modern object detection algorithms [40] originate from sliding window foun-

dations [41] enhanced with the efficient integral image representations [42] (cf.

[43] for a recent survey on pedestrian detection). In principle, any classification140

algorithm can be utilized for object detection. In this respect, neural networks

of many different types is a strong candidate. Specifically, convolutional neural

networks are successful for face [44], hand [45] and text [46] detection applica-

tions.

2.5. Classification based Object Tracking145

Tracking can be treated as a learning/classification problem (see [47] for a

recent survey) such that, images that belong to target can be discriminated

from images that belong to background [48, 49]. In this framework, the target

object and background are projected onto a discriminative feature space and a

classifier is trained to segment target and background pixels in the subsequent150

frames. The new location of the classified target pixels reveals the motion of the

target object. This approach is advantageous over the other tracking methods

mainly because:

i. Interest point based tracking algorithms suffer when the target is too small

that [50, 51], it is hard to compute sufficiently many strong features.155

ii. Classifier can be updated during tracking for plasticity as in template

matching. However, unlike other approaches, the appearance update is able to

keep multiple instances of appearance at the same time in an effective and mem-

ory efficient way, i.e., classifier. This will make appearance adaptation smooth

and natural.160

7

iii. The representation of image regions is in a discriminative high dimen-

sional space, where tracked object can vividly be represented.

Furthermore, using a classification framework for tracking unifies the target

detection/classification and tracking computations. The same hardware/software

resources can easily be tailored to scan and detect a specific target in the whole165

image (vehicle, person etc.) or classify a region of interest (possibly marked via

motion detection) using a pre-trained classifier, and this commonalization is the

main proposition of this study.

2.6. Contributions of the Study

In this study:170

1. We generalize the Local Binary Patterns (LBP) framework using a neural

network perspective.

2. Using the novel image representation, we introduce a real-time operat-

ing computer vision core, which is capable of detecting a specific object in an

image, classifying an image region provided by another algorithm (e.g. motion175

detection), and tracking a specified object in a video.

3. Classification capability of the algorithm is utilized for real-time object

detection purposes.

4. A classification based object tracking algorithm is introduced that uses

the neural network framework, whose tracking performance in standard datasets180

is shown to be comparable to the state of the art techniques.

5. Two new aerial thermal image datasets are presented: object tracking

video set from a medium-high altitude UAV, and ship classification image set

from a low altitude UAV 2.

6. An FPGA design framework for a single layer neural network is developed185

that can be extended to multi-layer architectures.

2Please contact the corresponding author for access to these datasets.

8

(a) (b)

Image

w s

: patch xij

w: patch sizes : stride

...

yijk

K

Image Representation

(c)

Figure 2: (a) Dictionary of K (200) visual receptive fields, size w (6), learned through k-means.

Receptive fields resemble oriented Gabor filters. (b) Image patches used for computation of

hidden layer activities. Patches have size w, and sampling period is s (also called stride) which

determines the number of hidden layer neurons. (c) Image representation in the hidden neural

network layer, which has K distinct maps corresponding to different receptive fields. yijk is

the kth receptive field response at location (i, j). The representation is sparse, illustrated by

different sized black squares spread inside the first map.

3. Methods

A recent neural network framework [7] shows state-of-the-art classification

performance despite its simplicity in unsupervised pre-training (k-means) and

supervised learning (linear SVM) phases. Also, it is suitable for parallel imple-190

mentation in embedded hardware, i.e. FPGA chip, after simplifications on the

receptive fields and neural responses.

In the single layer neural algorithm [7], hidden layer activities are computed

densely at every pixel location (stride parameter s is equal to 1) in an image,

using the pre-computed receptive field dictionary (Fig. 2b). For a patch xij at

pixel location (i, j), hidden layer activities yij are computed using the Euclidian

distance function f such that

yijk = f(xij , Dk), ∀(i, j), k ∈ K,

where Dk is the kth receptive field in the dictionary, and yijk is the hidden

neural activity at location (i, j) for kth receptive field type (Fig. 2c). The re-

9

sult is a high dimensional representation of the image, ready to be utilized in195

classification tasks. For an image region, spatial pooling is applied to arrive

at a feature vector representation of the region of interest with reduced dimen-

sion. In the algorithm, the output layer of the neural network is replaced with

a linear support vector machine (SVM), to speed up both training and test

phases. Our analyses have shown that current FPGA circuits are insufficient200

for implementing this successful algorithm, hence simplifications are necessary.

3.1. Binary Receptive Fields and Neural Responses

In this study, we propose to use binary receptive fields and binary response

neural units in hidden layers of a single layer neural network. Binary response

neural unit is commonly used in artificial neural network studies, rooting from205

McCulloch and Pitts model [52], whereas binary receptive fields are uncommon

in visual processing. Analyses in FPGA resources of available chips suggests

that this simplification is necessary for the FPGA implementation. This type

of processing resembles LBP analyses of images. LBPs are nonlinear image

filters, that are equivalent to receptive field processing in neural networks [53].210

There is an essential difference between the neural network and LBP approaches:

distributed representation in neural networks vs binary coding based local rep-

resentation in LBPs. An image patch (say 4×4) is assigned to a specific pattern

(out of 216 different patterns) in LBP methods, whereas the patch is distribut-

edly represented in the hidden layer activities (K number of units) in neural215

network methods. The connection between LBPs and neural network RFs was

not established previously. There are several advantages of this novel perspec-

tive:

1.Binary patterns are represented in a distributed manner as opposed to local

representation, which provides robustness to noise and greater representational220

capacity [54].

2.Binary receptive fields in a neural network enables convoluted binary pat-

tern analyses that capture higher level image statistics via the usage of multi

layer network architectures as well as recurrent processing. A candidate archi-

10

tecture is proposed in [55] and a fast recurrent processing framework is given in225

[56].

Therfore, using binary receptive fields and units in a neural network gen-

eralizes LBP based pattern analysis approaches and enables nonlinear, multi-

layered, distributed and recurrent binary pattern analyses. Additionally, this

approach allows for real-time implementation of the neural network, otherwise230

not possible.

Figure 3: Binary receptive fields (K = 100) learned through k-means clustering. The receptive

fields capture edge, corner, line segment and some other complex patterns in images

Binary receptive fields are learned through k-means clustering on a suffi-

ciently large number of (order of millions) randomly cropped image patches

(Fig. 3). The image patches are first binarized by subtracting the mean from

the pixel values and applying thresholding function, then k-means is performed235

on binarized pathces. The cluster centers generated by k-means are also bina-

rized to compute binary receptive fields. The binary receptive fields capture

edge, corner, line segment and some other complex patterns in images.

The hidden layer activities, yij at pixel location (i, j) are computed using

the binarized image patch xij of size w × w (see Fig. 2) and binary receptive

fields Dk (Fig. 3), using the Hamming distance, h:

yijk = h(xij , Dk), ∀(i, j) and k ∈ K.

11

where Hamming distance function is defined as follows

h(a, b) =

K∑k=1

‖ak − bk‖

We also binarize the hidden layer activities with sparsity enforcing soft assign-

ment:

yij = max[0, sign(µ(yij)− yij)],

where µ is the corresponding averaging function over dimension k:

µ(yij) =1

K

K∑k=1

yijk

Binarization of receptive fields and image patches enables Hamming distance

instead of Euclidian distance for computing the activation of the hidden layer240

neurons, which reduces the computational complexity. Additionally, the hidden

layer activities are further binarized to reduce the memory requirements. The

hidden layer activities yij at pixel location (i, j) is a binary feature vector of size

K. The pattern of an image patch is represented distributedly in the network

with this sparse binary feature vector. The sparsification sets roughly half of245

the hidden neurons to zero, and represents a simple form of competition.

The binary hidden layer activities yij are computed for all the pixels (i, j) in

the acquired image (Fig. 4). Classification of an image region requires feature

computation. For computing a feature representation of a specific image region,

spatial pooling is applied for dimensionality reduction. Suppose the rectangular

image region S (size M × N) is centered at (u, v) (Fig. 4), then the feature

vector v of this region is:

v = Q(yij), ∀(i, j) ∈ S,

where Q is the spatial pooling function that sums the activities of hidden layers

at each quadrant (Q1, Q2, Q3 and Q4), and concatenates the sums to obtain

the feature vector v of size 4K:

Q1 =

M2∑

i=1

N2∑

j=1

yij , Q2 =

M∑i=M

2 +1

N2∑

j=1

yij ,

12

K

X

Y

Hidden Layer Representation

S M

N

Q1 Q2

Q3 Q4(u,v)

Figure 4: The spatial pooling operation applied on a region of interest S of size M × N in

the image. The hidden layer representation is computed for the whole image of size X × Y .

The hidden layer activities are summed over each quadrant (Q1 to Q4), then obtained vectors

(the sums) are concatenated to obtain a feature vector representation of the region.

Q3 =

M2∑

i=1

N∑i=N

2 +1

yij , Q4 =

M∑i=M

2 +1

N∑i=N

2 +1

yij ,

Q = [Q1;Q2;Q3;Q4] .

A linear SVM classifier (L2 norm) is trained (offline for object detection, online

for tracking) using feature vectors of images, v. In object detection tasks, multi-

scale sliding window based search is performed. Images are downsampled for

search in larger scales to speed up processing.250

4. Experiments

In the experiments we examined:

1. object classification and detection performance drop due to algorithmic

simplification,

2. performance difference between the single layer binary neural network255

algorithm and the local binary patterns,

3. usage of the framework for different applications, details of which are

given in a separate supplementary dcoument.

13

4.1. Classification with Binary Receptive Fields and Responses

The3 receptive fields and hidden layer neural responses are binarized for260

reducing computational and memory costs of the algorithm. This simplification

enables implementation of a real-time operating classification algorithm on a

currently available FPGA chip. We observe that classification performance of

the simplified algorithm does not change dramatically (Fig. 5), even though

the computational complexity and memory load are greatly reduced. In Fig.265

5, we show the classification accuracy of the original, binary receptive field and

binary receptive field + binary neural response algorithms on CIFAR10 dataset

[57]. We obtain significant computational improvement and hence real time

operation only for a 10% percent accuracy drop (full binarization that is Binary

2 in Fig. 5). The reason for this moderate drop in classification performance270

is due to the previous observations pointed out in local binary pattern studies

[26]: signed differences between pixels convey most of the texture information

in the images.

100 200 40045

50

55

60

65

70

Number of RFs (K)

Acc

ura

cy (

%)

RealBinary 1Binary 2

Figure 5: Classification performance on CIFAR 10 dataset as a function of cluster sizes, for

real neurons (original algorithm) and for the two types of binarization in the network.

3The parameters used in all experiments are given in Appendix.

14

Figure 6: Classification performance comparison of LBP and Binary RF Neural Network

approaches on on CIFAR 10 dataset and Flicker Material Database (FMD) datasets. The

numbers in parentheses after the dataset names are the number of spatial blocks in the image

used for summation (histogram calculation for LBP).

4.2. Comparison with Local Binary Patterns

We propose that utilization of binary receptive fields introduces binary pat-275

tern analyses using neural networks. This perspective provides a generalization

of LBP, enabling distributed, multi-layered, recurrent computation, which is

non-existent in LBP studies. In this section we compare the performance of

single layer neural network with rotation invariant LBP descriptor [14, 58], and

show that they are comparable. The size of the LBP filter is 3by3, there are280

total of 58 filters, and the feature vector is formed by taking a histogram over

four image quadrants, similar to the neural network algorithm. The neural net-

work receptive field size and parameter K is chosen accordingly, receptive field

is binary but neural response is real. Thus the comparison between the algo-

rithms is completely fair. The difference mainly lies in the sparse distributed vs285

local representation of the binary pattern, and rotation invariance property of

LBPs which does not exist in neural network hidden layer activities. The clas-

sification performance of the two algorithms are compared using on CIFAR 10,

Flicker Material Database [59] 4 and PASCAL VOC datasets (person subset).

4Texture classification is the strong suit of LBP algorithms, and Flicker Material Database

is considered as one of the hardest classification datasets. 75 percent of the data is used for

15

The results are given in Figure 6. It is observed that LBP and binary receptive290

field single layer neural network gives similar performance for the three datasets.

However, neural network algorithm has vast amount of room for improvement,

such as addition of layers, introduction of recurrent connections etc. The dif-

ference in potential for improvement is crucial for the comparison of the two

frameworks. Overall, classification performance is close to the state-of-the-art295

for FMD and VOC datasets, but there is a larger gap for CIFAR 10 dataset and

this underfitting can be attributed to the usage of very few number of receptive

fields (i.e. 58) compared to other algorithms.

4.3. Applications of the Framework

We applied the framework for classification, detection and tracking purposes.300

The details of the experiments are given in a supplementary material document.

We examine the tracking performance on standard datasets, report state-of-the-

art results. Ship classification, detection from aerial images and thermal track-

ing applications are specifically designed for RECONSURVE Project (ITEA 2),

and the datasets are provided to other researchers.305

5. FPGA Implementation

5.1. IP Core

We have done the preliminary FPGA design 5 for the proposed algo-

rithm. The implementation in FPGA is realized as an IP core (Fig. 7). Due to

the inherent parallelism:310

1. Feature volume (X × Y ×D) (Fig. 2c) can be constructed for the whole

video frame.

training. Image resolution is 128 by 128.5The amount of detail we provide in this paper is enough have an FPGA implementation.

For our case, VHDL coding and more detailed design will be performed after the circuit board

is finalized.

16

MEMORY INTERFACE

FEATURE

EXTRACTION

FEATURE

SUMMATION

CLASSIFICATION

FEATURE

VECTORS

CLASS

LABELS

IMAGE ANALYZER

VIDEO

IP CORE

FEATURE

DICTIONARY

CLASS

MATRIX

FEATURE

CALCULATION

REQUESTS

Figure 7: FPGA IP core structure. The memory interface handles data read/write operations

with the external memories. There are three main stages of the Image Analyzer block. Feature

dictionary and classification matrix is pre-uploaded into the FPGA memory. Video frames

and feature calculation requests are constantly feed into the core during operation.

2. Feature vector v can be calculated for many image regions (i.e. sliding

windows for object detection) simultaneously.

3. The calculated feature vectors can be classified in parallel.315

The parallelism supplied by the FPGA core is exploited for search based

object detection and tracking applications. After computing the feature volume

(Fig. 2c), FPGA receives coordinates of image regions from a CPU, which are

queries and need to be classified according to the pre-uploaded classifier (ma-320

trix). FPGA outputs the class labels of the regions of interest (e.g. sliding

window object detection queries or object tracking queries), and also the fea-

ture vectors of the query regions. The core consists of two main sub-blocks:

Image Analyzer and Memory Interface. Memory Interface is responsible for

data transfer between Analyzer and external memories. Image Analyzer block325

consists of three sub-blocks (Fig. 7): Feature Extraction, Feature Summation

and Classification. Image Analyzer block receives four types of inputs; video,

feature dictionary, class matrix and feature calculation requests. Feature dictio-

17

nary and classification matrix is pre-uploaded into the FPGA memory. Video

frames and feature calculation requests are constantly fed into the core during330

operation. The resolution of the video is X (row) by Y (column) and the frame

rate is considered to be higher than 10 Hz 6.

FPGA

STRATIX V

( 5SGXA7 )

DDR3 SDRAM

64 Bit

@ 600MHz

CPU2.5Gbps

PCIe

DEDICATED HARDWARE

Figure 8: The algorithm is implemented on a dedicated hardware using Stratix V (5SGXA7)

ALTERATMFPGA chip. It consists of memory, CPU and the FPGA chip, providing fast

communication among the components.

5.2. Hardware Usage and Timing

Preliminary FPGA design of the algorithm is done on a Stratix V (5SGXA7)

ALTERATMbased dedicated hardware (Fig. 8). Table 1 shows the FPGA and335

hardware resource usage7 of the object detection implementation for 640*512

@50Hz video rate, RF size 4 pixels, 20,000 numbers of 64 × 64 pixels sliding

windows, which corresponds to at least 5 different scales (1.3, 1.2, 1, 0.8, 0.6)

of exhaustive object search with 8 pixels shift between windows. Therefore,

an effective and fast (> 10 Hz) embedded object detection framework can be340

constructed using a fraction of an FPGA chip, that is to be deployed on UAVs.

Notice that, object detection, object tracking (> 20 objects) can be simultane-

ously executed using less than 20% of the FPGA resources, the rest of which

can be utilized for other tasks such as salient motion detection, sparse feature

6See supplementary materials for the details.7See supplementary materials for the detailed timing and hardware usage analysis.

18

extraction and visual odometry. Saliency of the detected motion/change [60]345

can be determined using the same classification framework. Moving pixels can

be analyzed for saliency (pre-determined object class detection) in a multi-scale

manner via appropriate CPU-FPGA communication, for less than 10% of the

FPGA resources. Sparse feature extraction requires further computation on hid-

den layer activities of individual pixels, however the computational complexity350

of this additional stage is predicted to be low.

The timing of calculations and delay analysis shows that the system is oper-

ating in real time without any frame delay. More importantly, for less than 30%

of the FPGA resources, object detection, object tracking (multiple), salient mo-

tion detection and scene recognition can be executed in a satisfactory accuracy355

(see Fig. 5). Sparse feature extraction needs to be worked out and analyzed to

understand its resource usage.

It should be noted that the overall performance of the FPGA friendly algo-

rithm is low compared to the state-of-the-art [61, 62] performance in CIFAR 10

dataset (see 5.2). However the computational load of these algorithms is very360

high due to sparsity enforcing regularization techniques that are utilized. Real-

time operation of these algorithms in an embedded hardware does not seem

realistic. Our analyses show that, even with one of the latest most powerful

chips (Stratix V), it is not possible to use real valued neurons for the high per-

forming single layer convolutional network [7]. There are two major reasons365

for this restriction: 1. The number of multipliers needed for the hidden layer

neuron activity computation for a real valued neuron can not be supplied by

the FPGA. But bitwise OR that we utilize is an abundant resource. 2. External

memory bandwidth is almost fully occupied when a real valued neuron is used.

However, we decrease the usage to one eight by binarization. Additionally, in370

commercial products a high end FPGA such as Stratix V is not an economical

choice, then the restrictions become even harsher.

Nevertheless, the performance of our system can be improved by using a

larger number of receptive fields hence more FPGA resources, however it should

be noted that FPGA resources will most likely be shared with other processing375

19

threads in an embedded vision system. Therefore, the performance-resource

trade-off should be considered when choosing the number of receptive fields.

Finally, even though there are other parallel processing options such as ASIC

or DSP, we are unable to compare our design for accuracy and speed with these

options, mainly because we couldn’t find studies that reported performance on380

challenging datasets such as CIFAR 10. GPU-CPU performance is reported

in [61] but it is not an embedded hardware suitable for robotic applications.

In summary, performance-wise comparison of our FPGA design with existing

hardware architectures does not seem possible, however we will compare the

performance of our algorithms with literature on algorithm design.385

Table 1: FPGA and Hardware Resource Usage Summary

Property Used Available Occupied (%)

Logic 30,000 622,000 5

Internal RAM 250 × M20K 2560 × M20K 10

Multiplier (18 × 18) 40 512 8

Ext. Memory Bandw. 6 Gps 75 Gps 8

CPU Interface Bandw. 0.1 Gbps 2.5 Gbps 4

6. Discussion and Future Work

In this paper, we provided an embedded visual processing core that is able

to serve as the main machinery for various applications a robotic platform needs

for artificial vision. The commonalization of different functions is achieved by

an analysis stage, as it is done in primary visual cortex. The overcomplete and390

sparse representation of the small image patches due to neural network recep-

tive fields allows for efficient and discriminative description for larger regions

of interest. We show that classification, detection, tracking can be done on

the same core in real time, using a fraction of FPGA chip resources. We fur-

ther propose several additional functionalities on the same core as future work,395

such as sparse interest point extraction, salient motion detection, scene recog-

20

nition. These set of capabilities in the arsenal, artificial vision is expected to

perform the necessary fundamental operations on the image in real time on a

low power chip, paving a way for more complex inferences, such as geometric

computations (3D reconstruction, stitching, visual odometry) and cross-modal400

information fusion (scene-object recognition, segmentation-recognition, scene

geometry-texture/reflectance, dorsal-ventral [63]).

Furthermore we show that the binary receptive field approach we propose

generalizes the successful Local Binary Pattern (LBP) features, and enables non-

linear, multi-layered, distributed and recurrent binary pattern analyses. This405

novel perspective on binary patterns needs further investigation.

We introduced a classification based tracking algorithm and its FPGA im-

plementation, that shows good performance in standard datasets. There are

several advantages of this approach over the others:

1. In classification based tracking, a high level object search can be utilized410

in case of occlusion if the tracked object can be labeled as an apriori known

class (tree) (vehicle → truck → brand or animal → dog → terrier). Therefore,

in this tracking framework it is possible to come to a high level description of

the tracked object and use it to recover tracking in case of failures (see online

update in multi-task tracking [64]).415

2. This tracking framework can easily be tailored for active visual learning

[65], in which a learning system will become enabled to sample multiple views

of the same object via tracking.

Why did we focus on a specific single layer neural network algorithm [7]? It

is simple to train and successful in object classification, hence it provides a good420

edge in Occam’s razor. However, it is still a starting point after which there are

several ways of improvement. We plan to enhance the capability with sparse

feature extraction, scene recognition, salient motion detection add-ons, none

of which are predicted to occupy large valuable embedded space or processing

time. Some other future directions are:425

1. Having more convolutions: single layer architecture of our system can be

extended to have multiple convolutional layers [22, 15, 38]. In fact, the true

21

power of neural network approaches are not tapped if only a single layer net-

work is used, even though it is shown to perform very well in classification[7].

However, we propose that the raison d’etre of convolutional layers in human430

visual cortex is the diversity of tasks that need to be performed for ecological

vision. Some of these tasks require precise localization and low level feature

analysis (e.g. simultaneous classification and pose estimation), whereas others

require generalization power and higher order statistical dependence (e.g. emo-

tion recognition). Therefore the “output layer” is not always the last layer of435

the network, but it should be varying with the requirements of the task. For the

first class of tasks, lower layer network activities are appropriate and the second

class of tasks require activities of invariant detectors residing in higher layers.

Some tasks even might require much more complicated cascade, one layer af-

ter another. Therefore, we envision a demultiplexing approach that probes and440

routes the activities of a layer for a specific task. Theoretical and empirical

exploration of this matching between layers and tasks is important for building

a multi-purpose neural network architecture. It is also essential to note that,

the holy grail of this kind of perspective is recurrent procesing in which high

level hypotheses are fed back into lower layers for consistency in a theoreti-445

cally tractable manner. The authors are currently working on recurrent neural

networks that are suited for the FPGA design paradigm we adopted.

2. Using a volume of images for having motion sensitive receptive fields

and video processing (activity recognition, motion detection, classification based

egomotion estimation etc.) This second block will serve as the motion processing450

pathway of our system and the famous cortical duality (Magno-Parvo, Dorsal-

Ventral [66]) is going to be completed, ready to enrich the system with inter-

pathway interactions [63].

Good features to track and good descriptors to match are two fundamental

problems in computer vision [67, 68]. Sparse feature detection, description and455

matching framework enables fast and accurate algorithms for several applica-

tions such as image alingment, 3D reconstruction, visual odometry and object

recognition. There are efficient FPGA implementations of successful features

22

[30], however overcomplete and discriminative neural network representation

should be sufficient to define keypoints and descriptors. There is an ongoing460

research for cognitively inspired feature extraction methods [69]. Indeed, the

authors conducts an ongoing research on performing keypoint detection and de-

scription on the hidden layer representation of images (Fig. 1 and 2c), which

will provide the capability to extract sparse features to be used for homogra-

phy estimation as well as visual odometry. We predict that the detection and465

extraction computations will occupy small amount of additional space and time.

7. Conclusion

We predict that in the near future, FPGA implementation of neural network

algorithms will enable high performance, multi-functional and modular process-

ing architectures. In this undertake, it is important to seek for the most compact470

and powerful neural representation of the visual data, as well as its versatility

in several different tasks. Moreover, dual pathway theory of the visual cortex

should be considered in designing artificial systems, since cortical architecture

has been an inspiration to many successful algorithms.

Appendix A. Parameters475

The common parameters in all experiments is as follows. The patch (or

called receptive field) size w is set to 4 pixels. The number of random patches

extracted for unsupervised receptive field learning is one million. Maximum

number of iterations in SVM training is one thousand.

In crossroad detection experiments, number of RFs (K) is 800. Scale space480

consists of 4 images with scale factors 1.2x, 1x, 0.8x and 0.6x. The sliding

window sampling step is 8 pixels at all scales. The dataset images are all resized

to 32× 32 pixels.

The ship classification experiments use 100 RFs and w is 4 and the dataset

images are all resized to 64× 64 pixels.485

23

In object tracking experiments K is 100 and w is 4. All the target and back-

ground images are resized to 16× 16 pixels. Plus/minus 2 pixel shifted versions

of the target are also considered as target, summing to a total of 25 images for

target class. The background images are cropped from the background region

according to a sampling pattern defined by the following shift vector (plus and490

minus): [3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 24 26 28 30 32 35 38 41

45]. Therefore, there are 3025 images for background class. Maximum number

of iterations in SVM training is 10, however it is 100 in the first frame. If the

projection of the newly trained hyperplane on the previous hyperplane is less

than 0.9, the training is rejected. During detection, the target is searched in 30495

pixel neighborhood of the previously known location. A very simple filtering is

applied to tighten target matching: if the projection of the feature vector onto

the hyperplane is less than a threshold (0.1), detection is rejected. Furthermore,

2 target detections are required in order to resume tracking, otherwise track lost

is declared.500

In experiments where we compare LBP and neural network performances,

the parameters are given in the text.

Acknowledgment

This research is supported by The Scientific and Technological Research

Council of Turkey (TUBITAK) Career Grant, No: 114E554. We would like505

to thank RECONSURVE project (ITEA 2) for providing the ship dataset, and

also ASELSAN Inc. for providing the thermal tracking and crossroad detection

datasets.

References

[1] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep510

convolutional neural networks, in: Advances in Neural Information Pro-

cessing Systems 25, 2012, pp. 1106–1114.

24

[2] N. Srivastava, R. Salakhutdinov, Multimodal learning with deep boltzmann

machines, in: Advances in Neural Information Processing Systems 25, 2012,

pp. 2231–2239.515

[3] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep

belief nets, Neural computation 18 (7) (2006) 1527–1554.

[4] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-

plied to document recognition, Proceedings of the IEEE 86 (11) (1998)

2278–2324.520

[5] M. Ranzato, F. J. Huang, Y.-L. Boureau, Y. Lecun, Unsupervised learning

of invariant feature hierarchies with applications to object recognition, in:

Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Confer-

ence on, IEEE, 2007, pp. 1–8.

[6] B. Scholkopf, A. J. Smola, Learning with kernels, The MIT Press, 2002.525

[7] A. Coates, A. Y. Ng, H. Lee, An analysis of single-layer networks in unsu-

pervised feature learning, in: International Conference on Artificial Intelli-

gence and Statistics, 2011, pp. 215–223.

[8] T. S. Lee, D. Mumford, R. Romero, V. A. Lamme, The role of the primary

visual cortex in higher level vision, Vision research 38 (15) (1998) 2429–530

2454.

[9] J. Misra, I. Saha, Artificial neural networks in hardware: A survey of two

decades of progress, Neurocomputing 74 (1) (2010) 239–255.

[10] G. Indiveri, E. Chicca, R. J. Douglas, Artificial cognitive systems: from vlsi

networks of spiking neurons to neuromorphic cognition, Cognitive Compu-535

tation 1 (2) (2009) 119–127.

[11] Y.-L. Boureau, F. Bach, Y. LeCun, J. Ponce, Learning mid-level features

for recognition, in: Computer Vision and Pattern Recognition (CVPR),

2010 IEEE Conference on, IEEE, 2010, pp. 2559–2566.

25

[12] N. W. Tay, C. K. Loo, M. Perus, Face recognition with quantum associative540

networks using overcomplete gabor wavelet, Cognitive Computation 2 (4)

(2010) 297–302.

[13] M. Tomasi, M. Vanegas, F. Barranco, J. Daz, E. Ros, Massive parallel-

hardware architecture for multiscale stereo, optical flow and image-

structure computation, Circuits and Systems for Video Technology, IEEE545

Transactions on 22 (2) (2012) 282–294.

[14] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and ro-

tation invariant texture classification with local binary patterns, Pattern

Analysis and Machine Intelligence, IEEE Transactions on 24 (7) (2002)

971–987.550

[15] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, T. Poggio, Robust object

recognition with cortex-like mechanisms, Pattern Analysis and Machine

Intelligence, IEEE Transactions on 29 (3) (2007) 411–426.

[16] L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natu-

ral scene categories, in: Computer Vision and Pattern Recognition, 2005.555

CVPR 2005. IEEE Computer Society Conference on, Vol. 2, IEEE, 2005,

pp. 524–531.

[17] M. Jiu, C. Wolf, C. Garcia, A. Baskurt, Supervised learning and codebook

optimization for bag-of-words models, Cognitive Computation 4 (4) (2012)

409–419.560

[18] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and com-

posing robust features with denoising autoencoders, in: Proceedings of the

25th international conference on Machine learning, ACM, 2008, pp. 1096–

1103.

[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-565

stage architecture for object recognition?, in: Computer Vision, 2009 IEEE

12th International Conference on, IEEE, 2009, pp. 2146–2153.

26

[20] H. Lee, C. Ekanadham, A. Ng, Sparse deep belief net model for visual

area v2, in: Advances in neural information processing systems, 2007, pp.

873–880.570

[21] J. Yu, Y. Rui, Y. Tang, D. Tao, High-order distance-based multiview

stochastic learning in image classification., IEEE transactions on cyber-

netics 44 (12) (2014) 2431.

[22] A. Coates, A. Ng, Selecting receptive fields in deep networks, in: Advances

in Neural Information Processing Systems, 2011, pp. 2528–2536.575

[23] T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texture

measures with classification based on featured distributions, Pattern recog-

nition 29 (1) (1996) 51–59.

[24] Z. Guo, L. Zhang, D. Zhang, A completed modeling of local binary pattern

operator for texture classification, Image Processing, IEEE Transactions580

on 19 (6) (2010) 1657–1663.

[25] A. Suruliandi, K. Meena, R. Reena Rose, Local binary pattern and its

derivatives for face recognition, Computer Vision, IET 6 (5) (2012) 480–

488.

[26] T. Ojala, K. Valkealahti, E. Oja, M. Pietikainen, Texture discrimination585

with multidimensional distributions of signed gray-level differences, Pattern

Recognition 34 (3) (2001) 727–739.

[27] M. Topi, O. Timo, P. Matti, S. Maricor, Robust texture classification by

subsets of local binary patterns, in: Pattern Recognition, 2000. Proceed-

ings. 15th International Conference on, Vol. 3, IEEE, 2000, pp. 935–938.590

[28] D. Maturana, D. Mery, A. Soto, Face recognition with decision tree-based

local binary patterns, in: Computer Vision–ACCV 2010, Springer, 2011,

pp. 618–629.

27

[29] T. Ahonen, M. Pietikainen, Image description using joint distribution of

filter bank responses, Pattern Recognition Letters 30 (4) (2009) 368–376.595

[30] T. Chang, L. Chiu, J. Chen, N. Chang, Fast sift design for real-time visual

feature extraction, Image Processing, IEEE Transactions on.

[31] N. Farrugia, F. Mamalet, S. Roux, F. Yang, M. Paindavoine, Fast and

robust face detection on a parallel optimized architecture implemented on

fpga, Circuits and Systems for Video Technology, IEEE Transactions on600

19 (4) (2009) 597–602.

[32] B. Zhang, K. Mei, N. Zheng, Reconfigurable processor for binary image

processing, Circuits and Systems for Video Technology, IEEE Transactions

on 23 (5) (2013) 823–831.

[33] C. Desmouliers, E. Oruklu, S. Aslan, J. Saniie, F. Vallina, Image and video605

processing platform for field programmable gate arrays using a high-level

synthesis, IET Computers & Digital Techniques 6 (6) (2012) 414–425.

[34] A. R. Omondi, J. C. Rajapakse, FPGA implementations of neural networks,

Vol. 365, Springer New York, NY, USA:, 2006.

[35] B. Girau, Neural networks on fpgas: a survey.610

[36] L. P. Maguire, T. M. McGinnity, B. Glackin, A. Ghani, A. Belatreche,

J. Harkin, Challenges for large-scale implementations of spiking neural net-

works on fpgas, Neurocomputing 71 (1) (2007) 13–29.

[37] A. Muthuramalingam, S. Himavathi, E. Srinivasan, Neural network im-

plementation using fpga: issues and application, International journal of615

information technology 4 (2) (2008) 86–92.

[38] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini, P. Ak-

selrod, S. Talay, Large-scale FPGA-based convolutional networks, Cam-

bridge, UK: Cambridge University Press, 2011.

28

[39] J. Schlessman, C.-Y. Chen, W. Wolf, B. Ozer, K. Fujino, K. Itoh, Hard-620

ware/software co-design of an fpga-based embedded tracking system, in:

Computer Vision and Pattern Recognition Workshop, 2006. CVPRW’06.

Conference on, IEEE, 2006, pp. 123–123.

[40] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object de-

tection with discriminatively trained part-based models, Pattern Analysis625

and Machine Intelligence, IEEE Transactions on 32 (9) (2010) 1627–1645.

[41] C. Papageorgiou, T. Poggio, A trainable system for object detection, In-

ternational Journal of Computer Vision 38 (1) (2000) 15–33.

[42] P. Viola, M. J. Jones, Robust real-time face detection, International journal

of computer vision 57 (2) (2004) 137–154.630

[43] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An eval-

uation of the state of the art, Pattern Analysis and Machine Intelligence,

IEEE Transactions on 34 (4) (2012) 743–761.

[44] C. Garcia, M. Delakis, Convolutional face finder: A neural architecture for

fast and robust face detection, Pattern Analysis and Machine Intelligence,635

IEEE Transactions on 26 (11) (2004) 1408–1423.

[45] S. J. Nowlan, J. C. Platt, A convolutional neural network hand tracker,

Advances in Neural Information Processing Systems (1995) 901–908.

[46] M. Delakis, C. Garcia, text detection with convolutional neural networks.,

in: VISAPP (2), 2008, pp. 290–294.640

[47] Q. Liu, X. Zhao, Z. Hou, Survey of single-target visual tracking methods

based on online learning, IET Computer Vision.

[48] S. Avidan, Ensemble tracking, Pattern Analysis and Machine Intelligence,

IEEE Transactions on 29 (2) (2007) 261–271.

29

[49] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with on-645

line multiple instance learning, Pattern Analysis and Machine Intelligence,

IEEE Transactions on 33 (8) (2011) 1619–1632.

[50] D. Serby, E. Meier, L. Van Gool, Probabilistic object tracking using multi-

ple features, in: Pattern Recognition, 2004. ICPR 2004. Proceedings of the

17th International Conference on, Vol. 2, IEEE, 2004, pp. 184–187.650

[51] O. Yilmaz, Oscillatory synchronization model of attention to moving ob-

jects, Neural Networks 29 (2012) 20–36.

[52] W. S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in

nervous activity, The Bulletin of Mathematical Biophysics 5 (4) (1943)

115–133.655

[53] T. S. Lee, Image representation using 2d gabor wavelets, Pattern Analysis

and Machine Intelligence, IEEE Transactions on 18 (10) (1996) 959–971.

[54] G. Hinton, J. McClelland, D. Rumelhart, Distributed representations, in:

Parallel distributed processing: explorations in the microstructure of cog-

nition, vol. 1, MIT Press, 1986, pp. 77–109.660

[55] O. Yilmaz, Connectionist-symbolic machine intelligence using cellular

automata based reservoir-hyperdimensional computing, arXiv preprint

arXiv:1503.00851.

[56] O. Yilmaz, Classification of occluded objects using fast recurrent process-

ing, under review, Neural Computing and Applications.665

[57] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny

images, Master’s thesis, Department of Computer Science, University of

Toronto.

[58] A. Vedaldi, B. Fulkerson, VLFeat: An open and portable library of com-

puter vision algorithms, http://www.vlfeat.org/ (2008).670

30

[59] L. Sharan, R. Rosenholtz, E. Adelson, Material perception: What can you

see in a brief glance?, Journal of Vision 9 (8) (2009) 784–784.

[60] N. Goyette, P. Jodoin, F. Porikli, J. Konrad, P. Ishwar, Changedetection.

net: A new change detection benchmark dataset, in: Computer Vision and

Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society675

Conference on, IEEE, 2012, pp. 1–8.

[61] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, R. Fergus, Regularization of neu-

ral networks using dropconnect, in: Proceedings of the 30th International

Conference on Machine Learning (ICML-13), 2013, pp. 1058–1066.

[62] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio,680

Maxout networks, arXiv preprint arXiv:1302.4389.

[63] T. Schenk, R. D. McIntosh, Do we have independent visual streams for

perception and action?, Cognitive Neuroscience 1 (1) (2010) 52–62.

[64] H. Liu, F. Sun, Y. Yu, Multitask extreme learning machine for visual track-

ing, Cognitive Computation 6 (3) (2014) 391–404.685

[65] B. Settles, Active learning literature survey, University of Wisconsin, Madi-

son.

[66] M. A. Goodale, A. D. Milner, Separate visual pathways for perception and

action, Trends in neurosciences 15 (1) (1992) 20–25.

[67] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors,690

Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 (10)

(2005) 1615–1630.

[68] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,

F. Schaffalitzky, T. Kadir, L. Van Gool, A comparison of affine region

detectors, International journal of computer vision 65 (1-2) (2005) 43–72.695

31

[69] S. Kim, S. Kwon, I. S. Kweon, A perceptual visual feature extraction

method achieved by imitating v1 and v4 of the human visual system, Cog-

nitive Computation 5 (4) (2013) 610–628.

32

Supplementary Material: Applications of theFramework

Ozgur Yilmaz1

Department of Computer Engineering, Turgut Ozal University, Ankara, Turkey

Ismail Ozsarac

Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,

Turkey.

Omer Gunay

Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,

Turkey.

Huseyin Ozkan

Department of Image Processing, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,Turkey.

Abstract

In this supplementary material we provide the details of the experiments we did

to demonstrate different applications of the proposed framework.

1. Detection of Crossroads from Aerial Images

We have tested the ability of the proposed system for object detection from

high altitude aerial images. This is an essential capability for an autonomous

UAV, in tandem with tracking of detected objects. The sacrifice in detection

capability due to simplification of FPGA neural network algorithm needs to be5

tested. Crossroads objects are used for the tests for which detection is a chal-

lenging task due to appearance variability, clutter and occlusion. Even though

Email addresses: [email protected] (Ozgur Yilmaz),[email protected] (Ismail Ozsarac), [email protected] (Omer Gunay),[email protected] (Huseyin Ozkan)

1Corresponding author

Preprint submitted to Journal of LATEX Templates May 7, 2015

it is an important source of information for road detection/tracking purposes,

it is not exploited often enough in these studies (see [1]). 600 annotated cross-

road images were used in the training and the positive image set was further10

populated by including artificially rotated, scaled and translated versions of the

crossroad images [2]. There are about 120,000 crossroad class instances, and

250,000 background class instances randomly cropped from the images2. Every

image in the training set was first resized to a 32×32 pixel square. The receptive

fields are learned using k-means on this dataset of resized images; then feature15

vectors, v, are extracted with the learned receptive fields. A linear SVM classi-

fier is trained using the set of labeled feature vectors. The tests were performed

on 13 separate images of size 1000 × 1000, using multi-scale (4 scales: 1.2, 1,

0.8, 0.6 scale factors) sliding window method. We use conservative criteria for

precision-recall measures such that, a declared detection by the algorithm is20

accepted as correct detection if its center falls inside a crossroad object region,

and a crossroad object is detected if its center falls inside a declared detection

by the algorithm. The precision-recall curves are given for real and binary neu-

ral networks in Fig. 1a. A sample image from the test set is shown in Fig. 1b

and Fig. 1c, for real and binary neural networks respectively. As mentioned25

in section 5.2 in the main text, the multiscale detection algorithm can run on

a fraction of the FPGA chip. The sacrifice in detection performance (from 50

percent precision to 40 percent precision) due to binarization follows what is

observed in classification experiments (Fig. 5 in the main text).

2. Classification of Ships from Aerial Thermal Images30

We wanted to test the binary algorithm for a less challenging object classifica-

tion task. Ship classification is essential for an autonomous UAV system that is

deployed over the sea, for general surveillance purposes and specifically for detec-

tion of illegal immigrant activities on fish boats (RECONSURVE Project, ITEA

2Please contact the corresponding author for access to the dataset.

2

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Recall

Pre

cis

ion

Real

Binary

(a) (b) (c)

Figure 1: Crossroad detection performance of the classification algorithm in a sliding window

search framework. (a) Precision-Recall curves for real and binary neurons (b) A sample

crossroad detection test image for real neuron. Green rectangles are the ground truth, red

rectangles are the declared detections, yellow pixels represent the true detections. (c) A sample

crossroad detection test image for binary neuron.

2). We introduce a new ship thermal image dataset3 (Fig. 2), in which 15 dif-35

ferent ship images are captured from a low altitude UAV (ASELSANTMthermal

camera), covering 360 deg views (at least 70 images per ship). The ship types are

chosen in consideration to cover a wide spectrum of ship appearances. Segmen-

tation of the ships over the sea background is straightforward on thermal images

(for both day and night operation), which reduces the problem from search based40

detection to classification, and the classification performance is 99% for both

real and binary neural units. Our findings indicate that the classification per-

formance of binary algorithm converges to the real neuron performance for less

challenging classification tasks. Also, it is shown that ship identification and

detection can be performed successfully from a low altitude UAV platform.45

3

Figure 2: Some of the ships that are used in ship classification task. There are 15 different

ship classes each of which contains around 70 images. The images are acquired using an

ASELSANTMthermal camera mounted on a low altitude UAV.

3. Object Tracking

3.1. Object Tracking Algorithm

Neural network based classification framework is utilized for frame by frame

detection of tracked object. Naturally, there are two phases of this tracking al-

gorithm: training and detection. In the training phase (Fig. 3b), tracked object50

and its immediate background is used to create a database of feature vectors

(v). There are two classes in this database, target and background. Using only

one image for target class is an option but in order to enforce robustness and

to compute sub pixel location of the target, multiple target images are used.

Target images are cropped from the image, that are left-right and bottom-top55

shifted versions of the target (Fig. 3a, left). These set of images will have similar

feature representation since they are only a few pixels apart from each other,

however they will enable multiple target detections for robustness and local-

ization precision. For the background class, again multiple images are cropped

that represent the background of the target image (Fig. 3a, right). Regardless60

of the target object size, these sets of images are resized to a constant square

3Please contact the corresponding author for access to the dataset. See supplementary

materials for images of all classes.

4

(b)

(c)

(a)

Figure 3: (a) The image patches used in training are shown in boxes. Left: Target image

patches that are a few pixels apart from the actual target image. Right: Background image

patches that represent the background class. (b) Training phase flowchart of object tracking

algorithm. (c) Detection phase flowchart of object tracking algorithm.

5

region for keeping the computational load fixed. The 4K dimensional feature

vector v is extracted for each of these target and background images. Then

these labeled feature vectors are used to train the linear SVM classifier. Classi-

fier is periodically updated during tracking to enable plasticity for appearance65

changes. There are appearance changes in the scene not only due to target but

also due to occlusion and clutter. False changes in the classifier due to external

noise factors cause drift in tracking algorithms. In order to prevent this, a sanity

check mechanism is used during training such that, if the change (angle change

of the hyperplane) in SVM classifier exceeds a certain threshold, the training is70

rejected.

Periodically, the SVM classifier is trained, and in the following frames it can

be used to label image regions in the neighborhood of the target using stan-

dard object detection procedure: sliding window based search (Fig. 3c). Target

image patches are detected using the SVM classifier and the center of these75

images are labeled as target pixels. The average pixels location of these target

pixels reveals the current frame target location. If the number of target pixels

is below a certain threshold, target lost is declared. However, it is possible to

misclassify background patches as target due to appearance similarities. False

target detections cause tracking errors. However, these misclassifications are on80

average expected to be spatially separated from correct classifications, because

they most likely will originate from an object similar to the target that is in the

search region but spatially distinct. Then, multiple spatial clusters will emerge

for putative target pixels. It is essential to reject misclassified pixels, since fail-

ure to do so would cause jumps and drifts. In order to reject incorrect target85

pixels, spatial clustering is performed on target pixel locations. To determine

the number of clusters, we use Akaike information criterion (AIC, [3]) by fitting

mixture of Gaussians on target pixel locations. Once the most prominent num-

ber of mixture is determined by AIC, then the Gaussian with the closest center

to the previously known target center is assigned as the ”correct” cluster, and90

the rest are rejected. This procedure rejects spurious detections due to clutter.

6

3.2. Standard Tracking Dataset Experiments

Tracking function in an unmanned system is essential for carrying out com-

plex tasks requiring temporal continuity. The proposed tracking algorithm is

tested on standard sequences [4, 5] commonly used in the literature (Table 1).95

The success metrics in the literature vary, but 3 metrics are frequently used:

center location error, 20 pixel precision and coverage. The center location error

of MILTrack [5], proposed algorithm (non-simplified, original), Bolme tracker

(aka Mosse, [6]) and Circulant tracker [7] is given in Table 2. 20 pixel precision

is given in Table 3. Best results are highlighted in red color.100

Table 1: Tracking sequences [5]

Track Difficulty

Sylvester Illumination, pose, scale change, 3D camera motion

David Indoor Illumination, pose, scale change, 3D camera motion

Cola Can Specular object, pose change, occlusion

Occluded Face Occlusion, moving camera

Occluded Face 2 Heavy appearance change and occlusion

Surfer Low contrast and appearance change

Tiger 1 Fast motion, frequent occlusions, motion blur

Tiger 2 Fast motion, frequent occlusions, motion blur

Coupon Book Heavy appearance change, serious clutter

Apart from these, in more recent studies, marginal increases in performances

are reported for some of the sequences:

[8] reported 3.8 and 3.6 pixels center location errors for Occluded Face 2 and

David sequences respectively.

[9] reported 8 and 12 pixels center location errors for Tiger 1 and Cola Can105

sequences respectively.

Tracking algorithms specifically designed for one tracking difficulty (e.g. Oc-

clusion) or one specific object (e.g. face) exist, which yield to superior perfor-

mance for one or two sequences, but fail for most of the rest. Hence, the errors

7

for these algorithms are not considered in the comparison.110

Table 2: Tracking center location error (red: best, green:2nd best)

Sequence MILTrack Bolme Proposed

Sylvester 11 36 8

David Indoor 23 9 10

Cola Can 20 24 13

Occluded Face 27 89 15

Occluded Face 2 20 7 30

Surfer 11 93 7

Tiger 1 16 49 11

Tiger 2 18 34 9

Coupon Book 15 5 7

Table 3: Tracking precision for 20 pixel error (red: best, green:2nd best)

Sequence MILTrack Bolme Circulant Proposed

Sylvester 0.9 0.53 1.0 0.94

David Indoor 0.52 1.0 0.49 0.98

Cola Can 0.55 0.34 1.0 0.95

Occluded Face 0.43 0.07 1.0 0.73

Occluded Face 2 0.6 1.0 1.0 0.44

Surfer 0.93 0.04 0.99 0.96

Tiger 1 0.81 0.18 0.61 0.9

Tiger 2 0.83 0.26 0.63 0.86

Coupon Book 0.69 1.0 1.0 1.0

The algorithm successfully tracks all objects in nine sequences, with the

best performance for six of them. Although proposed algorithm seems to fail

for ”Occluded Face 2” sequence for the reported metrics, close analysis shows

that the tracker drifts to the edge of actual object (face) in the beginning and

keeps tracking throughout the sequence. The coverage metric (0.33 F measure115

8

threshold) is 1.0, which shows that the track was never lost but the center lo-

cation error was larger than 20 pixels, thus causing a smaller precision for this

sequence. The same quantitative result is valid for ”Occluded Face” sequence in

which the coverage is also 1.0. Mosse tracker seems to fail for six of the sequences

but achieved top results for the other three. The reason for this inconsistency120

is hard to locate. However it is not due to poor selection of parameters because

exhaustive search was performed in the parameter space to maximize its per-

formance. Circulant tracker gives superior performance over Bolme, showing

best performance for six of the sequences but gives poor performance for the

rest three sequences. However, we emphasize that the overall performance of125

proposed algorithm is more balanced since it gives good precision almost for all

the sequences.

3.3. Thermal Aerial Images

The intended platform of the designed system uses thermal imagery. For

this purpose, we introduce a very comprehensive thermal video set and test our130

algorithm on this dataset 4. There are total of 33 tracks, annotated every 10-20

frames. The videos are acquired using ASELSANTMthermal imaging products,

some of which are mounted on UAVs for aerial surveillance. Vehicle, people,

apartment/region (Fig. 4) are tracked in a wide variety of scenarios: object

size and type, viewing angle, occlusion, clutter and noise, feature intensity, ap-135

pearance change, abrupt motion. The proposed tracking algorithm is tested on

this dataset for both real and binary neural units (Table 4), investigating the

effect of algorithmic simplification (binarization) on tracking performance. In

the analysis, individual track results are grouped (a single track can contribute

to multiple groups) according to tracking difficulties, since it is hard to analyze140

all 33 tracks. The results show that the performance drop due to binarization

is not significant overall; on the contrary, binary neural network performs sig-

4Please contact the corresponding author for access to the dataset. See supplementary

materials for a detailed description.

9

(a) (b)

(c) (d)

Figure 4: The thermal object tracking dataset introduced in this paper contains 33 tracks,

4 of them are shown as examples. See supplementary materials for a thorough description.

(a) Person. (b) Vehicle, oblique angle, large size. (c) Vehicle, orthographic, small size. (d)

Vehicle medium size, high clutter due to traffic.

nificantly better for some of the categories (occlusion). There is a significant

drop in performance for long run vehicle tracks and abrupt vehicle turns due to

binarization, which suggests a weakness of handling sudden appearance changes145

in the simplified algorithm.

References

[1] J. Cheng, T. Jin, X. Ku, J. Sun, Road junction extraction in high-resolution

sar images via morphological detection and shape identification, Remote

Sensing Letters 4 (3) (2013) 296–305.150

[2] D. Ciresan, U. Meier, J. Schmidhuber, Multi-column deep neural networks

10

Table 4: Tracking coverage, thermal video set (red:best)

Sequences Proposed Proposed Binary

People 0.64 0.87

Vehicle Medium 0.70 0.65

Vehicle Small 0.80 0.71

Vehicle Traffic 0.42 0.56

Vehicle Long Run 0.75 0.43

Apartment and Region 1.0 1.0

Occlusion 0.69 0.92

Clutter 0.58 0.65

Poor Feature 0.71 0.65

Abrupt Turn 0.52 0.33

Sudden Camera Motion 0.75 0.75

for image classification, in: Computer Vision and Pattern Recognition

(CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 3642–3649.

[3] H. Akaike, A new look at the statistical model identification, Automatic

Control, IEEE Transactions on 19 (6) (1974) 716–723.155

[4] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using

the integral histogram, in: Computer Vision and Pattern Recognition, 2006

IEEE Computer Society Conference on, Vol. 1, IEEE, 2006, pp. 798–805.

[5] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with online

multiple instance learning, Pattern Analysis and Machine Intelligence, IEEE160

Transactions on 33 (8) (2011) 1619–1632.

[6] D. S. Bolme, J. R. Beveridge, B. A. Draper, Y. M. Lui, Visual object tracking

using adaptive correlation filters, in: Computer Vision and Pattern Recog-

nition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 2544–2550.

[7] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant165

11

structure of tracking-by-detection with kernels, in: Computer Vision–ECCV

2012, Springer, 2012, pp. 702–715.

[8] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local

sparse appearance model, in: Computer Vision and Pattern Recognition

(CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1822–1829.170

[9] Y. Bai, M. Tang, Robust tracking via weakly supervised ranking svm, in:

Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference

on, IEEE, 2012, pp. 1854–1861.

12

Supplementary Material: FPGA Design

Ozgur Yilmaz1

Department of Computer Engineering, Turgut Ozal University, Ankara, Turkey

Ismail Ozsarac

Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,

Turkey.

Omer Gunay

Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,

Turkey.

Huseyin Ozkan

Department of Image Processing, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,Turkey.

Abstract

In this supplementary material we provide the details of our FPGA design.

The implementation in FPGA is realized as an IP core (Fig. 1). Due to the

inherent parallelism:

1. Feature volume (X×Y ×D)can be constructed for the whole video frame.

2. Feature vector v can be calculated for many image regions (i.e. sliding

windows for object detection) simultaneously.5

3. The calculated feature vectors can be classified in parallel.

The parallelism supplied by the FPGA core is exploited for search based

object detection and tracking applications. After computing the feature volume,

FPGA receives coordinates of image regions from a CPU, which are queries and10

Email addresses: [email protected] (Ozgur Yilmaz),[email protected] (Ismail Ozsarac), [email protected] (Omer Gunay),[email protected] (Huseyin Ozkan)

1Corresponding author

Preprint submitted to Journal of LATEX Templates December 2, 2014

MEMORY INTERFACE

FEATURE

EXTRACTION

FEATURE

SUMMATION

CLASSIFICATION

FEATURE

VECTORS

CLASS

LABELS

IMAGE ANALYZER

VIDEO

IP CORE

FEATURE

DICTIONARY

CLASS

MATRIX

FEATURE

CALCULATION

REQUESTS

Figure 1: FPGA IP core structure. The memory interface handles data read/write operations

with the external memories. There are three main stages of the Image Analyzer block. Feature

dictionary and classification matrix is pre-uploaded into the FPGA memory. Video frames

and feature calculation requests are constantly feed into the core during operation.

need to be classified according to the pre-uploaded classifier (matrix). FPGA

outputs the class labels of the regions of interest (e.g. sliding window object

detection queries or object tracking queries), and also the feature vectors of the

query regions. The core consists of two main sub-blocks: Image Analyzer and

Memory Interface. Memory Interface is responsible for data transfer between15

Analyzer and external memories. Image Analyzer block consists of three sub-

blocks (Fig. 1): Feature Extraction, Feature Summation and Classification.

Image Analyzer block receives four types of inputs; video, feature dictionary

(Fig. 3a), class matrix (Fig. 5a) and feature calculation requests. Feature

dictionary and classification matrix is pre-uploaded into the FPGA memory.20

Video frames and feature calculation requests are constantly feed into the core

during operation. The resolution of the video is X (row) by Y (column) and

the frame rate is considered to be higher than 10 Hz.

2

FEATURE EXTRACTION

TAKE PATCHCONSTRUCT

P VECTOR

COMPUTE MEAN

VALUE OF P

VIDEO

CONSTRUCT BINARYPB VECTOR

CALCULATE BIT FLIPPING DISTANCE VECTOR (DV) OF PB WITH DICTIONARY D

COMPUTE MEAN

VALUE OF DV

COMPUTE STANDART

DEVIATION VALUE OF DV

COMPUTE ACTIVATION THRESHOLD OF DV

COMPUTE PIXEL FEATURE VECTOR (PFV)

OF DV

PIXEL FEATURE

VECTORS

Figure 2: Feature extraction block. It consists of many sub-blocks. Video frame is feed into

this block and output is the hidden layer volume representation of the frame, which consists

of a K length feature vector for each pixel in the frame.

1. Feature Extraction

Feature extraction block preemptively builds the hidden layer representation25

of the whole video frame (i.e. for every pixel). This block (Fig. 2) starts

with ”take patch” process (Fig. 3b). Notice that a patch is a W ×W small

image region, the same size as the receptive fields (RF). To capture the related

pixels, the incoming video line is written to LINE FIFO. According to the

patch dimension (W ), ”take patch” process uses W LINE FIFO. Each incoming30

video line is firstly written to LINE FIFO-W , and then when the next video

line is coming, the previous one is read from LINE FIFO-W and written to

LINE FIFO-(W − 1). These steps continue until all LINE FIFOs are filled with

the necessary lines to construct the patch. When all lines are available, with

the next line coming, pixel values are read from the FIFOs. After W read35

operations the patch is ready for the later operations. The W + 1 reading from

the LINE FIFOs gives the patch of the next pixel. These steps continue until all

patches are captured through a line. During patch read from the FIFOs, new

3

Y

X

W

W

D11

D21

DT1

D1K

D2K

DTK

D

DC1 DCK

LINE FIFO - 1

LINE FIFO -2

LINE FIFO – (W 1)

LINE FIFO W

P1PX

Pn + 1Pn +W

Lm + 1

Lm + W

TAKE PATCH

VIDEO LINE

Pn+1Pn+W

Lm + 1

Lm + W

P

L1P1

L1P2

L1PW

LWPW

P

L1P1

L1P2

L1PW

LWPW

COMPARE

PB

1 or 0

1 or 0

1 or 0

1 or 0

T

XOR

PB

DC1

XORPB

DC2

XORPB

DCK

COMPUTE

TOTAL “1”s

COMPUTE

TOTAL “1”s

COMPUTE TOTAL “1”s

CONSTRUCT

DISTANCEVECTOR

DV

DV

DV1 DVK

CALCULATE BIT

FLIPPING DISTANCE

COMPAREAT

DV1

COMPAREAT

DV2

COMPAREAT

DVK

CONSTRUCT

FEATUREVECTOR

FV

PFV

PFV1 PFVK

COMPUTE

PIXEL FEATURE VECTOR

(a) (b) (c)

(d) (e) (f)

Figure 3: Sub-routines and data structures of the feature extraction block. (a) X by Y frame

and a single patch of size W by W is shown on top, and dictionary D that consists of K

receptive fields is shown in the bottom. (b) Take patch sub-block, that captures a patch

from the image (c) Construct P vector sub-block, that vectorizes the patch. (d) PB vector

construction block, that binarizes the vector. (e) Distance vector calculation sub-block, that

computes the Hamming distance between the binary vector PB and the binary receptive

fields in the dictionary. (f) Pixel feature vector calculation sub-block, that computes the

sparse hidden layer representation for each pixel value.

4

lines continue to move to upper FIFOs. This movement generates the patch

downward movement through the frame.40

The patch is vectorized and the P vector is constructed by using the captured

patch pixel values (Fig. 3c). This construction process is a register assignment.

There are W ×W registers from L1P1 to LWPW and every register keep the

related pixel value. The bit size of the registers is determined by the maximum

possible pixel value (8 bits in general).45

Mean value of the P vector is needed for binarization. To calculate mean

patch value Pµ, every pixel value in the patch should be added and then divided

by the total number of pixels. The addition process can be realized by the

adders; the input number of the adders can be different according to the FPGA

capability. The adder input number affects the pipeline clock latency and the50

number of adders used. After all pixel values are added, the total is divided by

W ×W .

After calculating the Pµ, each entry of the P vector is compared with Pµ, and

binarized to construct the vector PB (Fig. 3d). Binarization step is essential

for realizing this neural network algorithm in currently available FPGAs. For55

the values that are less than Pµ, ’0’ is assigned, otherwise, ’1’ is assigned. After

all values are compared with mean value, binary vector PB is obtained. PB is

a T by 1 bit vector where T equals to W ∗W .

Every binary vector constructed from all the patches in an image are trans-

formed into a feature vector (hidden layer representation) using a pre-computed60

dictionary that has K number of visual words. The dictionary D is a T by K

bit binary matrix. The columns of D matrix (DC1 to DCK) are stored in

internal registers DCX of FPGA (Fig. 3a). The dictionary is loaded to FPGA

by means of the communication interfaces like PCI, VME etc. The entries of

the dictionary can be updated any time since the entries are stored in internal65

registers.

Bit flipping (Hamming) distance calculation computes the similarity between

two binary vectors of the same size: PB and every column DC of D (Fig. 3e). If

two entries of PB and DC are the same, ’0’ is assigned, otherwise ’1’ is assigned.

5

This operation is realized by XOR blocks. The total number of 1 values after70

XOR operation is a measure of similarity between the two binary vectors. DV

contains the Hamming distance of a single PB vector to all the visual words

in the dictionary. The entries of DV keep the total number of 1s, so they are

integer values. DV is an H by K bit vector. H is the minimum number of bits

that can define the scalar value T.75

Vector DV represents the inverse activity of each neuron in the dictionary

for a patch, and it needs to be sparsified and binarized. The mean value of DV

is computed similar to the one of P . To calculate standard deviation of DV ,

DV µ is subtracted from each entry of DV . Then the square of the subtraction

is calculated and all the squares are added. Then the total value is divided80

by K. Finally, the square root is calculated and DV σ is obtained. Activation

threshold AT is calculated by subtracting SPARSITYMULTIPLIER ∗DV σ

from DV µ. This adaptive threshold is used to construct a sparse representation

(from DV ) via nullifying the distance values larger than this threshold.

To construct the pixel feature vector (PFV in Fig. 3f), each entry of DV85

is compared with AT . If the entry is greater than AT then ’0’ is assigned to

related entry of PFV , if it is smaller ’1’ is assigned. The result is a 1 by K pixel

feature vector PFV . For each pixel in a video frame, 1 by K bit vector (pixel

feature vector PFV ) is obtained thus the hidden layer volume is constructed.

The computed PFV s are sent to the memory interface to be written to external90

memories.

2. Feature Summation

Once the sparse hidden layer layer activity of each pixel is computed in

Feature Extraction block (Fig. 2), a feature vector of an image region can

be calculated by spatial summation based dimensionality reduction. This pro-95

cedure divides the region into four quadrants, sums the pixel feature vectors

(PFV ) inside each quadrant and concatenates the 4 summations to arrive at

feature vector (FV ) of an image region. This feature vector is then forwarded

6

to Classification block or the CPU (Fig. 1). Multiple of these feature vectors

can be computed in parallel for many image regions of interest, for object detec-100

tion and tracking purposes, and these requests are communicated to the FPGA

through a specific interface.

The feature calculation requests are written to the Feature Calculation Re-

quest FIFO (Fig. 4a), given as pixel coordinates of the region of interest. The

CPU sends the coordinates of two border pixels (upper-left and lower-right,105

black dots) and FPGA calculates the rest (white dots in Fig. 4d) of the coor-

dinates of the sub-regions. According to pixel coordinates, the Internal RAM

addresses are calculated by Address calculator block (Fig. 4a). This block

knows the content of the RAM, namely the line coordinates that are stored. To

make the calculations faster, the PFV values are read from external memory110

and written to Internal RAM. The RAM can store R ∗X ∗K bit data, where

R is the maximum number of lines that can be processed at a time (Fig. 4b).

Integral image is used to speed up feature summation requests that is likely

to be over multiple overlapping regions (Fig. 4c). In that case, integral image

avoids duplicate summation operations. Integral vector calculator reads the115

necessary PFV s from the RAM to calculate the integral vector. Notice that

the PFV volume is three dimensional: two spatial dimensions and one feature

dimension. Integral vector IV entry is the summation of the all entries of

previous PFV s on both horizontal and vertical dimensions. The integration

operations form integral images for each feature dimension. Then the sums for120

each quadrant can be computed separately by 4 additions (QIV in Fig. 4d).

Since there exist four quadrants (Q1, Q2, Q3 and Q4), all quadrant results are

concatenated and final feature vector FV is obtained. The FV is G ∗ S bit

vector. S is the minimum bit number that can store the all 1s in the quadrant.

G = 4×K. The vector is stored in the internal RAM of the FPGA. This Feature125

Vector (FV ) is a discriminative and efficient representation of the image patch

defined by the border coordinates (black dots in Fig. 4d), and it can further be

used for classification and clustering purposes, executed either in FPGA or in

CPU via memory transfer.

7

INTERNAL RAM

FEATURE SUMMATION

FEATURE

CALCULATION

REQUESTS

FEATURECALCULATION

REQUEST FIFO

ADRESS

CALCULATOR

MEMORY

INTERFACE

INTEGRAL VECTOR (IV) CALCULATOR

FV

K

X X

XxR

INTERNAL RAM

PFV

PFV1

PFVK

INTERNAL

RAM

ADDPFV1

ADDPFVK

IV

IV1

IVK

Q1 Q2

Q3 Q4

FV

QIV1

QIV2

QIV3

QIV4

G

PFV11 PFV12

PFV21 PFV22

IV11 IV12

IV21 IV22

Q1

(a) (b)

(c) (d)

Figure 4: Sub-routines and data structures of the feature summation. (a) Overall diagram of

the feature summation. After receiving feature calculation requests, feature summation block

computes feature vector of an image region via spatial pooling based dimensionality reduction.

(b) Internal RAM structure. (c) Integral image is formed for efficient spatial pooling operation

on multiple image region feature vector calculation requests. (d) The spatial pooling reduces

to addition operations once the integral image is formed, and this is performed separately in

parallel for the 4 quadrants (QIV ). Then the summations are concatenated to form feature

vector (FV ) representation of the image region defined by the corner coordinates.

8

C11

C21

CJ1

C1G

C2G

CJG

C

CC1 CCG

MULTCX1

FV1

MULTCX2

FV2

MULTCXG

FVG

ADD

CL

CL1

CLJ

JC MATRIX

ROW

ARBITER

CLASSIFICATION

(a) (b)

Figure 5: Sub-routines and data structures of the Classification block. (a) Classification matrix

that is used for linear classification. (b) Classification matrix is multiplied with feature vector

FV to arrive at class likelihood vector, CL.

When pooling operations on requested coordinates are finished, RAM is up-130

dated with new lines, and new pooling calculations are started. These processes

are controlled by integral vector calculator with the aid of address calculator.

3. Classification

Classification block generates a class label likelihood vector using a linear

classification method. It performs matrix-vector multiplication of class matrix135

C (Fig. 5a) with FV . The class matrix C is loaded to FPGA just like feature

dictionary D. Row Arbiter controls the C matrix row management for the FV

multiplication. The C matrix is J ∗ G ∗ S bit matrix, where J is the number

of trained classes, G is the feature dimension and S is the bit precision. The

result is the class label likelihood CL vector. The entities CLX of the CL are140

the addition of the multiplication of FV with C rows. The CL can be sent to

the CPU for further processing, i.e. classification/detection decision, or a max

operation can be applied to assign a class label, which is the max index of the

vector CL.

9

FPGA

STRATIX V

( 5SGXA7 )

DDR3 SDRAM

64 Bit

@ 600MHz

CPU2.5Gbps

PCIe

DEDICATED HARDWARE

Figure 6: The algorithm is implemented on a dedicated hardware using Stratix V (5SGXA7)

ALTERATMFPGA chip. It consists of memory, CPU and the FPGA chip, providing fast

communication among the components.

4. Hardware Usage and Timing145

The algorithm is implemented on a Stratix V (5SGXA7) ALTERATMbased

dedicated hardware (Fig. 6). Table 1 shows the FPGA and hardware resource

usage2 of the object detection implementation for 640*512 @50Hz video rate,

RF size 4 pixels, 20,000 numbers of 64*64 pixels sliding windows, which corre-

sponds to at least 5 different scales (1.3, 1.2, 1, 0.8, 0.6) of exhaustive object150

search with 8 pixels shift between windows. Therefore, an effective and fast

(> 10 Hz) embedded object detection framework can be constructed using a

fraction of an FPGA chip, that is to be deployed on UAVs. Notice that, object

detection, object tracking (> 20 objects) can be simultaneously executed us-

ing less than 20% of the FPGA resources, the rest of which can be utilized for155

other tasks such as salient motion detection, sparse feature extraction and vi-

sual odometry. Saliency of the detected motion/change can be determined using

the same classification framework. Moving pixels can be analyzed for saliency

(pre-determined object class detection) in a multi-scale manner via appropriate

CPU-FPGA communication, for less than 10% of the FPGA resources. Sparse160

2See supplementary materials for the detailed timing and hardware usage analysis.

10

feature extraction requires further computation on hidden layer activities of in-

dividual pixels, however the computational complexity of this additional stage

is predicted to be low.

The timing of calculations and delay analysis is as follows. The hidden layer

activity calculations can be realized in pipeline order until the end of the Feature165

Extraction block. There is only pipeline latency and PFV can be calculated in

less than 1µs after the patch is available. Feature Summation (calculation of

FV ) block needs to read PFV s from external memory and store them in the

internal memory. The external and internal memory read operations introduce

a lag, however frame delay due to this lag is mitigated by using multiple buffers170

and optimal order of requests calculated and communicated by CPU part. In

that, FPGA fetches frame data portions from external into internal memory in

an order and CPU feature summation requests are synchronized with this order.

Classification block requires FV s to be computed. After FV s are ready, class

label CLs can be calculated in less than 1µs. Therefore, the system is operating175

in real time without any frame delay.

As a summary, for less than 30% of the FPGA resources, object detection,

object tracking (multiple), salient motion detection and scene recognition can

be executed in a satisfactory accuracy. Sparse feature extraction needs to be

worked out and analyzed to understand its resource usage.180

Table 1: FPGA and Hardware Resource Usage Summary

Property Used Available Occupied (%)

Logic 30,000 622,000 5

Internal RAM 250 × M20K 2560 × M20K 10

Multiplier (18 × 18) 40 512 8

Ext. Memory Bandw. 6 Gps 75 Gps 8

CPU Interface Bandw. 0.1 Gbps 2.5 Gbps 4

References

11

FPGA Resource Usage and Timing

The FPGA implementation is analyzed for the 640 x 512 @ 50Hz (20ms frame period) video on a

dedicated hardware.

Fig.1 : Frame and patch dimensions.

FPGA

STRATIX V

(5SGXA7)

DDR3 SDRAM

64 Bit

@ 600MHz

CPU2.5Gbps

PCIe

DEDICATED HARDWARE

Fig.2 : Structure of the dedicated hardware.

FPGA properties are detailed below in TABLE 1.

TABLE.1 : Selected FPGA Properties

Property Quantity

Logic 622,000 LE

Internal RAM 2,560*M20K

Multiplier 512*18x18

TABLE.2 : Selected Hardware Properties

Property Quantity

External Memory

Bandwidth

64 x 600 x 2 = 75 Gbps

CPU Interface

Bandwidth

2.5 Gbps

Y =

51

2

X = 640

K = 4

@ 50Hz

Pixel CLK = 20 MHz

Feature Extraction

TABLE.3 : FPGA Resource Usage and Timing

TAKE PATCH

X = 512

Y = 640

W = 4

CLK = Pixel CLK (20 MHz)

Pixel Depth = 8 bit

Property Quantity

Logic 120 lut/256 reg

Internal

Memory

4 * M20K

Latency* ~128.5 us

*This latency occurs only at beginning of the frame

CLK = Pixel CLK (20 MHz)

Property Quantity

Logic 5 * 4 input adder,

1 * 12 bit divider

Latency** ~0.2 us

**Pipeline latency

PB

T = 16

CLK = Pixel CLK (20 MHz)

Property Quantity

Logic 16 * IF check

Latency** ~0.05 us

DICTIONARY D

K = 100

CLK = Pixel CLK (20 MHz)

Property Quantity

Logic 100 * 16 bit register

DV

T = 16

H = 5

CLK = 5 * Pixel CLK = 100 MHz

Property Quantity

Logic 20 * 16bit XOR,

20*16* If check,

20*16* 4bit counter

100 * 5 bit register

Latency** ~0.2 us

DVµ

CLK = Pixel CLK (20 MHz)

Property Quantity

Logic 30 * 4 input adder,

1 * 12 bit divider

Latency** ~0.3 us

DVσ

CLK = 5 * Pixel CLK = 100 MHz

Property Quantity

Logic 10 * 5 bit subtractor,

6* 4 input adder,

1*13bit divider,

1* 8 bit square root operation

Multiplier 10*9x9 mult

Latency** ~0.05 us

AT

CLK = Pixel CLK (20 MHz)

Property Quantity

Logic 1* 5 bit subtractor,

Multiplier 1*9x9 mult

Latency** ~0.01 us

PFV

CLK = 5 * Pixel CLK = 100 MHz

100 Bit per Pixel

Property Quantity

Logic 20*If check,

External

Memory

100 x 20 = 2000Mbps

= 1.95 Gbps

Latency** ~0.1 us

Feature Summation

INTERNAL RAM

R = 64

CLK = 10 * Pixel CLK = 200 MHz

Property Quantity

External

Memory

100 x 20 = 2000Mbps

= 1.95 Gbps

Internal

Memory

200 * M20K

REQUEST FIFO

REQUEST: 5000 request at a time, can be

updated by new requests when the previous

requests are fulfilled.

Requests are 10 bit row; 10 bit column

coordinates of two pixels. Total 40 bit

Property Quantity

Internal

Memory

10 * M20K

ADDRESS CALCULATOR

CLK = 10 * Pixel CLK = 200 MHz

Property Quantity

Logic RAM address management

logic

Multiplier 4*18x18 mult

Latency** ~0.05 us

IV CALCULATOR

CLK = 15 * Pixel CLK = 300 MHz

Maximum quadrant size = 32 x 32

S = 10 bit

G = 400

Given latency for the calculation of 4*32x32

quadrant. For the other dimensions use the

formula; 4xQSize x QSize x 3.3ns (300 Mhz)

Internal Memory has two separate read ports so it

is possible to calculate two quadrant at the same

time

Since all the PFVs are stored in external memory

frame latency (20 ms) can be used to calculate

requests.

Property Quantity

Logic RAM read operations,

100* 2 input adder

Internal

Memory

2 * M20K

Latency ~6.8 us

Classification

CLASS C

J = 10

G = 400

CXX is 10 bit

Property Quantity

Internal

Memory

2 * M20K

CLASSIFICATION

CLK = 15 * Pixel CLK = 300 MHz

Given latency is for the calculation of one CL.

Property Quantity

Logic 20* 2 input adder

Multiplier 20*18x18 mult

Latency ~0.66 us

The calculations can be realized in pipeline order until the end of the Feature Extraction block. There

is only pipeline latency and PFVs can be calculated less than 1us after the patch is available.

Feature Summation block needs to read PFVs from external memory and stores them in the internal

memories. Due to smart and synchronized design of requests from the CPU the lag due to memory

transfer does not cause frame delay. Classification is similar with Feature Summation and it requires

FVs. After a FV is ready, class label CL can be calculated less than 1us.

It is possible to achieve higher than 10 Hz frame rate with over 20k 32x32 quadrant size. Below are

the usage for this configuration.

TABLE.4 : FPGA Usage Summary

Property Quantity

Logic ~ 30,000

Internal RAM ~ 250*M20K

Multiplier ~ 40*18x18

TABLE.2 : Hardware Usage Summary

Property Quantity

External Memory

Bandwidth

~ 6 Gps

CPU Interface

Bandwidth

~ 0.1 Gbps

Thermal Object Tracking Dataset

The images are acquired using an image grabber attached to different ASELSAN thermal imaging

products. The acquired images in the dataset are not raw images but they are quantized, histogram

equalized and enhanced. More detail on the image sources can be provided on demand.

Description of Tracks

A short description and a representative image (below description) will be given in this section. The

description will include the object type, object size, scene characteristics, viewing angle and the

hardship of the sequence (clutter, occlusion etc.). In the images, bounding box (black) and center of

the track (white dot) are overlaid.

Track 1: Person, large object, urban scene, partial occlusion, clutter, non-rigid motion.

Track 2: Person, large object, urban scene, abrupt orientation change, non-rigid motion.

Track 3: Vehicle, large object, urban scene, full occlusion, oblique view.

Track 4: Vehicle, large object, urban scene, full occlusion, oblique view.

Track 5: Vehicle, large object, urban scene, full occlusion, oblique view.

Track 6: Vehicle, medium object, urban scene, full occlusion, oblique view.

Track 7: Vehicle, large object, urban scene, full occlusion, intensity change, oblique view.

Track 8: Vehicle, tiny object, urban scene, abrupt orientation change, clutter, oblique view.

Track 9: Vehicle, tiny object, urban scene, low contrast, high clutter, oblique view.

Track 10: Vehicle, tiny object, urban scene, low contrast, high clutter, oblique view.

Track 11: Vehicle, medium object, urban scene, clutter, partial occlusion, oblique view.

Track 12: Vehicle, small object, rural scene, low contrast, clutter, partial occlusion, oblique view.

Track 13: Vehicle, tiny object, urban scene, air view.

Track 14: Vehicle, small object, urban scene, long run, clutter, partial occlusion, sudden jump, air

view.

Track 15: Vehicle, small object, urban scene, sudden jump, air view.

Track 16: Vehicle, small object, urban scene, smooth orientation change, air view.

Track 17: Vehicle, small object, urban scene, clutter, air view.

Track 18: Vehicle, tiny object, urban scene, high contrast, no clutter, air view.

Track 19: Vehicle, tiny object, urban scene, high contrast, high clutter: similar object, air view.

Track 20: Vehicle, tiny object, urban scene, high contrast, sudden jump, air view.

Track 21: Apartment, medium size, urban scene, texture high, no clutter, air view.

Track 22: Apartment, medium size, urban scene, texture high, high clutter: similar object, air view.

Track 23: Region, large size, urban scene, texture low, smooth orientation change, air view.

Track 24: Region, medium size, urban scene, texture high, air view.

Track 25: Vehicle, small object, urban scene, long run, clutter, sudden jump, orientation and scale

change, air view.

Track 26: Vehicle, tiny object, urban scene, clutter, orientation change, low contrast air view.

Track 27: Vehicle, small object, urban scene, abrupt orientation change, air view.

Track 28: Vehicle, small object, urban scene, smooth orientation change, clutter, air view.

Track 29: Vehicle, small object, urban scene, smooth orientation change, clutter, air view.

Track 30: Vehicle, small object, urban scene, full occlusion, high clutter, air view.

Track 31: Vehicle, tiny object, urban scene, low contrast, high clutter, air view.

Track 32: Vehicle, medium object, urban scene, high clutter, air view.

Track 33: Vehicle, medium object, urban scene, high clutter, smooth orientation change, contrast

change, air view.

SCENARIO VARIETY

In this section different scenarios in the dataset will be highlighted. A subset of the dataset can be

used in algorithmic studies according to imaging system or operational needs.

Oblique View

The scene is viewed from the side in some of the tracks, which is always the case for ASIR systems or

sometimes for 300T systems according to the gimbals’ orientation. Oblique view may show

perspective effects if there are nearby objects. Also this viewing angle is more prone to occlusion.

Air view

The scene is viewed from above in some of the sequences. This case generally generates affine image

motion and occlusion is rare, however the object size may become very small due to distance and

clutter/noise can cause problems.

Object Size and Type

Vehicle, people, apartments or regions are tracked in the sequences. The variation in vehicle size is

largest among different object types.

Feature Intensity

The amount of image features (edge, corner, blob etc.) determines the richness of the target. The

higher the feature intensity, the better the target can be discriminated from background and the

better target tracking performs. The feature intensity varies in the dataset: there are very poor

targets especially for very small sizes and very rich targets specifically for regions and apartments.

Occlusion

Both partial and full occlusions are observed in the tracks. Recovering from full occlusion is a

challenge for tracking algorithms and it can be tested in the dataset. Occlusion happens for different

size objects, so there is variety in occlusion scenarios.

Clutter and Noise

Clutter can be defined as structured noise, and it generally stems from other objects nearby the

tracked object. The strength of clutter and similarity with the target determines the difficulty of the

scenario. Image noise also varies in the dataset according to zoom level, time of day the image is

acquired and many other factors. The SNR is not calculated in the dataset but it can be done using

standard techniques in the literature.

Appearance Change

Sometimes the appearance of the tracked object alters due to orientation change or intensity

changes caused probably by image enhancement. The orientation change happens smoothly or

abruptly in different tracks, and it is specified in the track description.

Abrupt Motion

In some of the tracks the camera makes a sudden orientation change which makes an abrupt image

motion. It is important to handle these situations since this may happen during operation due to

stabilization errors or user intervention.

Thermal Ship Classification Dataset

15 different ships are imaged using ASELSAN thermal camera, mounted on a low altitude UAV. 360

degree view of the ships are captured and sample image for each ship is shown below.