Pedestrian Detection with Multichannel Convolutional ...€¦ · Electrical and Computer...
Transcript of Pedestrian Detection with Multichannel Convolutional ...€¦ · Electrical and Computer...
Pedestrian Detection with Multichannel ConvolutionalNeural Networks
David José Lopes de Brito Duarte Ribeiro
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Professor Jacinto Carlos Marques Peixoto do NascimentoProfessor Alexandre José Malheiro Bernardino
Examination Committee
Chairperson: Professor João Fernando Cardoso Silva SequeiraSupervisor: Professor Jacinto Carlos Marques Peixoto do Nascimento
Member of the Committee: Professor Mário Alexandre Teles de Figueiredo
December, 2015
ii
Agradecimentos
Gostaria de agradecer aos meus orientadores, Professor Jacinto Nascimento e Professor Alexandre
Bernardino, por me terem concedido a oportunidade de trabalhar com eles, podendo ter a certeza que
esta experiencia foi extremamente enriquecedora para mim. Mais especificamente, agradeco todas as
indicacoes, sugestoes, comentarios, tempo e atencao que despenderam, e que contribuıram de forma
valiosa para esta Tese.
Agradeco tambem a minha famılia, em particular aos meus pais e irma, por todo o apoio que me
deram e por todas as experiencias e oportunidades que me proporcionaram. Agradeco tambem a
educacao e formacao que me conferiram, e que indubitavelmente contribuiu para me tornar uma pessoa
melhor.
Por fim, agradeco ao Matteo Taiana, pelas explicacoes dadas relativamente aos metodos de deteccao
de pedestres e relativamente a avaliacao da performance dos mesmos.
iii
iv
Resumo
A deteccao de pessoas em geral, e a deteccao de pedestres (DP) em particular, sao tarefas impor-
tantes e desafiantes no contexto da interacao homem-maquina, com aplicacoes em vigilancia, robotica
e Sistemas Avancados de Assistencia ao Condutor. Os seus principais desafios devem-se a existencia
de variabilidade na aparencia dos pedestres (e.g. relativamente ao vestuario), e as semelhancas com
outros objetos (e.g. sinais de transito).
Este trabalho propoe um metodo inovador para abordar o problema de DP, baseado em Redes
Neuronais de Convolucao (RNC). Mais concretamente, um modelo de RNC foi treinado para cada canal
de entrada individual (e.g. RGB ou LUV) e representacoes de alto nıvel foram extraıdas da penultima
camada. Finalmente, um modelo de RNC para multiplos canais de entrada foi treinado (parcialmente
ou integralmente) com estas representacoes. Durante o teste, as imagens foram pre-processadas
com o detector Aggregated Channels Features (ACF) para gerar janelas candidatas a pedestres. Em
seguida, estas janelas foram introduzidas no modelo de RNC para multiplos canais, sendo efetivamente
classificadas como pedestres ou nao pedestres.
O metodo desenvolvido e competitivo com o estado da arte, quando avaliado no conjunto de da-
dos INRIA, conduzindo a melhorias relativamente ao metodo de base (ACF). Foram realizadas duas
experiencias, nomeadamente, utilizando o INRIA na totalidade com alta resolucao, e utilizando parte
do INRIA com baixa resolucao. Adicionalmente, a metodologia desenvolvida pode ser aplicada com
sucesso ao problema de DP em baixa resolucao, tendo potencial para ser estendida a outras areas
atraves da integracao de informacao proveniente de varias entradas.
Palavras-chave: Deteccao de Pedestres, Redes Neuronais de Convolucao, canal de en-
trada individual, modelo de RNC para multiplos canais, representacoes de alto nıvel
v
vi
Abstract
The detection of people in general, and particularly Pedestrian Detection (PD), are important and chal-
lenging tasks of human-machine interaction, with applications in surveillance, robotics and Advanced
Driver Assistance Systems. Their main challenges are due to the high variability in the pedestrian ap-
pearance (e.g. concerning clothing), and the similarities with other classes resembling pedestrians (e.g.
traffic signs).
This work proposes an innovative method to approach the PD problem, based on the combination
of heterogeneous input channel features obtained with Convolutional Neural Networks (CNN). More
specifically, a CNN model is trained for each single input channel (e.g. RGB or LUV) and high level
features are extracted from the penultimate layer (right before the classification layer). Finally, a multi-
channel input CNN model is trained (partially or fully) with these high level features. During test, the full
images are pre-processed with the Aggregated Channels Features (ACF) detector in order to generate
pedestrian candidate windows (i.e., windows potentially containing pedestrians). Next, these candidate
windows are entered in the multichannel input CNN model, being effectively classified as pedestrians or
non pedestrians.
The developed method is competitive with other state of the art approaches, when evaluated on the
INRIA dataset, achieving improvements over the baseline ACF method. Two experimental setups were
adopted, namely, the full INRIA dataset with higher resolution and the partial INRIA dataset with lower
resolution. Furthermore, the devised methodology can be successfully applied to the low resolution PD
problem, and promisingly extended to other areas by integrating information from several inputs.
Keywords: Pedestrian Detection, Convolutional Neural Networks, single input channel, CNN
multichannel input model, high level features
vii
viii
Contents
Agradecimentos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1 Introduction 1
1.1 Motivation and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 State of the art in PD and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 PD methods and related approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 PD Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 PD Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Convolutional Neural Networks 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Training and testing problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Operations and layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Fully connected layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.5 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.7 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
ix
2.4.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Proposed method 22
3.1 Description of the Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.1 Datasets and Input channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Pre-trained CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 CNN single input model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.4 CNN multichannel input combination model . . . . . . . . . . . . . . . . . . . . . . 25
3.1.5 Overall detection methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Experimental Results 27
4.1 Performance evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Training methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.1 PD using the full INRIA dataset with 7 input channels . . . . . . . . . . . . . . . . 30
4.4.2 PD using the partial INRIA dataset with 4 input channels . . . . . . . . . . . . . . . 31
4.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Conclusions 37
5.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Bibliography 43
x
List of Tables
4.1 Miss rate % using single channels as input and without feature combinations for the full
INRIA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Miss rate% using 4 features combinations for the full INRIA dataset . . . . . . . . . . . . . 31
4.3 Miss rate% using 7 features combinations for the full INRIA dataset . . . . . . . . . . . . . 32
4.4 Miss rate % using single channels as input and without feature combinations for the partial
INRIA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Miss rate% using 4 features combinations for the partial INRIA dataset . . . . . . . . . . . 34
xi
xii
List of Figures
1.1 Some examples of PD challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Overview of the pedestrian datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Diagram of the expanded CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Diagram illustrating the methodology used to feasibly perform the backpropagation pro-
cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Example of the backpropagation procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Example of the convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Example of the max pooling operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Examples of activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Cross-channel normalization scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Example images of the input channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Pre-trained model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Scheme of the single input and multiple input combination CNN models . . . . . . . . . . 25
3.4 Scheme of the overall detection methodology . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Comparison of the developed method best result, denoted by Multichannel CNN, with
other PD benchmarks for the full INRIA dataset . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Comparison of the results obtained with the developed method for the combination of
different input channels for the full INRIA dataset . . . . . . . . . . . . . . . . . . . . . . . 33
xiii
xiv
Glossary
ACF Aggregated Channel Features
BBdt Detected bounding box
BBgt Ground truth bounding box
CNN Convolutional Neural Networks
CSS Color self-similarity
DCT Discrete cosine transform
DF Decision Forests
DL Deep Learning
DN Deep Network
DPM Deformable Part Models
FHOG Felzenszwalb Histogram of Oriented Gradients
FPPI False positives per image
FPS Frames per second
GradHist6 Gradient histogram in the orientation range
from 150 degrees to 180 degrees
GradMag Gradient Magnitude
Gx Horizontal derivative across all RGB channels
Gy Vertical derivative across all RGB channels
HOG Histogram of Oriented Gradients
MR Miss rate
NMS Non Maximal Suppression
NN Neural Networks
PD Pedestrian Detection
RBM Restricted Boltzmann Machine
ReLU Rectified Linear Unit
SGD Stochastic Gradient Descent
SVM Support Vector Machine
xv
xvi
Chapter 1
Introduction
1.1 Motivation and Problem Formulation
The ability to detect people is an important and challenging component of human-machine interaction.
This has been an active area of research in the last few years due to its wide range of applications,
e.g. automotive safety, surveillance, entertainment, robotics, aiding systems for the visually impaired,
Advanced Driver Assistance Systems, to quote a few.
Pedestrian Detection (PD) consists in the detection of people portraying typical standing poses. Its
main challenges are due to the high variability in the pedestrian appearance and the similarities with
other classes of objects resembling pedestrians. In Advanced Driver Assistance Systems and surveil-
lance applications, the PD task might include walkers, skateboard and bicycle users. The appearance of
the pedestrians is influenced by the pose, clothing, lighting conditions, backgrounds, scales and even by
the atmospheric conditions. Besides that, occlusions (among pedestrians and between pedestrians and
other objects), background clutter, deformations, viewpoint variations, low resolution and the resem-
blances between pedestrians and other classes of objects (interclass resemblances, such as the re-
semblances between pedestrians and mannequins, pedestrian traffic signs and pedestrian traffic lights)
are other complexities inherent to this task. The referred challenges and complexities are illustrated in
Figure 1.1.
As a result, the previously mentioned challenges motivate the adoption of methods that simultane-
ously have representative power to capture the general pedestrian traits at multiple scales, and are
robust to the pedestrian intra-class variability and the inter-class similarities. Moreover, the ability to
combine various sources of information about the pedestrians may lead to ameliorated results.
Recently, Convolutional Neural Networks (CNN) have successfully been applied to several vision
tasks such as general object classification [14, 31], general object detection [52, 50] and even PD [41, 49,
29], which demonstrates its potential to further explore the PD problem and deal with the PD challenges.
This thesis reviews the current pedestrian detection state of the art and uses a combination of the
Aggregated Channel Features (ACF)[21] detector and CNNs to perform the PD task. Besides the regular
CNN use (single channel input), an innovative method is implemented based on the combination of fea-
1
tures extracted from CNN models obtained with different inputs (multichannel combination). Particularly,
this method is applied to the challenging low resolution PD problem.
Figure 1.1: Some examples of PD challenges: a) reflections of pedestrians in windows; b) drawingsof people; c) mannequins; d) a pedestrian sized poster; e) a pedestrian traffic sign and crowd relatedocclusions; f) pedestrian traffic lights and g) a gumball machine resembling a pedestrian. a), b), c) andd) were adapted from [53], e) was obtained from [3], f) was obtained from [2] and g) was obtained fromthe INRIA dataset [13].
1.2 State of the art in PD and related work
The PD problem has been studied for more than a decade, with several datasets, benchmarks, eval-
uation metrics and over 40 methods being developed and established in the process. The following
subsection provides an extended overview of the most relevant methodologies proposed in the field, in
conjunction with other potentially useful approaches.
1.2.1 PD methods and related approaches
The literature is rich and provides a great diversity of methods devoted to the PD problem, with an
outcome of over 40 methods. However, these methodologies can be categorized into three main families
according to [6]: Decision Forests (DF) (possibly using boosting techniques and the most common
approach), Deformable Part Models (DPM) variants and Deep Networks (DNs). Additionally, some
works consider the combination of diverse families, such as [39], which integrates DPM and DN to
tackle the PD occlusion problem.
A survey of PD detectors (DF and DPM families) from 2000 to 2012 is presented in [20], a discussion
about the PD problem from around 2004 to 2014 is shown in [6] and a brief survey of Deep Learning
(DL) based PD methods (i.e., the DN family) is presented in [29]. The most relevant methodologies and
concepts contained in these surveys are summarized herein, complemented by other techniques outside
of their scope. Relevant approaches in other fields of study, but with the potential to be successfully
applied to PD are included as well.
In general terms, most detectors contemplate three main components: feature representation, clas-
sification schemes and training methodology. However, this distinction is less evident in approaches that
jointly learn several modules of the detector (e.g. [41]). In fact, [5] provides an experimentally supported
overview of the detector’s design options, comprising: feature pooling, normalization and selection, pre-
processing, weak classifiers, and training methods and sets. The features can be obtained from different
2
input channels, transformations or pre-processing (e.g. HOG, RGB and LUV, are some of the features
and channels considered in [16, 21]). Regarding the classifiers, it is possible to differentiate between two
main frameworks: monolithic and part-based. While in the monolithic case the entire image is taken into
account to detect the pedestrians (DF family, e.g. [13]), in the part-based case, deformation models and
partial detections contribute to determine the pedestrian existence probability (DPM family, e.g. [24]).
Training can be performed with a linear (more frequently) or non linear Support Vector Machine (SVM)
(e.g. [13]), boosting, among more complex methodologies.
Accordingly, the work of [68] intends to unify the framework of several PD methods (such as Integral
Channel Features [16], Aggregated Channels Features [21], Squares Channels Features [5], Informed
Haar [67] and Locally Decorrelated Channels Features [36]), by proposing a detector’s model where a
boosted decision forest uses filtered low-level features. Furthermore, a new method is devised, based
on the particular choice of the filters used, and different filter bank design options are explored and
supported with experimental results.
In the last years, the improvement experienced in the PD methodologies performance relied mainly
on the use of better features (concerning dimensionality and complexity) [6], as noticed when comparing
the prominent benchmarks of Viola and Jones [58], HOG [13], Integral Channels Features [16], Squares
Channels Features [5] and Spatial Pooling [44]. Indeed, in [13] the feature quality was improved by the
introduction of the Histogram of Oriented Gradients (HOG), followed by a linear SVM classifier. The
HOG features were widely adopted and extended in the literature (e.g. [69] increased [13] detection
speed) even in current approaches (e.g. [21, 68]).
The sliding window based PD methodologies are the most successful and commonly found in the lit-
erature, in contrast to keypoints [48] or segmentation [28] based approaches. Among the pioneers of PD
sliding window methods is [45], which resorted to the use of the Haar wavelet transform in conjunction
with SVMs to build an overcomplete dictionary of features at multiple scales. This work was extended
by [58], which developed a faster technique (named Integral Images) to compute the features, applied
AdaBoost to sparsely extract the most important visual attributes, and proposed a cascade of classifiers
to constrain the computational effort to the most promising regions. Recently, these core principles are
still applied in most of the detectors (mainly in the DF and DPM families, but also in combination with
DN approaches), to some degree or variation.
In [16], the Integral Images technique proposed in [58] was applied to different image channels
(namely, gradient histograms, grayscale, RGB, HSV, LUV, and gradient magnitude), in order to extract
features (in the form of local sums and Haar feature variants, expanding the image channels considered
to compute the features in [58]), which were subsequently processed by a boosted classifier. This
method was improved in [21], by changing the features from local sums in image channels to pixel
lookups in aggregated channels (resulting in the Aggregated Channels Features method). A variant of
[16] was introduced in [5] (named Square Channels Features), by limiting the image channels to HOG
(with gradient magnitude) and LUV, and choosing candidate features according to square feature pooling
subwindows (comprised by the model window). In the same work [5], a final detector (named Roerei)
was proposed, by considering different scales, global normalization and using an enhanced feature pool.
3
The complementarity of different types of features has been proven to yield improved results, as
discussed in [61], for the combination of HOG features, Haar wavelets and shape context using different
classifiers. Expanding this approach to include motion features (obtained from optical flow) and a novel
feature named color self-similarity (CSS), as proposed by [60], allows to attain further gains. More
recently, in [6], optical flow information, context information (i.e., interactions among two people) and
channel augmentation using a discrete cosine transform (DCT) were added to the Square Channels
Features detector [5], resulting in an improvement of over 10% regarding the baseline Square Channels
Features method [5] in the Caltech-USA reasonable dataset (the described method is named Katamari-
v1, with a total of 40 channels, 30 of which originate from the convolution of 10 channels, i.e. HOG plus
gradient magnitude plus LUV, with 3 DCT basis functions).
Recent PD detectors include [67], which explores the prior knowledge of the up-right human body
shape. This is accomplished by computing Haar-like features according to a model of the head, upper
and lower body parts. Other approach was proposed in [36], which builds on the ACF framework [21]
by creating a methodology to locally decorrelate the channel features. As a result, orthogonal decision
trees can be adequately and efficiently used instead of resorting to the more computationally expensive
oblique decision trees. The approaches proposed in [43, 44], contemplate the use of spatially pooled
low-level features in conjunction with the optimization of the area under the receiver operating charac-
teristic curve only in the range of the detector evaluation (i.e., from 10−2 false positives per image (FPPI)
to 100 false positives per image).
Initially, the PD methods were mainly focused on enhancing accuracy (measured in miss rate versus
false positives per image). Nevertheless, the emergence of real time applications,such as surveillance
and robotics, demanding higher detection rates (measured in frames per second(FPS)),raised atten-
tion to the importance of combining detection accuracy with adequate detection speed. Sliding win-
dow based detectors require the computation of image pyramids and correspondent features at each
scale, which can be computationally expensive and incur in unsatisfactory runtime. As a result, new
methodologies were proposed (e.g. [18, 21]) to approximate the feature computation (including gradient
histograms) at multiple scales, resulting in substantial runtime improvements with slight accuracy reduc-
tions. In fact, the method described in [18], reaches almost real time (i.e., approximately 5 FPS) PD
using 640x480 images, whereas the detector in [57, 58, 59], devised more than ten years ago, achieves
detection rates of 15 FPS using 384x288 video. Additionally, [21] achieves detection rates superior to 30
FPS, [19] reaches performances in the range from 35 to 65 FPS (on 640x480 images) and [4] achieves
50 FPS for monocular images (using GPU) and 135 FPS for stereo images (using CPU and GPU) (a
more detailed review of the detectors speed is presented in [21]).
In the scope of deformable part models, an unsupervised method was devised in [25] to learn scale
invariant part constellations, considering object attributes such as: shape, appearance, occlusion and
relative scale. More recent DPM methodologies are proposed in [24, 23], where deformable part models
comprising a root filter, part filters and respective deformation models (named star model), were applied
to images at multiple scales and positions. A score dependent on each element of the star model is
computed. The training is performed using the latent SVM approach, with partially labeled data. Various
4
star models are combined and a score is assigned to each position and scale of the images. As a result,
this methodology addresses the variability existent in the object and pedestrian classes.
An emerging and challenging problem consists in performing PD with very low resolution images.
The additional challenges are noticeable in many surveillance systems, because high resolution im-
ages with a detailed shape description may not be available. In automotive systems, the detection of
pedestrians far away (i.e., at low and medium resolutions) from the vehicle guarantees a safety margin,
essential to perform the adequate driving maneuvers. Therefore, it is desirable to achieve satisfactory
performance levels at medium and low resolution, but without employing expensive high resolution cam-
eras, since it would constitute an obstacle to the implementation of the PD methods in the automotive
systems framework [20]. However, the detection is disappointing at low resolution PD, as mentioned
in [20]. To address the PD problem at low resolutions and multiple scales, the work of [46] and [65]
introduced improvements to the DPM approach proposed in [24]. The work of [46] created an adaptive
model that works as a rigid model at low resolutions and changes to a deformable part model at high
resolutions, while in [65], a multi-task model was applied to pedestrians at distinct resolutions in order to
capture their similarities and dissimilarities.
Further considerations in parts, shape, pose and articulation representations are present in the liter-
ature. In [62], body part detectors contemplating edgelet features (which correspond to short segments
belonging to a line or curve) were devised to tackle PD in crowded scenes. Learning was achieved
by a boosting technique. A joint likelihood model was designed to include occlusions among pedestri-
ans (besides the regular PD task), by considering the output of the aforementioned detectors. Further
development followed in [63], where this method was applied to the multiview paradigm. The method
described in [47] introduces the shapelet mid-level features, which result from applying Adaboost to
oriented gradients over local image windows (denoted by low level features). These shapelet features
are then processed with Adaboost, originating the final classifier. In [10], poselets were developed to
represent parts of people poses and to estimate them. PD and body elements (e.g. torso or nose)
localization are possible by applying a linear SVM classifier to the poselets features, followed by further
classification or regression. Additionally, the work of [33] proposed an HOG based pose invariant de-
scriptor, resulting from the application of a part template model, which extracts pedestrian shapes and
poses. In [32] sketch tokens were introduced, which is a mid-level feature representation aimed to cap-
ture the edges and contours existent in images and subsequently apply this information to pedestrian
and object detection.
The Deep Learning (DL) framework has recently been applied to the PD problem (e.g. [39, 42, 29]),
contemplating the use of distinct DN models such as Restricted Boltzmann Machines (RBMs) and CNNs.
The DL based methods belong to the DN family and were utilized to tackle occlusion, background clutter,
pedestrian intra-class variability, among other PD challenges.
The occlusion problem is addressed in [39, 42], which build on the deformable part model frame-
work [24]. The work of [39], learns the relationship among the visibility of the detected pedestrian parts
with a deep model (comprising RBMs). In [42], a deep model is developed to predict the mutual visibility
relationship among pedestrians. Additionally, the work of [40] is specifically focused on modelling pedes-
5
trian interactions and pedestrian group scenes (including single pedestrian detection as well). Finally,
[41] devised a methodology to capture the interaction among feature extraction, deformation handling,
occlusion handling and classification, by integrating and jointly learning these modules with a DN. An-
other approach to mitigate occlusion, but outside of the DN family scope is presented in [35], which
proposes an efficient training procedure that enables the application of several classifiers particularly
devoted to the occlusion task.
With the objective of dealing with background clutter and the variability within the pedestrian class,
[34] developed a DN comprising: a convolutional layer to capture lower level features, Switchable Re-
stricted Boltzmann Machine based layers to obtain higher level feature combinations and body salience
maps, and a logistic regression layer for classification.
In [64], the authors propose a DN inspired by cascaded classifiers with joint training. Each classifier in
the network’s hierarchy introduces refinements to the previous classifications, as the top of the network is
progressively being reached. In fact, the contextual information about the features and detection scores
from each network stage contributes to the decision made in the following stage.
The network pre-training subject is approached in [49], where a CNN is initialized with a unsupervised
learning methodology (consisting in convolutional sparse coding) in order to overcome the scarcity of
data available when training on the INRIA dataset [13] (which contains 614 positive images and 1218
negative images). Afterwards, the CNN is fine-tuned with the INRIA dataset. The features from multiple
layers are merged in the classifier, to obtain features that capture low level and high level pedestrian
traits (such as silhouettes and facial details). Besides that, [11] and [29] pre-train a CNN by transferring
the parameters of another network, already trained with a subset of the Imagenet [14] dataset, followed
by fine-tuning (i.e., training) with the dataset pertaining to the task of interest (in the case of [11] are
datasets related to mammograms and in the case of [29] are datasets related to pedestrians).
Regarding the type of inputs entered in the networks used in PD, they vary according to the adopted
methodology and can be: the RGB color space [29], the YUV color space [49], HOG [39, 40], a com-
bination of HOG and CSS [64], and a combination of YUV and gradients [34, 41]. Further experiments
were performed in [29], by using as input the LUV color space and other combinations of LUV and HOG.
The detection task is more complex than the classification task, since it is intended to assign a
class to each object (in the PD case, decide if a certain detection window contains a pedestrian or not)
and output the location of the object (in the PD case, the bounding box only surrounds pedestrians,
because the rest of the classes and the background are considered irrelevant). As mentioned before in
this subsection, the most common architecture of the detectors in PD is based on the sliding window
approach. In this technique a classifier evaluates if a detection window of a specific size contains a
pedestrian of the corresponding height. The detection window is slided on a grid of locations throughout
an entire image (outputting the confidence in the presence of a pedestrian in that window), at multiple
scales, followed by non maximal suppression (NMS) in order to merge the enclosed bounding boxes.
Some methods (mainly in the DN family) might not comprise the entire detector architecture, lacking
the scanning of the image with detection windows at multiple scales and the NMS, or were originally
designed to perform as classifiers. As a result, they require the use of other methods (faster although
6
still accurate) to output candidate windows (also known as regions of interest or detection proposals),
instead of fully scanning the images themselves. Moreover, this procedure contributes to reduce the
computational effort spent by the more computationally expensive methods, and lead to runtime speed
ups.
Accordingly, the detection can be performed by first using a method to select pedestrian candidate
windows and then applying another methodology to determine if the selected regions correspond, in
fact, to pedestrians or not (i.e., the detection quality is refined by a more powerful classifier). The works
published in [41, 64, 34] adopt this procedure by applying HOG, CSS and a linear SVM detector to
generate candidate images and only afterwards, entering this candidate windows into the deep network.
Additionally, [29] uses the Aggregated Channel Features detector and the Squares Channel Features
detector as the region of interest selection methods.
Conversely, although applied to the object detection task, [50] convolves a network (deeper than the
one in [49]) with full images, being able to not only classify but also locate objects, without requiring the
use of other methods to select candidate windows.
Although not being present in the PD Deep Learning literature, the combination of multiple inputs
is a promising idea that has the potential to be successful in PD. It is explored in [30] and in [51],
where features for images and text are jointly learned. In [30], a Convolutional Network was used to
jointly train neural language models conditioned on images. In [51], a Deep Boltzmann Machine model
is employed to jointly learn text and images, and multimodal (text and images) or unimodal queries
can be submitted, being possible to generate the missing modality in the unimodal case. Besides
that, [38] uses deep networks (including a Deep Autoencoder model and RBMs) to learn audio and
video information, resorting to the multimodal fusion, cross modality learning and shared representation
learning settings. Furthermore, the work of [11] addresses multiview mammogram analysis, by using
individual deep CNN models for each view (namely, craniocaudal and mediolateral oblique views for the
standard mammogram, micro-calcifications and masses), extracting the features for each view from the
corresponding CNN and then combining them in a joint CNN model. The methodology proposed in [11],
leads to significant improvements for the multiview combination case, when compared with the single
view results, which shows the relevance of combining multiple input channels.
1.2.2 PD Datasets
The datasets are designed to provide a substantial and representative sample of the aforementioned
PD challenges, requiring continuous updates and reformulations, in order to potentiate progress in the
PD field of study.
An overview of the most relevant pedestrian datasets is presented in Figure 1.2, adapted from
[20]. Although INRIA is still extensively utilized (mainly for training), recently, the most commonly used
datasets for benchmarking [6] are Caltech-USA [17] and KITTI [26].
The most recent benchmarks, corresponding to several methods and to the datasets: Caltech-USA
(training and testing), Caltech-Japan, INRIA, ETH, TUD-Brussels and Daimler (Daimler-DB), are re-
7
ported online at [1]. Besides that, the website in [1] allows the continuous submission and comparison
of new method’s benchmarks. Additionally, the benchmarks for more than 40 methods predominantly
resorting to the Caltech-USA [17] dataset (although residually including INRIA and KITTI datasets) are
discussed in [6]. Older benchmarks, established for 16 detectors using the aforementioned datasets,
are presented in [20].
The detection can be performed in static images (photographs), surveillance videos or images re-
sulting from video (not from surveillance, but from continuously filming in other contexts) [20]. The need
to manually select the images in the photographs case, incurs in selection bias, which is unlikely to be
experienced in the video case [20]. Moreover, according to [53], using only pure data samples (i.e.,
where the pedestrian can be unambiguously identified) can prevent the detection of partially occluded
or smaller samples, and cause the unfair benchmarking of different detectors (since the impure samples
may be considered false positives). Furthermore, using video instead of static images, allows to exploit
optical flow, leading to improved detection results [6].
Figure 1.2: Overview of the pedestrian datasets (adapted from [20]).
1.2.3 PD Evaluation metrics
The adequate definition of evaluation metrics is of crucial importance in order to ensure a fair comparison
among detection methods, to establish benchmarks, and to correctly approximate the real detector
performance levels when applied to a specific task of interest.
The existent PD performance evaluation methodologies are per-window and per-image based [20].
In the per-window methodology, the detector classifies image windows containing pedestrians against
windows without pedestrians. Despite being suited to assess the performance of classifiers and frame-
works proposing automatic regions of interest, this methodology can have shortcomings when evaluating
detectors [20], because it assesses the performance of the classifier instead of considering the entire
detection system.
As a result, the per-image evaluation methodology superseded per-window metrics in 2009 (as pro-
8
posed in [17] and latter in [20]). In the per-image evaluation metric, a detection window is slided on a grid
of locations throughout a full image and a bounding box and a confidence score are outputted for each
multiscale detection, followed by NMS and the application of the PASCAL measure, to determine the
existing matches between the groundtruth and the detected bounding boxes (the detailed methodology
is described in Section 4.1 of Chapter 4).
In practice, per-window performance does not predict per-image performance (only a weak correla-
tion exists among the two) [20], disproving the conjecture that better per-window results lead to better
performance on full images.
In the per-window evaluation methodology, each image in the selected dataset only needs to be
labelled binarily, as containing a pedestrian or not. However, if the per-image evaluation methodology is
used, the location of each pedestrian, in each image of the dataset, must be provided (constituting the
groundtruth). Consequently, the referred metric demands the effort of manually annotating each image.
This process of annotating pedestrians in each image of a dataset might depend on the application,
particularly, in situations where the appearance of pedestrians and other objects are identical. This
occurs, for example, in the case of images of mannequins, posters, photographs, drawings or reflections
on mirrors and windows [53] (these examples are depicted in images a), b), c) and d) of Figure 1.1).
1.3 Objectives
This thesis proposes an innovative methodology to address the PD problem, based on the Deep Learn-
ing framework. More specifically, Convolutional Neural Networks with multichannel combination are
used (including a pre-processing with the ACF detector). This approach aims to deal with the various
PD challenges and to explore additional information existent in heterogeneous channels.
Moreover, this work intends to demonstrate that, when presented with several input channels (applied
to a CNN), a useful and promising hypothesis to improve performance consists in combining (entirely
or partially) the features extracted from top-level CNN layers. Particularly, the more challenging subject,
regarding the state of the art, of low resolution PD is expected to be successfully tackled by the proposed
methodology.
1.4 Main contributions
The main contributions of this master thesis are four-fold:
• A review of the state of the art in pedestrian detection and an outline of the CNNs background.
• An innovative method to approach the PD problem, based on multiple input channels combination
with CNNs. The obtained results allow to generally conjecture that the combination of multiple
sources of input can improve the overall performance, when compared to the single input cases.
• Application of the proposed method to the low resolution PD problem. This subject is more chal-
lenging than the standard PD problem and is scarcely explored in the literature, constituting a
9
novel contribution from this thesis. According to the obtained results, the combination of several
low resolution inputs leads to a significant improvement, when compared with the results of each
single input.
• Competitive results achieved with the developed method and performance benchmarking against
some of the best performing pedestrian detection approaches.
1.5 Dissertation outline
The structure of this thesis contemplates five chapters. In Chapter 2 the CNNs background is discussed.
The CNNs are introduced and the training and testing problems are formulated. The main operations and
layers comprised in these networks are detailed and the pre-training, fine-tuning and feature extraction
procedures are mentioned.
Chapter 3 describes the proposed method, consisting in the combination of several input channels
using CNNs. The selected datasets and input channels are specified in conjunction with the pre-training
and fine-tuning techniques for the CNN single input model. The multichannel input combination pro-
cedure is discussed, where the features extracted from each CNN input model are combined in a joint
CNN model. The overall detection methodology is outlined and the implementation details (software and
toolboxes) are mentioned.
The experimental results achieved with the proposed CNN input multichannel combination method
are depicted in Chapter 4. First, the performance evaluation methodology, the experimental setups
behind the attained results and the training methodology are introduced. Afterwards, the results for the
two experimental setups (namely, full INRIA dataset with better resolution and partial INRIA dataset with
lower resolution) are presented in the form of tables and miss rate versus false positives per image plots.
Finally, a comparison with the state of the art is performed and the obtained results are discussed.
Chapter 5 provides an overview of the thesis, describes the most significant achievements and pro-
poses possible extensions to the developed work.
10
Chapter 2
Convolutional Neural Networks
This chapter provides a brief outline of the Convolutional Neural Networks background. The CNNs are
defined and the contextualization in the Deep Learning framework is mentioned. Afterwards, the CNN
training and testing problems are formulated and the CNN operations and layers are discussed. Finally,
the transfer learning subject is approached by analyzing the tasks of pre-training, fine-tuning and feature
extraction.
2.1 Introduction
CNNs are variants of Neural Networks (NN), specially designed to exploit structured input data with
lattice-like topology [9], such as: sound, which can be interpreted as a one dimensional time series
data; images, which can be interpreted as a two or three dimensional grid of pixels; and video, which
can be interpreted as a four dimensional input composed by a temporal sequence of (three dimensional)
images. Moreover, these networks derive from the Deep Learning framework, assuming deep architec-
tures that comprise various layers of non-linear operations, and with more representational power than
more shallow ones (i.e., with 1, 2 or 3 layers, for example) [7]. As a result, when compared with NN, the
CNN architecture handles the high dimensionality of inputs, attains invariance to slight changes, reduces
the number of connections and parameters, and is more robust to overfitting.
More formally, a CNN consists in a multi-layer processing architecture that can be regarded as a
function f assigning input data, denoted by x0 ∈ RH0×W0×D0 , to an output, denoted by y ∈ 1, . . . , C
(where C is the number of classes), and where the subscript 0 denotes the raw input data before any
layer computation [55]. Alternatively, the output can be a vector y ∈ RC with the probabilities (or scores)
of x0 belonging to each of the C classes. For instance, a CNN can be used to classify an image (in this
case, H0, W0 and D0 denote the height, width and depth, respectively) according to a defined set of
classes, by outputting a probability for each class (for RGB images, the depth equals 3 and for grayscale
images the depth equals 1). This function f results from composing diverse functions f1, . . . , fl, . . . , fL,
typically assuming a sequential topology (which is the one considered herein), and where the subscript
denotes the network module (or layer). However, a more complex disposition in the form of a directed
11
acyclic graph is also possible [56]. In this thesis the former topology is adopted as follows [55]:
f(x0,w) = fL(. . . f2(f1(x0,w1,b1),w2,b2) . . . ,wL,bL), (2.1)
where w = [w,b] = [w1, . . . ,wL] = [w1,b1, . . . ,wL,bL] represents the parameters, which include the
weights (denoted by w) and the biases (denoted by b).
In fact, each function (e.g. fl) represents a module that generates feature maps as output (e.g.
xl) [55]. Starting from the input x0 ∈ RH0×W0×D0 , the output of each network module is respectively
denoted by x1,x2, . . . ,xL, where xl ∈ RHl×Wl×Dl , wl ∈ RH′l×W ′l×Dl×Ql and bl ∈ RQl , for an arbitrary
module (or layer) l (as shown in Figure 2.1). Hl,Wl, Dl denote the height, width and depth of the feature
map (respectively), H ′l and W ′l denote the height and width of the weights (or filters), respectively, and
Ql denotes the number of weights (or filters) and biases (further details are provided in Section 2.3). For
instance, the feature map produced at module l depends on the results of the preceding modules and
is denoted by xl = fl(xl−1; wl). Some modules do not contain weights and biases, as is the case of the
activation functions (e.g. the rectified linear unit (ReLU)).
The most common CNN modules (to be detailed in Section 2.3) are as follows: (i) convolution layers,
(ii) fully connected layers, (iii) pooling layers (e.g. max or average pooling), (iv) activation functions (e.g.
hyperbolic tangent, sigmoid function and rectified linear unit (ReLU)), (v) different types of normaliza-
tion (e.g. contrast normalization or cross-channel normalization), (vi) classification layers (e.g. softmax
classification layer) and (vii) loss functions (which are appended to the core architecture of the CNN in
order to train the network, e.g. cross entropy loss or hinge loss). Two main structural organizations of
the network components are possible [9]: several simple layers or only a few complex layers (e.g. only
convolutional and fully connected layers), where each one has stages that correspond to simpler oper-
ations (e.g. the convolution, the activation function and the pooling are comprised in the convolutional
layer).
2.2 Training and testing problem formulation
The process of training a CNN aims to learn the parameters w = [w1,b1, . . . ,wL,bL], given a set of
training data D = (x0, y)iNi=1, in order to obtain a representative model, capable of generalizing the
information obtained during training, and correctly classifying new unseen inputs during the test phase.
The subscript 0 denotes the raw input data before any layer computation and the superscript denotes
the i-th data sample. When there is only one data sample or mentioning the dataset is superfluous, the
superscript i is suppressed (as was done in Section 2.1). The data is contained in xi0 ∈ RH×W×D and
the corresponding class is indicated by yi ∈ Y = 1, ..., C, where C is the number of classes and N is
the number of training data samples.
Since the focus is on the classification task, the objective function is chosen to be discriminative [56],
allowing to achieve the distribution of the labels given the input data, i.e., p(y|x0), instead of being gen-
erative, which provides a representation of the joint distribution of the input data and the corresponding
12
class, i.e., p(x0, y).
Consequently, the training problem can be cast as the following optimization problem:
arg minw
J(x0,y,w) = arg minw1,b1,...,wL,bL
1
N
N∑i=1
l(f(xi0; w1,b1, . . . ,wL,bL), yi) +λ
2(‖w‖2 + ‖b‖2), (2.2)
where l is a loss function 1 (e.g. cross entropy loss), l : f(xi0,w) → R, which penalizes the errors
associated with wrong class predictions (i.e., for each input xi0, the CNN model produces an incorrect
prediction f(xi0; w) if it is different from its associated label yi, and a correct one if it is equal to its label
yi) [55]. The squared Euclidean norm of the weights and biases acts as a regularization penalty, in order
to reduce their size and contribute to mitigate overfitting. The parameter λ controls the relevance of the
regularization approach, when compared with the loss term [22].
For training purposes, the loss function (i.e., l) can be appended to the end of the network (as de-
picted in Figure 2.1) by composing it with the model of the CNN (i.e., f(x0,w)), resulting in the expanded
CNN model, denoted by z = l f(x0,w) [55] (the superscript notation was suppressed for simplicity rea-
sons, since the results apply to each data sample i).
Figure 2.1: Diagram of the expanded CNN model (obtained from [55]).
The optimization problem described in Equation 2.2 can be solved by using backpropagation with
stochastic gradient descent (SGD), or its variant mini-batch SGD with momentum, since it typically pro-
cesses substantial amounts of training data [55]. While the standard gradient descent (also known as
batch gradient descent) uses the complete training set in each iteration, SGD acts only on one training
example in each iteration and mini-batch SGD utilizes only a subset of the complete training set in each
iteration (i.e., a batch with some training examples) [8]. Although this distinction assumes a pure view
of the SGD method, this assumption can be flexibilized to include batches in SGD, in which case, SGD
and mini-batch SGD correspond to the same approach. The SGD approaches mentioned are less com-
putationally expensive than standard gradient descent, but require approximately independent sampling
of the batches or examples [8]. Adding momentum to the original SGD, smooths the gradient and, in
a physical interpretation, allows the parameters to acquire velocity (or momentum), which aggregates
previous gradient values [22, 9].
Indeed, resorting to mini-batch SGD with momentum [54], the parameters of the objective function J
are updated according to:
mwt+1 = µtmw
t + ηt∂J
∂wt, (2.3a)
1The notation regarding the letter l was previously used to index an arbitrary network layer. The distinction between referringto the loss function or the network layer should be clarified by the context.
13
wt+1 = wt −mwt+1, (2.3b)
mbt+1 = µtmb
t + ηt∂J
∂bt, (2.3c)
bt+1 = bt −mbt+1, (2.3d)
where µt ∈]0, 1] is the momentum value (a hyperparameter) at the t-th iteration (because it can be
increased during training iterations), ηt is the learning rate (a hyperparameter) at the t-th iteration (since
it can be decreased during training iterations), mwt and mb
t are the momentum terms at the t-h iteration
for the weights and the biases (respectively), wt are the weights at the t-th iteration, and bt are the
biases at the t-th iteration [37, 54]. Regarding the weights and biases notation, the module (or layer)
subscript was temporarily replaced by the iteration subscript for simplicity reasons, since the equations
are valid for every layer 2.
Without momentum, the parameter update for the mini-batch SGD consists in [54]:
wt+1 = wt + ηt∂J
∂wt, (2.4a)
bt+1 = bt + ηt∂J
∂bt. (2.4b)
Other hyperparameters and refinements introduced by the mini-batch SGD with momentum are: the
number of epochs (one epoch corresponds to a complete sweep across the training set) and the batch
size (the number of training examples used at a certain iteration occurring in each epoch) [37, 54].
The parameters are initialized randomly or transferred from a learned representation (as discussed
in Section 2.4). The hyperparameters can be obtained through cross-validation or from reference values
and procedures [22].
The partial derivative of the objective function J , present in Equations 2.3a and 2.4a, comprises two
parts: the derivative of the loss function term, i.e.:
∂L
∂(vec wl)T=
∂z
∂(vec wl)T=
∂
∂(vec wl)T1
N
N∑i=1
l(f(xi0; w1, ...,wL), yi), (2.5)
and the derivative of the regularization term, i.e.:
∂R
∂(vec wl)T=
∂
∂(vec wl)Tλ
2‖wl‖2, (2.6)
where the derivative of the regularization term is simple to calculate and considering an arbitrary layer l
(the subscript indicating the iteration was replaced by the layer indication).
To compute the partial derivative of the loss function term of the objective function J with respect to
the weights in the l-th layer, is necessary to perform a forward pass in the expanded CNN model (i.e.,
from x0 to z as illustrated in Figure 2.1) and apply the chain rule in conjunction with backpropagation
2The operator ∂ denotes the partial derivative. Computing the partial derivative with respect to a considered multidimensionalvariable, corresponds to the application of this operator to each component of the matrix.
14
from the end of the network to the beginning, as follows [55]:
∂z
∂(vec wl)T=
∂z
∂(vec xL)T∂xL
∂(vec xL−1)T. . .
∂xl+1
∂(vec xl)T∂xl
∂(vec wl)T, (2.7)
where the vec function reshapes its argument to a column vector and the derivatives are computed at
the working point, set when the input was forwarded through the network [55].
The intermediate matrices involved in these calculations have high dimensionality, leading to ex-
pensive and even unfeasible computations [55]. To solve this problem, the explicit computation of this
intermediate matrices must be circumvented. As shown in Figure 2.2, the network modules from an
arbitrary intermediate layer (or module) l to the end of the network z can be aggregated in the function
h = lfL(xL; wL)fL−1(xL−1; wL−1) · · · fl+1(xl+1; wl+1) [55]. The derivatives of the composition hflcan be expressed as [55]:
∂z
∂(vec xl−1)T=
∂z
∂(vec xl)T∂vec xl
∂(vec xl−1)T, (2.8a)
∂z
∂(vec wl)T=
∂z
∂(vec xl)T∂vec xl
∂(vec wl)T, (2.8b)
which are now feasible to compute, since ∂z∂(vec xl−1)T
has the same dimensions as xl−1 and ∂z∂(vec xl)T
has
the same dimensions as xl [55]. By applying this methodology recursively, it is possible to backpropagate
the derivative ∂z∂(vec xl)T
, computed at an arbitrary layer l, to the preceding layer l − 1. Only the derivatives∂z
∂(vec wl)Tand ∂z
∂(vec xl−1)Tneed to be computed, avoiding the explicit computation of the high dimensional
intermediate matrices [55].
The overall backpropagation procedure (including its feasible variant described previously) can be
applied to the biases b in a similar way.
Figure 2.2: Diagram illustrating the methodology used to feasibly perform the backpropagation proce-dure (adapted from [54]).
Figure 2.3 provides an abstract illustration of the backpropagation procedure.
In the test phase, the input to test is forwarded through the previously trained original CNN model
(without the loss function), outputting a prediction, which is compared with the corresponding label to
determine the test error.
15
Figure 2.3: Example of the backpropagation procedure (obtained from [54]).
2.3 Operations and layers
2.3.1 Convolutional layer
As mentioned in Section 2.1, the CNN explores the properties of data with lattice-like topology, to obtain
representative deep models in a computationally viable manner, to which task the convolutional layer
contributes significantly.
Since the statistics existent across natural images (and similarly in other inputs) repeat itself, patterns
emerge locally, which can be represented by features [37]. These features are learned in certain regions
of the image, but due to their meaningfulness, can then be used throughout the entire image [37]. In
fact, the feature learning process is based on the activations obtained from convolving a set of filters or
kernels (with more reduced dimensions than the input) with the input image [37, 9]. For example, edge
representations can be found in the first layer of the CNN model of [31], as they are useful features at
every region of an image [9].
Departing from this reusability of features perspective, the parameter sharing property is employed,
where the parameters are shared for each location of the input (in the width and height dimensions,
denoted by H and W , respectively) in which the filters are applied, although different filters can be
learned (its number corresponds to the depth dimension of the layer output and is denoted by D′′) [22].
Supporting full connections between neurons in adjacent layers can be computationally demanding,
specially for inputs of substantial size [37]. This problem is addressed by establishing the local con-
nectivity property (or sparse connectivity), which limits the number of connections in the width (W ) and
height (H) dimensions, and has complete connectivity in the depth dimension (denoted by D) [22]. In-
deed, only a spatial region (in terms of width and height), named receptive field, composed by various
input layer adjacent neurons is connected to each neuron in the next layer (as depicted in Figure 2.4)
[22]. The receptive field (of size H ′ ×W ′ ×D) corresponds to the height, width and depth dimensions
of the filters (of size H ′ ×W ′ ×D ×D′′), where the height and width are usually equal (i.e., H ′ = W ′),
16
and the fourth dimension corresponds to the number of filters (i.e., D′′ neurons seek to learn different
filters from the same positions of the image) [22]. Other hyperparameters existent in the convolutional
layer are the stride, which is the quantity by which the filters are displaced among contiguous image
positions while applying convolution, and the zero padding, which is the number of zeros to append to
the boundaries of the input (allowing to guarantee certain output dimensions) [22].
Resulting from the parameter sharing framework described above, equivariance to translation is
achieved, implying that translating an input and applying convolution yields the same output as applying
convolution to the input first and translating the resulting output afterwards [9].
Given an input x ∈ RH×W×D, multidimensional filters (also known as weights) w ∈ RH′×W ′×D×D′′
and biases b ∈ RD′′ , the convolution of x with the filters w (and biases b) is given by [55] (without zero
padding and with stride equal to 1):
yi′′j′′d′′ = bd′′ +
H′∑i′=1
W ′∑j′=1
D∑d′=1
wi′j′d′d′′ × xi′′+i′−1,j′′+j′−1,d′ , (2.9)
where the output is y ∈ RH′′×W ′′×D′′ , with H ′′ = 1 + floor((H − H ′ + Pt + Pb)/Sh) and W ′′ =
1 + floor((W −W ′ + Pl + Pr)/Sw), Sh and Sw correspond to the stride for the height and the width,
respectively, and Pt, Pb, Pr and Pl correspond to the zero padding of the top, bottom, right and left of
the input, respectively [55]. Usually, the inputs have equal height and width dimensions, i.e., H = W ,
the strides are equal, i.e., Sh = Sw = S, the padding values are equal, i.e., Pt = Pb = Pr = Pl = P ,
and the receptive field height and width are equal, i.e., H ′′ = W ′′. With parameter sharing, there are
H ′ ×W ′ ×D ×D′′ weigths and D′′ biases.
For example, as illustrated in Figure 2.4, considering a CNN input layer of size 227x227x3 (i.e.,
H = 227×W = 227×D = 3), followed by a Convolutional Layer with receptive field size 11x11x3 (i.e.,
H ′ = 11×W ′ = 11×D = 3), number of filters D′′ = 96, stride S = 4 and no zero padding (i.e., P = 0),
the height and width output size are equal to: H ′′ = W ′′ = 1 + floor((227− 11)/4) = 55. Consequently,
this layer has 11x11x3x96 weights and 96 bias (assuming parameter sharing), and its output size is:
H ′′ ×W ′′ ×D′′ = 55 × 55 × 96, meaning that the same input layer patch of size 11x11x3 is connected
to 96 neurons (with distinct parameters) [22].
2.3.2 Fully connected layer
In fully connected layers, the neurons in adjacent layers are fully connected, similarly to the case of
Neural Networks layers [22] (the procedure to convert from convolutional layers to fully connected layers
and vice-versa, is explained in [22]).
2.3.3 Pooling
The pooling layer collects statistics about the feature representations, typically in non-overlapping input
patches [22]. In the image case, applying pooling subsamples the image by a factor of the stride,
using patches of the receptive field size. Consequently, the feature dimensionality, the parameters and
17
Figure 2.4: Example of the convolution operation (adapted from [22]).
the computational power demands are decreased, mitigating overfitting [22] and attaining invariance
to slight translations. Although multiple types of statistics exist (e.g. maximum, average or L2 norm),
maximum (or max) pooling is the most common.
Considering an input x ∈ RH×W×D, the max pooling (in patches of size H ′ ×W ′ and stride S) [55]
is given by:
yi′′j′′d = max1≤i′≤H′,1≤j′≤W ′
xi′′+i′,j′′+j′,d, (2.10)
where the output is y ∈ RH′′×W ′′×D. An example of the max pooling operation application is depicted in
Figure 2.5.
Figure 2.5: Example of the max pooling operation (obtained from [22]).
2.3.4 Activation function
The activation functions enrich the network representational power by adding a non-linearity and are
typically placed after the convolutional layers (similar to the layers in the NN case) [22]. One of the most
popular activation functions is the Rectified Linear Unit (ReLU), mainly because it leads to faster training
times than the sigmoid and hyperbolic tangent activation functions [31], can be easily computed and
does not saturate [22]. Examples of several activation functions are depicted in Figure 2.6.
18
The application of ReLU to an input x ∈ RH×W×D is given by [55]:
yijd = max0, xijd. (2.11)
Figure 2.6: Examples of activation functions (obtained from [54]).
2.3.5 Normalization
Normalization is another module of the CNNs which contemplates various methodologies, such as:
local contrast normalization and cross-channel normalization. Given an input x ∈ RH×W×D, the cross-
channel normalization is performed according to [55]:
yijd = xijd(κ+ α∑
t∈G(d)
x2ijt)−β , (2.12)
where the output y has the same size as the input x and G(t) ⊂ 1, . . . , D represents a subset of the
input channels, for each output channel t. A scheme of the cross-channel normalization operation is
shown in Figure 2.7.
Figure 2.7: Cross-channel normalization scheme (obtained from [54]).
2.3.6 Classification
The classification module assigns a class to the inputs. A popular approach is based on SVMs (including
all their variants), which provide class scores. In the context of CNNs, the most common choice is the
softmax function given by [55]:
yijk =exijk∑Tt=1 e
xijt
, (2.13)
19
where x ∈ RH×W×T is the input and typically H = W = 1; k = 1, . . . , T ; T represents the total number
of classes and yijk denotes the probability of xijk belonging to the class k, out of all the possible T
classes.
2.3.7 Loss function
The loss function is applied during the training process for optimization purposes (as mentioned in
Section 2.2) and the two main types are: the hinge loss (possibly squared) and the softmax logarithmic
loss (also known as the cross entropy loss), which merges the softmax function and the logarithmic loss.
The cross entropy loss is given by [55]:
y = −∑ij
(xijc − log
T∑t=1
exijt), (2.14)
where x ∈ RH×W×T is the input and typically H = W = 1; the ground truth class is represented by xijc
and T is the total number of classes.
2.4 Transfer Learning
In image trained CNNs, the features are organized according to a hierarchy, where the features existent
in the initial layers (lower-level) are more general (e.g. edge representations and color blobs), with
its specificity to the task of interest classes gradually increasing as the top (high-level) of the network is
reached [66, 22]. For instance, while the lower layers of the ImageNet CNN model [31] may contain edge
representations, the top layers are more focused on differentiating between specific types of felines and,
even more specifically, between cat breeds. As a result, this idea inspires the possibility of transferring
learned features from one CNN to another [66].
2.4.1 Pre-training
CNNs can be trained only with the dataset pertaining to the chosen task (also known as base task),
starting from random initialization (e.g. according to a known probability distribution such as the normal
distribution). However, the size of the dataset may not be enough to prevent overfitting (due to the large
number of parameters to learn), motivating the use of pre-trained models [22]. These pre-trained models
result from training a CNN (possibly with random initialization) with a dataset of more substantial size
(e.g. a subset of ImageNet [14], which contains 1,2 million images with 1000 categories) and different
from the one of the task of interest, also known as target task (although some resemblances may exist)
[22].
20
2.4.2 Fine-tuning
The classification layer of the pre-trained CNN model can be adapted to the number of classes of the
target task dataset, and the entire (or some of the layers of the) network can be trained (as in [27]) with
the target task dataset (which is a process called fine-tuning), originating the target task CNN model.
Determining which layers are fine-tuned (all, lower-level, higher-level or only the classification layer)
depends essentially on the size of the target task dataset and on its resemblances with the base task
dataset [22]. Qualitatively, when resemblances exist and the dataset has a substantial size, the entire
network can be fine-tuned [22]. Nevertheless, when the data is more scarce (but with resemblances as
well), a better approach consists in extracting the penultimate layer and utilizing it to train a linear classi-
fication layer [22]. Although the dataset’s size may be enough to train a CNN from random initialization,
fine-tuning a pre-trained model (trained with a different type of data) can be more advantageous [66].
Finally, if the dataset’s size is scarce and the data type distinct, a more adequate choice may consist in
training a classifier from lower-level layers of the pre-trained network [22].
2.4.3 Feature extraction
CNNs are able to learn rich feature representations from raw inputs (such as image pixels). These
features can be extracted from the top-level layers (typically the penultimate layer, i.e., the one before the
classification layer) of a (pre-trained and fine-tuned) network and used to train a classifier (e.g. softmax
classifier or SVM). Images from the task specific dataset can be forwarded through the same network,
followed by the extraction of features (from the same layer as before) and testing on the previously
trained classifier [27].
21
Chapter 3
Proposed method
This chapter describes the innovative method developed to solve the Pedestrian Detection problem by
combining features extracted from several CNN input models. First, the used datasets, input channels
and pre-trained CNN models are specified. Subsequently, the pre-training and fine-tuning processes
are discussed for the single channel input case. An analysis of the final CNN model, resulting from the
combination of the input channels (multichannel combination), is performed and the overall detection
methodology is mentioned. Finally, the implementation is outlined, including the software and toolboxes
used.
3.1 Description of the Proposed Method
3.1.1 Datasets and Input channels
Two types of datasets are considered: the dataset containing pedestrian images (which is the INRIA
dataset [13]) and the dataset used to pre-train the CNN model (which is a subset of the Imagenet dataset
[14], comprising 1000 object categories with approximately 1,2 million training images, 50 thousand
validation images and 100 thousand test images, and whose images range from mammals and birds to
vehicles and fruit).
The chosen pedestrian dataset underwent six transformations, resulting in seven distinct datasets
(i.e., the original RGB image and six more). These seven different datasets are used as input channels
to fine-tune the pre-trained CNN model and are enumerated as follows: 1) RGB color space, denoted
by DRGB = (xRGB , y)nNn=1; 2) gradient magnitude, denoted by DGM = (xGM , y)nNn=1; 3) horizontal
derivative (along all 3 dimensions of the depth), denoted by DGx = (xGx, y)nNn=1; 4) vertical derivative
(along all 3 dimensions of the depth), denoted by DGy = (xGy, y)nNn=1; 5) grayscale, denoted by
DGs = (xGs, y)nNn=1; 6) YUV color space, denoted by DY UV = (xY UV , y)nNn=1 and 7) LUV color
space, denoted by DLUV = (xLUV , y)nNn=1.
The images in each dataset are represented by xRGB ,xGx,xGy,xY UV ,xLUV ∈ RH×W×D and xGM ,xGs ∈
R(H×W ) (with xRGB ,xGM ,xGx,xGy,xGs,xY UV ,xLUV : Ω→ R, where Ω is the image lattice),in which H
and W denote the height and width of the images (respectively), D their depth (D is equal to 1 in the 2D
22
image case, i.e., for xGM and xGs), and N the total number of images. Example images of each input
channel are depicted in Figure 3.1.
The corresponding class of each of the N images is denoted by y ∈ Y = 1, 2, where y = 1
represents the absence of pedestrians in the corresponding image and y = 2 represents the existence
of pedestrians in the corresponding image. The complete dataset of pedestrian images is denoted by
D = DRGB ,DGM ,DGx,DGy,DGs,DY UV ,DLUV .
Similar notation is adopted to denote the Imagenet dataset [14] used to pre-train the CNN model,
namely: D = (x, y)Nn=1, x ∈ RH×W×D and y ∈ Y = 1, . . . , 1000.
Figure 3.1: Example images of the input channels: 1) RGB colorspace (xRGB); 2) gradient magnitude(xGM ); 3) horizontal derivative (xGx); 4) vertical derivative (xGy); 5) grayscale (xGs); 6) YUV colorspace(xY UV ); 7) LUV colorspace (xLUV ).
3.1.2 Pre-trained CNN model
According to [66], initializing a CNN with features from a pre-trained network can be more advantageous
than random initialization and can lead to enhancements in the generalization capability, even if the
tasks assigned to the two models are distant.
Therefore, the publicly available CNN-F pre-trained model [12], represented by f(x, w) with param-
eters w = [wcn, wfc, wcl], was selected. The mentioned pre-trained model architecture is depicted in
Figure 3.2 and contains a total of 8 layers: 5 being convolutional (including non-linear sub-sampling and
parameters wcn), 2 being fully connected (with parameters wfc) and the last 1 being a classification
layer (also fully connected with parameters wcl). This CNN has an architecture similar to the one used
in the ILSVRC-2012 competition (ImageNet Large Scale Visual Recognition Challenge 2012) [31] and
was pre-trained with a subset of the ImageNet dataset [14] referred to in Subsection 3.1.1.
More specifically, the CNN input size is 224x224x3 (which requires resizing or pre-processing the
input images in order to achieve these dimensions) and the first convolutional layer comprises: 64
filters with a 11x11 receptive field, convolutional stride 4 and no zero spatial padding, followed by local
response normalization and max-pooling with a downsampling factor of 2 and zero padding on the
bottom and on the right. The second convolutional layer contains: 256 filters with a 5x5 receptive field,
convolutional stride 1 and two levels of zero spatial padding, followed by local response normalization
and max-pooling with a downsampling factor of 2 (and no zero padding). The third convolutional layer
comprises: 256 filters with a 3x3 receptive field, convolutional stride 1 and one level of zero spatial
padding. The fourth convolutional layer contains: 256 filters with a 3x3 receptive field, convolutional
23
stride 1 and one level of zero spatial padding. The fifth convolutional layer comprises: 256 filters with a
3x3 receptive field, convolutional stride 1 and one level of zero spatial padding, followed by max-pooling
with a downsampling factor of 2 (and no zero padding). The sixth layer and seventh layers are fully
connected with size 4096 (and support for dropout regularization, despite not being used herein). The
eight and last layer is a softmax classifier of size 1000. All layers containing parameters (except for the
last one) have the Rectified Linear Unit as the activation function.
3.1.3 CNN single input model
To construct the CNN model for the target task of pedestrian detection, the parameters wcn and wfc,
pertaining to the CNN model obtained in the ILSVRC-2012 task (from layer 1 to 7), were transferred to
the pedestrian detection task CNN model.
The architecture of the pedestrian detection task CNN model is equal to the pre-trained model’s
architecture, with the exception of the last layer (i.e., the eighth), which was replaced by a new softmax
classification layer (having parameters wcl and randomly initialized using a Gaussian distribution with
zero mean and variance equal to 0,01), adapted to the new number of classes, that is currently two (i.e.,
pedestrian and non pedestrian) instead of 1000.
Each of the seven previously mentioned input channels (datasets contained in the complete dataset
D) was used to fine-tune a CNN (similar to the procedure in [27] and [11], and resorting to the logarithmic
loss function), resulting in a total of seven single input CNN models (1 per distinct input channel), namely:
1) for RGB color space, f(xRGB ,wRGB); 2) for Gradient Magnitude, f(xGM ,wGM ); 3) for the Horizontal
derivative, f(xGx,wGx); 4) for the Vertical derivative, f(xGy,wGy); 5) for grayscale, f(xGs,wGs); 6) for
YUV color space, f(xY UV ,wY UV ); 7) for LUV color space, f(xLUV ,wLUV ). This process is illustrated
in Figure 3.3 a).
As mentioned previously, the expected CNN input image size is 224x224x3. Consequently, for the
considered input channels (datasets) that contain two dimensional images, the third dimension (depth)
was constructed by stacking 3 replicas of the image. Then, before entering the CNN, all images were
resized to the size 224x224x3 using cubic interpolation. Prior to these two operations, the mean of the
training images was computed and removed from all train,validation and test images, as a normalization
approach.
Figure 3.2: Pre-trained model architecture (obtained from [54]).
24
3.1.4 CNN multichannel input combination model
To build the multichannel combination CNN model, the features from the seventh layer (designated by
the L − 1th layer) of each of the seven single input CNN models are extracted and combined among
themselves (entirely or partially). These features are then utilized to train (by minimizing the logarith-
mic loss function) a softmax classification layer (randomly initialized using a Gaussian distribution with
zero mean and variance equal to 0,01), originating the multichannel input combination CNN model rep-
resented by: f(xRGB,L−1,xGM,L−1,xGx,L−1,xGy,L−1,xGs,L−1,xY UV,L−1,xLUV,L−1; wcl) (similarly to the
procedure in [11]) and depicted in Figure 3.3 b).
Figure 3.3: Scheme of the a) single input and b) multiple input combination CNN models.
3.1.5 Overall detection methodology
The overall detection process comprises two parts, as shown in Figure 3.4. Firstly, the Aggregated
Channels Features detector [21] is applied to the test images in order to obtain the pedestrian candidate
windows (i.e., windows potentially containing pedestrians), outputted in the form of bounding boxes
that surround each detected person (containing the height, width and coordinates), with corresponding
confidence scores for each detection.
Next, each of the candidate windows is extracted from the full image and passed through the multiple
input combination CNN model previously described, being classified as pedestrian or non pedestrian.
The candidate windows classified as non pedestrians are discarded and the ones classified as pedes-
trians maintain the bounding box and confidence score obtained with the ACF detector. Then, the
bounding box and the confidence score are utilized to perform the per-image evaluation of the overall
detector (as described in Section 4.1 of Chapter 4).
25
Figure 3.4: Scheme of the overall detection methodology.
3.2 Implementation
The developed method was implemented in Matlab and the MatConvNet Matlab toolbox [55] was used
to implement the CNN framework. The Piotr’s Computer Vision Matlab Toolbox [15] (2014, version
3.40 and including the channels and detector packages) was used to perform the detection with the
Aggregated Channel Features [21] method and to assess the performance of the proposed method,
including benchmarking against other methods and plotting the miss rate versus false positives per
image graph.
26
Chapter 4
Experimental Results
The results obtained with the CNN multichannel combination method proposed in Chapter 3 are pre-
sented in this chapter. The CNN single input channel results are also shown, in order to analyze the dif-
ference between the multichannel and single channel approaches. The performance evaluation method-
ology, the training methodology and the experimental setups are described and the results discussed.
For benchmarking purposes, the performances of several other state of the art PD methods are depicted
(including the performance of the ACF detector alone, without using the CNN) and compared with the
developed methodology.
4.1 Performance evaluation methodology
Regarding evaluation metrics, the more suitable methodology to evaluate pedestrian detectors was
proved to be based on per-image performance evaluation, instead of the per-window counterpart [20].
Accordingly, the most recent benchmarks in PD were obtained with full image evaluation metrics and,
therefore, this was the methodology used to evaluate the performance of the proposed method. A de-
tailed description of the full image evaluation metric is presented in [20] (and summarized in this section).
In the per-image evaluation metric, a detection window is slided on a grid of locations throughout
a full image, at multiple scales, outputting a bounding box and a confidence score per detection [20].
Afterwards, the detections placed in the same neighborhood are merged using non-maximal suppres-
sion, originating the definitive bounding boxes and confidence scores [20]. To determine if a detected
bounding box (BBdt) and a ground truth bounding box (BBgt) constitute a match, the PASCAL measure
is used, expressed as:
ao =area(BBdt ∩BBgt)area(BBdt ∪BBgt)
> 0, 5 , (4.1)
meaning that the overlapping area (ao) among the two bounding boxes must be greater than 50% [20].
Only one match between each BBdt and BBgt is allowed [20]. The first matches occur between the
detection bounding boxes (BBsdt) having the highest confidence scores and the corresponding ground
truth bounding boxes (BBsgt) (according to the PASCAL measure described previously) [20]. After
the previous matches, a single BBdt might have been matched to several BBsgt. This constitutes an
27
ambiguity that is solved by selecting the highest overlapping match (ties are solved by arbitrary selection)
[20]. False positives correspond to unmatched BBdt and false negatives correspond to unmatched BBgt
[20].
The detectors performance can be illustrated with a curve, portraying the variation of the miss rate
(MR) with the false positives per image (FPPI), when the threshold established for the detection con-
fidence score is changed [20]. These curves have logarithmic scales and are particularly useful for
detector benchmarking purposes [20]. In order to concisely express the information contained in the
MR against FPPI performance curve, the log-average miss rate was introduced (henceforth, referred to
simply as miss rate, except in this section) [20], which is the average of nine miss rate values selected
for nine FPPI rates belonging to the interval [10−2, 100] [20]. More specifically, each of these FPPI rates
corresponds to a equally distributed point (in the logarithmic domain) for the mentioned interval [20]. In
the special case where a certain miss rate value cannot be computed for a certain FPPI rate, because
the curve ended before sweeping through the entire FPPI rates interval (i.e., [10−2, 100]), it is replaced
by the minimum miss rate obtained [20].
4.2 Experimental setup
The experiments were performed on the INRIA dataset [13]. The methodology used to obtain the pos-
itive and negative training images is discussed in Section 4.3. Two main experimental setups were
considered: one including the entire INRIA training set with higher image resolution and the second one
using only a small portion of the INRIA training set with lower image resolution. Although the training
sets have a different number of images, the test sets have the same number of images, but with different
resolutions, i.e., for the higher resolution case the test set resolution is higher and for the lower resolution
case its resolution is lower.
In the first experimental setup, designated by full INRIA dataset, 17128 images were used for training
(90%) and validation (10%). In this set of 17128 images, 12180 are negative (i.e., without containing
pedestrians) and the remaining 4948 images include: 1237 positive images (i.e., containing pedestri-
ans), its horizontal flipping or mirroring (another 1237 images) and random deformations applied to the
previous 2474 images (positive images and its horizontal flipping).The deformations consist in perform-
ing cubic interpolation between randomly chosen starting and end values for the width and height of the
image (obtained from a uniform distribution on the interval ]0, 15[), but preserving the original size. The
size of the images is 100 for the height and 41 for the width, for all input channels, corresponding to the
higher resolution images.
In the second experimental setup, designated by partial INRIA dataset, 400 images were used for
training (75%) and validation (25%). In this set of 400 images, 300 are negative and the remaining 100
images are positive, without data augmentation (no horizontal flipping and no random deformations),
which were randomly chosen from the entire set of positive images. The RGB images result from resizing
the ACF detections from 100x41x3 to 25x10x3 (height, width and depth dimensions, respectively; the
details about the ACF pre-processing and the training methodology are mentioned in Section 4.3). The
28
gradient magnitude and the gradient histogram result from the ACF features computation applied to the
100x41x3 RGB images and then shrinked by a factor of 4 (i.e., with size 25x10). The Felzenszwalb’s
histogram of oriented gradients [24] was applied to the 100x41x3 RGB images, having size 12x5x32
and was then reshaped to the size 32x20x3. These are the lower resolution images.
The test set contains 1835 images which correspond to candidate windows obtained by running the
ACF detector in the INRIA 288 positive full test images (with at least one pedestrian, but can contain
more than one). This is the INRIA test set used to establish the benchmarks shown in [1] and must be
considered for comparison purposes.
4.3 Training methodology
Two training methodologies were considered to obtain the positive training images for the full INRIA
dataset experimental setup, namely: 1) cropping and resizing the INRIA positive training images and 2)
pre-processing the INRIA positive training dataset with the ACF detector. For the partial INRIA dataset,
the second training methodology was adopted.
For both cases, the negative images (i.e., without pedestrians) result from randomly extracting win-
dows of size 100x41x3 from the INRIA negative training images (i.e., 10 windows per negative full
image).
Concerning the positive train and validation images, its acquisition procedure is different depending
on the adopted methodology. In the first case, the positive train and validation images result from
cropping and resizing the INRIA positive training dataset to achieve the size 100x41x3 (corresponding
to the height, width and depth, respectively), keeping the pedestrians centered and fully present. The
cropping and resizing operations reduce the size of the images and allow to decrease the computational
cost. Moreover, the dimensions 100x41x3 were selected to match the ones of the second methodology,
allowing to fairly compare the performance of the two methodologies.
In the second case, to obtain the positive images (for training and validation), the ACF detector is
applied to the INRIA positive training dataset and the bounding boxes corresponding to the true positives
are selected (with size 100x41x3, which is the ACF detection window size). The original bounding
boxes, corresponding to the false negatives, are extracted as well (by comparing the true positives with
the ground truth). The true positives and the false negatives constitute the total set of positive images,
containing only true pedestrian bounding boxes. This methodology allows to obtain a more robust and
diversified training, since the method learns deformations, translations and occlusions, that would not
have been experienced with a centered and adequately formatted version of the dataset. Furthermore,
the candidate images of the test set are obtained by applying the ACF detector, which indicates that
pre-processing the train images with the same detector might be advantageous.
For the previously stated reasons, the training methodology used to obtain the results in Section
4.4 was the second one, where the ACF was applied to the INRIA positive training dataset to obtain the
positive training images. The difference between the results reached with the two training methodologies
is analyzed in Subsection 4.4.3.
29
4.4 Results
Training and testing were performed using the Matlab-based MatConvNet toolbox [55] running on CPU
mode (no GPU was used) on 2,50 GHz Intel Core i7-4710 HQ with 12 GB of RAM and 64 bit architecture.
No cross validation technique was used.
4.4.1 PD using the full INRIA dataset with 7 input channels
Each single input CNN model spent approximately 4,5 hours to train and 3 minutes to test. The training
process only concerns the fine-tuning of the entire network (adapted to the pedestrian and non pedes-
trian classes and comprising 8 layers) with the INRIA [13] pedestrian dataset. The time spent pre-training
the initial CNN-F model [12] with a subset of the Imagenet [14] dataset was not taken into account. In
the multichannel combination CNN model, the feature extraction per input channel took approximately
from 2 hours to 3 hours and the test time was under 1 minute. However, estimating the runtime of the
entire method (in frames per second) requires the inclusion of the feature extraction time, besides the
classification time (calculated previously for the test phase).
The optimization algorithm used for training (included in the MatConvNet toolbox [55] ) was the mini-
batch stochastic gradient descent with momentum, having the following parameters: batch size equal to
100, number of epochs equal to 10, learning rate equal to 0,001 and momentum equal to 0,9.
Table 4.1 presents the results for the single input channels: RGB color space (denoted by RGB),
gradient magnitude (denoted by GradMag), horizontal derivative across all RGB channels (denoted
by Gx), vertical derivative across all RGB channels (denoted by Gy), grayscale (denoted by Grayscale),
YUV color space (denoted by YUV) and LUV color space (denoted by LUV). As mentioned in Subsection
3.1.3 of Chapter 3, the input channels with original size 100x41, which are the gradient magnitude and
grayscale, are replicated in order to achieve the size 100x41x3. The other input channels having original
size 100x41x3, do not undergo this operation. Before entering the CNN, all inputs are resized to the size
224x224x3 using cubic interpolation (since this is the network expected input dimensions).
Table 4.2 depicts the performance when the following 4 input channels are combined: gradient mag-
nitude (GradMag), horizontal derivative (Gx), vertical derivative (Gy) and LUV color space (LUV) (ac-
cording to the methodology described in Figure 3.3 of Chapter 3). Table 4.3 shows the performance
for the combination of all the 7 input channels (according to the methodology described in Figure 3.3
of Chapter 3). The combination of the inputs: gradient magnitude, horizontal derivative and LUV color
space provides the best result, which is further improved by changing the training parameters, only
for the multichannel combination, as follows: the learning rate was changed to 0,0001, the number of
epochs was changed to 80 and the batch size was changed to 2000, resulting in a miss rate % of 14,64%
(not shown in Tables 4.2 and 4.3, but presented in Figure 4.1).
The comparison of the results obtained with the developed method for the combination of different in-
put channels, using the full INRIA dataset, are shown in Figure 4.2. In order to perform a fair comparison
with the best result, the training parameters of each combination presented in Figure 4.2 were changed
to have the same values as the best method training parameters in the multichannel combination, i.e.:
30
the learning rate was changed to 0,0001, the number of epochs was changed to 80 and the batch size
was changed to 2000.
The comparison of the developed method best result, which occurs for the combination of the inputs
(using the full INRIA dataset): gradient magnitude (GradMag), horizontal derivative (Gx) and LUV color
space (LUV), with other PD benchmarks is presented in Figure 4.1 (denoted by Multichannel CNN in the
box).
Table 4.1: Miss rate % using single channels as input and without feature combinations for the full INRIAdataset
Channel Miss Rate%RGB 16,27
GradMag 15,24Gx 16,42Gy 16,31
Grayscale 15,75YUV 16,03LUV 15,97
Table 4.2: Miss rate% using 4 features combinations for the full INRIA dataset (the ones in the table meanthat the feature of that channel is present in the combination and the zeros represent its absence).
GradMag Gx Gy LUV Miss Rate%0 0 1 1 15,970 1 0 1 15,941 0 0 1 15,090 1 1 0 16,121 0 1 0 16,041 1 0 0 15,620 1 1 1 16,021 0 1 1 14,911 1 0 1 14,821 1 1 0 15,991 1 1 1 15,77
4.4.2 PD using the partial INRIA dataset with 4 input channels
Each single input CNN model spent approximately 14 minutes to train and 1,5 minutes to test. The
training process only concerns the fine-tuning of the entire network (adapted to the pedestrian and non
pedestrian classes and comprising 8 layers) with the INRIA [13] pedestrian dataset. The time spent
pre-training the initial CNN-F model [12] with a subset of the Imagenet [14] dataset was not taken into
account. In the multichannel combination CNN model, the feature extraction per input channel took
approximately 10 minutes and the test time was under 1 minute. However, estimating the runtime of the
entire method (in frames per second) requires the inclusion of the feature extraction time, besides the
classification time (calculated previously for the test phase).
31
Table 4.3: Miss rate% using 7 features combinations for the full INRIA dataset (the ones in the table meanthat the feature of that channel is present in the combination and the zeros represent its absence).
RGB GradMag Grayscale Gx Gy YUV LUV Miss Rate%1 1 1 1 1 1 1 16,701 1 0 1 0 0 1 16,551 1 0 1 1 0 1 16,700 0 1 1 1 1 1 15,951 1 0 1 1 1 1 16,001 0 1 0 1 1 1 16,750 1 0 1 0 0 1 14,820 1 0 1 1 0 1 15,770 1 1 1 1 0 1 15,920 1 0 1 1 1 1 15,890 1 0 0 1 0 1 14,910 1 0 1 0 0 0 15,620 0 0 1 0 0 1 15,941 1 0 1 0 0 0 16,460 1 1 1 0 0 1 16,22
Figure 4.1: Comparison of the developed method best result, denoted by Multichannel CNN, with otherPD benchmarks for the full INRIA dataset. The box contains the log-average miss rate % for eachmethod.
The optimization algorithm used for training was the mini-batch stochastic gradient descent with
momentum (included in the MatConvNet toolbox [55]), having the following parameters: batch size equal
32
Figure 4.2: Comparison of the results obtained with the developed method for the combination of dif-ferent input channels for the full INRIA dataset. The box contains the log-average miss rate % for eachmethod.
to 10, number of epochs equal to 10, learning rate equal to 0,001 and momentum equal to 0,9.
Table 4.4 presents the results for the single input channels: RGB color space with size 25x10x3
(denoted by RGB), gradient magnitude with size 25x10 (denoted by GradMag), gradient histogram in
the orientation range from 150 degrees to 180 degrees with size 25x10 (denoted by GradHist6) and the
reshaped Felzenszwalb’s histogram of oriented gradients [24] with size 32x20x3 (denoted by FHOG).
The performance for the combination of these input channels is depicted in Table 4.5. As mentioned in
Subsection 3.1.3 of Chapter 3, for two dimensional input channels, the image is replicated in order to fill
the third dimension. Before entering the CNN, all inputs are resized to the size 224x224x3 using cubic
interpolation (since this is the network expected input dimensions).
Table 4.4: Miss rate % using single channels as input and without feature combinations for the partialINRIA dataset
Channel Miss Rate%RGB 21,05
GradMag 24,23GradHist6 21,83
FHOG 22,34
4.4.3 Discussion
By analyzing Table 4.1, Table 4.2 and Table 4.3, it is possible to observe that the combination of multiple
input channels is advantageous and can lead to improved results. In fact, the best result is 14,64 %
miss rate and was obtained for the combination of GradMag, Gx and LUV input channels (individually
33
Table 4.5: Miss rate% using 4 features combinations for the partial INRIA dataset (the ones in the tablemean that the feature of that channel is present in the combination and the zeros represent its absence)
RGB GradMag GradHist6 FHOG Miss Rate%0 0 1 1 25,620 1 0 1 25,171 0 0 1 21,040 1 1 0 23,521 0 1 0 23,461 1 0 0 23,820 1 1 1 23,401 0 1 1 18,681 1 0 1 18,441 1 1 0 21,761 1 1 1 19,95
these channels have worst performances, i.e., the miss rate of GradMag is 15,24%, the miss rate of Gx
is 16,42%, and the miss rate of LUV is 15,97%).
However, some combinations of the input channels lead to miss rates that are worse than the ones
of its individual input channels alone (the interaction among some input channels seems to deteriorate
the overall performance). For example, the Gx miss rate (equal to 16,42%) is better than the 7 input
channels combination miss rate in Table 4.3 (equal to 16,70%), but worst than the 4 input channels
combination miss rate in Table 4.2 (equal to 15,77%).
Comparing the results from Tables 4.1, 4.2, 4.3, 4.4 and 4.5, it is relevant conjecturing that, when
combining several input channels, an higher performance improvement can be obtained if each single
channel has reduced quality (e.g. in terms of resolution or dimensionality), whereas higher quality single
input channels seem to lead to less improvement when combined (although still existent). Concordantly,
the multichannel combination maximum improvement, in the lower resolution single input channels case
(shown in Tables 4.4 and 4.5, where images have lower dimensions than in the Tables 4.1, 4.2 and
4.3), is 5,79% (resulting from the difference between GradMag miss rate% and RGB, GradMag, FHOG
combination miss rate%) and, in the higher resolution single input channels case (shown in Tables 4.1,
4.2 and 4.3), is 1,78% (resulting from the difference between Gx miss rate% and GradMag, Gx and LUV
combination miss rate%). Despite the resizing to 224x224x3 underwent by all images before entering
the CNN, the scarce image resolution affects the resize operation causing loss of image quality (from the
higher resolution to the lower resolution case, the height and width were reduced to approximately one
fourth of the initial size, except for the case of FHOG, in which the height was reduced to approximately
one third and the width to one half of the initial RGB image size, before being transformed into FHOG).
According to the previous information, the proposed PD method is suited for the low resolution PD
problem. However, the low resolution images were only used in the CNN based approach. The ACF
method, which generated the candidate images, utilized the original INRIA test images with high res-
olution. The candidate images were only resized (or reshaped) afterwards in order to obtain lower
resolution. This low resolution candidate images were then entered in the CNN. Consequently, the low
resolution results only comprise the CNN methodology, not the entire detection system (which is com-
34
posed by the ACF and the developed CNN based method). The performance for the CNN multichannel
combination case cannot be compared with the performance of the ACF method alone (i.e., the base-
line), because the image resolution was not reduced before the generation of the candidate windows.
Nevertheless, within the scope of the CNN framework, it is still possible to compare the performance
of the single input channels with the performance of the multiple input channels combinations (as done
previously).
The channels combination performance improvements do not assume more significant proportions,
possibly due to the lack of heterogeneity among the channels (or at least they could be substantially
more heterogeneous). For instance, if the input channels represented different views of the pedestrians,
more noticeable differences would be expected. As a result, a better and more heterogeneous selection
of the input channels may enhance the performance of their combination. Indeed, the combination of
the RGB, GradMag, Gx and LUV input channels produces a miss rate (equal to 16,55%) 1,91% worst
than the miss rate of the combination of GradMag, Gx and LUV (equal to 14,64%), possibly due to
the redundancy existent between the colorspaces RGB and LUV (the combination of RGB and LUV
may be incompatible as well, since the miss rates of the combination of RGB, GradMag and Gx, and
the combination GradMag, Gx and LUV are better than the miss rates after adding LUV and RGB,
respectively).
Regarding the sensibility of the model, when the hyperparameter batch size increases, the perfor-
mance tends to improve in the single channel case (mainly when the training dataset size is small),
not showing significant differences in the multichannel combination case. By increasing the number
of epochs over 10, there are no substantial changes in the performance (specially when the training
dataset is large), since the training and validation errors are not able to substantially decrease more. If
the training and validation errors are constant after some number of epochs (or vary slightly, e.g. less
than 0,5%),the training has stabilized. Indeed, increasing the number of epochs has no substantial effect
in the multichannel combination case.
Concerning the adopted training methodology, the results reached with the ACF pre-processing train-
ing methodology (the one in use) were superior, by less than 1% (approximately), to the ones obtained
with the cropped and resized INRIA positive training images (as discussed in Section 4.3). A relevant
remark, consists in the fact that the developed PD method is robust with respect to the training method-
ology adopted, with maximum result’s variations of approximately 1%.
The selected experimental setups, namely, the full INRIA dataset with higher resolution images and
the partial INRIA dataset with lower resolution images, intend to test two extreme cases by varying the
resolution of the images and the amount of data used during training. When more data and higher
resolution images are available, the network undergoes a better training procedure, leading to the best
result (e.g. 14,64 % miss rate for the combination of the inputs: GradMag, Gx and LUV). Conversely,
when the images have lower resolution and the amount of data utilized is substantially more scarce, the
training of the network is not as good, leading to the worst situation regarding the resolution and data
quantity (e.g. 18,44 % miss rate for the combination of the inputs: RGB, GradMag and FHOG).
The best result achieved with the developed method occurs for the combination of the inputs (using
35
the full INRIA dataset): gradient magnitude (GradMag), horizontal derivative (Gx) and LUV color space
(LUV) (as depicted in Figures 4.1 and 4.2). When compared with other PD benchmarks, as shown
in Figure 4.1 (denoted by Multichannel CNN in the box), it is possible to conclude that the proposed
approach is competitive with the state of the art (is in the top 7 methods for INRIA dataset, according
to the benchmarks acquired in [1]), and introduces an improvement of 2,64% when compared with the
ACF method alone (displaying a better performance curve).
36
Chapter 5
Conclusions
5.1 Thesis overview
Throughout this master thesis, the PD problem was formulated and the main challenges identified.
The existent datasets, benchmarks and evaluation methodologies were mentioned. The most relevant
PD approaches were discussed, including an analysis of the Deep Learning based PD methods and
other relevant techniques. The CNNs background was outlined. The proposed method, comprising the
multichannel input combination, was described. Specifically, the chosen pre-trained CNN input model
underwent fine-tuning with each single input channel and the final CNN model was built by combining
the features of each input. The experimental results, obtained by applying the proposed method on the
INRIA dataset, were presented in conjunction with a benchmarking against other state of the art PD
methods.
5.2 Achievements
An innovative method to solve the PD problem was proposed, based on the multichannel combination of
different input channels using CNNs. Particularly, the performed experiments motivate the application of
this method to low resolution PD, since it is possible to create synergies between low resolution inputs,
leading to enhanced detection rates for the input combination case. The experimental results obtained
using the full INRIA dataset are competitive with the PD state of the art approaches. Moreover, when
multiple input channels are available, the developed method can be abstractly applied in other areas
beyond the scope of PD, allowing to integrate inputs and achieve improved performance.
5.3 Future Work
This work can be expanded by selecting more heterogeneous input channels such as pedestrian body
parts (which can be obtained with [10], for example) or different views of the pedestrians (despite being
difficult to obtain them, resorting to the current pedestrian datasets).
37
Furthermore, pre-trained CNN models with more layers (deeper) and diversified architectures could
be fine-tuned in order to extract more representative features for each input channel.
The low resolution results obtained in this Thesis can be extended to the entire detection system
(which is composed by the ACF and the developed CNN based method). In Chapter 4, the experimental
setup formulation was not adequate for the ACF method’s performance evaluation in low resolution. To
solve this problem, the image resolution can be reduced before the application of the ACF method to
generate candidate windows. Afterwards, the low resolution candidate images could be entered in the
CNN. As a result, the performances for the CNN single channel and multichannel combination cases
could then be compared with the performance of the ACF method alone (i.e., the baseline).
Regarding the detection of pedestrian candidate windows for the test set, a distinct detector, such
as the Square Channels Feature or the Roerei detectors [5], can be used. Another possibility consists
in integrating the multiscale sliding window task and the pedestrian bounding box prediction task in the
CNN, similarly to [50], instead of using an external detector to generate pedestrian candidate windows
(i.e., the ACF detector would not be needed).
38
Bibliography
[1] Caltech pedestrian detection benchmark. URL www.vision.caltech.edu/Image_Datasets/
CaltechPedestrians. Last Access: July, 2015.
[2] Pedestrian traffic light image, . URL http://novuslight.com/
transportation-solution-for-pedestrians-in-cologne_N1232.html. Last Access: July,
2015.
[3] Pedestrian traffic sign image, . URL http://iica.de/pd/index.py. Last Access: July, 2015.
[4] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool. Pedestrian detection at 100 frames per
second. In CVPR, 2012.
[5] R. Benenson, M. Mathias, T. Tuytelaars, and L. Van Gool. Seeking the strongest rigid detector. In
CVPR, 2013.
[6] R. Benenson, M. Omran, J. H. Hosang, and B. Schiele. Ten Years of Pedestrian Detection, What
Have We Learned? CoRR, abs/1411.4304, 2014.
[7] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):
1–127, 2009. doi: 10.1561/2200000006. Also published as a book. Now Publishers, 2009.
[8] Y. Bengio. Practical recommendations for gradient-based training of deep architectures. CoRR,
abs/1206.5533, 2012.
[9] Y. Bengio, I. J. Goodfellow, and A. Courville. Deep learning. Book in preparation for MIT Press,
2015. URL http://www.iro.umontreal.ca/~bengioy/dlbook.
[10] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3D human pose annotations.
In International Conference on Computer Vision (ICCV), 2009.
[11] G. Carneiro, J. Nascimento, and A. Bradley. Unregistered multiview mammogram analysis with
pre-trained deep learning models. To appear in MICCAI, 2015.
[12] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the Devil in the Details: Delving
Deep into Convolutional Nets. In British Machine Vision Conference, 2014.
[13] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In International
Conference on Computer Vision & Pattern Recognition, volume 2, pages 886–893, 2005.
39
[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical
Image Database. In CVPR, 2009.
[15] P. Dollar. Piotr’s Computer Vision Matlab Toolbox (PMT). http://vision.ucsd.edu/~pdollar/
toolbox/doc/index.html.
[16] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral Channel Features. In BMVC, 2009.
[17] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian Detection: A Benchmark. In CVPR, June
2009.
[18] P. Dollar, S. Belongie, and P. Perona. The fastest pedestrian detector in the west. In BMVC, 2010.
[19] P. Dollar, R. Appel, and W. Kienzle. Crosstalk Cascades for Frame-Rate Pedestrian Detection. In
ECCV, 2012.
[20] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian Detection: An Evaluation of the State of
the Art. PAMI, 34, 2012.
[21] P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast Feature Pyramids for Object Detection. PAMI,
2014.
[22] L. Fei-Fei and A. Karpathy. Notes about Convolutional Neural Networks from the course CS231n:
Convolutional Neural Networks for Visual Recognition lectured at Standford University, Winter quar-
ter, 2015. URL http://cs231n.stanford.edu/. Last Access: July, 2015.
[23] P. F. Felzenszwalb, D. A. McAllester, and D. Ramanan. A discriminatively trained, multiscale, de-
formable part model. In CVPR, 2008.
[24] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-
tively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence,
32(9):1627–1645, 2010.
[25] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant
learning. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition
CVPR, pages 264–271, 2003.
[26] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark
suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
[27] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detec-
tion and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2014.
[28] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik. Recognition using regions. In IEEE Computer Society
Conference on Computer Vision and Pattern Recognition CVPR, pages 1030–1037, 2009.
40
[29] J. H. Hosang, M. Omran, R. Benenson, and B. Schiele. Taking a Deeper Look at Pedestrians.
CoRR, abs/1501.05790, 2015.
[30] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Multimodal neural language models. In Proceedings
of the 31th International Conference on Machine Learning, ICML 2014, pages 595–603, 2014.
[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional
Neural Networks. In NIPS, pages 1106–1114, 2012.
[32] J. J. Lim, C. L. Zitnick, and P. Dollar. Sketch tokens: A learned mid-level representation for contour
and object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages
3158–3165, 2013.
[33] Z. Lin and L. S. Davis. A pose-invariant descriptor for human detection and segmentation. In ECCV,
pages 423–436, 2008.
[34] P. Luo, Y. Tian, X. Wang, and X. Tang. Switchable deep network for pedestrian detection. CVPR,
2014.
[35] M. Mathias, R. Benenson, R. Timofte, and L. J. V. Gool. Handling occlusions with franken-
classifiers. In IEEE International Conference on Computer Vision ICCV, pages 1505–1512, 2013.
[36] W. Nam, P. Dollar, and J. H. Han. Local decorrelation for improved pedestrian detection. In Ad-
vances in Neural Information Processing Systems 27: Annual Conference on Neural Information
Processing Systems, pages 424–432, 2014.
[37] A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, C. Suen, A. Coates, A. Maas, A. Hannun, B. Huval, T. Wang, and
S. Tandon. Deep Learning Tutorial. URL http://ufldl.stanford.edu/tutorial. Last Access:
July, 2015.
[38] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceed-
ings of the 28th International Conference on Machine Learning, ICML, pages 689–696, 2011.
[39] W. Ouyang and X. Wang. A discriminative deep model for pedestrian detection with occlusion
handling. CVPR, 2012.
[40] W. Ouyang and X. Wang. Single-pedestrian detection aided by multi-pedestrian detection. CVPR,
2013.
[41] W. Ouyang and X. Wang. Joint deep learning for pedestrian detection. ICCV, 2013.
[42] W. Ouyang, X. Zeng, and X. Wang. Modeling mutual visibility relationship in pedestrian detection.
In IEEE Conference on Computer Vision and Pattern Recognition, pages 3222–3229, 2013.
[43] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Strengthening the effectiveness of pedestrian
detection with spatially pooled features. CoRR, abs/1407.0786, 2014.
41
[44] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Pedestrian detection with spatially pooled
features and structured ensemble learning. CoRR, abs/1409.5209, 2014.
[45] C. Papageorgiou and T. Poggio. A trainable system for object detection. International Journal of
Computer Vision, 38(1):15–33, 2000.
[46] D. Park, D. Ramanan, and C. Fowlkes. Multiresolution models for object detection. In ECCV, pages
241–254, 2010.
[47] P. Sabzmeydani and G. Mori. Detecting pedestrians by learning shapelet features. In IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition CVPR, 2007.
[48] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evaluation of local shape-based features
for pedestrian detection. In Proceedings of the British Machine Vision Conference, 2005.
[49] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised
multi-stage feature learning. CVPR, 2013.
[50] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated
recognition, localization and detection using convolutional networks. In International Conference
on Learning Representations (ICLR). CBLS, April 2014.
[51] N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In
F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Pro-
cessing Systems 25, pages 2222–2230. Curran Associates, Inc., 2012.
[52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
[53] M. Taiana, J. C. Nascimento, and A. Bernardino. On the purity of training and testing data for
learning: The case of Pedestrian Detection. Neurocomputing, vol. 150, Part A:214–226, 2015.
URL http://www.sciencedirect.com/science/article/pii/S0925231214012636.
[54] A. Vedaldi. AIMS Big Data, Lecture 3: Deep Learning, 2015. URL http://www.robots.ox.ac.uk/
~vedaldi/assets/teach/2015/vedaldi15aims-bigdata-lecture-4-deep-learning-handout.
pdf. Last Access: July, 2015.
[55] A. Vedaldi and K. Lenc. MatConvNet – Convolutional Neural Networks for MATLAB (including the
manual). CoRR, abs/1412.4564, 2014.
[56] A. Vedaldi and A. Zisserman. VGG Convolutional Neural Networks Practical. URL http://www.
robots.ox.ac.uk/~vgg/practicals/cnn. Last Access: July, 2015.
[57] M. Viola, M. J. Jones, and P. Viola. Fast multi-view face detection. CVPR, 2001.
[58] P. A. Viola and M. J. Jones. Robust real-time face detection. International Journal of Computer
Vision, 57(2):137–154, 2004.
42
[59] P. A. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appear-
ance. In ICCV, pages 734–741, 2003.
[60] S. Walk, N. Majer, K. Schindler, and B. Schiele. New features and insights for pedestrian detection.
In IEEE Conference on Computer Vision and Pattern Recognition CVPR, pages 1030–1037, 2010.
[61] C. Wojek and B. Schiele. A performance evaluation of single and multi-feature people detection. In
Pattern Recognition (DAGM), 2008.
[62] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by
bayesian combination of edgelet part detectors. In IEEE International Conference on Computer
Vision ICCV, pages 90–97, 2005.
[63] B. Wu and R. Nevatia. Cluster boosted tree classifier for multi-view, multi-pose object detection. In
IEEE International Conference on Computer Vision ICCV, pages 1–8, 2007.
[64] W. O. X. Zeng and X. Wang. Multi-stage contextual deep learning for pedestrian detection. ICCV,
2013.
[65] J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li. Robust multi-resolution pedestrian detection in traffic
scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3033–3040,
2013.
[66] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural
networks? CoRR, abs/1411.1792, 2014.
[67] S. Zhang, C. Bauckhage, and A. B. Cremers. Informed haar-like features improve pedestrian de-
tection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 947–954,
2014.
[68] S. Zhang, R. Benenson, and B. Schiele. Filtered channel features for pedestrian detection. CoRR,
abs/1501.05759, 2015.
[69] Q. Zhu, M. Yeh, K. Cheng, and S. Avidan. Fast human detection using a cascade of histograms
of oriented gradients. In IEEE Computer Society Conference on Computer Vision and Pattern
Recognition CVPR, pages 1491–1498, 2006.
43
44