Pedestrian Detection with Multichannel Convolutional ...€¦ · Electrical and Computer...

Pedestrian Detection with Multichannel ConvolutionalNeural Networks

David José Lopes de Brito Duarte Ribeiro

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisors: Professor Jacinto Carlos Marques Peixoto do NascimentoProfessor Alexandre José Malheiro Bernardino

Examination Committee

Chairperson: Professor João Fernando Cardoso Silva SequeiraSupervisor: Professor Jacinto Carlos Marques Peixoto do Nascimento

Member of the Committee: Professor Mário Alexandre Teles de Figueiredo

December, 2015

Agradecimentos

Gostaria de agradecer aos meus orientadores, Professor Jacinto Nascimento e Professor Alexandre

Bernardino, por me terem concedido a oportunidade de trabalhar com eles, podendo ter a certeza que

esta experiencia foi extremamente enriquecedora para mim. Mais especificamente, agradeco todas as

indicacoes, sugestoes, comentarios, tempo e atencao que despenderam, e que contribuıram de forma

valiosa para esta Tese.

Agradeco tambem a minha famılia, em particular aos meus pais e irma, por todo o apoio que me

deram e por todas as experiencias e oportunidades que me proporcionaram. Agradeco tambem a

educacao e formacao que me conferiram, e que indubitavelmente contribuiu para me tornar uma pessoa

melhor.

Por fim, agradeco ao Matteo Taiana, pelas explicacoes dadas relativamente aos metodos de deteccao

de pedestres e relativamente a avaliacao da performance dos mesmos.

iii

Resumo

A deteccao de pessoas em geral, e a deteccao de pedestres (DP) em particular, sao tarefas impor-

tantes e desafiantes no contexto da interacao homem-maquina, com aplicacoes em vigilancia, robotica

e Sistemas Avancados de Assistencia ao Condutor. Os seus principais desafios devem-se a existencia

de variabilidade na aparencia dos pedestres (e.g. relativamente ao vestuario), e as semelhancas com

outros objetos (e.g. sinais de transito).

Este trabalho propoe um metodo inovador para abordar o problema de DP, baseado em Redes

Neuronais de Convolucao (RNC). Mais concretamente, um modelo de RNC foi treinado para cada canal

de entrada individual (e.g. RGB ou LUV) e representacoes de alto nıvel foram extraıdas da penultima

camada. Finalmente, um modelo de RNC para multiplos canais de entrada foi treinado (parcialmente

ou integralmente) com estas representacoes. Durante o teste, as imagens foram pre-processadas

com o detector Aggregated Channels Features (ACF) para gerar janelas candidatas a pedestres. Em

seguida, estas janelas foram introduzidas no modelo de RNC para multiplos canais, sendo efetivamente

classificadas como pedestres ou nao pedestres.

O metodo desenvolvido e competitivo com o estado da arte, quando avaliado no conjunto de da-

dos INRIA, conduzindo a melhorias relativamente ao metodo de base (ACF). Foram realizadas duas

experiencias, nomeadamente, utilizando o INRIA na totalidade com alta resolucao, e utilizando parte

do INRIA com baixa resolucao. Adicionalmente, a metodologia desenvolvida pode ser aplicada com

sucesso ao problema de DP em baixa resolucao, tendo potencial para ser estendida a outras areas

atraves da integracao de informacao proveniente de varias entradas.

Palavras-chave: Deteccao de Pedestres, Redes Neuronais de Convolucao, canal de en-

trada individual, modelo de RNC para multiplos canais, representacoes de alto nıvel

v

Abstract

The detection of people in general, and particularly Pedestrian Detection (PD), are important and chal-

lenging tasks of human-machine interaction, with applications in surveillance, robotics and Advanced

Driver Assistance Systems. Their main challenges are due to the high variability in the pedestrian ap-

pearance (e.g. concerning clothing), and the similarities with other classes resembling pedestrians (e.g.

traffic signs).

This work proposes an innovative method to approach the PD problem, based on the combination

of heterogeneous input channel features obtained with Convolutional Neural Networks (CNN). More

specifically, a CNN model is trained for each single input channel (e.g. RGB or LUV) and high level

features are extracted from the penultimate layer (right before the classification layer). Finally, a multi-

channel input CNN model is trained (partially or fully) with these high level features. During test, the full

images are pre-processed with the Aggregated Channels Features (ACF) detector in order to generate

pedestrian candidate windows (i.e., windows potentially containing pedestrians). Next, these candidate

windows are entered in the multichannel input CNN model, being effectively classified as pedestrians or

non pedestrians.

The developed method is competitive with other state of the art approaches, when evaluated on the

INRIA dataset, achieving improvements over the baseline ACF method. Two experimental setups were

adopted, namely, the full INRIA dataset with higher resolution and the partial INRIA dataset with lower

resolution. Furthermore, the devised methodology can be successfully applied to the low resolution PD

problem, and promisingly extended to other areas by integrating information from several inputs.

Keywords: Pedestrian Detection, Convolutional Neural Networks, single input channel, CNN

multichannel input model, high level features

vii

Contents

Agradecimentos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction 1

1.1 Motivation and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 State of the art in PD and related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 PD methods and related approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 PD Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.3 PD Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Convolutional Neural Networks 11

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Training and testing problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Operations and layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 Fully connected layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.4 Activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.5 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.7 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.2 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

ix

2.4.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Proposed method 22

3.1 Description of the Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Datasets and Input channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.2 Pre-trained CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.3 CNN single input model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.4 CNN multichannel input combination model . . . . . . . . . . . . . . . . . . . . . . 25

3.1.5 Overall detection methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Experimental Results 27

4.1 Performance evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Training methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4.1 PD using the full INRIA dataset with 7 input channels . . . . . . . . . . . . . . . . 30

4.4.2 PD using the partial INRIA dataset with 4 input channels . . . . . . . . . . . . . . . 31

4.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Conclusions 37

5.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Bibliography 43

x

List of Tables

4.1 Miss rate % using single channels as input and without feature combinations for the full

INRIA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Miss rate% using 4 features combinations for the full INRIA dataset . . . . . . . . . . . . . 31

4.3 Miss rate% using 7 features combinations for the full INRIA dataset . . . . . . . . . . . . . 32

4.4 Miss rate % using single channels as input and without feature combinations for the partial

INRIA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5 Miss rate% using 4 features combinations for the partial INRIA dataset . . . . . . . . . . . 34

xi

List of Figures

1.1 Some examples of PD challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Overview of the pedestrian datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Diagram of the expanded CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Diagram illustrating the methodology used to feasibly perform the backpropagation pro-

cedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Example of the backpropagation procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Example of the convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Example of the max pooling operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 Examples of activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.7 Cross-channel normalization scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Example images of the input channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Pre-trained model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Scheme of the single input and multiple input combination CNN models . . . . . . . . . . 25

3.4 Scheme of the overall detection methodology . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Comparison of the developed method best result, denoted by Multichannel CNN, with

other PD benchmarks for the full INRIA dataset . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Comparison of the results obtained with the developed method for the combination of

different input channels for the full INRIA dataset . . . . . . . . . . . . . . . . . . . . . . . 33

xiii

Glossary

ACF Aggregated Channel Features

BBdt Detected bounding box

BBgt Ground truth bounding box

CNN Convolutional Neural Networks

CSS Color self-similarity

DCT Discrete cosine transform

DF Decision Forests

DL Deep Learning

DN Deep Network

DPM Deformable Part Models

FHOG Felzenszwalb Histogram of Oriented Gradients

FPPI False positives per image

FPS Frames per second

GradHist6 Gradient histogram in the orientation range

from 150 degrees to 180 degrees

GradMag Gradient Magnitude

Gx Horizontal derivative across all RGB channels

Gy Vertical derivative across all RGB channels

HOG Histogram of Oriented Gradients

MR Miss rate

NMS Non Maximal Suppression

NN Neural Networks

PD Pedestrian Detection

RBM Restricted Boltzmann Machine

ReLU Rectified Linear Unit

SGD Stochastic Gradient Descent

SVM Support Vector Machine

xv

Chapter 1

Introduction

1.1 Motivation and Problem Formulation

The ability to detect people is an important and challenging component of human-machine interaction.

This has been an active area of research in the last few years due to its wide range of applications,

e.g. automotive safety, surveillance, entertainment, robotics, aiding systems for the visually impaired,

Advanced Driver Assistance Systems, to quote a few.

Pedestrian Detection (PD) consists in the detection of people portraying typical standing poses. Its

main challenges are due to the high variability in the pedestrian appearance and the similarities with

other classes of objects resembling pedestrians. In Advanced Driver Assistance Systems and surveil-

lance applications, the PD task might include walkers, skateboard and bicycle users. The appearance of

the pedestrians is influenced by the pose, clothing, lighting conditions, backgrounds, scales and even by

the atmospheric conditions. Besides that, occlusions (among pedestrians and between pedestrians and

other objects), background clutter, deformations, viewpoint variations, low resolution and the resem-

blances between pedestrians and other classes of objects (interclass resemblances, such as the re-

semblances between pedestrians and mannequins, pedestrian traffic signs and pedestrian traffic lights)

are other complexities inherent to this task. The referred challenges and complexities are illustrated in

Figure 1.1.

As a result, the previously mentioned challenges motivate the adoption of methods that simultane-

ously have representative power to capture the general pedestrian traits at multiple scales, and are

robust to the pedestrian intra-class variability and the inter-class similarities. Moreover, the ability to

combine various sources of information about the pedestrians may lead to ameliorated results.

Recently, Convolutional Neural Networks (CNN) have successfully been applied to several vision

tasks such as general object classification [14, 31], general object detection [52, 50] and even PD [41, 49,

29], which demonstrates its potential to further explore the PD problem and deal with the PD challenges.

This thesis reviews the current pedestrian detection state of the art and uses a combination of the

Aggregated Channel Features (ACF)[21] detector and CNNs to perform the PD task. Besides the regular

CNN use (single channel input), an innovative method is implemented based on the combination of fea-

1

tures extracted from CNN models obtained with different inputs (multichannel combination). Particularly,

this method is applied to the challenging low resolution PD problem.

Figure 1.1: Some examples of PD challenges: a) reflections of pedestrians in windows; b) drawingsof people; c) mannequins; d) a pedestrian sized poster; e) a pedestrian traffic sign and crowd relatedocclusions; f) pedestrian traffic lights and g) a gumball machine resembling a pedestrian. a), b), c) andd) were adapted from [53], e) was obtained from [3], f) was obtained from [2] and g) was obtained fromthe INRIA dataset [13].

1.2 State of the art in PD and related work

The PD problem has been studied for more than a decade, with several datasets, benchmarks, eval-

uation metrics and over 40 methods being developed and established in the process. The following

subsection provides an extended overview of the most relevant methodologies proposed in the field, in

conjunction with other potentially useful approaches.

1.2.1 PD methods and related approaches

The literature is rich and provides a great diversity of methods devoted to the PD problem, with an

outcome of over 40 methods. However, these methodologies can be categorized into three main families

according to [6]: Decision Forests (DF) (possibly using boosting techniques and the most common

approach), Deformable Part Models (DPM) variants and Deep Networks (DNs). Additionally, some

works consider the combination of diverse families, such as [39], which integrates DPM and DN to

tackle the PD occlusion problem.

A survey of PD detectors (DF and DPM families) from 2000 to 2012 is presented in [20], a discussion

about the PD problem from around 2004 to 2014 is shown in [6] and a brief survey of Deep Learning

(DL) based PD methods (i.e., the DN family) is presented in [29]. The most relevant methodologies and

concepts contained in these surveys are summarized herein, complemented by other techniques outside

of their scope. Relevant approaches in other fields of study, but with the potential to be successfully

applied to PD are included as well.

In general terms, most detectors contemplate three main components: feature representation, clas-

sification schemes and training methodology. However, this distinction is less evident in approaches that

jointly learn several modules of the detector (e.g. [41]). In fact, [5] provides an experimentally supported

overview of the detector’s design options, comprising: feature pooling, normalization and selection, pre-

processing, weak classifiers, and training methods and sets. The features can be obtained from different

2

input channels, transformations or pre-processing (e.g. HOG, RGB and LUV, are some of the features

and channels considered in [16, 21]). Regarding the classifiers, it is possible to differentiate between two

main frameworks: monolithic and part-based. While in the monolithic case the entire image is taken into

account to detect the pedestrians (DF family, e.g. [13]), in the part-based case, deformation models and

partial detections contribute to determine the pedestrian existence probability (DPM family, e.g. [24]).

Training can be performed with a linear (more frequently) or non linear Support Vector Machine (SVM)

(e.g. [13]), boosting, among more complex methodologies.

Accordingly, the work of [68] intends to unify the framework of several PD methods (such as Integral

Channel Features [16], Aggregated Channels Features [21], Squares Channels Features [5], Informed

Haar [67] and Locally Decorrelated Channels Features [36]), by proposing a detector’s model where a

boosted decision forest uses filtered low-level features. Furthermore, a new method is devised, based

on the particular choice of the filters used, and different filter bank design options are explored and

supported with experimental results.

In the last years, the improvement experienced in the PD methodologies performance relied mainly

on the use of better features (concerning dimensionality and complexity) [6], as noticed when comparing

the prominent benchmarks of Viola and Jones [58], HOG [13], Integral Channels Features [16], Squares

Channels Features [5] and Spatial Pooling [44]. Indeed, in [13] the feature quality was improved by the

introduction of the Histogram of Oriented Gradients (HOG), followed by a linear SVM classifier. The

HOG features were widely adopted and extended in the literature (e.g. [69] increased [13] detection

speed) even in current approaches (e.g. [21, 68]).

The sliding window based PD methodologies are the most successful and commonly found in the lit-

erature, in contrast to keypoints [48] or segmentation [28] based approaches. Among the pioneers of PD

sliding window methods is [45], which resorted to the use of the Haar wavelet transform in conjunction

with SVMs to build an overcomplete dictionary of features at multiple scales. This work was extended

by [58], which developed a faster technique (named Integral Images) to compute the features, applied

AdaBoost to sparsely extract the most important visual attributes, and proposed a cascade of classifiers

to constrain the computational effort to the most promising regions. Recently, these core principles are

still applied in most of the detectors (mainly in the DF and DPM families, but also in combination with

DN approaches), to some degree or variation.

In [16], the Integral Images technique proposed in [58] was applied to different image channels

(namely, gradient histograms, grayscale, RGB, HSV, LUV, and gradient magnitude), in order to extract

features (in the form of local sums and Haar feature variants, expanding the image channels considered

to compute the features in [58]), which were subsequently processed by a boosted classifier. This

method was improved in [21], by changing the features from local sums in image channels to pixel

lookups in aggregated channels (resulting in the Aggregated Channels Features method). A variant of

[16] was introduced in [5] (named Square Channels Features), by limiting the image channels to HOG

(with gradient magnitude) and LUV, and choosing candidate features according to square feature pooling

subwindows (comprised by the model window). In the same work [5], a final detector (named Roerei)

was proposed, by considering different scales, global normalization and using an enhanced feature pool.

3

The complementarity of different types of features has been proven to yield improved results, as

discussed in [61], for the combination of HOG features, Haar wavelets and shape context using different

classifiers. Expanding this approach to include motion features (obtained from optical flow) and a novel

feature named color self-similarity (CSS), as proposed by [60], allows to attain further gains. More

recently, in [6], optical flow information, context information (i.e., interactions among two people) and

channel augmentation using a discrete cosine transform (DCT) were added to the Square Channels

Features detector [5], resulting in an improvement of over 10% regarding the baseline Square Channels

Features method [5] in the Caltech-USA reasonable dataset (the described method is named Katamari-

v1, with a total of 40 channels, 30 of which originate from the convolution of 10 channels, i.e. HOG plus

gradient magnitude plus LUV, with 3 DCT basis functions).

Recent PD detectors include [67], which explores the prior knowledge of the up-right human body

shape. This is accomplished by computing Haar-like features according to a model of the head, upper

and lower body parts. Other approach was proposed in [36], which builds on the ACF framework [21]

by creating a methodology to locally decorrelate the channel features. As a result, orthogonal decision

trees can be adequately and efficiently used instead of resorting to the more computationally expensive

oblique decision trees. The approaches proposed in [43, 44], contemplate the use of spatially pooled

low-level features in conjunction with the optimization of the area under the receiver operating charac-

teristic curve only in the range of the detector evaluation (i.e., from 10−2 false positives per image (FPPI)

to 100 false positives per image).

Initially, the PD methods were mainly focused on enhancing accuracy (measured in miss rate versus

false positives per image). Nevertheless, the emergence of real time applications,such as surveillance

and robotics, demanding higher detection rates (measured in frames per second(FPS)),raised atten-

tion to the importance of combining detection accuracy with adequate detection speed. Sliding win-

dow based detectors require the computation of image pyramids and correspondent features at each

scale, which can be computationally expensive and incur in unsatisfactory runtime. As a result, new

methodologies were proposed (e.g. [18, 21]) to approximate the feature computation (including gradient

histograms) at multiple scales, resulting in substantial runtime improvements with slight accuracy reduc-

tions. In fact, the method described in [18], reaches almost real time (i.e., approximately 5 FPS) PD

using 640x480 images, whereas the detector in [57, 58, 59], devised more than ten years ago, achieves

detection rates of 15 FPS using 384x288 video. Additionally, [21] achieves detection rates superior to 30

FPS, [19] reaches performances in the range from 35 to 65 FPS (on 640x480 images) and [4] achieves

50 FPS for monocular images (using GPU) and 135 FPS for stereo images (using CPU and GPU) (a

more detailed review of the detectors speed is presented in [21]).

In the scope of deformable part models, an unsupervised method was devised in [25] to learn scale

invariant part constellations, considering object attributes such as: shape, appearance, occlusion and

relative scale. More recent DPM methodologies are proposed in [24, 23], where deformable part models

comprising a root filter, part filters and respective deformation models (named star model), were applied

to images at multiple scales and positions. A score dependent on each element of the star model is

computed. The training is performed using the latent SVM approach, with partially labeled data. Various

4

star models are combined and a score is assigned to each position and scale of the images. As a result,

this methodology addresses the variability existent in the object and pedestrian classes.

An emerging and challenging problem consists in performing PD with very low resolution images.

The additional challenges are noticeable in many surveillance systems, because high resolution im-

ages with a detailed shape description may not be available. In automotive systems, the detection of

pedestrians far away (i.e., at low and medium resolutions) from the vehicle guarantees a safety margin,

essential to perform the adequate driving maneuvers. Therefore, it is desirable to achieve satisfactory

performance levels at medium and low resolution, but without employing expensive high resolution cam-

eras, since it would constitute an obstacle to the implementation of the PD methods in the automotive

systems framework [20]. However, the detection is disappointing at low resolution PD, as mentioned

in [20]. To address the PD problem at low resolutions and multiple scales, the work of [46] and [65]

introduced improvements to the DPM approach proposed in [24]. The work of [46] created an adaptive

model that works as a rigid model at low resolutions and changes to a deformable part model at high

resolutions, while in [65], a multi-task model was applied to pedestrians at distinct resolutions in order to

capture their similarities and dissimilarities.

Further considerations in parts, shape, pose and articulation representations are present in the liter-

ature. In [62], body part detectors contemplating edgelet features (which correspond to short segments

belonging to a line or curve) were devised to tackle PD in crowded scenes. Learning was achieved

by a boosting technique. A joint likelihood model was designed to include occlusions among pedestri-

ans (besides the regular PD task), by considering the output of the aforementioned detectors. Further

development followed in [63], where this method was applied to the multiview paradigm. The method

described in [47] introduces the shapelet mid-level features, which result from applying Adaboost to

oriented gradients over local image windows (denoted by low level features). These shapelet features

are then processed with Adaboost, originating the final classifier. In [10], poselets were developed to

represent parts of people poses and to estimate them. PD and body elements (e.g. torso or nose)

localization are possible by applying a linear SVM classifier to the poselets features, followed by further

classification or regression. Additionally, the work of [33] proposed an HOG based pose invariant de-

scriptor, resulting from the application of a part template model, which extracts pedestrian shapes and

poses. In [32] sketch tokens were introduced, which is a mid-level feature representation aimed to cap-

ture the edges and contours existent in images and subsequently apply this information to pedestrian

and object detection.

The Deep Learning (DL) framework has recently been applied to the PD problem (e.g. [39, 42, 29]),

contemplating the use of distinct DN models such as Restricted Boltzmann Machines (RBMs) and CNNs.

The DL based methods belong to the DN family and were utilized to tackle occlusion, background clutter,

pedestrian intra-class variability, among other PD challenges.

The occlusion problem is addressed in [39, 42], which build on the deformable part model frame-

work [24]. The work of [39], learns the relationship among the visibility of the detected pedestrian parts

with a deep model (comprising RBMs). In [42], a deep model is developed to predict the mutual visibility

relationship among pedestrians. Additionally, the work of [40] is specifically focused on modelling pedes-

5

trian interactions and pedestrian group scenes (including single pedestrian detection as well). Finally,

[41] devised a methodology to capture the interaction among feature extraction, deformation handling,

occlusion handling and classification, by integrating and jointly learning these modules with a DN. An-

other approach to mitigate occlusion, but outside of the DN family scope is presented in [35], which

proposes an efficient training procedure that enables the application of several classifiers particularly

devoted to the occlusion task.

With the objective of dealing with background clutter and the variability within the pedestrian class,

[34] developed a DN comprising: a convolutional layer to capture lower level features, Switchable Re-

stricted Boltzmann Machine based layers to obtain higher level feature combinations and body salience

maps, and a logistic regression layer for classification.

In [64], the authors propose a DN inspired by cascaded classifiers with joint training. Each classifier in

the network’s hierarchy introduces refinements to the previous classifications, as the top of the network is

progressively being reached. In fact, the contextual information about the features and detection scores

from each network stage contributes to the decision made in the following stage.

The network pre-training subject is approached in [49], where a CNN is initialized with a unsupervised

learning methodology (consisting in convolutional sparse coding) in order to overcome the scarcity of

data available when training on the INRIA dataset [13] (which contains 614 positive images and 1218

negative images). Afterwards, the CNN is fine-tuned with the INRIA dataset. The features from multiple

layers are merged in the classifier, to obtain features that capture low level and high level pedestrian

traits (such as silhouettes and facial details). Besides that, [11] and [29] pre-train a CNN by transferring

the parameters of another network, already trained with a subset of the Imagenet [14] dataset, followed

by fine-tuning (i.e., training) with the dataset pertaining to the task of interest (in the case of [11] are

datasets related to mammograms and in the case of [29] are datasets related to pedestrians).

Regarding the type of inputs entered in the networks used in PD, they vary according to the adopted

methodology and can be: the RGB color space [29], the YUV color space [49], HOG [39, 40], a com-

bination of HOG and CSS [64], and a combination of YUV and gradients [34, 41]. Further experiments

were performed in [29], by using as input the LUV color space and other combinations of LUV and HOG.

The detection task is more complex than the classification task, since it is intended to assign a

class to each object (in the PD case, decide if a certain detection window contains a pedestrian or not)

and output the location of the object (in the PD case, the bounding box only surrounds pedestrians,

because the rest of the classes and the background are considered irrelevant). As mentioned before in

this subsection, the most common architecture of the detectors in PD is based on the sliding window

approach. In this technique a classifier evaluates if a detection window of a specific size contains a

pedestrian of the corresponding height. The detection window is slided on a grid of locations throughout

an entire image (outputting the confidence in the presence of a pedestrian in that window), at multiple

scales, followed by non maximal suppression (NMS) in order to merge the enclosed bounding boxes.

Some methods (mainly in the DN family) might not comprise the entire detector architecture, lacking

the scanning of the image with detection windows at multiple scales and the NMS, or were originally

designed to perform as classifiers. As a result, they require the use of other methods (faster although

6

still accurate) to output candidate windows (also known as regions of interest or detection proposals),

instead of fully scanning the images themselves. Moreover, this procedure contributes to reduce the

computational effort spent by the more computationally expensive methods, and lead to runtime speed

ups.

Accordingly, the detection can be performed by first using a method to select pedestrian candidate

windows and then applying another methodology to determine if the selected regions correspond, in

fact, to pedestrians or not (i.e., the detection quality is refined by a more powerful classifier). The works

published in [41, 64, 34] adopt this procedure by applying HOG, CSS and a linear SVM detector to

generate candidate images and only afterwards, entering this candidate windows into the deep network.

Additionally, [29] uses the Aggregated Channel Features detector and the Squares Channel Features

detector as the region of interest selection methods.

Conversely, although applied to the object detection task, [50] convolves a network (deeper than the

one in [49]) with full images, being able to not only classify but also locate objects, without requiring the

use of other methods to select candidate windows.

Although not being present in the PD Deep Learning literature, the combination of multiple inputs

is a promising idea that has the potential to be successful in PD. It is explored in [30] and in [51],

where features for images and text are jointly learned. In [30], a Convolutional Network was used to

jointly train neural language models conditioned on images. In [51], a Deep Boltzmann Machine model

is employed to jointly learn text and images, and multimodal (text and images) or unimodal queries

can be submitted, being possible to generate the missing modality in the unimodal case. Besides

that, [38] uses deep networks (including a Deep Autoencoder model and RBMs) to learn audio and

video information, resorting to the multimodal fusion, cross modality learning and shared representation

learning settings. Furthermore, the work of [11] addresses multiview mammogram analysis, by using

individual deep CNN models for each view (namely, craniocaudal and mediolateral oblique views for the

standard mammogram, micro-calcifications and masses), extracting the features for each view from the

corresponding CNN and then combining them in a joint CNN model. The methodology proposed in [11],

leads to significant improvements for the multiview combination case, when compared with the single

view results, which shows the relevance of combining multiple input channels.

1.2.2 PD Datasets

The datasets are designed to provide a substantial and representative sample of the aforementioned

PD challenges, requiring continuous updates and reformulations, in order to potentiate progress in the

PD field of study.

An overview of the most relevant pedestrian datasets is presented in Figure 1.2, adapted from

[20]. Although INRIA is still extensively utilized (mainly for training), recently, the most commonly used

datasets for benchmarking [6] are Caltech-USA [17] and KITTI [26].

The most recent benchmarks, corresponding to several methods and to the datasets: Caltech-USA

(training and testing), Caltech-Japan, INRIA, ETH, TUD-Brussels and Daimler (Daimler-DB), are re-

7

ported online at [1]. Besides that, the website in [1] allows the continuous submission and comparison

of new method’s benchmarks. Additionally, the benchmarks for more than 40 methods predominantly

resorting to the Caltech-USA [17] dataset (although residually including INRIA and KITTI datasets) are

discussed in [6]. Older benchmarks, established for 16 detectors using the aforementioned datasets,

are presented in [20].

The detection can be performed in static images (photographs), surveillance videos or images re-

sulting from video (not from surveillance, but from continuously filming in other contexts) [20]. The need

to manually select the images in the photographs case, incurs in selection bias, which is unlikely to be

experienced in the video case [20]. Moreover, according to [53], using only pure data samples (i.e.,

where the pedestrian can be unambiguously identified) can prevent the detection of partially occluded

or smaller samples, and cause the unfair benchmarking of different detectors (since the impure samples

may be considered false positives). Furthermore, using video instead of static images, allows to exploit

optical flow, leading to improved detection results [6].

Figure 1.2: Overview of the pedestrian datasets (adapted from [20]).

1.2.3 PD Evaluation metrics

The adequate definition of evaluation metrics is of crucial importance in order to ensure a fair comparison

among detection methods, to establish benchmarks, and to correctly approximate the real detector

performance levels when applied to a specific task of interest.

The existent PD performance evaluation methodologies are per-window and per-image based [20].

In the per-window methodology, the detector classifies image windows containing pedestrians against

windows without pedestrians. Despite being suited to assess the performance of classifiers and frame-

works proposing automatic regions of interest, this methodology can have shortcomings when evaluating

detectors [20], because it assesses the performance of the classifier instead of considering the entire

detection system.

As a result, the per-image evaluation methodology superseded per-window metrics in 2009 (as pro-

8

posed in [17] and latter in [20]). In the per-image evaluation metric, a detection window is slided on a grid

of locations throughout a full image and a bounding box and a confidence score are outputted for each

multiscale detection, followed by NMS and the application of the PASCAL measure, to determine the

existing matches between the groundtruth and the detected bounding boxes (the detailed methodology

is described in Section 4.1 of Chapter 4).

In practice, per-window performance does not predict per-image performance (only a weak correla-

tion exists among the two) [20], disproving the conjecture that better per-window results lead to better

performance on full images.

In the per-window evaluation methodology, each image in the selected dataset only needs to be

labelled binarily, as containing a pedestrian or not. However, if the per-image evaluation methodology is

used, the location of each pedestrian, in each image of the dataset, must be provided (constituting the

groundtruth). Consequently, the referred metric demands the effort of manually annotating each image.

This process of annotating pedestrians in each image of a dataset might depend on the application,

particularly, in situations where the appearance of pedestrians and other objects are identical. This

occurs, for example, in the case of images of mannequins, posters, photographs, drawings or reflections

on mirrors and windows [53] (these examples are depicted in images a), b), c) and d) of Figure 1.1).

1.3 Objectives

This thesis proposes an innovative methodology to address the PD problem, based on the Deep Learn-

ing framework. More specifically, Convolutional Neural Networks with multichannel combination are

used (including a pre-processing with the ACF detector). This approach aims to deal with the various

PD challenges and to explore additional information existent in heterogeneous channels.

Moreover, this work intends to demonstrate that, when presented with several input channels (applied

to a CNN), a useful and promising hypothesis to improve performance consists in combining (entirely

or partially) the features extracted from top-level CNN layers. Particularly, the more challenging subject,

regarding the state of the art, of low resolution PD is expected to be successfully tackled by the proposed

methodology.

1.4 Main contributions

The main contributions of this master thesis are four-fold:

• A review of the state of the art in pedestrian detection and an outline of the CNNs background.

• An innovative method to approach the PD problem, based on multiple input channels combination

with CNNs. The obtained results allow to generally conjecture that the combination of multiple

sources of input can improve the overall performance, when compared to the single input cases.

• Application of the proposed method to the low resolution PD problem. This subject is more chal-

lenging than the standard PD problem and is scarcely explored in the literature, constituting a

9

novel contribution from this thesis. According to the obtained results, the combination of several

low resolution inputs leads to a significant improvement, when compared with the results of each

single input.

• Competitive results achieved with the developed method and performance benchmarking against

some of the best performing pedestrian detection approaches.

1.5 Dissertation outline

The structure of this thesis contemplates five chapters. In Chapter 2 the CNNs background is discussed.

The CNNs are introduced and the training and testing problems are formulated. The main operations and

layers comprised in these networks are detailed and the pre-training, fine-tuning and feature extraction

procedures are mentioned.

Chapter 3 describes the proposed method, consisting in the combination of several input channels

using CNNs. The selected datasets and input channels are specified in conjunction with the pre-training

and fine-tuning techniques for the CNN single input model. The multichannel input combination pro-

cedure is discussed, where the features extracted from each CNN input model are combined in a joint

CNN model. The overall detection methodology is outlined and the implementation details (software and

toolboxes) are mentioned.

The experimental results achieved with the proposed CNN input multichannel combination method

are depicted in Chapter 4. First, the performance evaluation methodology, the experimental setups

behind the attained results and the training methodology are introduced. Afterwards, the results for the

two experimental setups (namely, full INRIA dataset with better resolution and partial INRIA dataset with

lower resolution) are presented in the form of tables and miss rate versus false positives per image plots.

Finally, a comparison with the state of the art is performed and the obtained results are discussed.

Chapter 5 provides an overview of the thesis, describes the most significant achievements and pro-

poses possible extensions to the developed work.

10

Chapter 2

Convolutional Neural Networks

This chapter provides a brief outline of the Convolutional Neural Networks background. The CNNs are

defined and the contextualization in the Deep Learning framework is mentioned. Afterwards, the CNN

training and testing problems are formulated and the CNN operations and layers are discussed. Finally,

the transfer learning subject is approached by analyzing the tasks of pre-training, fine-tuning and feature

extraction.

2.1 Introduction

CNNs are variants of Neural Networks (NN), specially designed to exploit structured input data with

lattice-like topology [9], such as: sound, which can be interpreted as a one dimensional time series

data; images, which can be interpreted as a two or three dimensional grid of pixels; and video, which

can be interpreted as a four dimensional input composed by a temporal sequence of (three dimensional)

images. Moreover, these networks derive from the Deep Learning framework, assuming deep architec-

tures that comprise various layers of non-linear operations, and with more representational power than

more shallow ones (i.e., with 1, 2 or 3 layers, for example) [7]. As a result, when compared with NN, the

CNN architecture handles the high dimensionality of inputs, attains invariance to slight changes, reduces

the number of connections and parameters, and is more robust to overfitting.

More formally, a CNN consists in a multi-layer processing architecture that can be regarded as a

function f assigning input data, denoted by x0 ∈ RH0×W0×D0 , to an output, denoted by y ∈ 1, . . . , C

(where C is the number of classes), and where the subscript 0 denotes the raw input data before any

layer computation [55]. Alternatively, the output can be a vector y ∈ RC with the probabilities (or scores)

of x0 belonging to each of the C classes. For instance, a CNN can be used to classify an image (in this

case, H0, W0 and D0 denote the height, width and depth, respectively) according to a defined set of

classes, by outputting a probability for each class (for RGB images, the depth equals 3 and for grayscale

images the depth equals 1). This function f results from composing diverse functions f1, . . . , fl, . . . , fL,

typically assuming a sequential topology (which is the one considered herein), and where the subscript

denotes the network module (or layer). However, a more complex disposition in the form of a directed

11

acyclic graph is also possible [56]. In this thesis the former topology is adopted as follows [55]:

f(x0,w) = fL(. . . f2(f1(x0,w1,b1),w2,b2) . . . ,wL,bL), (2.1)

where w = [w,b] = [w1, . . . ,wL] = [w1,b1, . . . ,wL,bL] represents the parameters, which include the

weights (denoted by w) and the biases (denoted by b).

In fact, each function (e.g. fl) represents a module that generates feature maps as output (e.g.

xl) [55]. Starting from the input x0 ∈ RH0×W0×D0 , the output of each network module is respectively

denoted by x1,x2, . . . ,xL, where xl ∈ RHl×Wl×Dl , wl ∈ RH′l×W ′l×Dl×Ql and bl ∈ RQl , for an arbitrary

module (or layer) l (as shown in Figure 2.1). Hl,Wl, Dl denote the height, width and depth of the feature

map (respectively), H ′l and W ′l denote the height and width of the weights (or filters), respectively, and

Ql denotes the number of weights (or filters) and biases (further details are provided in Section 2.3). For

instance, the feature map produced at module l depends on the results of the preceding modules and

is denoted by xl = fl(xl−1; wl). Some modules do not contain weights and biases, as is the case of the

activation functions (e.g. the rectified linear unit (ReLU)).

The most common CNN modules (to be detailed in Section 2.3) are as follows: (i) convolution layers,

(ii) fully connected layers, (iii) pooling layers (e.g. max or average pooling), (iv) activation functions (e.g.

hyperbolic tangent, sigmoid function and rectified linear unit (ReLU)), (v) different types of normaliza-

tion (e.g. contrast normalization or cross-channel normalization), (vi) classification layers (e.g. softmax

classification layer) and (vii) loss functions (which are appended to the core architecture of the CNN in

order to train the network, e.g. cross entropy loss or hinge loss). Two main structural organizations of

the network components are possible [9]: several simple layers or only a few complex layers (e.g. only

convolutional and fully connected layers), where each one has stages that correspond to simpler oper-

ations (e.g. the convolution, the activation function and the pooling are comprised in the convolutional

layer).

2.2 Training and testing problem formulation

The process of training a CNN aims to learn the parameters w = [w1,b1, . . . ,wL,bL], given a set of

training data D = (x0, y)iNi=1, in order to obtain a representative model, capable of generalizing the

information obtained during training, and correctly classifying new unseen inputs during the test phase.

The subscript 0 denotes the raw input data before any layer computation and the superscript denotes

the i-th data sample. When there is only one data sample or mentioning the dataset is superfluous, the

superscript i is suppressed (as was done in Section 2.1). The data is contained in xi0 ∈ RH×W×D and

the corresponding class is indicated by yi ∈ Y = 1, ..., C, where C is the number of classes and N is

the number of training data samples.

Since the focus is on the classification task, the objective function is chosen to be discriminative [56],

allowing to achieve the distribution of the labels given the input data, i.e., p(y|x0), instead of being gen-

erative, which provides a representation of the joint distribution of the input data and the corresponding

12

class, i.e., p(x0, y).

Consequently, the training problem can be cast as the following optimization problem:

arg minw

J(x0,y,w) = arg minw1,b1,...,wL,bL

1

N

N∑i=1

l(f(xi0; w1,b1, . . . ,wL,bL), yi) +λ

2(‖w‖2 + ‖b‖2), (2.2)

where l is a loss function 1 (e.g. cross entropy loss), l : f(xi0,w) → R, which penalizes the errors

associated with wrong class predictions (i.e., for each input xi0, the CNN model produces an incorrect

prediction f(xi0; w) if it is different from its associated label yi, and a correct one if it is equal to its label

yi) [55]. The squared Euclidean norm of the weights and biases acts as a regularization penalty, in order

to reduce their size and contribute to mitigate overfitting. The parameter λ controls the relevance of the

regularization approach, when compared with the loss term [22].

For training purposes, the loss function (i.e., l) can be appended to the end of the network (as de-

picted in Figure 2.1) by composing it with the model of the CNN (i.e., f(x0,w)), resulting in the expanded

CNN model, denoted by z = l f(x0,w) [55] (the superscript notation was suppressed for simplicity rea-

sons, since the results apply to each data sample i).

Figure 2.1: Diagram of the expanded CNN model (obtained from [55]).

The optimization problem described in Equation 2.2 can be solved by using backpropagation with

stochastic gradient descent (SGD), or its variant mini-batch SGD with momentum, since it typically pro-

cesses substantial amounts of training data [55]. While the standard gradient descent (also known as

batch gradient descent) uses the complete training set in each iteration, SGD acts only on one training

example in each iteration and mini-batch SGD utilizes only a subset of the complete training set in each

iteration (i.e., a batch with some training examples) [8]. Although this distinction assumes a pure view

of the SGD method, this assumption can be flexibilized to include batches in SGD, in which case, SGD

and mini-batch SGD correspond to the same approach. The SGD approaches mentioned are less com-

putationally expensive than standard gradient descent, but require approximately independent sampling

of the batches or examples [8]. Adding momentum to the original SGD, smooths the gradient and, in

a physical interpretation, allows the parameters to acquire velocity (or momentum), which aggregates

previous gradient values [22, 9].

Indeed, resorting to mini-batch SGD with momentum [54], the parameters of the objective function J

are updated according to:

mwt+1 = µtmw

t + ηt∂J

∂wt, (2.3a)

1The notation regarding the letter l was previously used to index an arbitrary network layer. The distinction between referringto the loss function or the network layer should be clarified by the context.

13

wt+1 = wt −mwt+1, (2.3b)

mbt+1 = µtmb

t + ηt∂J

∂bt, (2.3c)

bt+1 = bt −mbt+1, (2.3d)

where µt ∈]0, 1] is the momentum value (a hyperparameter) at the t-th iteration (because it can be

increased during training iterations), ηt is the learning rate (a hyperparameter) at the t-th iteration (since

it can be decreased during training iterations), mwt and mb

t are the momentum terms at the t-h iteration

for the weights and the biases (respectively), wt are the weights at the t-th iteration, and bt are the

biases at the t-th iteration [37, 54]. Regarding the weights and biases notation, the module (or layer)

subscript was temporarily replaced by the iteration subscript for simplicity reasons, since the equations

are valid for every layer 2.

Without momentum, the parameter update for the mini-batch SGD consists in [54]:

wt+1 = wt + ηt∂J

∂wt, (2.4a)

bt+1 = bt + ηt∂J

∂bt. (2.4b)

Other hyperparameters and refinements introduced by the mini-batch SGD with momentum are: the

number of epochs (one epoch corresponds to a complete sweep across the training set) and the batch

size (the number of training examples used at a certain iteration occurring in each epoch) [37, 54].

The parameters are initialized randomly or transferred from a learned representation (as discussed

in Section 2.4). The hyperparameters can be obtained through cross-validation or from reference values

and procedures [22].

The partial derivative of the objective function J , present in Equations 2.3a and 2.4a, comprises two

parts: the derivative of the loss function term, i.e.:

∂L

∂(vec wl)T=

∂z

∂(vec wl)T=

∂

∂(vec wl)T1

N

N∑i=1

l(f(xi0; w1, ...,wL), yi), (2.5)

and the derivative of the regularization term, i.e.:

∂R

∂(vec wl)T=

∂

∂(vec wl)Tλ

2‖wl‖2, (2.6)

where the derivative of the regularization term is simple to calculate and considering an arbitrary layer l

(the subscript indicating the iteration was replaced by the layer indication).

To compute the partial derivative of the loss function term of the objective function J with respect to

the weights in the l-th layer, is necessary to perform a forward pass in the expanded CNN model (i.e.,

from x0 to z as illustrated in Figure 2.1) and apply the chain rule in conjunction with backpropagation

2The operator ∂ denotes the partial derivative. Computing the partial derivative with respect to a considered multidimensionalvariable, corresponds to the application of this operator to each component of the matrix.

14

from the end of the network to the beginning, as follows [55]:

∂z

∂(vec wl)T=

∂z

∂(vec xL)T∂xL

∂(vec xL−1)T. . .

∂xl+1

∂(vec xl)T∂xl

∂(vec wl)T, (2.7)

where the vec function reshapes its argument to a column vector and the derivatives are computed at

the working point, set when the input was forwarded through the network [55].

The intermediate matrices involved in these calculations have high dimensionality, leading to ex-

pensive and even unfeasible computations [55]. To solve this problem, the explicit computation of this

intermediate matrices must be circumvented. As shown in Figure 2.2, the network modules from an

arbitrary intermediate layer (or module) l to the end of the network z can be aggregated in the function

h = lfL(xL; wL)fL−1(xL−1; wL−1) · · · fl+1(xl+1; wl+1) [55]. The derivatives of the composition hflcan be expressed as [55]:

∂z

∂(vec xl−1)T=

∂z

∂(vec xl)T∂vec xl

∂(vec xl−1)T, (2.8a)

∂z

∂(vec wl)T=

∂z

∂(vec xl)T∂vec xl

∂(vec wl)T, (2.8b)

which are now feasible to compute, since ∂z∂(vec xl−1)T

has the same dimensions as xl−1 and ∂z∂(vec xl)T

has

the same dimensions as xl [55]. By applying this methodology recursively, it is possible to backpropagate

the derivative ∂z∂(vec xl)T

, computed at an arbitrary layer l, to the preceding layer l − 1. Only the derivatives∂z

∂(vec wl)Tand ∂z

∂(vec xl−1)Tneed to be computed, avoiding the explicit computation of the high dimensional

intermediate matrices [55].

The overall backpropagation procedure (including its feasible variant described previously) can be

applied to the biases b in a similar way.

Figure 2.2: Diagram illustrating the methodology used to feasibly perform the backpropagation proce-dure (adapted from [54]).

Figure 2.3 provides an abstract illustration of the backpropagation procedure.

In the test phase, the input to test is forwarded through the previously trained original CNN model

(without the loss function), outputting a prediction, which is compared with the corresponding label to

determine the test error.

15

Figure 2.3: Example of the backpropagation procedure (obtained from [54]).

2.3 Operations and layers

2.3.1 Convolutional layer

As mentioned in Section 2.1, the CNN explores the properties of data with lattice-like topology, to obtain

representative deep models in a computationally viable manner, to which task the convolutional layer

contributes significantly.

Since the statistics existent across natural images (and similarly in other inputs) repeat itself, patterns

emerge locally, which can be represented by features [37]. These features are learned in certain regions

of the image, but due to their meaningfulness, can then be used throughout the entire image [37]. In

fact, the feature learning process is based on the activations obtained from convolving a set of filters or

kernels (with more reduced dimensions than the input) with the input image [37, 9]. For example, edge

representations can be found in the first layer of the CNN model of [31], as they are useful features at

every region of an image [9].

Departing from this reusability of features perspective, the parameter sharing property is employed,

where the parameters are shared for each location of the input (in the width and height dimensions,

denoted by H and W , respectively) in which the filters are applied, although different filters can be

learned (its number corresponds to the depth dimension of the layer output and is denoted by D′′) [22].

Supporting full connections between neurons in adjacent layers can be computationally demanding,

specially for inputs of substantial size [37]. This problem is addressed by establishing the local con-

nectivity property (or sparse connectivity), which limits the number of connections in the width (W ) and

height (H) dimensions, and has complete connectivity in the depth dimension (denoted by D) [22]. In-

deed, only a spatial region (in terms of width and height), named receptive field, composed by various

input layer adjacent neurons is connected to each neuron in the next layer (as depicted in Figure 2.4)

[22]. The receptive field (of size H ′ ×W ′ ×D) corresponds to the height, width and depth dimensions

of the filters (of size H ′ ×W ′ ×D ×D′′), where the height and width are usually equal (i.e., H ′ = W ′),

16

and the fourth dimension corresponds to the number of filters (i.e., D′′ neurons seek to learn different

filters from the same positions of the image) [22]. Other hyperparameters existent in the convolutional

layer are the stride, which is the quantity by which the filters are displaced among contiguous image

positions while applying convolution, and the zero padding, which is the number of zeros to append to

the boundaries of the input (allowing to guarantee certain output dimensions) [22].

Resulting from the parameter sharing framework described above, equivariance to translation is

achieved, implying that translating an input and applying convolution yields the same output as applying

convolution to the input first and translating the resulting output afterwards [9].

Given an input x ∈ RH×W×D, multidimensional filters (also known as weights) w ∈ RH′×W ′×D×D′′

and biases b ∈ RD′′ , the convolution of x with the filters w (and biases b) is given by [55] (without zero

padding and with stride equal to 1):

yi′′j′′d′′ = bd′′ +

H′∑i′=1

W ′∑j′=1

D∑d′=1

wi′j′d′d′′ × xi′′+i′−1,j′′+j′−1,d′ , (2.9)

where the output is y ∈ RH′′×W ′′×D′′ , with H ′′ = 1 + floor((H − H ′ + Pt + Pb)/Sh) and W ′′ =

1 + floor((W −W ′ + Pl + Pr)/Sw), Sh and Sw correspond to the stride for the height and the width,

respectively, and Pt, Pb, Pr and Pl correspond to the zero padding of the top, bottom, right and left of

the input, respectively [55]. Usually, the inputs have equal height and width dimensions, i.e., H = W ,

the strides are equal, i.e., Sh = Sw = S, the padding values are equal, i.e., Pt = Pb = Pr = Pl = P ,

and the receptive field height and width are equal, i.e., H ′′ = W ′′. With parameter sharing, there are

H ′ ×W ′ ×D ×D′′ weigths and D′′ biases.

For example, as illustrated in Figure 2.4, considering a CNN input layer of size 227x227x3 (i.e.,

H = 227×W = 227×D = 3), followed by a Convolutional Layer with receptive field size 11x11x3 (i.e.,

H ′ = 11×W ′ = 11×D = 3), number of filters D′′ = 96, stride S = 4 and no zero padding (i.e., P = 0),

the height and width output size are equal to: H ′′ = W ′′ = 1 + floor((227− 11)/4) = 55. Consequently,

this layer has 11x11x3x96 weights and 96 bias (assuming parameter sharing), and its output size is:

H ′′ ×W ′′ ×D′′ = 55 × 55 × 96, meaning that the same input layer patch of size 11x11x3 is connected

to 96 neurons (with distinct parameters) [22].

2.3.2 Fully connected layer

In fully connected layers, the neurons in adjacent layers are fully connected, similarly to the case of

Neural Networks layers [22] (the procedure to convert from convolutional layers to fully connected layers

and vice-versa, is explained in [22]).

2.3.3 Pooling

The pooling layer collects statistics about the feature representations, typically in non-overlapping input

patches [22]. In the image case, applying pooling subsamples the image by a factor of the stride,

using patches of the receptive field size. Consequently, the feature dimensionality, the parameters and

17

Figure 2.4: Example of the convolution operation (adapted from [22]).

the computational power demands are decreased, mitigating overfitting [22] and attaining invariance

to slight translations. Although multiple types of statistics exist (e.g. maximum, average or L2 norm),

maximum (or max) pooling is the most common.

Considering an input x ∈ RH×W×D, the max pooling (in patches of size H ′ ×W ′ and stride S) [55]

is given by:

yi′′j′′d = max1≤i′≤H′,1≤j′≤W ′

xi′′+i′,j′′+j′,d, (2.10)

where the output is y ∈ RH′′×W ′′×D. An example of the max pooling operation application is depicted in

Figure 2.5.

Figure 2.5: Example of the max pooling operation (obtained from [22]).

2.3.4 Activation function

The activation functions enrich the network representational power by adding a non-linearity and are

typically placed after the convolutional layers (similar to the layers in the NN case) [22]. One of the most

popular activation functions is the Rectified Linear Unit (ReLU), mainly because it leads to faster training

times than the sigmoid and hyperbolic tangent activation functions [31], can be easily computed and

does not saturate [22]. Examples of several activation functions are depicted in Figure 2.6.

18

The application of ReLU to an input x ∈ RH×W×D is given by [55]:

yijd = max0, xijd. (2.11)

Figure 2.6: Examples of activation functions (obtained from [54]).

2.3.5 Normalization

Normalization is another module of the CNNs which contemplates various methodologies, such as:

local contrast normalization and cross-channel normalization. Given an input x ∈ RH×W×D, the cross-

channel normalization is performed according to [55]:

yijd = xijd(κ+ α∑

t∈G(d)

x2ijt)−β , (2.12)

where the output y has the same size as the input x and G(t) ⊂ 1, . . . , D represents a subset of the

input channels, for each output channel t. A scheme of the cross-channel normalization operation is

shown in Figure 2.7.

Figure 2.7: Cross-channel normalization scheme (obtained from [54]).

2.3.6 Classification

The classification module assigns a class to the inputs. A popular approach is based on SVMs (including

all their variants), which provide class scores. In the context of CNNs, the most common choice is the

softmax function given by [55]:

yijk =exijk∑Tt=1 e

xijt

, (2.13)

19

where x ∈ RH×W×T is the input and typically H = W = 1; k = 1, . . . , T ; T represents the total number

of classes and yijk denotes the probability of xijk belonging to the class k, out of all the possible T

classes.

2.3.7 Loss function

The loss function is applied during the training process for optimization purposes (as mentioned in

Section 2.2) and the two main types are: the hinge loss (possibly squared) and the softmax logarithmic

loss (also known as the cross entropy loss), which merges the softmax function and the logarithmic loss.

The cross entropy loss is given by [55]:

y = −∑ij

(xijc − log

T∑t=1

exijt), (2.14)

where x ∈ RH×W×T is the input and typically H = W = 1; the ground truth class is represented by xijc

and T is the total number of classes.

2.4 Transfer Learning

In image trained CNNs, the features are organized according to a hierarchy, where the features existent

in the initial layers (lower-level) are more general (e.g. edge representations and color blobs), with

its specificity to the task of interest classes gradually increasing as the top (high-level) of the network is

reached [66, 22]. For instance, while the lower layers of the ImageNet CNN model [31] may contain edge

representations, the top layers are more focused on differentiating between specific types of felines and,

even more specifically, between cat breeds. As a result, this idea inspires the possibility of transferring

learned features from one CNN to another [66].

2.4.1 Pre-training

CNNs can be trained only with the dataset pertaining to the chosen task (also known as base task),

starting from random initialization (e.g. according to a known probability distribution such as the normal

distribution). However, the size of the dataset may not be enough to prevent overfitting (due to the large

number of parameters to learn), motivating the use of pre-trained models [22]. These pre-trained models

result from training a CNN (possibly with random initialization) with a dataset of more substantial size

(e.g. a subset of ImageNet [14], which contains 1,2 million images with 1000 categories) and different

from the one of the task of interest, also known as target task (although some resemblances may exist)

[22].

20

2.4.2 Fine-tuning

The classification layer of the pre-trained CNN model can be adapted to the number of classes of the

target task dataset, and the entire (or some of the layers of the) network can be trained (as in [27]) with

the target task dataset (which is a process called fine-tuning), originating the target task CNN model.

Determining which layers are fine-tuned (all, lower-level, higher-level or only the classification layer)

depends essentially on the size of the target task dataset and on its resemblances with the base task

dataset [22]. Qualitatively, when resemblances exist and the dataset has a substantial size, the entire

network can be fine-tuned [22]. Nevertheless, when the data is more scarce (but with resemblances as

well), a better approach consists in extracting the penultimate layer and utilizing it to train a linear classi-

fication layer [22]. Although the dataset’s size may be enough to train a CNN from random initialization,

fine-tuning a pre-trained model (trained with a different type of data) can be more advantageous [66].

Finally, if the dataset’s size is scarce and the data type distinct, a more adequate choice may consist in

training a classifier from lower-level layers of the pre-trained network [22].

2.4.3 Feature extraction

CNNs are able to learn rich feature representations from raw inputs (such as image pixels). These

features can be extracted from the top-level layers (typically the penultimate layer, i.e., the one before the

classification layer) of a (pre-trained and fine-tuned) network and used to train a classifier (e.g. softmax

classifier or SVM). Images from the task specific dataset can be forwarded through the same network,

followed by the extraction of features (from the same layer as before) and testing on the previously

trained classifier [27].

21

Chapter 3

Proposed method

This chapter describes the innovative method developed to solve the Pedestrian Detection problem by

combining features extracted from several CNN input models. First, the used datasets, input channels

and pre-trained CNN models are specified. Subsequently, the pre-training and fine-tuning processes

are discussed for the single channel input case. An analysis of the final CNN model, resulting from the

combination of the input channels (multichannel combination), is performed and the overall detection

methodology is mentioned. Finally, the implementation is outlined, including the software and toolboxes

used.

3.1 Description of the Proposed Method

3.1.1 Datasets and Input channels

Two types of datasets are considered: the dataset containing pedestrian images (which is the INRIA

dataset [13]) and the dataset used to pre-train the CNN model (which is a subset of the Imagenet dataset

[14], comprising 1000 object categories with approximately 1,2 million training images, 50 thousand

validation images and 100 thousand test images, and whose images range from mammals and birds to

vehicles and fruit).

The chosen pedestrian dataset underwent six transformations, resulting in seven distinct datasets

(i.e., the original RGB image and six more). These seven different datasets are used as input channels

to fine-tune the pre-trained CNN model and are enumerated as follows: 1) RGB color space, denoted

by DRGB = (xRGB , y)nNn=1; 2) gradient magnitude, denoted by DGM = (xGM , y)nNn=1; 3) horizontal

derivative (along all 3 dimensions of the depth), denoted by DGx = (xGx, y)nNn=1; 4) vertical derivative

(along all 3 dimensions of the depth), denoted by DGy = (xGy, y)nNn=1; 5) grayscale, denoted by

DGs = (xGs, y)nNn=1; 6) YUV color space, denoted by DY UV = (xY UV , y)nNn=1 and 7) LUV color

space, denoted by DLUV = (xLUV , y)nNn=1.

The images in each dataset are represented by xRGB ,xGx,xGy,xY UV ,xLUV ∈ RH×W×D and xGM ,xGs ∈

R(H×W ) (with xRGB ,xGM ,xGx,xGy,xGs,xY UV ,xLUV : Ω→ R, where Ω is the image lattice),in which H

and W denote the height and width of the images (respectively), D their depth (D is equal to 1 in the 2D

22

image case, i.e., for xGM and xGs), and N the total number of images. Example images of each input

channel are depicted in Figure 3.1.

The corresponding class of each of the N images is denoted by y ∈ Y = 1, 2, where y = 1

represents the absence of pedestrians in the corresponding image and y = 2 represents the existence

of pedestrians in the corresponding image. The complete dataset of pedestrian images is denoted by

D = DRGB ,DGM ,DGx,DGy,DGs,DY UV ,DLUV .

Similar notation is adopted to denote the Imagenet dataset [14] used to pre-train the CNN model,

namely: D = (x, y)Nn=1, x ∈ RH×W×D and y ∈ Y = 1, . . . , 1000.

Figure 3.1: Example images of the input channels: 1) RGB colorspace (xRGB); 2) gradient magnitude(xGM ); 3) horizontal derivative (xGx); 4) vertical derivative (xGy); 5) grayscale (xGs); 6) YUV colorspace(xY UV ); 7) LUV colorspace (xLUV ).

3.1.2 Pre-trained CNN model

According to [66], initializing a CNN with features from a pre-trained network can be more advantageous

than random initialization and can lead to enhancements in the generalization capability, even if the

tasks assigned to the two models are distant.

Therefore, the publicly available CNN-F pre-trained model [12], represented by f(x, w) with param-

eters w = [wcn, wfc, wcl], was selected. The mentioned pre-trained model architecture is depicted in

Figure 3.2 and contains a total of 8 layers: 5 being convolutional (including non-linear sub-sampling and

parameters wcn), 2 being fully connected (with parameters wfc) and the last 1 being a classification

layer (also fully connected with parameters wcl). This CNN has an architecture similar to the one used

in the ILSVRC-2012 competition (ImageNet Large Scale Visual Recognition Challenge 2012) [31] and

was pre-trained with a subset of the ImageNet dataset [14] referred to in Subsection 3.1.1.

More specifically, the CNN input size is 224x224x3 (which requires resizing or pre-processing the

input images in order to achieve these dimensions) and the first convolutional layer comprises: 64

filters with a 11x11 receptive field, convolutional stride 4 and no zero spatial padding, followed by local

response normalization and max-pooling with a downsampling factor of 2 and zero padding on the

bottom and on the right. The second convolutional layer contains: 256 filters with a 5x5 receptive field,

convolutional stride 1 and two levels of zero spatial padding, followed by local response normalization

and max-pooling with a downsampling factor of 2 (and no zero padding). The third convolutional layer

comprises: 256 filters with a 3x3 receptive field, convolutional stride 1 and one level of zero spatial

padding. The fourth convolutional layer contains: 256 filters with a 3x3 receptive field, convolutional

23

stride 1 and one level of zero spatial padding. The fifth convolutional layer comprises: 256 filters with a

3x3 receptive field, convolutional stride 1 and one level of zero spatial padding, followed by max-pooling

with a downsampling factor of 2 (and no zero padding). The sixth layer and seventh layers are fully

connected with size 4096 (and support for dropout regularization, despite not being used herein). The

eight and last layer is a softmax classifier of size 1000. All layers containing parameters (except for the

last one) have the Rectified Linear Unit as the activation function.

3.1.3 CNN single input model

To construct the CNN model for the target task of pedestrian detection, the parameters wcn and wfc,

pertaining to the CNN model obtained in the ILSVRC-2012 task (from layer 1 to 7), were transferred to

the pedestrian detection task CNN model.

The architecture of the pedestrian detection task CNN model is equal to the pre-trained model’s

architecture, with the exception of the last layer (i.e., the eighth), which was replaced by a new softmax

classification layer (having parameters wcl and randomly initialized using a Gaussian distribution with

zero mean and variance equal to 0,01), adapted to the new number of classes, that is currently two (i.e.,

pedestrian and non pedestrian) instead of 1000.

Each of the seven previously mentioned input channels (datasets contained in the complete dataset

D) was used to fine-tune a CNN (similar to the procedure in [27] and [11], and resorting to the logarithmic

loss function), resulting in a total of seven single input CNN models (1 per distinct input channel), namely:

1) for RGB color space, f(xRGB ,wRGB); 2) for Gradient Magnitude, f(xGM ,wGM ); 3) for the Horizontal

derivative, f(xGx,wGx); 4) for the Vertical derivative, f(xGy,wGy); 5) for grayscale, f(xGs,wGs); 6) for

YUV color space, f(xY UV ,wY UV ); 7) for LUV color space, f(xLUV ,wLUV ). This process is illustrated

in Figure 3.3 a).

As mentioned previously, the expected CNN input image size is 224x224x3. Consequently, for the

considered input channels (datasets) that contain two dimensional images, the third dimension (depth)

was constructed by stacking 3 replicas of the image. Then, before entering the CNN, all images were

resized to the size 224x224x3 using cubic interpolation. Prior to these two operations, the mean of the

training images was computed and removed from all train,validation and test images, as a normalization

approach.

Figure 3.2: Pre-trained model architecture (obtained from [54]).

24

3.1.4 CNN multichannel input combination model

To build the multichannel combination CNN model, the features from the seventh layer (designated by

the L − 1th layer) of each of the seven single input CNN models are extracted and combined among

themselves (entirely or partially). These features are then utilized to train (by minimizing the logarith-

mic loss function) a softmax classification layer (randomly initialized using a Gaussian distribution with

zero mean and variance equal to 0,01), originating the multichannel input combination CNN model rep-

resented by: f(xRGB,L−1,xGM,L−1,xGx,L−1,xGy,L−1,xGs,L−1,xY UV,L−1,xLUV,L−1; wcl) (similarly to the

procedure in [11]) and depicted in Figure 3.3 b).

Figure 3.3: Scheme of the a) single input and b) multiple input combination CNN models.

3.1.5 Overall detection methodology

The overall detection process comprises two parts, as shown in Figure 3.4. Firstly, the Aggregated

Channels Features detector [21] is applied to the test images in order to obtain the pedestrian candidate

windows (i.e., windows potentially containing pedestrians), outputted in the form of bounding boxes

that surround each detected person (containing the height, width and coordinates), with corresponding

confidence scores for each detection.

Next, each of the candidate windows is extracted from the full image and passed through the multiple

input combination CNN model previously described, being classified as pedestrian or non pedestrian.

The candidate windows classified as non pedestrians are discarded and the ones classified as pedes-

trians maintain the bounding box and confidence score obtained with the ACF detector. Then, the

bounding box and the confidence score are utilized to perform the per-image evaluation of the overall

detector (as described in Section 4.1 of Chapter 4).

25

Figure 3.4: Scheme of the overall detection methodology.

3.2 Implementation

The developed method was implemented in Matlab and the MatConvNet Matlab toolbox [55] was used

to implement the CNN framework. The Piotr’s Computer Vision Matlab Toolbox [15] (2014, version

3.40 and including the channels and detector packages) was used to perform the detection with the

Aggregated Channel Features [21] method and to assess the performance of the proposed method,

including benchmarking against other methods and plotting the miss rate versus false positives per

image graph.

26

Chapter 4

Experimental Results

The results obtained with the CNN multichannel combination method proposed in Chapter 3 are pre-

sented in this chapter. The CNN single input channel results are also shown, in order to analyze the dif-

ference between the multichannel and single channel approaches. The performance evaluation method-

ology, the training methodology and the experimental setups are described and the results discussed.

For benchmarking purposes, the performances of several other state of the art PD methods are depicted

(including the performance of the ACF detector alone, without using the CNN) and compared with the

developed methodology.

4.1 Performance evaluation methodology

Regarding evaluation metrics, the more suitable methodology to evaluate pedestrian detectors was

proved to be based on per-image performance evaluation, instead of the per-window counterpart [20].

Accordingly, the most recent benchmarks in PD were obtained with full image evaluation metrics and,

therefore, this was the methodology used to evaluate the performance of the proposed method. A de-

tailed description of the full image evaluation metric is presented in [20] (and summarized in this section).

In the per-image evaluation metric, a detection window is slided on a grid of locations throughout

a full image, at multiple scales, outputting a bounding box and a confidence score per detection [20].

Afterwards, the detections placed in the same neighborhood are merged using non-maximal suppres-

sion, originating the definitive bounding boxes and confidence scores [20]. To determine if a detected

bounding box (BBdt) and a ground truth bounding box (BBgt) constitute a match, the PASCAL measure

is used, expressed as:

ao =area(BBdt ∩BBgt)area(BBdt ∪BBgt)

> 0, 5 , (4.1)

meaning that the overlapping area (ao) among the two bounding boxes must be greater than 50% [20].

Only one match between each BBdt and BBgt is allowed [20]. The first matches occur between the

detection bounding boxes (BBsdt) having the highest confidence scores and the corresponding ground

truth bounding boxes (BBsgt) (according to the PASCAL measure described previously) [20]. After

the previous matches, a single BBdt might have been matched to several BBsgt. This constitutes an

27

ambiguity that is solved by selecting the highest overlapping match (ties are solved by arbitrary selection)

[20]. False positives correspond to unmatched BBdt and false negatives correspond to unmatched BBgt

[20].

The detectors performance can be illustrated with a curve, portraying the variation of the miss rate

(MR) with the false positives per image (FPPI), when the threshold established for the detection con-

fidence score is changed [20]. These curves have logarithmic scales and are particularly useful for

detector benchmarking purposes [20]. In order to concisely express the information contained in the

MR against FPPI performance curve, the log-average miss rate was introduced (henceforth, referred to

simply as miss rate, except in this section) [20], which is the average of nine miss rate values selected

for nine FPPI rates belonging to the interval [10−2, 100] [20]. More specifically, each of these FPPI rates

corresponds to a equally distributed point (in the logarithmic domain) for the mentioned interval [20]. In

the special case where a certain miss rate value cannot be computed for a certain FPPI rate, because

the curve ended before sweeping through the entire FPPI rates interval (i.e., [10−2, 100]), it is replaced

by the minimum miss rate obtained [20].

4.2 Experimental setup

The experiments were performed on the INRIA dataset [13]. The methodology used to obtain the pos-

itive and negative training images is discussed in Section 4.3. Two main experimental setups were

considered: one including the entire INRIA training set with higher image resolution and the second one

using only a small portion of the INRIA training set with lower image resolution. Although the training

sets have a different number of images, the test sets have the same number of images, but with different

resolutions, i.e., for the higher resolution case the test set resolution is higher and for the lower resolution

case its resolution is lower.

In the first experimental setup, designated by full INRIA dataset, 17128 images were used for training

(90%) and validation (10%). In this set of 17128 images, 12180 are negative (i.e., without containing

pedestrians) and the remaining 4948 images include: 1237 positive images (i.e., containing pedestri-

ans), its horizontal flipping or mirroring (another 1237 images) and random deformations applied to the

previous 2474 images (positive images and its horizontal flipping).The deformations consist in perform-

ing cubic interpolation between randomly chosen starting and end values for the width and height of the

image (obtained from a uniform distribution on the interval ]0, 15[), but preserving the original size. The

size of the images is 100 for the height and 41 for the width, for all input channels, corresponding to the

higher resolution images.

In the second experimental setup, designated by partial INRIA dataset, 400 images were used for

training (75%) and validation (25%). In this set of 400 images, 300 are negative and the remaining 100

images are positive, without data augmentation (no horizontal flipping and no random deformations),

which were randomly chosen from the entire set of positive images. The RGB images result from resizing

the ACF detections from 100x41x3 to 25x10x3 (height, width and depth dimensions, respectively; the

details about the ACF pre-processing and the training methodology are mentioned in Section 4.3). The

28

gradient magnitude and the gradient histogram result from the ACF features computation applied to the

100x41x3 RGB images and then shrinked by a factor of 4 (i.e., with size 25x10). The Felzenszwalb’s

histogram of oriented gradients [24] was applied to the 100x41x3 RGB images, having size 12x5x32

and was then reshaped to the size 32x20x3. These are the lower resolution images.

The test set contains 1835 images which correspond to candidate windows obtained by running the

ACF detector in the INRIA 288 positive full test images (with at least one pedestrian, but can contain

more than one). This is the INRIA test set used to establish the benchmarks shown in [1] and must be

considered for comparison purposes.

4.3 Training methodology

Two training methodologies were considered to obtain the positive training images for the full INRIA

dataset experimental setup, namely: 1) cropping and resizing the INRIA positive training images and 2)

pre-processing the INRIA positive training dataset with the ACF detector. For the partial INRIA dataset,

the second training methodology was adopted.

For both cases, the negative images (i.e., without pedestrians) result from randomly extracting win-

dows of size 100x41x3 from the INRIA negative training images (i.e., 10 windows per negative full

image).

Concerning the positive train and validation images, its acquisition procedure is different depending

on the adopted methodology. In the first case, the positive train and validation images result from

cropping and resizing the INRIA positive training dataset to achieve the size 100x41x3 (corresponding

to the height, width and depth, respectively), keeping the pedestrians centered and fully present. The

cropping and resizing operations reduce the size of the images and allow to decrease the computational

cost. Moreover, the dimensions 100x41x3 were selected to match the ones of the second methodology,

allowing to fairly compare the performance of the two methodologies.

In the second case, to obtain the positive images (for training and validation), the ACF detector is

applied to the INRIA positive training dataset and the bounding boxes corresponding to the true positives

are selected (with size 100x41x3, which is the ACF detection window size). The original bounding

boxes, corresponding to the false negatives, are extracted as well (by comparing the true positives with

the ground truth). The true positives and the false negatives constitute the total set of positive images,

containing only true pedestrian bounding boxes. This methodology allows to obtain a more robust and

diversified training, since the method learns deformations, translations and occlusions, that would not

have been experienced with a centered and adequately formatted version of the dataset. Furthermore,

the candidate images of the test set are obtained by applying the ACF detector, which indicates that

pre-processing the train images with the same detector might be advantageous.

For the previously stated reasons, the training methodology used to obtain the results in Section

4.4 was the second one, where the ACF was applied to the INRIA positive training dataset to obtain the

positive training images. The difference between the results reached with the two training methodologies

is analyzed in Subsection 4.4.3.

29

4.4 Results

Training and testing were performed using the Matlab-based MatConvNet toolbox [55] running on CPU

mode (no GPU was used) on 2,50 GHz Intel Core i7-4710 HQ with 12 GB of RAM and 64 bit architecture.

No cross validation technique was used.

4.4.1 PD using the full INRIA dataset with 7 input channels

Each single input CNN model spent approximately 4,5 hours to train and 3 minutes to test. The training

process only concerns the fine-tuning of the entire network (adapted to the pedestrian and non pedes-

trian classes and comprising 8 layers) with the INRIA [13] pedestrian dataset. The time spent pre-training

the initial CNN-F model [12] with a subset of the Imagenet [14] dataset was not taken into account. In

the multichannel combination CNN model, the feature extraction per input channel took approximately

from 2 hours to 3 hours and the test time was under 1 minute. However, estimating the runtime of the

entire method (in frames per second) requires the inclusion of the feature extraction time, besides the

classification time (calculated previously for the test phase).

The optimization algorithm used for training (included in the MatConvNet toolbox [55] ) was the mini-

batch stochastic gradient descent with momentum, having the following parameters: batch size equal to

100, number of epochs equal to 10, learning rate equal to 0,001 and momentum equal to 0,9.

Table 4.1 presents the results for the single input channels: RGB color space (denoted by RGB),

gradient magnitude (denoted by GradMag), horizontal derivative across all RGB channels (denoted

by Gx), vertical derivative across all RGB channels (denoted by Gy), grayscale (denoted by Grayscale),

YUV color space (denoted by YUV) and LUV color space (denoted by LUV). As mentioned in Subsection

3.1.3 of Chapter 3, the input channels with original size 100x41, which are the gradient magnitude and

grayscale, are replicated in order to achieve the size 100x41x3. The other input channels having original

size 100x41x3, do not undergo this operation. Before entering the CNN, all inputs are resized to the size

224x224x3 using cubic interpolation (since this is the network expected input dimensions).

Table 4.2 depicts the performance when the following 4 input channels are combined: gradient mag-

nitude (GradMag), horizontal derivative (Gx), vertical derivative (Gy) and LUV color space (LUV) (ac-

cording to the methodology described in Figure 3.3 of Chapter 3). Table 4.3 shows the performance

for the combination of all the 7 input channels (according to the methodology described in Figure 3.3

of Chapter 3). The combination of the inputs: gradient magnitude, horizontal derivative and LUV color

space provides the best result, which is further improved by changing the training parameters, only

for the multichannel combination, as follows: the learning rate was changed to 0,0001, the number of

epochs was changed to 80 and the batch size was changed to 2000, resulting in a miss rate % of 14,64%

(not shown in Tables 4.2 and 4.3, but presented in Figure 4.1).

The comparison of the results obtained with the developed method for the combination of different in-

put channels, using the full INRIA dataset, are shown in Figure 4.2. In order to perform a fair comparison

with the best result, the training parameters of each combination presented in Figure 4.2 were changed

to have the same values as the best method training parameters in the multichannel combination, i.e.:

30

the learning rate was changed to 0,0001, the number of epochs was changed to 80 and the batch size

was changed to 2000.

The comparison of the developed method best result, which occurs for the combination of the inputs

(using the full INRIA dataset): gradient magnitude (GradMag), horizontal derivative (Gx) and LUV color

space (LUV), with other PD benchmarks is presented in Figure 4.1 (denoted by Multichannel CNN in the

box).

Table 4.1: Miss rate % using single channels as input and without feature combinations for the full INRIAdataset

Channel Miss Rate%RGB 16,27

GradMag 15,24Gx 16,42Gy 16,31

Grayscale 15,75YUV 16,03LUV 15,97

Table 4.2: Miss rate% using 4 features combinations for the full INRIA dataset (the ones in the table meanthat the feature of that channel is present in the combination and the zeros represent its absence).

GradMag Gx Gy LUV Miss Rate%0 0 1 1 15,970 1 0 1 15,941 0 0 1 15,090 1 1 0 16,121 0 1 0 16,041 1 0 0 15,620 1 1 1 16,021 0 1 1 14,911 1 0 1 14,821 1 1 0 15,991 1 1 1 15,77

4.4.2 PD using the partial INRIA dataset with 4 input channels

Each single input CNN model spent approximately 14 minutes to train and 1,5 minutes to test. The

training process only concerns the fine-tuning of the entire network (adapted to the pedestrian and non

pedestrian classes and comprising 8 layers) with the INRIA [13] pedestrian dataset. The time spent

pre-training the initial CNN-F model [12] with a subset of the Imagenet [14] dataset was not taken into

account. In the multichannel combination CNN model, the feature extraction per input channel took

approximately 10 minutes and the test time was under 1 minute. However, estimating the runtime of the

entire method (in frames per second) requires the inclusion of the feature extraction time, besides the

classification time (calculated previously for the test phase).

31

Table 4.3: Miss rate% using 7 features combinations for the full INRIA dataset (the ones in the table meanthat the feature of that channel is present in the combination and the zeros represent its absence).

RGB GradMag Grayscale Gx Gy YUV LUV Miss Rate%1 1 1 1 1 1 1 16,701 1 0 1 0 0 1 16,551 1 0 1 1 0 1 16,700 0 1 1 1 1 1 15,951 1 0 1 1 1 1 16,001 0 1 0 1 1 1 16,750 1 0 1 0 0 1 14,820 1 0 1 1 0 1 15,770 1 1 1 1 0 1 15,920 1 0 1 1 1 1 15,890 1 0 0 1 0 1 14,910 1 0 1 0 0 0 15,620 0 0 1 0 0 1 15,941 1 0 1 0 0 0 16,460 1 1 1 0 0 1 16,22

Figure 4.1: Comparison of the developed method best result, denoted by Multichannel CNN, with otherPD benchmarks for the full INRIA dataset. The box contains the log-average miss rate % for eachmethod.

The optimization algorithm used for training was the mini-batch stochastic gradient descent with

momentum (included in the MatConvNet toolbox [55]), having the following parameters: batch size equal

32

Figure 4.2: Comparison of the results obtained with the developed method for the combination of dif-ferent input channels for the full INRIA dataset. The box contains the log-average miss rate % for eachmethod.

to 10, number of epochs equal to 10, learning rate equal to 0,001 and momentum equal to 0,9.

Table 4.4 presents the results for the single input channels: RGB color space with size 25x10x3

(denoted by RGB), gradient magnitude with size 25x10 (denoted by GradMag), gradient histogram in

the orientation range from 150 degrees to 180 degrees with size 25x10 (denoted by GradHist6) and the

reshaped Felzenszwalb’s histogram of oriented gradients [24] with size 32x20x3 (denoted by FHOG).

The performance for the combination of these input channels is depicted in Table 4.5. As mentioned in

Subsection 3.1.3 of Chapter 3, for two dimensional input channels, the image is replicated in order to fill

the third dimension. Before entering the CNN, all inputs are resized to the size 224x224x3 using cubic

interpolation (since this is the network expected input dimensions).

Table 4.4: Miss rate % using single channels as input and without feature combinations for the partialINRIA dataset

Channel Miss Rate%RGB 21,05

GradMag 24,23GradHist6 21,83

FHOG 22,34

4.4.3 Discussion

By analyzing Table 4.1, Table 4.2 and Table 4.3, it is possible to observe that the combination of multiple

input channels is advantageous and can lead to improved results. In fact, the best result is 14,64 %

miss rate and was obtained for the combination of GradMag, Gx and LUV input channels (individually

33

Table 4.5: Miss rate% using 4 features combinations for the partial INRIA dataset (the ones in the tablemean that the feature of that channel is present in the combination and the zeros represent its absence)

RGB GradMag GradHist6 FHOG Miss Rate%0 0 1 1 25,620 1 0 1 25,171 0 0 1 21,040 1 1 0 23,521 0 1 0 23,461 1 0 0 23,820 1 1 1 23,401 0 1 1 18,681 1 0 1 18,441 1 1 0 21,761 1 1 1 19,95

these channels have worst performances, i.e., the miss rate of GradMag is 15,24%, the miss rate of Gx

is 16,42%, and the miss rate of LUV is 15,97%).

However, some combinations of the input channels lead to miss rates that are worse than the ones

of its individual input channels alone (the interaction among some input channels seems to deteriorate

the overall performance). For example, the Gx miss rate (equal to 16,42%) is better than the 7 input

channels combination miss rate in Table 4.3 (equal to 16,70%), but worst than the 4 input channels

combination miss rate in Table 4.2 (equal to 15,77%).

Comparing the results from Tables 4.1, 4.2, 4.3, 4.4 and 4.5, it is relevant conjecturing that, when

combining several input channels, an higher performance improvement can be obtained if each single

channel has reduced quality (e.g. in terms of resolution or dimensionality), whereas higher quality single

input channels seem to lead to less improvement when combined (although still existent). Concordantly,

the multichannel combination maximum improvement, in the lower resolution single input channels case

(shown in Tables 4.4 and 4.5, where images have lower dimensions than in the Tables 4.1, 4.2 and

4.3), is 5,79% (resulting from the difference between GradMag miss rate% and RGB, GradMag, FHOG

combination miss rate%) and, in the higher resolution single input channels case (shown in Tables 4.1,

4.2 and 4.3), is 1,78% (resulting from the difference between Gx miss rate% and GradMag, Gx and LUV

combination miss rate%). Despite the resizing to 224x224x3 underwent by all images before entering

the CNN, the scarce image resolution affects the resize operation causing loss of image quality (from the

higher resolution to the lower resolution case, the height and width were reduced to approximately one

fourth of the initial size, except for the case of FHOG, in which the height was reduced to approximately

one third and the width to one half of the initial RGB image size, before being transformed into FHOG).

According to the previous information, the proposed PD method is suited for the low resolution PD

problem. However, the low resolution images were only used in the CNN based approach. The ACF

method, which generated the candidate images, utilized the original INRIA test images with high res-

olution. The candidate images were only resized (or reshaped) afterwards in order to obtain lower

resolution. This low resolution candidate images were then entered in the CNN. Consequently, the low

resolution results only comprise the CNN methodology, not the entire detection system (which is com-

34

posed by the ACF and the developed CNN based method). The performance for the CNN multichannel

combination case cannot be compared with the performance of the ACF method alone (i.e., the base-

line), because the image resolution was not reduced before the generation of the candidate windows.

Nevertheless, within the scope of the CNN framework, it is still possible to compare the performance

of the single input channels with the performance of the multiple input channels combinations (as done

previously).

The channels combination performance improvements do not assume more significant proportions,

possibly due to the lack of heterogeneity among the channels (or at least they could be substantially

more heterogeneous). For instance, if the input channels represented different views of the pedestrians,

more noticeable differences would be expected. As a result, a better and more heterogeneous selection

of the input channels may enhance the performance of their combination. Indeed, the combination of

the RGB, GradMag, Gx and LUV input channels produces a miss rate (equal to 16,55%) 1,91% worst

than the miss rate of the combination of GradMag, Gx and LUV (equal to 14,64%), possibly due to

the redundancy existent between the colorspaces RGB and LUV (the combination of RGB and LUV

may be incompatible as well, since the miss rates of the combination of RGB, GradMag and Gx, and

the combination GradMag, Gx and LUV are better than the miss rates after adding LUV and RGB,

respectively).

Regarding the sensibility of the model, when the hyperparameter batch size increases, the perfor-

mance tends to improve in the single channel case (mainly when the training dataset size is small),

not showing significant differences in the multichannel combination case. By increasing the number

of epochs over 10, there are no substantial changes in the performance (specially when the training

dataset is large), since the training and validation errors are not able to substantially decrease more. If

the training and validation errors are constant after some number of epochs (or vary slightly, e.g. less

than 0,5%),the training has stabilized. Indeed, increasing the number of epochs has no substantial effect

in the multichannel combination case.

Concerning the adopted training methodology, the results reached with the ACF pre-processing train-

ing methodology (the one in use) were superior, by less than 1% (approximately), to the ones obtained

with the cropped and resized INRIA positive training images (as discussed in Section 4.3). A relevant

remark, consists in the fact that the developed PD method is robust with respect to the training method-

ology adopted, with maximum result’s variations of approximately 1%.

The selected experimental setups, namely, the full INRIA dataset with higher resolution images and

the partial INRIA dataset with lower resolution images, intend to test two extreme cases by varying the

resolution of the images and the amount of data used during training. When more data and higher

resolution images are available, the network undergoes a better training procedure, leading to the best

result (e.g. 14,64 % miss rate for the combination of the inputs: GradMag, Gx and LUV). Conversely,

when the images have lower resolution and the amount of data utilized is substantially more scarce, the

training of the network is not as good, leading to the worst situation regarding the resolution and data

quantity (e.g. 18,44 % miss rate for the combination of the inputs: RGB, GradMag and FHOG).

The best result achieved with the developed method occurs for the combination of the inputs (using

35

the full INRIA dataset): gradient magnitude (GradMag), horizontal derivative (Gx) and LUV color space

(LUV) (as depicted in Figures 4.1 and 4.2). When compared with other PD benchmarks, as shown

in Figure 4.1 (denoted by Multichannel CNN in the box), it is possible to conclude that the proposed

approach is competitive with the state of the art (is in the top 7 methods for INRIA dataset, according

to the benchmarks acquired in [1]), and introduces an improvement of 2,64% when compared with the

ACF method alone (displaying a better performance curve).

36

Chapter 5

Conclusions

5.1 Thesis overview

Throughout this master thesis, the PD problem was formulated and the main challenges identified.

The existent datasets, benchmarks and evaluation methodologies were mentioned. The most relevant

PD approaches were discussed, including an analysis of the Deep Learning based PD methods and

other relevant techniques. The CNNs background was outlined. The proposed method, comprising the

multichannel input combination, was described. Specifically, the chosen pre-trained CNN input model

underwent fine-tuning with each single input channel and the final CNN model was built by combining

the features of each input. The experimental results, obtained by applying the proposed method on the

INRIA dataset, were presented in conjunction with a benchmarking against other state of the art PD

methods.

5.2 Achievements

An innovative method to solve the PD problem was proposed, based on the multichannel combination of

different input channels using CNNs. Particularly, the performed experiments motivate the application of

this method to low resolution PD, since it is possible to create synergies between low resolution inputs,

leading to enhanced detection rates for the input combination case. The experimental results obtained

using the full INRIA dataset are competitive with the PD state of the art approaches. Moreover, when

multiple input channels are available, the developed method can be abstractly applied in other areas

beyond the scope of PD, allowing to integrate inputs and achieve improved performance.

5.3 Future Work

This work can be expanded by selecting more heterogeneous input channels such as pedestrian body

parts (which can be obtained with [10], for example) or different views of the pedestrians (despite being

difficult to obtain them, resorting to the current pedestrian datasets).

37

Furthermore, pre-trained CNN models with more layers (deeper) and diversified architectures could

be fine-tuned in order to extract more representative features for each input channel.

The low resolution results obtained in this Thesis can be extended to the entire detection system

(which is composed by the ACF and the developed CNN based method). In Chapter 4, the experimental

setup formulation was not adequate for the ACF method’s performance evaluation in low resolution. To

solve this problem, the image resolution can be reduced before the application of the ACF method to

generate candidate windows. Afterwards, the low resolution candidate images could be entered in the

CNN. As a result, the performances for the CNN single channel and multichannel combination cases

could then be compared with the performance of the ACF method alone (i.e., the baseline).

Regarding the detection of pedestrian candidate windows for the test set, a distinct detector, such

as the Square Channels Feature or the Roerei detectors [5], can be used. Another possibility consists

in integrating the multiscale sliding window task and the pedestrian bounding box prediction task in the

CNN, similarly to [50], instead of using an external detector to generate pedestrian candidate windows

(i.e., the ACF detector would not be needed).

38

Bibliography

[1] Caltech pedestrian detection benchmark. URL www.vision.caltech.edu/Image_Datasets/

CaltechPedestrians. Last Access: July, 2015.

[2] Pedestrian traffic light image, . URL http://novuslight.com/

transportation-solution-for-pedestrians-in-cologne_N1232.html. Last Access: July,

2015.

[3] Pedestrian traffic sign image, . URL http://iica.de/pd/index.py. Last Access: July, 2015.

[4] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool. Pedestrian detection at 100 frames per

second. In CVPR, 2012.

[5] R. Benenson, M. Mathias, T. Tuytelaars, and L. Van Gool. Seeking the strongest rigid detector. In

CVPR, 2013.

[6] R. Benenson, M. Omran, J. H. Hosang, and B. Schiele. Ten Years of Pedestrian Detection, What

Have We Learned? CoRR, abs/1411.4304, 2014.

[7] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):

1–127, 2009. doi: 10.1561/2200000006. Also published as a book. Now Publishers, 2009.

[8] Y. Bengio. Practical recommendations for gradient-based training of deep architectures. CoRR,

abs/1206.5533, 2012.

[9] Y. Bengio, I. J. Goodfellow, and A. Courville. Deep learning. Book in preparation for MIT Press,

2015. URL http://www.iro.umontreal.ca/~bengioy/dlbook.

[10] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3D human pose annotations.

In International Conference on Computer Vision (ICCV), 2009.

[11] G. Carneiro, J. Nascimento, and A. Bradley. Unregistered multiview mammogram analysis with

pre-trained deep learning models. To appear in MICCAI, 2015.

[12] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the Devil in the Details: Delving

Deep into Convolutional Nets. In British Machine Vision Conference, 2014.

[13] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In International

Conference on Computer Vision & Pattern Recognition, volume 2, pages 886–893, 2005.

39

www.vision.caltech.edu/Image_Datasets/CaltechPedestrians

www.vision.caltech.edu/Image_Datasets/CaltechPedestrians

http://novuslight.com/transportation-solution-for-pedestrians-in-cologne_N1232.html

http://novuslight.com/transportation-solution-for-pedestrians-in-cologne_N1232.html

http://iica.de/pd/index.py

http://www.iro.umontreal.ca/~bengioy/dlbook

[14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR, 2009.

[15] P. Dollar. Piotr’s Computer Vision Matlab Toolbox (PMT). http://vision.ucsd.edu/~pdollar/

toolbox/doc/index.html.

[16] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral Channel Features. In BMVC, 2009.

[17] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian Detection: A Benchmark. In CVPR, June

2009.

[18] P. Dollar, S. Belongie, and P. Perona. The fastest pedestrian detector in the west. In BMVC, 2010.

[19] P. Dollar, R. Appel, and W. Kienzle. Crosstalk Cascades for Frame-Rate Pedestrian Detection. In

ECCV, 2012.

[20] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian Detection: An Evaluation of the State of

the Art. PAMI, 34, 2012.

[21] P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast Feature Pyramids for Object Detection. PAMI,

2014.

[22] L. Fei-Fei and A. Karpathy. Notes about Convolutional Neural Networks from the course CS231n:

Convolutional Neural Networks for Visual Recognition lectured at Standford University, Winter quar-

ter, 2015. URL http://cs231n.stanford.edu/. Last Access: July, 2015.

[23] P. F. Felzenszwalb, D. A. McAllester, and D. Ramanan. A discriminatively trained, multiscale, de-

formable part model. In CVPR, 2008.

[24] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-

tively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence,

32(9):1627–1645, 2010.

[25] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant

learning. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition

CVPR, pages 264–271, 2003.

[26] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark

suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[27] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detec-

tion and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), 2014.

[28] C. Gu, J. J. Lim, P. Arbelaez, and J. Malik. Recognition using regions. In IEEE Computer Society

Conference on Computer Vision and Pattern Recognition CVPR, pages 1030–1037, 2009.

40

http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html

http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html

http://cs231n.stanford.edu/

[29] J. H. Hosang, M. Omran, R. Benenson, and B. Schiele. Taking a Deeper Look at Pedestrians.

CoRR, abs/1501.05790, 2015.

[30] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Multimodal neural language models. In Proceedings

of the 31th International Conference on Machine Learning, ICML 2014, pages 595–603, 2014.

[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional

Neural Networks. In NIPS, pages 1106–1114, 2012.

[32] J. J. Lim, C. L. Zitnick, and P. Dollar. Sketch tokens: A learned mid-level representation for contour

and object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages

3158–3165, 2013.

[33] Z. Lin and L. S. Davis. A pose-invariant descriptor for human detection and segmentation. In ECCV,

pages 423–436, 2008.

[34] P. Luo, Y. Tian, X. Wang, and X. Tang. Switchable deep network for pedestrian detection. CVPR,

2014.

[35] M. Mathias, R. Benenson, R. Timofte, and L. J. V. Gool. Handling occlusions with franken-

classifiers. In IEEE International Conference on Computer Vision ICCV, pages 1505–1512, 2013.

[36] W. Nam, P. Dollar, and J. H. Han. Local decorrelation for improved pedestrian detection. In Ad-

vances in Neural Information Processing Systems 27: Annual Conference on Neural Information

Processing Systems, pages 424–432, 2014.

[37] A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, C. Suen, A. Coates, A. Maas, A. Hannun, B. Huval, T. Wang, and

S. Tandon. Deep Learning Tutorial. URL http://ufldl.stanford.edu/tutorial. Last Access:

July, 2015.

[38] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceed-

ings of the 28th International Conference on Machine Learning, ICML, pages 689–696, 2011.

[39] W. Ouyang and X. Wang. A discriminative deep model for pedestrian detection with occlusion

handling. CVPR, 2012.

[40] W. Ouyang and X. Wang. Single-pedestrian detection aided by multi-pedestrian detection. CVPR,

2013.

[41] W. Ouyang and X. Wang. Joint deep learning for pedestrian detection. ICCV, 2013.

[42] W. Ouyang, X. Zeng, and X. Wang. Modeling mutual visibility relationship in pedestrian detection.

In IEEE Conference on Computer Vision and Pattern Recognition, pages 3222–3229, 2013.

[43] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Strengthening the effectiveness of pedestrian

detection with spatially pooled features. CoRR, abs/1407.0786, 2014.

41

http://ufldl.stanford.edu/tutorial

[44] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Pedestrian detection with spatially pooled

features and structured ensemble learning. CoRR, abs/1409.5209, 2014.

[45] C. Papageorgiou and T. Poggio. A trainable system for object detection. International Journal of

Computer Vision, 38(1):15–33, 2000.

[46] D. Park, D. Ramanan, and C. Fowlkes. Multiresolution models for object detection. In ECCV, pages

241–254, 2010.

[47] P. Sabzmeydani and G. Mori. Detecting pedestrians by learning shapelet features. In IEEE Com-

puter Society Conference on Computer Vision and Pattern Recognition CVPR, 2007.

[48] E. Seemann, B. Leibe, K. Mikolajczyk, and B. Schiele. An evaluation of local shape-based features

for pedestrian detection. In Proceedings of the British Machine Vision Conference, 2005.

[49] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised

multi-stage feature learning. CVPR, 2013.

[50] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated

recognition, localization and detection using convolutional networks. In International Conference

on Learning Representations (ICLR). CBLS, April 2014.

[51] N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In

F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Pro-

cessing Systems 25, pages 2222–2230. Curran Associates, Inc., 2012.

[52] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and

A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.

[53] M. Taiana, J. C. Nascimento, and A. Bernardino. On the purity of training and testing data for

learning: The case of Pedestrian Detection. Neurocomputing, vol. 150, Part A:214–226, 2015.

URL http://www.sciencedirect.com/science/article/pii/S0925231214012636.

[54] A. Vedaldi. AIMS Big Data, Lecture 3: Deep Learning, 2015. URL http://www.robots.ox.ac.uk/

~vedaldi/assets/teach/2015/vedaldi15aims-bigdata-lecture-4-deep-learning-handout.

pdf. Last Access: July, 2015.

[55] A. Vedaldi and K. Lenc. MatConvNet – Convolutional Neural Networks for MATLAB (including the

manual). CoRR, abs/1412.4564, 2014.

[56] A. Vedaldi and A. Zisserman. VGG Convolutional Neural Networks Practical. URL http://www.

robots.ox.ac.uk/~vgg/practicals/cnn. Last Access: July, 2015.

[57] M. Viola, M. J. Jones, and P. Viola. Fast multi-view face detection. CVPR, 2001.

[58] P. A. Viola and M. J. Jones. Robust real-time face detection. International Journal of Computer

Vision, 57(2):137–154, 2004.

42

http://www.sciencedirect.com/science/article/pii/S0925231214012636

http://www.robots.ox.ac.uk/~vedaldi/assets/teach/2015/vedaldi15aims-bigdata-lecture-4-deep-learning-handout.pdf



http://www.robots.ox.ac.uk/~vgg/practicals/cnn

http://www.robots.ox.ac.uk/~vgg/practicals/cnn

[59] P. A. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appear-

ance. In ICCV, pages 734–741, 2003.

[60] S. Walk, N. Majer, K. Schindler, and B. Schiele. New features and insights for pedestrian detection.

In IEEE Conference on Computer Vision and Pattern Recognition CVPR, pages 1030–1037, 2010.

[61] C. Wojek and B. Schiele. A performance evaluation of single and multi-feature people detection. In

Pattern Recognition (DAGM), 2008.

[62] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by

bayesian combination of edgelet part detectors. In IEEE International Conference on Computer

Vision ICCV, pages 90–97, 2005.

[63] B. Wu and R. Nevatia. Cluster boosted tree classifier for multi-view, multi-pose object detection. In

IEEE International Conference on Computer Vision ICCV, pages 1–8, 2007.

[64] W. O. X. Zeng and X. Wang. Multi-stage contextual deep learning for pedestrian detection. ICCV,

2013.

[65] J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li. Robust multi-resolution pedestrian detection in traffic

scenes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3033–3040,

2013.

[66] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural

networks? CoRR, abs/1411.1792, 2014.

[67] S. Zhang, C. Bauckhage, and A. B. Cremers. Informed haar-like features improve pedestrian de-

tection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 947–954,

2014.

[68] S. Zhang, R. Benenson, and B. Schiele. Filtered channel features for pedestrian detection. CoRR,

abs/1501.05759, 2015.

[69] Q. Zhu, M. Yeh, K. Cheng, and S. Avidan. Fast human detection using a cascade of histograms

of oriented gradients. In IEEE Computer Society Conference on Computer Vision and Pattern

Recognition CVPR, pages 1491–1498, 2006.

43

Pedestrian Detection with Multichannel Convolutional ...€¦ · Electrical and Computer...

Documents

Transcript of Pedestrian Detection with Multichannel Convolutional ...€¦ · Electrical and Computer...