Use of Convolutional Neural Networks to Automate the ...ncfrn.mcgill.ca/members/pubs/CISRAM.pdf ·...

6
Use of Convolutional Neural Networks to Automate the Detection of Wildlife from Remote Cameras Rajarshi Maiti, Yi Hou, Colleen Cassady St. Clair and Hong Zhang Abstract— Determining the abundance, distribution, and habitat associations of wildlife species is important for under- standing their ecology, behavior and conservation. For mobile, rare, and wide-ranging species, biologists often obtain this information from remote cameras and time-lapse photography. The captured images are then visually inspected to identify those that contain useful information. Due to the large number of images to be processed, the task of visual inspection is painstaking and tedious. In this paper, we describe preliminary results of an automated screening system that is intended to alleviate this problem. Specifically, we study the problem of detecting grizzly bears (Ursus acrtos) in still images, using a convolutional neural network (CNN). Given each image, we first use the Maximally Stable Extremal Regions (MSER) to segment sub-regions that potentially contain a bear and then apply a pre-trained convolutional neural network as the classifier to determine if a bear is present in a sub-region. Experimental results from a real-world dataset demonstrate that our system is able to eliminate over 90% of the images from human inspection while recalling over 60% of the positive images that contain a bear, at a rate of approximately one minute per image. I. INTRODUCTION Determining the abundance and distribution of species is a fundamental goal of ecology and provides essential information for conservation programs. This information can be difficult to obtain for species that are mobile, especially if they are wide-ranging, secretive, or rare. For these species, wildlife ecologists increasingly rely on remote cameras to detect animal presence, approximate population sizes, and assess habitat use. Cameras are sometimes triggered via infrared signals, but this technique precludes use of remote cameras to monitor areas more than a few meters distant. Larger-scale monitoring is useful in many contexts, but espe- cially for determining the presence of animals in landscapes where they may be endangered by human infrastructure or activity. Consequently, remote cameras are often set to take images at specified intervals, which are processed to provide indices of relative use by animals over both time and space. An example of this common conservation context is pro- vided by the need to monitor grizzly bear (Ursus arctos) ac- tivity along railroad tracks in Banff and Yoho National Parks, Canada. There, mortality of bears by train strikes threatens the viability of the local population, creating an urgent need for mitigation. As part of a larger research project, time- lapse photography is being used at several locations that have historically been frequented by bears and will later be used R. Maiti and H. Zhang are with Department of Computing Science, University of Alberta, Edmonton, Alberta Canada. Yi Hou is visiting from National University of Defense Technology, China. Colleen Cassady St. Clair is with Department of Biological Sciences, University of Alberta. Email: {rmaiti, cstclair, hzhang}@ualberta.ca to evaluate the efficacy of warning and deterrent devices. The cameras are programmed to capture images once every fixed time interval (e.g., one minute), and operate on a 24- 7 basis. Consequently, many thousand images are generated each day, and these images need to be visually inspected to identify those that contain a positive sighting. Obviously, visual inspection of such a large number of images is time consuming, tedious and costly. In this paper, we describe a solution to this problem derived from collaboration between wildlife biologists and computer scientists. In particular, we describe an automated system that can screen the images to eliminate as many images as possible that contain no bears while retaining as many as possible of those images that contain a bear (Fig 1). We present preliminary results from our screening system based on the application of computer vision to a set of photos taken at a site in Banff National Park. Since the purpose of our proposed method is to determine whether an image contains a bear, it is essentially a visual object class detection problem, which aims at locating and identifying in an image instances of objects in a finite number of object classes. In our application, although we are interested only in the bear class, it is still a special case of class-specific detection and can benefit from the rich body of research in object detection where tremendous progress has been made in recent years. One popular and successful approach is based on first identifying in the input image sub- regions that likely contain objects of a predefined category and then applying a classifier to either accept or reject a region. A solution in this approach typically proceeds in three steps: (a) enumerate all possible regions in an input image (b) decide if any of them correspond to any of the predefined categories, and (c) combine the responses of all regions to obtain the final detection result [1]. Numerous works, including this paper, focus on the first two critical steps. To find regions of interest, the sliding window strategy is based on exhaustive search [2, 3]. This strategy involves exhaustively scanning the input image with a fixed-size rectangular window, aiming to locate all potential objects. Then, for each sub-region of the image bounded by this window, its features are extracted and fed into a classifier to determinate the class label of the sub-image. Although this method is conceptually simple, it is computationally expensive due to the complexity of exhaustive search. In order to reduce the number of image sub-regions to consider, many object proposal techniques [4, 5, 6, 7] became popular in recent years. Some of them are based on the hierarchical grouping of oversegmentation [4] or the voting by over-

Transcript of Use of Convolutional Neural Networks to Automate the ...ncfrn.mcgill.ca/members/pubs/CISRAM.pdf ·...

Page 1: Use of Convolutional Neural Networks to Automate the ...ncfrn.mcgill.ca/members/pubs/CISRAM.pdf · detecting grizzly bears (Ursus acrtos) in still images, using a convolutional neural

Use of Convolutional Neural Networks to Automate the Detection ofWildlife from Remote Cameras

Rajarshi Maiti, Yi Hou, Colleen Cassady St. Clair and Hong Zhang

Abstract— Determining the abundance, distribution, andhabitat associations of wildlife species is important for under-standing their ecology, behavior and conservation. For mobile,rare, and wide-ranging species, biologists often obtain thisinformation from remote cameras and time-lapse photography.The captured images are then visually inspected to identifythose that contain useful information. Due to the large numberof images to be processed, the task of visual inspection ispainstaking and tedious. In this paper, we describe preliminaryresults of an automated screening system that is intended toalleviate this problem. Specifically, we study the problem ofdetecting grizzly bears (Ursus acrtos) in still images, using aconvolutional neural network (CNN). Given each image, we firstuse the Maximally Stable Extremal Regions (MSER) to segmentsub-regions that potentially contain a bear and then apply apre-trained convolutional neural network as the classifier todetermine if a bear is present in a sub-region. Experimentalresults from a real-world dataset demonstrate that our system isable to eliminate over 90% of the images from human inspectionwhile recalling over 60% of the positive images that contain abear, at a rate of approximately one minute per image.

I. INTRODUCTION

Determining the abundance and distribution of speciesis a fundamental goal of ecology and provides essentialinformation for conservation programs. This information canbe difficult to obtain for species that are mobile, especially ifthey are wide-ranging, secretive, or rare. For these species,wildlife ecologists increasingly rely on remote cameras todetect animal presence, approximate population sizes, andassess habitat use. Cameras are sometimes triggered viainfrared signals, but this technique precludes use of remotecameras to monitor areas more than a few meters distant.Larger-scale monitoring is useful in many contexts, but espe-cially for determining the presence of animals in landscapeswhere they may be endangered by human infrastructure oractivity. Consequently, remote cameras are often set to takeimages at specified intervals, which are processed to provideindices of relative use by animals over both time and space.

An example of this common conservation context is pro-vided by the need to monitor grizzly bear (Ursus arctos) ac-tivity along railroad tracks in Banff and Yoho National Parks,Canada. There, mortality of bears by train strikes threatensthe viability of the local population, creating an urgent needfor mitigation. As part of a larger research project, time-lapse photography is being used at several locations that havehistorically been frequented by bears and will later be used

R. Maiti and H. Zhang are with Department of Computing Science,University of Alberta, Edmonton, Alberta Canada. Yi Hou is visiting fromNational University of Defense Technology, China. Colleen Cassady St.Clair is with Department of Biological Sciences, University of Alberta.

Email: {rmaiti, cstclair, hzhang}@ualberta.ca

to evaluate the efficacy of warning and deterrent devices.The cameras are programmed to capture images once everyfixed time interval (e.g., one minute), and operate on a 24-7 basis. Consequently, many thousand images are generatedeach day, and these images need to be visually inspectedto identify those that contain a positive sighting. Obviously,visual inspection of such a large number of images is timeconsuming, tedious and costly. In this paper, we describe asolution to this problem derived from collaboration betweenwildlife biologists and computer scientists. In particular, wedescribe an automated system that can screen the images toeliminate as many images as possible that contain no bearswhile retaining as many as possible of those images thatcontain a bear (Fig 1). We present preliminary results fromour screening system based on the application of computervision to a set of photos taken at a site in Banff NationalPark.

Since the purpose of our proposed method is to determinewhether an image contains a bear, it is essentially a visualobject class detection problem, which aims at locating andidentifying in an image instances of objects in a finitenumber of object classes. In our application, although we areinterested only in the bear class, it is still a special case ofclass-specific detection and can benefit from the rich bodyof research in object detection where tremendous progresshas been made in recent years. One popular and successfulapproach is based on first identifying in the input image sub-regions that likely contain objects of a predefined categoryand then applying a classifier to either accept or reject aregion. A solution in this approach typically proceeds in threesteps: (a) enumerate all possible regions in an input image(b) decide if any of them correspond to any of the predefinedcategories, and (c) combine the responses of all regions toobtain the final detection result [1].

Numerous works, including this paper, focus on the firsttwo critical steps. To find regions of interest, the slidingwindow strategy is based on exhaustive search [2, 3]. Thisstrategy involves exhaustively scanning the input image witha fixed-size rectangular window, aiming to locate all potentialobjects. Then, for each sub-region of the image bounded bythis window, its features are extracted and fed into a classifierto determinate the class label of the sub-image. Althoughthis method is conceptually simple, it is computationallyexpensive due to the complexity of exhaustive search. Inorder to reduce the number of image sub-regions to consider,many object proposal techniques [4, 5, 6, 7] became popularin recent years. Some of them are based on the hierarchicalgrouping of oversegmentation [4] or the voting by over-

Page 2: Use of Convolutional Neural Networks to Automate the ...ncfrn.mcgill.ca/members/pubs/CISRAM.pdf · detecting grizzly bears (Ursus acrtos) in still images, using a convolutional neural

Fig. 1: Bear detection system where from an input image bear region proposals are firstly extracted with MSER (a), then every bear region proposal isclassified by CNN (b), and the top-5 classes of the softmax layer of CNN are checked to decide whether the input image has a bear (c).

lapping contours [7]. The computational efficiency of theseobject-proposal methods is better than the sliding windowapproach, although their performance heavily depends on thequality of that of the oversegmentation or the correspondingextracted boundaries. Other methods use machine learning,cascaded boosting classifier [5] or a pre-trained linear classi-fier [6] for example, to screen the bounding boxes producedin the sliding window technique. Although this process isefficient, their localization is inaccurate and the pre-trainingof classifiers requires time [8]. Furthermore, in practicalapplications the number of image sub-regions that need tobe examined and classified is still high in order to achieve agood performance.

In this paper we use a simple approach based on Maxi-mally Stable Extremal Regions (MSER) [9] detector to ex-tract regions of interest. MSER is a blob detector that extractsco-variant image regions that are stable under affine geomet-ric and monotonic intensity transformations. Our motivationin using MSER is based on the following observation. Asshown in Fig. 1(a) where a bear is annotated in a yellowdashed bounding box, the pixels of the bear share a similarcolor or intensity value and, consequently, a likely imageregion with a bear can be found by grouping pixels in aneighborhood based on their intensity. If this can be donesuccessfully, then we can further examine all such regionscarefully to determine if a bear is present. MSER is oneregion detector which can detect such homogeneous regionswell (as shown in Fig. 1 (b)) with a reasonable computationalcost.

The second crucial step of object detection is object classi-fication for an extracted object proposal. Object classificationinvolves two key components: appearance modeling andclassification. Appearance modeling establishes an abstractrepresentation of the appearance of each object class, andthen each input image is expressed in this representationand matched with the those of various object classes by aclassifier. Finally, we can get a classification score for eachobject class or category.

The basis of object classification is visual representa-tion or visual feature extraction. As a traditional approach,hand-crafted features, i.e., those that are designed by theprocess of feature engineering to capture various desiredproperties of an image with expert knowledge, dominate incomputer vision research. These handcrafted features can be

at the patch-level – such as scale-invariant feature transform(SIFT) [10] – or at the region-level – such as bag-of-words(BoVW) [11]. A patch-level feature mainly captures low-level image properties such as intensity, texture, color, andit is often not sufficiently robust to deal with challengingscenarios. In contrast, a region-level feature has better invari-ant properties and can encode higher-level semantics of anobject category. These higher-level semantics are usually ab-stracted from low-level properties of the patch-level featuresusing, for example, machine learning algorithms. Despitehaving some higher-level semantics, the invariance propertyof region-level features is still limited with respect to affinetransformation and illumination because their performanceessentially depends on that of low-level features. In addition,the hand-crafted features can be time-consuming to extractbecause of the keypoint detection process and their vectorquantization.

Recent research in deep convolutional neural networks(CNN) have produced impressive results. A CNN is a typeof feed-forward neural network where the individual neuronsare grouped in such a way that they respond to overlappingregions of an image. A deep CNN typically consists of anumber of convolutional and one or more fully-connectedlayers whose parameters or weights are learned through theprocess of back-propagation that minimizes the predictionerror of the network. A CNN can be considered as ablackbox that computes various representations or features ofan input image with its various layers. These features havebeen shown to outperform the state-of-the-art hand-craftedfeatures on a number of standard visual tasks includingobject classification. For example, on the PASCAL [12] andILSVRC [13] visual recognition challenges, top-performingobject recognition [14, 15, 16, 17] methods are based onCNN, beating systems with traditional hand-crafted features.Similarly, for a special application such as face identifi-cation [18, 19], the remarkable advance is also attributedto CNN features. Clearly, these achievements demonstratethe ability of a CNN to learn features from visual datawith increasing levels of abstraction. This motivates us toinvestigate CNN as a potential feature descriptor, and useit in our bear detection system, which is faced with thechallenge of handling a diverse range of illumination, season,and background clutter conditions.

Our bear detection system combines MSER and CNN.

Page 3: Use of Convolutional Neural Networks to Automate the ...ncfrn.mcgill.ca/members/pubs/CISRAM.pdf · detecting grizzly bears (Ursus acrtos) in still images, using a convolutional neural

Specifically, MSER is used as a bear proposal detector. In thisway, we can efficiently extract MSER-based bear proposalsin an input image, which all potentially contain a bear. Then,these MSER-based bear proposals are sent into a pre-trainedCNN model, trained on over one million images including,fortunately, several species of bear as its classes within the1000 class set. By a layer-by-layer abstract representationsof the CNN, the extracted CNN-based features encode high-level semantic and discriminative information, critical to anobject classification solution. Next, these high-level CNN-based features are presented to the softmax layer of the CNNto obtain ranked class labels. Finally, based on an analysisof the classification results for all MSER-based bear regionproposals, we decide whether an input image contains a bear.

The rest of this paper is organized as follows. Section IIdetails our proposed method, especially its two key com-ponents: the MSER-based bear proposal extraction and theCNN-based bear classification. In Section III experimentalresults on a real-world monitoring dataset demonstrate theperformance of our method. Finally, we conclude the paperin Section IV with a summary and future work.

II. PROPOSED BEAR DETECTION SYSTEM

Fig. 1 gives an overview of our proposed bear imageidentification system. In this section we explain in detailits components, by following the flowchart in Fig. 1 andfocusing on its proposal generation and classification steps.

A. Overview of the Proposed System

In our system, first, MSER is applied to the input image todetect the regions that are potentially a bear. Second, for eachdetected MSER region, pre-trained CNN is used to generateprobabilities or levels of confidence on the class labels ofthe region. Third, class labels, each in the form of a list ofwords as exemplified in Fig. 1(c), are parsed and, if any labelbelongs to a “bear” class (a total of eight bear classes existout of the total of 1000 classes of objects), the region isconsidered a bear region. The image is a considered a bearimage if any of its MSER regions is a bear region.

B. MSER-Based Bear Proposal Detection

The MSER is originally developed as a local invariantfeature detector to extract co-variant regions from an image.Intuitively each of these regions has the property that itsintensity value is constant, and they are found by searchingfor connected components that are stable when variousintensity thresholds are applied to the image. As seen inFig. 1 (a), the pixels that make up the bear can often begrouped together based on their similar intensity values withrespect to the surrounding background.

As shown in Fig. 1 (b), there will be multiple MSERregions in an image, and these regions need to be analyzedby CNN to identify those that contain a bear. The defaultparameters of the MSER algorithm result in on the ordertwo thousand viable regions, which would impose a highcomputational burden on the subsequent classification step.Instead, we use pixel area of the regions to exclude from

Fig. 2: Bear class: brown bear, bruin. These are example images used inthe training of ImageNet CNN in Caffe [13, 21] adopted in this paper. Onecan see the large range of color, texture, and background objects and thisvariation provides strength to the CNN based classifier.

further consideration many that are either too small or toolarge and therefore highly unlikely to belong to a bear.Empirically, analysis of the positive cases in our datasethas allowed us to set 280 and 9000 as the lower and upperbounds of the pixel area, to filter out over 60% of the MSERregions and reduce the number of MSERs to a few hundreds,although the specific thresholds are obviously applicationdependent.

C. CNN-Based Bear Classification

A CNN is a multi-layer neural network. We use a pre-trained CNN, shown in Fig. 2, that has been trained torecognize objects in 1000 classes, eight of which are variousspecies of bear. The architecture of this CNN model [20]is available with an open-source deep learning frameworkcalled Caffe [21]. As can be seen, this CNN consists of fiveconvolutional layers and three fully-connected layers. A max-pooling layer follows the first, second and fifth convolutionallayer. The function of the max pooling layers is to enhancethe transformation invariance of the feature map and reduceits dimension. In fact, it is a process of building an abstractrepresentation, by merging the lower-level local information.In a fully-connected layer, all responses of all neurons in theprevious layer are connected into every single neuron of thecurrent layer. This full connection is another step of high-level abstraction beneficial to image classification or objectrecognition. With such a deep architecture, the CNN is ableto learn high-level semantic features. Finally, the output ofthe last fully-connected layer is fed to a 1000-way softmaxlayer, which computes the probabilities of the 1000 classlabels.

In our study, we use the CNN in Caffe to classify MSER-based bear proposals extracted in the first stage. This CNNwas trained on a dataset called ImageNet LSVRC-2010 [13]with over 1.2 million images of 1000 different object cate-gories of which eight define “bear” classes:

– Koala, koala bear, kangaroo bear, native bear– Brown bear, bruin– American black bear, black bear, Ursus americanus

Page 4: Use of Convolutional Neural Networks to Automate the ...ncfrn.mcgill.ca/members/pubs/CISRAM.pdf · detecting grizzly bears (Ursus acrtos) in still images, using a convolutional neural

224

Stride

of 4

Maxing

pool

Maxing

poolMaxing

pool

96

256

384 384 256

4096

1000

4096

1000

11

11

224

5

5

55

55

27

27

3

33

3 3

3

13

13 13 13

13 13

dense dense

CONV1 CONV2 CONV3 CONV4 CONV5 FC6 FC7 FC8 SoftmaxImage

Fig. 3: The architecture of the standard CNN in Caffe. It is a multi-layerneural network, which consists of eight layers: five convolutional layers(CONV1-CONV5) and three fully-connected layers (FC6-FC8). Finally, theoutput of FC8 is fed to a 1000-way softmax layer (yellow rectangle) whichpresents the probability over the 1000 class labels. In our method the 1000-way softmax layer is used as a bear classifier.

– Ice bear, polar bear, Ursus Maritimus– Sloth bear, Melursus ursinus– Lesser panda, red panda, panda, bear cat, cat bear– Giant panda, panda, panda bear, coon bear– Teddy, teddy bear

Example training images for the Brown bear class (#2) areshown in Fig. 2, which most closely resemble the onesobserved in the dataset we work with in our application.Note that within each class, the terms are synonymous. Afterwe use the 1000-way softmax layer of CNN to compute theprobabilities of the class labels, we follow the conventionin [13], and take the five classes with the highest probabilitiescomputed by the softmax layer as the final description of theinput image region. An image is classified as containing abear if at least one bear class appears in the top-5 labels ofany of the regions.

III. EXPERIMENTAL RESULTS

A. Dataset

Experiments were conducted on a dataset whose imageswere collected near Lake Louise where the railway crossesunder a road in Banff National Park, Alberta, Canada.Initially a total of 37,000 images were captured over aperiod of three months. All 37,000 images were then visuallyinspected and 27 images were found to contain bears. These27 constituted positive cases in the ground truth. Sincetesting our system against all 37,000 images would be timeconsuming, we created a smaller subset of 3027 images,including the 27 positive cases. As a result, the final datasetused in the evaluation contained 3000 negative cases, i.e.,without bears, and 27 images with bears. The 3000 plusimages cover a time frame of roughly 36 hours under a widerange of lighting conditions for dawn, daytime, evening andnight. Fig. 4(a) and Fig. 4(b) show some examples of imagesfrom this dataset, with and without bears respectively.

B. Implementation Details

For MSER, we use the OpenCV [22] implementation,which is a popular open source library of computer visionalgorithms. This MSER implementation has a number ofparameters to be set. We use its default parameters in ourexperiments except for the minimum and maximum areaparameters, at 280 and 9000 respectively, in order to limitthe number of regions to be classified, as explained inthe previous section. For CNN, we use Caffe [21] with

Fig. 5: Some examples of correct MSER-based proposals. For visualization,detected bears are bounded by red rectangles in cropped regions from theactual images.

[‘n02134418 sloth bear, Melursus ursinus, ...’‘n02133161 American black bear, black bear, ...’‘n02132136 brown bear, bruin, Ursus arctos’‘n02483708 siamang, Hylobates syndactylus, ...’‘n02481823 chimpanzee, chimp, Pan troglodytes’]

Fig. 6: An example of correct CNN-based classification for a MSER-basedproposal with a bear. The first three of the top-5 classes are bear labels.

a reconstruction of the standard CNN architecture [20] toclassify extracted MSER-based bear proposals, and we usethe pre-trained model of Caffe on ImageNet. The hardwareused is an Intel Core2 Duo processor running at 3.0 GHzwith 5 GB of RAM. To achieve time efficiency, we run theCNN of the detection system on a GPU by Nvidia (QuadroK4200) with 1344 cores and 4GB of RAM.

To measure performance of our system, we use the stan-dard evaluation criterion of Precision-Recall. Precision isdefined as the ratio of true positives Tp over the sum oftrue positives Tp and false positives Fp. Recall is defined asthe ratio of true positives Tp over the sum of true positivesTp and false negatives Fn. A high precision or a high recallby itself is easy to achieve but an ideal identification systemstrikes the right balance between precision and recall.

C. Results and Analysis

Table I shows the recall and precision values when ourmethod was tested against the dataset described above. Outof the 27 positive images, 17 were detected successfully.Examples of correct detections are shown in Fig. 5. Outof the 3000 negative images, 2788 were classified correctly(true negatives) whereas 212 were detected as positive im-ages (false positives). In other words, approximately 93%(2788/3000) of the negative images are correctly excludedfrom further consideration. An example of CNN output isshown in Fig. 6. In this case, we see three bear classesin the top-5 class labels. This performance represents arecall of 63% and precision of 7%. Although they are farfrom perfect, it is a promising start and a baseline forour future research. The processing time is on the averageapproximately one minute per image, with most time spenton CNN classification at 0.1 second per image sub-region.

TABLE I: Recall and Precision on the Banff Morant Bridge dataset.

Recall = Tp / (Tp + Fn) = 17 / (17 + 10) = 63%Precision = Tp / (Tp + Fp) = 17 / (17 + 212) = 7%

Among the 17 successful detections, visual inspectionof the result shows that all are due to the correct operationof both MSER proposal and CNN classification, rather thanby chance (e.g., a non-bear MSER region could be classified

Page 5: Use of Convolutional Neural Networks to Automate the ...ncfrn.mcgill.ca/members/pubs/CISRAM.pdf · detecting grizzly bears (Ursus acrtos) in still images, using a convolutional neural

(a) Nine examples of positive images with bears (b) Nine examples of negative images without bears

Fig. 4: Example images in the experimental dataset (a) nine with bears (bounded by a yellow box) and (b) nine without bears.

as positive in a positive image). Of the 10 false negatives,MSER is responsible for three whereas CNN is responsiblefor seven. Examples of a false negative and a false positiveare shown in Fig. 7(a) and Fig. 7(b) respectively. Due toocclusion by the fence, the MSER fails to detect the bear asa contiguous region in Fig. 7(a) and in Fig. 7(b), the CNNincorrectly classified a portion of the fence as a “polar bear”.

(a) Missed detection by MSER (b) False classification by CNN

Fig. 7: Two failure cases due to (a) missed detection by MSER and (b)false classifications by CNN.

IV. CONCLUSION AND FUTURE WORK

We have presented in this paper an automated detectionsystem for detecting grizzly bears in still images obtainedfrom remote cameras positioned to cover approximately100 m2 of adjacent habitat. We achieved this by first usingMSER as an object proposal generator to extract a number ofregions that likely contained a bear. Then a pre-trained CNNwas used to classify the MSER-based proposals to obtaintheir class labels. The decision on bear presence was madeon the basis of whether the keyword “bear” was contained inthe top-5 class labels, Experimental evaluation of our systemshowed that it was able to eliminate over 90% of the bear-free images from further consideration while retaining over60% of the bear images.

The success of our technique is attributable largely tothe ability of deep convolutional neural networks to extracta semantic description of an image region and effectivelyhandle the large degree of variation in bear appearances,lighting, and background clutter. The previous processing ofthese images by people was tedious and time-consuming,which invited procrastination to create processing backlogs.By formulating the problem as one of object detection andclassification, our system has vastly reduced the time neededby people, with concomitant increases in their availability forother tasks and more rapid production of analysis-ready data.

Our system is easily generalized to other contexts thatinvolve large-scale monitoring of wildlife for diverse pur-poses by academics, industry, government, and not-for-profitorganizations. Our own future work will refine the system by(a) reducing processing time by eliminating irrelevant areasvia automated pre-processing, and (b) increasing the speci-ficity of regions in the MSER and improving the accuracy ofbear classification via the CNN. With this and similar tools,the processing of time-lapse photos from remote wildlifecameras will provide information comparable to that affordedby more invasive techniques, such as GPS collars and tags,but without their monetary, ethical, and logistical costs.

ACKNOWLEDGMENTS

This work was supported partially by Natural Sci-ence and Engineering Research Council (NSERC), theHunan Provincial Innovation Foundation for Postgraduate(No.CX2014B021), the Fund of Innovation of NUDT Grad-uate School (No.B140406), the Hunan Provincial NaturalScience Foundation of China (No.2015JJ3018) and the ChinaScholarship Council Scholarship. Images in this work werecaptured and used with permission as part of the GrizzlyBear Conservation Initiative, jointly supported by CanadianPacific Railway and Parks Canada Agency, and NSERC.

Page 6: Use of Convolutional Neural Networks to Automate the ...ncfrn.mcgill.ca/members/pubs/CISRAM.pdf · detecting grizzly bears (Ursus acrtos) in still images, using a convolutional neural

REFERENCES

[1] Xin Zhang, Yee-Hong Yang, Zhiguang Han, Hui Wang,and Chao Gao. Object class detection: A survey. ACMComputing Surveys (CSUR), 46(1):10:1–10:53, 2013.

[2] Navneet Dalal and Bill Triggs. Histograms of orientedgradients for human detection. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),volume 1, pages 886–893, 2005.

[3] Pedro F Felzenszwalb, Ross B Girshick, DavidMcAllester, and Deva Ramanan. Object Detectionwith Discriminatively Trained Part-based Models. IEEETransactions on Pattern Analysis and Machine Intelli-gence, 32(9):1627–45, 2010.

[4] K E A van de Sande, J R R Uijlings, T Gevers,and A W M Smeulders. Segmentation as SelectiveSearch for Object Recognition. In IEEE InternationalConference on Computer Vision (ICCV), pages 1879–1886, 2011.

[5] Xiaoyu Wang, Ming Yang, Shenghuo Zhu, and Yuan-qing Lin. Regionlets for Generic Object Detection.In IEEE International Conference on Computer Vision(ICCV), pages 17–24, 2013.

[6] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, andPhilip H S Torr. BING: Binarized Normed Gradients forObjectness Estimation at 300fps. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),pages 3286 – 3293, 2014.

[7] C Lawrence Zitnick and Piotr Dollar. Edge boxes:Locating object proposals from edges. In ComputerVision–ECCV 2014, pages 391–405. Springer, 2014.

[8] Jan Hendrik Hosang, Rodrigo Benenson, and BerntSchiele. How Good Are Detection Proposals, Really?arXiv:1406.6962, 2014.

[9] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robustwide baseline stereo from maximally stable extremalregions. In Proceedings of the British Machine VisionConference, pages 36.1–36.10. BMVA Press, 2002.

[10] David G Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of ComputerVision, 60(2):91–110, November 2004.

[11] J. Sivic and A. Zisserman. Video Google: A TextRetrieval Approach to Object Matching in Videos. InIEEE International Conference on Computer Vision(ICCV), volume 2, pages 1470–1477, 2003.

[12] Mark Everingham, Luc Van Gool, Christopher KIWilliams, John Winn, and Andrew Zisserman. Thepascal visual object classes (voc) challenge. Interna-tional Journal of Computer Vision (IJCV), 88(2):303–338, 2010.

[13] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, Alexan-der C. Berg, and Li Fei-Fei. ImageNet Large ScaleVisual Recognition Challenge. arXiv:1409.0575, 2014.

[14] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoff-man, Ning Zhang, Eric Tzeng, and Trevor Darrell.

DeCAF: A Deep Convolutional Activation Feature forGeneric Visual Recognition. arXiv:1310.1531, 2013.

[15] Pierre Sermanet, David Eigen, Xiang Zhang, MichaelMathieu, Rob Fergus, and Yann LeCun. Overfeat: In-tegrated Recognition, Localization and Detection usingConvolutional Networks. arXiv:1312.6229, 2013.

[16] M Oquab, L Bottou, I Laptev, and J Sivic. Learning andTransferring Mid-level Image Representations UsingConvolutional Neural Networks. In IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),pages 1717–1724, 2014.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. Spatial pyramid pooling in deep convolu-tional networks for visual recognition. arXiv preprintarXiv:1406.4729, 2014.

[18] Yi Sun, Yuheng Chen, Xiaogang Wang, and XiaoouTang. Deep learning face representation by jointidentification-verification. In Advances in Neural Infor-mation Processing Systems, pages 1988–1996, 2014.

[19] Y. Taigman, Ming Yang, M. Ranzato, and L. Wolf.Deepface: Closing the gap to human-level performancein face verification. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 1701–1708, 2014.

[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. ImageNet Classification with Deep ConvolutionalNeural Networks. In Advances in Neural InformationProcessing Systems (NIPS), pages 1097–1105, 2012.

[21] Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, Sergio Guadar-rama, and Trevor Darrell. Caffe: Convolutional archi-tecture for fast feature embedding. arXiv:1408.5093,2014.

[22] G. Bradski. The OpenCV Library. Dr. Dobb’s Journalof Software Tools, 2000.