static-content.springer.com10.3758/s134… · Web viewAnnotations were standardized to a 67-word...

SUPPLEMENTARY MATERIAL

for

How do targets, nontargets, and context influence real-world object detection?

Harish Katti1, Marius V. Peelen2 & S. P. Arun1

1Centre for Neuroscience, Indian institute of Science, Bangalore, India, 5600122Center for Mind/Brain Sciences, University of Trento, 38068 Rovereto, Italy.

CONTENTS

SECTION S1: COMPUTATIONAL MODELS

SECTION S2: ANALYSIS OF LOW-LEVEL FACTORS

SECTION S3: PARTICIPANT FEEDBACK

SECTION S4: ANALYSIS OF DEEP NEURAL NETWORKS

SECTION S5: SUPPLEMENTARY REFERENCES

of 19

SECTION S1: COMPUTATIONAL MODELS

Target features: We used a Hhistogram of oriented gradients (HOG) to learn a bag of 6 six templates consisting of 2 two views each of 3 three poses of people and 6 six unique templates for cars (FelzenszwalbFelzenszwalb, Girshick, McAllester, & Ramanan et al., 2010). These templates are essentially filters and the response to convolution of these filter templates with intensity information at different locations within multiple scales of a scene, indicates whether some regions in the scene bear a strong or weak resemblance to cars or people, in this implementation a zero detector score indicates an exact match between a region and template and more negative values indicate weak matches. We thresholded the degree of match between the person template, and a scene region is thresholded at two levels:, oOne is a tight threshold of -0.7 that has very few false alarms across the entire data set, and a second, weaker threshold of -1.2 is set to allow for correct detections as well as false alarms. A total of 31 attributes were then defined over the person detections in an image,

[1.] The attributes include the number of high confidence detections (estimate of Hhits).

[2.] Difference between high and low confidence detections (estimate of Ffalse aalarms).

1.[3.] Average area of detected person boxes weighted by detector score on each detection box. We weighted detections by the detector score based on feedback from subjects in which they indicated greater ease of target detection for larger targets and when the target appearance is more conspicuous.

[4.] Average deformation of each unique part in detected boxes. This is calculated by first normaliszing each detection to a unit square and finding the displacement of each detected part from the mean location of the part over the entire set of 1,300 scenes in the car task or 1,300 scenes in the person task, as applicable.

[5.] 5 Five bin histogram of eccentricity of person detections with respect to fixation.[6.] 6 Six bin histogram of the 6six person model types as detected in this scene.

A similar set of 31 attributes were defined for car detections;, in this manner, we represented each scene by a 62 dimensional vector capturing various attributes about partial matches to targets in the scene. Coarse and part structure information captured by car and person HOG templates are visualiszed in Figure Fig. S1. Representative examples of hits and false alarms from partial matches of car and person HOG templates, are shown in Figure Fig. S2.

To establish that this manner of summarising summarizing HOG template matches is indeed useful, we also evaluate target model performance using only the average HOG histograms (Dalal & Triggs, 2005) over all partial target match boxes arising from the same source detector (FelzenszwalbFelzenszwalb et al., 2010). Model correlations for this baseline detector (r = 0.39 ± 0.026 for person detection and r = 0.37 ± 0.027 for car detection) are lesser than the correlations for our HOG summary models (r = 0.45 ± 0.01 for person detection and r = 0.57 ± 0.01 for car detection).

We also evaluate whether unique aspects of response time variation can be explained by specific target attributes such as configural changes in the parts that are detected within a HOG template. For this purpose, we retain only 16 dimensional part deformation information and train models for target detection. We recomputed person model correlations after regressing out the remaining person target attributes, responses on the same scenes in car detection, distance of nearest person to scene center,

of 19

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950

largest person size, number of people, and predictions of a SIFT- based clutter model. Likewise for models trained to predict detection responses in the car task, we recomputed correlations after regressing out the remaining car target attributes, responses on the same scenes in person detection, distance of nearest car to scene center, largest car size, number of car, and predictions of a SIFT- based clutter model. We observe that models trained with part deformation information alone predict target detection response times (r = 0.16 ± 0.015 for person detection and r = 0.1 ± 0.036 for car detection).

Nontarget features: All 1,300 scenes used in the car detection task and 1,300 scenes used in the person detection task were annotated for the presence of objects. We avoided extracting features from each nontarget object because isolating each object is extremely cumbersome, and because nontarget objects may potentially share visual features with the target. Instead, we annotated each scene with binary labels corresponding to the presence of each particular nontarget object. Objects were included in the annotation only if they occurred close to the typical scale of objects in the data set;, global scene attributes and visual concepts such as ‘sky’, and ‘water’, etc were not annotated. Annotations were standardized to a 67- word dictionary and the final list of unique object labels along with frequency of occurrence in the 1,300 scenes used in the car detection task are: text (277), sign (460), stripe (243), pole (504), window (679), entrance (77), tree (687), lamppost (308), fence (271), bush (116), colour (133), roof (225), box (84), thing (90), glass (151), manhole -cover (19), door (279), hydrant (37), dustbin (80), bench (51), snow (4), stair (79), cable (148), traffic- light (78), parking- meter (16), lamp (38), cycle (61), boat (22), rock (47), flowerpot (46), statue (20), flower (33), flag (31), wheel (10), table (24), animal (14), cloud (27), cone (15), chair (34), shadow (6), umbrella (15), bag (18), hat (2), lights (1), cannon (1), grating (1), bird (7), bright (2), cap (1), cart (1), lamp-post (2), spot (1), wall (0), light (0), branch (0), clock (0), shoe (0), vehicle (0), spectacles (0), shelter (0), gun (0), drum (0), sword (0), pumpkin (0), bottle (0), pipe (1), leaf (1).

The frequency of occurrence of each of these 67 labels in the 1,300 scenes used in the person detection task is: text (294), sign (412), stripe (134), pole (446), window (563), entrance (81), tree (598), lamppost (280), fence (227), bush (50), colour (136), roof (186), box (90), thing (118), glass (170), manhole -cover (10), door (242), hydrant (22), dustbin (87), bench (78), snow (4), stair (63), cable (87), traffic -light (71), parking -meter (10), lamp (47), cycle (105), boat (23), rock (55), flower -pot (66), statue (33), flower (35), flag (33), wheel (31), table (37), animal (21), cloud (7), cone (11), chair (59), shadow (8), umbrella (44), bag (116), hat (6), lights (1), cannon (1), grating (1), bird (8), bright (2), cap (11), cart (1), lamp-post (4), spot (1), wall (1), light (1), branch (1), clock (1), shoe (2), vehicle (1), spectacles (1), shelter (1), gun (1), drum (1), sword (1), pumpkin (1), bottle (1), pipe (1), leaf (1).

We excluded visual concepts that could be global (snow) or are more like visual features (bright, color, shadow, lights/reflections, and bright spot).

We also excluded labels with very rare (less than 10) occurrences in the target -present or target -absent data sets;, this was done to ensure stable regression and model fits. In this manner, we limited the models to use a maximum of 36 unique nontarget labels. We also verified that there is not qualitative change in reported results on including these rare nontarget objects.

Regression models were trained to predict target rejection and detection RTs for cars and people separately. These regression models yield informative weights shown in

of 19

51525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899

100

Figure Fig. S3 that indicate which nontargets typically speed up or slow down target rejection or detection. Examples are ‘cycle’, ‘dustbin,’ and ‘hydrant’, which speed up person detection and slow down person rejection Figure Figs. S3a A, –bB. Similarly, nontargets such as ‘cone’ (traffic -cone) and ‘entrance’ speed up car detection and slow down car rejection.

Coarse scene structureThese features are derived from the energy information over representative

orientations at multiple scales in a scene and was first proposed in (Oliva & Torralba , 2001). This coarse scene envelope was extracted over blurred versions of the input scene to avoid object-scale statistics from contaminating the coarse scene description. This process yields a 512 dimensional feature vector for each scene. In post- hoc analysis, we verified that blurred scenes do not give rise to target matches (FelzenszwalbFelzenszwalb et al., 2010). We found that this method of modelling coarse scene envelopes outperforms other approaches such as extracting activations from a scene classification CNN with blurred scenes as input. One of the important reasons for better performance of the GIST operator on our set is that it captures variations arising out of changes in field- of -view and scene depth, more reliably in the first few principal components than the CNN- based descriptor.

Other alternative features considered We considered using several alternate feature representations, but their

performance was generally inferior to our final feature representation for each channel. We have listed the details of these models and their relative performance compared to the best models.Table S1 Description of baseline models.

Information type Description Notes

Target features Whole target HOG description trained iteratively using Ssupport vector machines and hard negative examples.

Less hits and less meaningful false alarms when compared to the deformable sum of parts HOG model (FelzenszwalbFelzenschwab et.al., 2010).

Target features Average HOG histograms from partial matches to target appearance.

Standard HOG histograms were extracted (Dalal & Triggs. 2005) from the locations of partial template matches using the method in (FelzenszwalbFelzenszwalb et al., 2010). Models informed with these features explained less than 30% of the variance explained by models informed with detection summaries described above.

of 19

101102103104105106107108109110111112113114115116117118119120121122123124125126127

Nontarget features Softmax confidence scores on object categories from a deep convolutional network (object CNN) trained for 1,000 way object classification (Zhou, Khosla, Lapedriza, Oliva, & Torralba et.al., 2014).

The network is biased and always gives false alarms for some objects more than others. False alarms for some important nontarget categories indicate that the CNN has learnt context more than the object appearance for some categories. Regression models on Softmax confidence scores predict RTs poorly as compared to features in penultimate layers of the CNN.

Coarse scene structure features

Deep convolutional features over blurred versions of input scenes

Each scene was represented by a 4096 dimensional real valued vector that is obtained by presenting the scene a input to a pre trained deep convolutional network (CNN) that had been trained for 205-way scene classification (Zhou et.al., 2014).

Coarse scene structure features

Combination of GIST and activations from a deep convolutional network to blurred scenes.

Combining GIST features with scene classification CNN activations did not improve performance either.

Table 1: Description of baseline models.

Table S2 Baseline model generalization to new scenes in person and car detection

Feature type

Model name

Person detection Model name

Car detectionrc Best model of

same type

rc Best model of same typeNoise

ceil0.45 ± 0.02

Noise ceil 0.45± ± 0.02

Avg HOG T 0.24± ± 0.02 0.41± ± 0.01 T 0.31± ± 0.02

0.50± ± 0.01Nontarget Softmax N 0.12± ±

0.01 0.14± ± 0.02N

0.04± ± 0.02 0.17± ± 0.02Blur scene CNN C 0.25± ±

0.01 0.30± ± 0.01 C 0.28± ± 0.01 0.41± ± 0.01Blur scene CNN + GIST

C 0.23± ± 0.01 0.30± ± 0.01

C0.35± ± 0.01 0.41± ± 0.01

Table 2: Baseline model generalization to new scenes in person and car detection. Note. Note. Performance of baseline models for each information channel are shown alongside the performance of the model trained with most informative target/nontarget/coarse scene features. Conventions are as in Table 1 in main manuscript.

Table S3: Baseline model generalization to new scenes in person and car rejection.

of 19

128129130

131132133134135136137

Feature type

Model name

Person rejection Model name

Car rejectionrc Best model of

same type

rc Best model of same typeNoise

ceil0.45± ±

0.02Noise ceil 0.45± ± 0.02

Avg HOG T 0.01± ± 0.02

0.25± ± 0.01 T 0.04± ± 0.02 0.06± ± 0.02

Nontarget Softmax N 0.14± ±

0.020.20± ± 0.02 N 0.09± ± 0.02 0.34± ± 0.01

Blur scene CNN C 0.06± ±

0.020.14± ± 0.02 C 0.13± ± 0.02 0.15± ± 0.02

Blur scene CNN + GIST

C0.06± ± 0.02

0.14± ± 0.02C

0.09± ± 0.02 0.15± ± 0.02

Table 3: Baseline model generalization to new scenes in person and car rejection. Note. Performance of baseline models for each information channel are shown alongside the performance of the model trained with most informative target/nontarget/coarse scene features. Conventions are as in Table 1 in main manuscript.

of 19

138139140141

Figure Fig. S1: Visualizsation of target templates learned for cars and people. (A) a Histogram of oriented gradient structure learnt for three out of six canonical views of isolated cars. Three more views are generated by flipping the shown templates along the vertical axis. (B) b Histogram of oriented gradient structure learnt for 8 eight parts within each view of a car. (C) c Deformation penalties that are imposed on the 8 eight parts defined within each template for a car, part detections in whiter regions incur more penalty. .(D)d Histogram of oriented gradient structure learnt for three out of six canonical views of isolated people. Three more views are generated by flipping the shown templates about the vertical axis. (E)e Histogram of oriented gradient structure learnt for 8 eight parts within each view of a person. f (F) Deformation penalties that are imposed on the 8 eight parts defined within each template for a person, part detections in whiter regions incur more penalty.

of 19

142143144145146147148149150151152153154155156157

Figure Fig. S2: Representative examples of true and false detections from the HOG detectors trained for cars and people. Illustrative examples of correct and incorrect matches to person (A-Ca–c) and car (D-Fd–f) HOG templates. The HOG representation of each scene is visualiszed along with the correct (Hhits) and incorrect (Ffalse alarms) matches. (A) a Person -like structure embedded in the intensity gradient information on doors, giving rise to person false alarms. (B) b Car false alarm due to box- like shape and internal structure of a parking entrance.

of 19

158159

160161162163164165166167168169170171172

Figure Fig. S3: Feature weights estimated for nontarget labels, using regression analysis. Regression weights estimated for nontarget labels over car–-person absent scenes, for best models that either predict person rejection RTs (x axis, best model contains targets + nontargets) or those that predict car rejection RTs (y axis, best model contains nontarget and coarse scene features). A positive feature weight for labels such as ‘animal’, 'flowerpot', and 'door' indicate that presence of those features will slow down person rejection in target-absent scenes. Nontarget labels such as ‘traffic -cone’, box-like structures, and ‘fence’ provide greater evidence for cars than people and hence slow down car rejection. Nontargets such as ‘trees’, and ‘roof’, etc either have evidence for both cars and people, or contribute to general clutter in the scene and hence slow down both car and person rejection. Error bars indicate standard deviation of the regression weight across 20 cross validated regression model instances. These results remain qualitatively unchanged for choices of car rejection models containing either nontarget + coarse scene information or nontarget + target feature information.

of 19

173174175176177178179180181182183184185186187188189190

SECTION S2: ANALYSIS OF LOW-LEVEL FACTORS

Targets that are large or that occur close to the center of the scene are detected faster by humans (Wolfe, Alvarez, Rosenholtz, Kuzmova, & Sherman, et al. 2011). While it is non-trivial to find target size or eccentricity without processing target features, we nonetheless investigated whether such “low-level” factors can predict the observed rapid detection data. We calculated a number of such factors, as detailed below.

1) To estimate the area of the target, we recorded the area of the largest high-confidence in the scene as yielded by the HOG detector.

2) To estimate the number of targets in the scene, we manually counted the number of cars and people in target-present scenes and assigned one of the following levels (0, 1, 2, 3, 4, 5, greater than 5 and less than 10, greater than 10).

[3)] To estimate clutter in the scene, we evaluated a variety of computational metrics including number of corners in the scene (Harris & Stephens, 1988; Shi & Tomasi, 1994) and sScale invariant feature transform (SIFT) (Lowe, 2004) that discovers key points that afford some degree of invariance when the scene undergoes scaling, translation and rotation transformations. SIFT key points have been used extensively to represent local image properties and for object representation and retrieval (Mikolajczyk & Schmid, 2005). We found that they also measure scene clutter well and use the number of SIFT points as a rough estimate of objects in a scene.

3)[4)] To estimate target eccentricity, we measured the radial distance of the nearest high confidence target detection, from the center of the scene.

An illustrative example scene containing cars as well as people is shown overlaid with information that is used to extract these independent features is shown in the following Figure Fig. S4.

Figure Fig. S4: Visualization of task independent factors derived from a scene containing both cars and people. Locations of most confidently detected person and car closest to scene center are marked by red and green boxes, respectively. The radial distance is marked using dashed lines. SIFT (Lowe, 2004) interest points are shown using yellow circles, randomly chosen 10% of detected points are shown here.

of 19

191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

220221222223224225226

The correlation of each factor with the observed detection time is shown in Figure Fig. S5. At the level of individual factors, we observe that larger target sizes and presence of low eccentricity targets seem to reliably speed up detection RTs for both cars and people. To assess how the combination of all factors might explain detection performance, we trained a model including all factors. These models explained person detection (r = 0.28± ± 0.02) and car detection times (0.31± ± 0.04) to some degree but their performance was still inferior to the best models based on target and coarse scene features.

Figure Fig. S5: Scatter plots for correlation of independent factors with detection response times. (A)a Observed person detection times for each scene plotted against the size of largest high confidence box in squared degrees of visual angle. Significance values are depicted by asterisks, where * is p < 0.05, ** is p < 0.005, etc. All results are computed on common scenes that contained both cars and people (n = 225). (B)b Person detection times plotted against the distance of the nearest detected person from scene center in degrees of visual angle. (C) c Person detection times plotted against number of people in the scene. (D)d Person detection times plotted against a measure of scene clutter using a computational model based on SIFT points (see text). (E-H)e–h Analogous plots for target-present response times in the car task.

of 19

227228229230231232233234

235236237238239240241242243244245246247248249250

SECTION S3: PARTICIPANT FEEDBACK

At the end of each task, subjects filled out a questionnaire where they were asked to report visual attributes of objects and scenes that made the task hard and easy. To what degree are subjects aware of different factors that affect object detection? To investigate this issue, we categorized all participant responses and report below the number of subjects that mentioned each attribute in each task (P = person task, C = car task). Each participant participated either only in car detection or person detection and thus answered one set of four questions, regarding visual aspects of targets, nontargets, and coarse scene layout, which made target detection or rejection, easier or harder. Keywords from those responses were then standardized and collated to obtain a frequency distribution that is tabulated here. We first summarizse the collated data in the following table and then discuss it in greater detail. Concepts highlighted in orange (or brown) occur in subjective responses with respect to both rejection and detection of people (or cars) and indicate shared features between rejection and detection. Some of these such as clutter slow down both detection and rejection, and some, like “streets,”, were reported to speed up person detection and slow down person rejection.

PERSON detection task (P)n=30 participants

Number of subjects

CAR detection task (C)n=31 participants

Number of subjects

Easy DETECTION

Easy DETECTION

person at center 19 large car 15many people 19 many cars 9streets 4 bright/colorful car 8urban scenes 3 Road with people 7high contrast 3 car at center 4cycles 2 simple scene 3boat 2 relevant scene 2large person 1 relevant nontargets 1well-lit scene 1 Building with people 1restaurant 1simple scene 1relevant nontargets 1

Hard DETECTION

Hard DETECTION

very small 10 very small/long shot 18person like shape 7 low contrast 9shadows 4 eccentric 3clutter 4 Road with people 3eccentric people 3 many people 2salient nontarget 3 silver/gray colour 1market place 2 facing away 1animals 2 salient nontargets 1

of 19

251252253254255256257258259260261262263264265266267268

street corners 2 winding road 1

cycles 2narrow road or highway 1

forest 2 cycles 1low contrast 2 clutter 1very big 1urban scenes 1highway/road 1mountain 1

Easy REJECTION

Easy REJECTION

Big building/wall 15 blank wall 13sparse scene 12 scenery 13scenery 12 landscape 11landscape 7 big building 8mountain 3 water body 6large nontargets 3 irrelevant scene 5horizon 2 garden/grass 5highway/road 2 road 4signboard 1 mountain 3restaurant 1 irrelevant nontargets 3

window/verandah 2tables 2simple scene 2irrelevant scene 1cycles 1

Hard REJECTION

Hard REJECTION

statues 6 relevant scenes 6shadows 5 small nontargets 5animals 4 shadows 5clutter 4 Road with people 5shop 4 eccentric cars 3relevant nontargets 4 partial match to shape 3person like shape 3 street 3urban scenes 3 traffic crossing 2bicycles 2 relevant nontargets 2streets 2 urban scene 1clutter 2 animals 1lamppost 2 cycles 1low contrast 2 clutter 1car 2forest 2scenery 1salient nontarget 1

of 19

Relevant to target detection only: Subject reported finding targets to be easy when they were at or near the center of the image (P = 19, C = 4), when the target was large (P=1, C=15), when the scene was frequently associated with the target (P = 8, C = 10), when the scene was well-lit (P = 1, C = 8), when the target was at high contrast (P = 3, C = 8) and when the scene contained relevant nontargets (P = 3, C = 1). In contrast, they reported finding targets hard when targets were small (P = 10, C = 18), and when nontargets were salient (P = 5, C = 1).

Relevant to target-absent responses only: Subjects reported target-absent judgments to be easier when the scene contained large and expansive structures like buildings, landscapes and water bodies (P = 15, C = 21), or contained frequently associated contexts (P = 5, C = 6). In contrast, target rejection was reported as harder when scenes contained partial matches to target appearance (P = 5, C = 3) or relevant nontargets (P = 10, C = 8).

Relevant to target-present and absent responses: Subjects reported finding cluttered scenes hard for both target rejection (P = 4, C = 1) as well as rejection (P = 2, C = 1). Conversely, sparse scenes or those with simple structures were reported as easy for both target detection (P = 1, C = 3) as well as rejection (P = 12, C = 2).

Relevant to person task alone: The presence of animals and birds was reported as making both person-present (P = 2) and absent judgments harder (P = 4). Person-like shapes was reported as interfering with both person detection (P = 7) as well as person rejection (P = 3). Specific nontargets such as cycles (P = 2) and boats (P = 2) were mentioned as speeding up person detection.

Relevant to car task alone: Low contrast scenes were reported to be hard for both target detection as well as rejection of cars alone (C = 2 for detection and C = 9 for rejection). Coarse scene layouts highways and parking lots containing nontargets that typically co-occur with cars were reported as making car rejection harder (C = 11). Subjects reported finding car detection easy when cars were conspicuous or colourful (C = 8) and when scenes contained people and roads (C = 7). Some subjects reported people and cars as interfering with detection too (C = 3).

Overall, we found that participants reported an advantage in target detection when targets were centrally located, at large sizes, salient and when discriminative target features were easy to detect. Target attributes were mentioned more frequently as being relevant to target detection than rejection, although partial matches to target appearance were reported to slow down rejection. Coarse scene layout that are typically associated with the targets, were reported as weakly facilitating detection and clutter in the scene was reported to interfere both in target detection and rejection. These observations are in agreement with our computational models that explain trial-by-trial variability in target detection response, we found that these models are informed by target and coarse scene information alone. Strongly associated nontargets were reported to play a greater role in slowing down target rejection and less in speeding up target detection, less significant role was ascribed to target features during target rejection. This again is reflected in our computational results that show that a

of 19

269

270271272273274275276

277278279280281

282283284285

286287288289290

291292293294295296297

298299300301302303304305306307308309310

significant proportion of target rejection response variability is uniquely explained by nontarget features alone.

of 19

311312

313

SECTION S4: ANALYSIS OF DEEP NEURAL NETWORKS

To further investigate differences between our best models for target detection and rejection and popular deep convolutional neural networks, we analysedanalyzed the unique contributions of our target, nontarget and coarse scene features on the predictions of this deep convolutional architecture. We took a version of this architecture that was fine tuned for 20-way object categorizsation including cars and people (Lapuschkin et al., 2016) and obtained detection confidence probabilities for cars and for people, on every scene in the two data sets of 1,300 scenes used in the car detection task and person detection task respectively.

In this case we simply replaced car detection or rejection response times by CNN output car probabilities. We similarly replaced person detection and rejection response times by CNN decision layer output person probabilities. We have several interesting observations from this analysis. Firstly, coarse scene information played a very significant role in CNN predictions for targets in target present scenes (r r = 0.51± ± 0.01 for persons and r r = 0.6± ± 0.01 for cars) secondly and more surprisingly, the contribution of target features alone in target present scenes is lesser compared to that of coarse scene information (r = 0.42± ± 0.01 for persons and r = 0.37± ± 0.01 for cars). This finding perhaps originates from the use of intact scenes for training this convolutional network. To ensure that these results were not an accident of our definition of target and coarse information features, we also analyszed model correlations over target absent scenes. Here we found that target information had very little role in explaining CNN predictions (r = 0.08± ± 0.03 for persons and r = -0.02± ± 0.3 for cars) and models trained with coarse scene information alone give the best predictions on CNN outputs on target absent scenes (r = 0.31± ± 0.01 for persons and r = 0.66± ± 0.01 for cars). These findings indicated that rapid classification of targets in real world scenes seem to operate on different information processing principles in human vision as compared to this popular computational architecture.

Deep convolutional networks are rapidly evolving, and we additionally evaluated a more recent architecture (RCNN: Region proposal CNN) that identifies and classifies object like regions in a scene (Ren et al., 2016). To give this model the maximum benefit, we took scores from the most confident detections over the two sets of 1,300 scenes used for car detection and 1,300 scenes used for person detection and obtained 98% classification accuracy for cars and 96% classification accuracy for people. Though this performance seems impressive, it is not clear as to whether this deep network still incorporates coarse scene information despite being trained to learn and classify object like regions. On visual inspection of detection results, we observed that detection boxes are sometimes quite large and exceed the extent of the object, particularly so in the case of weak detections. We conjectured that weak object regions proposals at the time of training can also bring in contextual information into this deep convolutional network. There have been attempts to improve the quality of object region proposals for RCNN and improve its performance (Shrivastava & Gupta, 2016). To investigate this further, we trained models with subsets of target, nontarget, and coarse scene information in a similar manner as we did for car and person detection responses. In this case we simply replaced car detection response times by the score for the most confidently detected car instance in each car- present scene. Likewise, we replaced person detection response times with the score for the most strongly confidently person instance in a person- present scene. Firstly, we found that that models trained with coarse scene features derived from blurred scenes can predicted RCNN scores for cars (r = 0.16± ± 0.04) and we also found that model performance improved only slightly when HOG target summary features included as well (r = 0.19± ± 0.03). Similarly we found

of 19

314

315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362

that coarse scene information can also predict RCNN scores for persons (r = 0.08± ± 0.01). Thus, our scheme of parcelling real- world scene information into target, nontarget, and coarse scene information has the potential for novel analysis and greater interpretability of deep convolutional networks as well.

of 19

363364365366

SECTION S5. SUPPLEMENTARY REFERENCES

Dalal, N. & Triggs, B., 2005. Histograms of oriented gradients for human detection. Proceedings - 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, I, pp.886–893.

Felzenszwalb, P. F. Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminative trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

Felzenszwalb, P.F. et al., 2010. Object Detection with Discriminative Trained Part Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), pp.1627–1645.

Harris, C. ., & Stephens, M., (1988). A combined corner and edge detector. Procedings of the Alvey Vision Conference 1988, pp.147–151. Available at: Retrieved from http://www.bmva.org/bmvc/1988/avc-88-023.html.

Lapuschkin, S., Binder, A., Montavon, G., Muller, K.-R., & Samek, W. (2016). Analyzing classifiers: Fisher vectors and deep neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (p. 17). Retrieved from http://iphome.hhi.de/samek/pdf/LapCVPR16.pdf

Lapuschkin, S. et al., 2016. Analyzing Classifiers: Fisher Vectors and Deep Neural Networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p.17.

Lowe, D. G., ( 2004). Distinctive image features from scale invariant keypoints. International’l Journal of Computer Vision, 60, pp.91–11020042. Available at: Retrieved from http://portal.acm.org/citation.cfm?id=996342.

Mikolajczyk, K., & Schmid, C., (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), pp.1615–1630. Available at: Retrieved from http://doi.ieeecomputersociety.org/10.110910.1109/TPAMI.2005.188.

Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.

Oliva, A. & Torralba, A., 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), pp.145–175.

Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards real-time object detection with region proposal networks. Poster presented at the IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99). Retrieved from https://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks

Ren, S. et al., 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99), pp.1–1.

Shi, J., & Tomasi, C., (1994). Good features to track. Proceedings of the- 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1994, 1.

Shrivastava, A. ., & Gupta, A., (2016). Contextual priming and feedback for faster R-CNN. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9905 LNCS, pp.330–348.

of 19

367

368369370

371372373374375376

377378379

380381382383384385

386387388

389390391392

393394395396

397398399400401402403404

405406407

408409410

Wolfe, J. M., Alvarez, G. A., Rosenholtz, R., Kuzmova, Y. I., & Sherman, A. M. (2011). Visual search for arbitrary objects in real scenes. Attention, Perception & Psychophysics, 73(6), 1650–1671. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3153571&tool=pmcentrez&rendertype=abstract

Wolfe, J.M. et al., 2011. Visual search for arbitrary objects in real scenes. Attention, perception & psychophysics, 73(6), pp.1650–71. Available at: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3153571&tool=pmcentrez&rendertype=abstract.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2014). Object detectors emerge in deep scene CNNs. Arxiv, 12. Retrieved from http://arxiv.org/abs/1412.6856

of 19

411412413414415416417418419

420421422

static-content.springer.com10.3758/s134… · Web viewAnnotations were standardized to a 67-word...

Documents

Transcript of static-content.springer.com10.3758/s134… · Web viewAnnotations were standardized to a 67-word...