REVIEW ON GENERIC OBJECT RECOGNITION TECHNIQUES: CHALLENGES AND OPPORTUNITIES

http://www.iaeme.com/IJARET/index.asp 104 [email protected]

International Journal of Advanced Research in Engineering and Technology

(IJARET) Volume 6, Issue 12, Dec 2015, pp. 104-133, Article ID: IJARET_06_12_010

Available online at

http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=6&IType=12

ISSN Print: 0976-6480 and ISSN Online: 0976-6499

© IAEME Publication

REVIEW ON GENERIC OBJECT

RECOGNITION TECHNIQUES:

CHALLENGES AND OPPORTUNITIES

Prof. Deepika Shukla

Comp. Science and Engineering Department,

Institute of Technology, Nirma University, Ahmedabad, India

Apurva Desai

Department of Computer Science and Information Technology,

VNSGU, Surat India

ABSTRACT

Recognizing objects automatically from an image is a fundamental step for

many real-world computer vision applications. It is the task of identifying an

instance of object in an image or video sequence without or least human

intervention and assistance. In-spite of very high complexity, human beings

perform this task with very less effort and even in the state of least attention.

Little effort is needed for the humans to recognize huge number of and various

categories of objects in images, though ‘object’ in the image may be different

with respect to size / scale, viewpoint, position or orientation. We are even

able to recognize the objects from an image, when they are only partially

visible or present against cluttered background. Not only this, the recognition

can be for specific instance of object or object category/class. When the task is

done for classes of the object it is known as Generic object recognition or

object-class detection or category-level object recognition. It has been found

that over the years many techniques have evolved for recognizing object

classes from images, but any automated object recognition system till date has

not gained this capability fully at par with human beings. This very fact makes

recognition of objects from an image, the most basic and fundamental

challenge in the field of computer vision research. The purpose of this study is

to give an overview and categorization of the approaches used in the literature

for the purpose of Generic Object Recognition and various technical

advancements achieved in the field. Mostly the survey focusses on the leading

work since year 2000.

We have discussed the challenges that the field is currently facing. We

have also made an attempt to suggest future research directions in the area of

Generic Object Recognition. Finally we conclude the study with a hope that in

Review on Generic Object Recognition Techniques: Challenges and Opportunities


near future more sophisticated object class recognition systems would be

developed in an efficient and cost effective manner.

Key words: Object Recognition, Generic Object Recognition, Object class

Recognition, Scene Understanding, Scene categorization, Image Analysis,

Computer Vision, Machine Vision, Scene Analysis, Image Analysis.

Cite this Article: Prof. Deepika Shukla and Apurva Desai, Review on

Generic Object Recognition Techniques: Challenges and Opportunities.

International Journal of Advanced Research in Engineering and Technology,

6(12), 2015, pp. 104-133.

http://www.iaeme.com/IJARET/issues.asp?JType=IJARET&VType=6&IType=12

1. INTRODUCTION

Automated recognition of objects in images is a critical and fundamental step for

many real-world computer vision applications. It is the task of finding a given object

in an image or video sequence without or least human intervention/ assistance. As we

know, very little effort is required at our part to detect and recognize huge number of

classes of objects in images though image of the object may be different with respect

to size / scale, viewpoint position or orientation. Human beings are able to recognize

the objects from an image even when they are only partially visible or present against

cluttered background. Also, the ability to generalize from examples and categories

objects, events, scenes, and places is one of the core capabilities of the human visual

system; For human being this is a mundane activity, but imbibing these capabilities in

machine, has still proved to be significantly challenging task for computer vision

systems in general.

The reason behind this may root to the fact that “Automatic Object Recognition”

requires understanding of human visual perception and so becomes a

multidisciplinary research area involving knowledge and expertise of fields like

optics, psychology, pattern recognition, artificial intelligence, machine learning and

most importantly cognitive science which in itself needs sophisticated concepts and

tools from mathematics as well as computer science [1].

Object recognition is a dominant field of research in the computer vision as well

as image analysis applications and even the simplest machine vision task cannot be

solved without the help of recognition. The fact can be evidenced by the existence of

vast volume of research conducted in this area over the past three decades. The

statement can be substantiated by the fact that, if one just gives, “objects recognition

from images” as the search string on ieeexplore.org, gets more than 20000 results. So,

from the substantial volume of current literature existing on the topic, we can also say

that “Object Recognition” field is closely tied to and is part and parcel of computer

vision research.

This paper reviews most of the leading state-of-the-art researches performed in the

area of Generic object recognition. But more specifically, it is focused to get the

insight into following Research Questions pertaining to the topic of Generic Object

Recognition.

What are the generic object recognition techniques and approaches drawn by the

literature?

What different representation techniques are used for object representation?

Prof. Deepika Shukla and Apurva Desai


Which feature detection and extraction methods are used by most of the prominent

researchers on the topic?

Which classification/learning technique has been used in the classification stage of the

object recognition pipeline?

The rest of this paper is organized as follows. Section 2 introduces and explains

the problem of Generic object recognition which can be considered as specific subset

of object recognition problem. Section 3 concentrates on the challenges that the field

of object recognition faces in general and Generic Object Recognition in particular.

Section 4 discusses the vast literature existing for the topic area. Section 5 manifests

roadmap to future research areas and directions. Section 6 finally sought conclusion

of the study.

2. GENERIC OBJECT RECOGNITION PROBLEM

The problem of object recognition can be viewed as a classification or labelling

problem where models/representation of known objects are available to the system

and when a novel image is given, the system has to predict the class of the object[s]

present in the Image. Formally, it can be stated as, given an image containing one or

more objects of interest (and background) and a set of labels corresponding to a set of

models known to the system, the system should assign correct labels to regions, or a

set of regions, in the image. i. e. Object recognition systems should assign a high level

definition of an object based on the image data, that is represented.

Oftentimes, the task of object recognition is considered as broadly comprising of three

sub-tasks;

Object detection: Detecting whether an instance of the object category is present in

the image or not.

Localization: To give the location of object category. Drawing a bounding box

surrounding the object instance is most prominently used in literature to show the

result of localization.

Visual category recognition: To recognize and label the class/category of object

present in the image.

Moreover, the image being presented to the object recognition framework for the

purpose of recognizing objects from it may have single instance of some class of

object or may have multiple instances of single class or multiple instances of multiple

classes. Therefore, the object recognition approaches at the top-most level can broadly

be categorized to follow top-down, bottom-up or hybrid approach. And within that it

can be for recognizing specific or generic object. So basically image-based object

recognition can be stated as; Given a database of objects and an image, determine

what, if any of the object[s] are present in the image. Thus the problem of object class

recognition can be considered as an instance of supervised classification.

Another dimension along which, the task of object recognition can be categorized

is: First, Where a specific object to be recognized is known to the system and the

system is trained for that specific object category only. For example, Face recognition

, pedestrian recognition Second, Generic object recognition system. Generic object

recognition means that the computer recognizes objects from images by their general

name [2] or common name. Figure-1 shows an instance of Generic Object

Recognition. Generic object recognition has been also referred as object-class

detection or category-level object recognition in literature [14] which aims at

recognizing the class to which the object present in the image belongs. The images



can have single instance of a class, multiple instances of same class or multiple

instances of multiple classes. When categorization of multiple objects of multiple

classes in an image is done, it is known as scene categorization.

Figure 1 Generic Object Recognition

2.1. Architecture of the object recognition system

The current vision systems can said to be consisting of activities as shown in Figure -

2.

Figure 2 Activities involved in a typical vision system

Any recognition system would involve these or some subset of these activities in

its life cycle. In general, after image acquisition stage, image is pre-processed for

performing noise removal and some kind of enhancement. The pre-processing stage is

followed by feature extraction and description/representation stage which then are

passed for recognition. In the representation stage, the objects can be represented as 2-

D or 3-D. Figure-3 shows the general architecture of object recognition system.

Object Recognition task is affected by several factors and can differ according to

various aspects as shown in Figure-4. It shows the categorization of aspects in which

the work is going on, in the field of object recognition. The approaches may differ on

the basis of form and representation of objects, Matching schemes, Image Formation

Model, Type of Features, Type of Image and type of data suited for categorization.

Once we studied various aspects we figured out that these approaches mainly differ in



the object representation method based on type of features or the classification

approach adopted by the method in the recognition phase.

As the factor changes it can easily be observed that the approach changes

substantially but basically these approaches broadly follows three paradigms for

formulating and attempting the solution to the problem of Object Recognition from an

image; Bottom-up, Top-down, Hybrid paradigm [103].

Figure 3 Generic Architecture of Object Recognition System

Bottom-upIt can also be considered as Image analysis from its low level data and is

based on image segmentation techniques. It considers the raw image data which is

available in the form as it is acquired. Boundaries of the homogeneous regions are

extracted by performing non-purposive segmentation without prior knowledge about

properties of individual object classes. No attempt is made to make any prior

assumptions related to what these objects are. Fixed set of attributes are used to

characterize these regions and objects are linked together to characterize the scene

itself. However, without some additional information, purely bottom-up approaches

have so far been unable to yield figure-ground segmentations of sufficient quality for

object categorization [Leibe & Shiele] till 2009, then after many approaches have

been developed [85, 86, 88, 89] which uses bottom-up segmentation methods as

discussed in [82] and [85] and have achieved remarkable results which will be

discussed in detail later in literature review section of the paper.

Top-down: This is Image Analysis from the Semantic level data. Contradictory to

earlier approach, this methodology proceeds with an assumption that the image does

contain a particular object. If the problem is of scene categorization, it assumes that it

is a particular type of scene. The system will attempt to verify the existence of a

hypothesized object. Purposive segmentation may be performed or specialized ways

are used to represent the object.

Hybrid: Combination of the earlier two paradigms are used [61],[79] in this kind of

approach.

3. KEY CHALLENGES

3.1. Challenges overview

As stated earlier, the problem of Object recognition in general and Generic Object

Recognition in particular faces various challenges.

(I) The appearance of an object in the image can have a large range of variation due

to:

1. Viewpoint changes

2. Scale, Orientation and Shape changes (e.g., non-rigid objects)



3. Photometric effects (scene illumination etc.)

4. Scene/Background clutter (therefore objects may be occluded)

(II) Different views of the same object can give rise to widely different images.

(III) Large number of object categories existing in real-world and these categories

may have very less inter-class variation.

Figure 4 Factors affecting the task of object recognition

3.2. Description

Object recognition can be considered as yet another data processing task, so data is

given the highest priority thus acquisition should be considered as most important

step. In recent years, with the advent of high quality camera and other image

capturing devices, we can collect a huge amount of data (images) in various forms

like intensity images, range images and also from various sources like web but the

major problem that computer vision research community is facing today is scarcity of

accurately and precisely labelled image examples. As stated earlier, object recognition

problem can be considered essentially, a supervised classification task, and for that to

work successfully there remains the need of labelled images examples. The problem

becomes more gruesome for the reason that the task is labour intensive. Also due to

the non-availability of human experts which can do image annotation task efficiently

and accurately the task becomes more challenging.

‘Feature Extraction’ is the next crucial step in the pipeline of generic object

recognition. Assuming that the data is available, the feature extraction becomes the

most important stage of the entire object recognition framework. If, suitable features

of right dimensions are not extracted, this phase can become the bottle neck of the

recognition pipeline. Though recently many sophisticated approaches have been

developed and are existing in the literature ,but they are not sufficient to describe

every object , so feature extraction becomes too object specific and varies as either

viewpoint, size and illumination conditions of the image capturing varies. Thus

representing images by effective features is crucial to the performance of various

image analysis tasks. Features can be low-level (colour, texture, Intensity), middle-

level (Image Patches) or High-level (objects, textually annotated objects). Figure-5

shows a probable classification of different kinds of features.

Choosing and deploying an appropriate classifier is the next important step of the

pipeline. The classifier can be linear or a non-linear one. Various classifiers like

Byesian classifier, SVM, decision trees, Neural Networks etc. are utilized in literature



for the purpose of classification each possessing its own benefits and drawbacks. One

important inherent issue related to the classification stage is; scalability of the

classifier. As number of existing object categories is too large in real world and many

visual features are required to model each category thus forcing the system to have

huge volume of training data to model variety of category classifiers. In order to keep

scalability manageable, a linear classifier is commonly utilized, but its classification

performance is inferior to the nonlinear one whereas the non-linear classifiers are

more computation intensive. To remedy this defect of the linear classifiers, a design

of rich image feature set, (which is after all, a key factor in the success of the image

recognition system) per object class is required so that the system can distinctly

recognize objects from images possessing inter-class and intra-class variation as

shown in Figure-6. Additionally, classifier has to be updated continuously because

even if it is trained once for a category/class of object and previously unseen instances

of object emerge or appearance of the object evolves, the earlier trained classifier will

not give correct results. This kind of flexibility and resilience to change is inherently

expected from any object recognition framework.

Figure 5 Classification of Image Features

Figure 6 Images of different instances of object (Dog) in varied imaging conditions.

Intra-class appearance variations refer to the appearance differences among

different objects of the same class [14]. Intra-class appearance variation may be either

due to difference in colour, shapes and sizes of the object’s instance or due to

difference in imaging conditions. For example Image of the same object taken at

different time of the day, in different seasons or at different places, with different

devices and different viewpoints, will be entirely be different. In addition to intra-



class appearance variations, Generic Object Recognition system has to efficiently and

distinctly handle inter-class appearance variations also; which in many cases would be

very less as shown in Figure 7. For example, object recognition system should be

capable of distinctly recognize between a donkey and a horse or a Horse and a Mule.

Figure 7 Images of Horses and Donkeys with very small inter-class appearance

variation: Lower row is images of horses (adapted from [14])

The performance of generic object recognition framework is generally judged

upon criteria like robustness against noise, invariance to basic geometric

transformations, invariance to illumination and viewpoint changes and its ability to

handle the number and different types of objects, ability to handle intra-class and

inter-class variations, recognize objects in presence of clutter or complicated

background and also to be able to recognize the object even if it is partially occluded

accurately and efficiently. These requirements are expected implicitly and must be

present in any framework for object recognition and as a result these issues can be

considered as key challenges for the field of generic object recognition.

4. LITERATURE REVIEW

4.1. Overview

The object recognition pipeline, as stated in the earlier section, consists of the key

tasks like Image acquisition, Pre-processing, Feature Extraction, Feature

representation/Feature description and Classification. However Image acquisition and

Pre-processing phase falls out of the scope of this study. Although, most of the related

work surveyed and cited here focusses on one or the other phase of this pipeline, our

main focus in this study is on feature extraction and description techniques and also to

obtain the answers to the research questions put up at the beginning of this

manuscript. Although, various groups of researchers in the literature have attempted

to survey and review the work in the field of computer vision but either they are

related to some specific object like a survey on face recognition is presented in

[108,109] or various descriptors have been compared and surveyed [14,45,48] or a

separate survey is presented in [114] on object recognition using deep neural

networks. That is one particular aspect of the topic is explored and related literature

review is presented while discussing their core work. Periodically comprehensive

surveys on generic object recognition [14, 15] have been published in the past but

looking to the rapid pace of achievements in the field, it seems natural to survey the

most recent developments and object recognition techniques available in the

literature. In this study, we have mostly tried to review the work done in the field

since year 2000 but more emphasis is given on surveying the work done after 2011.



The rationale behind this is that most of the surveys and papers mostly talk

extensively of the approaches before 2011. But, looking at the pace of technical

advancements in the field, lot of approaches have emerged since 2011 which demands

detailed reportage and covering them is the basic motive of this review. Therefore, the

study is also aimed at presenting the survey in such a way which should help to gain

an insight into this field of research. Also as noted in introduction section that the task

of object recognition is considered as broadly comprising of three sub-tasks: Object

detection, Object localization and object classification, but in this manuscript we have

studied the approaches of generic object recognition which is the highest level of task

in the object categorization subtasks; i.e to categorize the class of object in the image,

object detection is inherently performed and many a times they needs to be localized

also. Due to this reason we have not segregated the approaches on the basis of

detection, localization or categorization.

4.2. Features and Feature Descriptors

The foundations of the field can be traced back to 1950s and 1960s, when early work

was done in very simplistic domains [1]. The world was modelled as being composed

of blocks defined by the coordinates of their vertices and edge information. The

“block image” represented areas of uniform brightness in the image and the edges of

blocks were located in the areas of intensity discontinuity. But very soon it was

realised that, it is not an ideal way to represent the complicated information presented

in the image. Since then various strategies are being developed for the task of object

recognition with an emphasis on feature extraction stage and in the usage of novel and

efficient type of feature descriptor.

Object recognition can be classified into various broad categories. These include

model-based approaches, shape-based approaches and appearance-based approaches.

Model-based approaches try to represent objects as set of various three dimensional

objects [1, 12, 13] like generalized cylinders, cones, cubes, cuboids spheres etc.

Shape-based approaches [13, 19, 20, 21, 52, 53] represent the objects by shape

primitives like boundary fragments, contours, shapelets, etc In contrast, for

appearance-based models only the appearance is used, which is usually captured by

different two-dimensional views of the object-of-interest. So, it can be observed

easily, whatever be the representation method object representation takes the centre

stage in the entire object recognition pipeline. And in turn the problem of object class

recognition reduces to the generating an efficient representation of object which can

detect, localise and identify the class of object discriminatively and repetitively.

As stated earlier extracting and describing features efficiently, of the objects from

the images, decides the fate and success of a typical object recognition system. In a

generic object recognition or categorization system, the relevant features or

descriptors from a characteristic point, patch or region of an image are often obtained

by different approaches. As shown in Figure-5, the features at the top most level can

be divided into two categories global and local wherein the former characterizes the

image as a whole whereas latter represents some local information in form of pixel,

patch or region. Yet another direction along which many researchers have tried to classify features is structural and statistical. Although, there are various classifications

for features but there exists significant overlap among these classes. For example,

local features can be structural as well as statistical. These features are often

combined to form various descriptors especially region level descriptors are formed

by combining colour, texture and other such low-level features.



As far as pixel level features are concerned, they are regarded as low level

information of the image and are directly computed from the grayscale value of the

pixels individually and generally used to build more sophisticated patch-level or

region level descriptors. We now briefly discuss some of the best performing

descriptors proposed and utilized over the years. This is not meant to be an exhaustive

discussion of the existing approaches, but rather to provide a sample of some

relatively successful and widely used approaches over the years.

4.2.1. Appearance-Based Object Representation

Local Scale-Invariant Features (SIFT) [3][4] introduced by Lowe, is regarded as one

of the most popular patch-level feature descriptors reported in literature. Feature

identified are shown to be completely invariant to basic geometric transformation and

partially invariant to illumination changes and occlusion. SIFT features proved more

successful, as they do not depend on exact grey level distribution within an image

patch, instead use general configuration of image gradient [60]. This was considered

as one of the prominent approach in the area of object recognition and the work is

considered as milestone in the research of object recognition, computer vision and

other image analysis problems. However, as the descriptors are appearance based, and

may produce poor result especially if the object does not have enough information of

its texture features. The SIFT is applied for the problem of object recognition in many

works. Two such usages are mentioned in [3] and [4]. In various other work [2, 39,

42, 75, 110, 111, 112], some kind of improvisation has been achieved by combining

other features along with SIFT or using other filters than Gaussian [110]. The

dimension of key-points obtained when SIFT is applied is relatively large in number,

hence resulting into high dimensional data. This drawback was realised by authors in

[5] and SIFT was extended as PCA-SIFT , where Principal Component Analysis is

applied to normalized gradient patch resulting into lesser dimensional descriptor.

PCA-SIFT yields 36-dimensional descriptor which is fast for computation and

matching but are less distinctive [6] while descriptor introduced by Mikolajczyk and

Schmid[45] namely GLOH (Gradient Location-Oriented Histogram) is another

variant over SIFT which proved to be more distinctive with the same dimension[6].

Also, a colour image-based SIFT has been demonstrated in [75], wherein in place of

intensity gradients, colour gradients are used in Gaussian framework.

As mentioned earlier, high dimensionality of the descriptor is the major limitation

of SIFT, another effective patch-level descriptor SURF (Speeded-Up Robust

Features) is proposed in [6] by Bay et al. The authors have made use of integral

images which results into yielding not only faster but distinctive and repeatable

features. The authors based their descriptor on Hessian matrix but uses very basic

approximation. Moreover only 64 dimensions are used which is much less than

SIFT’s 128 dimensional vector. Though one can argue that PCA-SIFT results in only

36-dimensional vector but at the same time it loses the distinctiveness whereas SURF

has been proved more distinctive and repeatable.

Another level at which features descriptors are generated in numerous papers is at

region level. Dalal and Triggs [32, 33, 34] used grids of locally normalised

Histograms of Oriented Gradients (HOG) as descriptors for object detection in static

images. The technique counts occurrences of gradient orientation in localized portions

of an image. The detector window is tiled with a grid of overlapping blocks in which

Histogram of Oriented Gradient feature vectors are extracted. Detector thus presented

is contrast-based which makes it robust to small changes in image contour locations

and directions, and significant changes in image illumination and colour, while



remaining highly discriminative for overall visual form. In work of Dalal and Triggs

[32, 33,34] is aimed at detection of humans in particular, but also proved effective in

detecting other object classes from images. HOG descriptor has proved very efficient

descriptor for representing structured objects. For example, It has outperformed all

other descriptors in pedestrian detection from videos and images. Inspired by HOG

[32] Bosch et al[36] proposed a novel descriptor called PHOG (pyramid of HOG).

The idea was to represent local image shape and its spatial layout, together with a

spatial pyramid kernel of Bag of Features (BoF) [25,26]. Each image is divided into a

sequence of increasingly finer spatial grids by repeatedly doubling the number of

divisions in each axis direction (like a quadtree). The number of points in each grid

cell is then recorded. HOG vector is computed for each grid cell at each pyramid

resolution level. The final PHOG descriptor for the image is a concatenation of all the

HOG vectors. This concatenated HOG vector is then normalized to ensure that texture

rich or images with more edges are not weighted more strongly than others. Another

descriptor which is built on the idea of histogram of gradients (HOG) is CoHOG (Co-

occurance histograms of gradients) proposed in [37]. CoHOG can express shapes in

more detail than HOG as CoHOG are histograms whose basic units are pairs of

gradient orientations. Histogram is referred as co-occurrence matrix. Due to this

pairing, the vocabulary size increases resulting into more specific expression of shape

of object in the image. The usage of higher dimensional matrix makes CoHOG

powerful in terms of its discriminative power but at the same time becomes highly

computation intensive.

Bag of Features and visual codebook based approaches

The approach is inspired by BoW (Bag of Words) approach which was first proposed

in 1997 by [38] for describing the textual data for the purpose various text analysis

tasks. It is used to represent a text document or a sentence written in natural language,

as set of words, not taking into consideration its grammar or the order in which these

words occur in the original text. The frequency of occurrence of each word is

calculated and then used for various language processing tasks. The analogous term

BoF( Bag of Features), is used to represent the approach. Similar to BoW model, here

image is represented as order less collections of local features of Image. Similar terms

like Bag of Keypoints (BoK), Bag of Visual Words(BoVW) by various researchers is

used in their works. The method is based on vector quantization of affine invariant

descriptors of image patches [39]. A bag of keypoints corresponds to a histogram of

the number of occurrences of particular image patterns in a given image. The method

uses clustering to obtain quite high-dimensional feature vectors for a classifier. As

construction of codebook is done in the BoF approach, at many times it is also

referred as codebook-based approach. The method includes following main steps.

Detection of image patches for computation of patch descriptors

Computing patch descriptors for these patches. These descriptors can be any feature

invariant descriptors like SIFT [3,4] or any variant of it or any other lower level

descriptor like Harris-affine [43], MSER.

Construction of a visual codebook/vocabulary/dictionary by assigning patch

descriptors to predetermined clusters (a vocabulary) with a vector quantization

algorithm that groups similar features together. For determining clusters instances of

usage of several clustering techniques are available. However, more frequently k-

means clustering is applied. [39]. Whereas Hierarchical k-means clustering is adopted

by [49] and mean-shift by authors in [35].



Generating a histogram of number of occurrences of particular patches assigned to

each cluster. The size of the resulting histogram equals the size of the codebook and

hence the number of clusters obtained from the clustering technique [40].

Treating the bag of features as a feature vector and using a classifier to classify the

respective image patch. A distance measure is required when comparing two term

vectors for similarity but this measure operates in the term vector space as opposed to

the feature space.

There are two reasons why the bag-of-features image representation (BoF) proved

to be popular for indexing and categorization applications. First, this representation

benefits from powerful local descriptors, such as the SIFT descriptor and Second,

these vector representations can be compared with standard distances, and

subsequently be used by robust classification methods such as support vector

machines [50]. Also, the codebook model-based approaches, while ignoring any

structural aspect in vision, provide state-of-the-art performances on current datasets

[40]. The discriminative power of such a visual codebook determines the quality of

the codebook model, whereas the size of the codebook controls the complexity of the

model. The codebook-based approaches are considered as simple and efficient, and

also can be made robust to clutter, occlusion, viewpoint change, and even non-rigid

deformations [26, 25]. Inspite of being one of the popular and successful approaches,

we find that BoF and visual codebook generation approach has also got certain

limitations. As BoF expresses the image as appearance frequency histograms of visual

words by quantizing SIFT like features, location information and the geometric

relationship between key-points are lost. Also as vector quantization is involved so

inherently loss of information occurs. Also due to loss of geometric relation between

the features, localization of the object is not possible.

To overcome the limitation of orderless representation of objects, several researchers have proposed approaches to augment bag of features with global spatial

relations in a way that significantly, at one end improves classification performance

while at the other end remain simple and computationally efficient so that can be

applied for the real-world applications [27]. Authors in [27] have demonstrated that

bag of feature description of the image can be extended to spatial pyramids so that the

spatial location information of the features can be retained. To generate these spatial

pyramids, the input image is partitioned into increasingly fine sub-regions.

Histograms of local features are computed over these sub-regions. The histograms are

further concatenated to generate the final features. This representation is combined

with a kernel-based pyramid matching scheme proposed by [24] that efficiently

computes approximate global geometric correspondence between sets of features in

two images. While the spatial pyramid representation sacrifices the geometric

invariance properties of bags of features, it compensates for this loss with increased

discriminative power derived from the global spatial information.

Similarly in [2], to overcome the problem inherent to BoF approach, graph is

constructed by connecting SIFT key-points with lines. As a result, the key-points

maintain their relationship, and then structural representation with location

information is achieved. Since graph representation is not suitable for statistical work,

the graph is embedded into a vector space according to the graph edit distance. As a

result, authors achieved recognition accuracy compared to the conventional method in

their experiments on PASCAL VOC and Caltech-101 datasets. So, the basic idea to

achieve the improvement in BoF approach is to somehow incorporate the spatial

location information of features in BoF features so that the method can not only be

used for recognition but can also be successfully applied for object localization. The



authors in [47] have achieved improvisation by adding binary signatures to the

descriptors, First, Hamming Embedding (HE) of the SIFT descriptors; analogical to

hamming distance and second integrate a weak geometric consistency (WGC) check

within the inverted file system which penalizes the descriptors that are not consistent

in terms of angle and scale. In this way geometrical information is incorporated in the

index with very large datasets. But at the same time both Hamming Embedding (HE)

and WGC require to store additional information, hence memory requirement of index

increases.

The visual codebook approach has been used by several other researches in

slightly different way. For example, Liebe et al in [7,8,9] have adopted a two staged

approach In first stage a codebook of local appearances is learnt which contains

information, which local structures may appear on objects of the target category.

Next an Implicit Shape Model (ISM) that specifies where on the object the codebook

entries may occur. To create the codebook the authors have adopted the method

presented in [17] by Agarwal and Roth. From a variety of images, 25 x 25 pixel

patches are extracted with the Harris interest point detector. These patches are

clustered using agglomerative clustering to generate a compact cluster. These

codebook entries are used to define implicit shape model of the objects. The approach

do not try to create and define a separate model for all possible shapes an object can

take rather define shapes of an object in terms of patches that are consistent in local

appearances. Due to this concept, less number of training examples are needed to

learn object’s probable shapes. A second time codebook entries are scanned and all

those entries are activated whose similarity is above a certain chosen threshold. The

threshold chosen would be same as the threshold used during clustering performed in

the first step. While in recognition stage generalized Hough transform is performed

for identifying possible object centre.

GIST: Humans can recognize the gist of a novel image in a single glance,

independent of its complexity [69], by considering them in a “holistic” manner, while

overlooking most of the details of the constituent objects. Intuitively, GIST

summarizes the gradient information (scales and orientations) for different parts of an

image, which provides a rough description (the gist) of the scene. Input image is

divided into non-overlapping regions. The region is then further divided into sub-

regions and then Gradient Orientation histogram is computed for these sub-regions.

The GIST descriptor for a region is formed by concatenating these Gradient

Orientation histograms for all sub-regions of a region. The approach is more

prevalently used for scene understanding purpose. Approaches based on GIST cannot

be considered as an alternative to image analysis based on local feature based

approaches but can be considered as an additional support for recognition problems

by helping to constrain the local feature based image analysis. In [72,73] short binary

codes are used to compress local GIST descriptors and demonstrated that the

approach works on millions of images obtained from internet without sacrificing the

recognition accuracy and effectiveness.

4.2.2. Shape-Base Approaches

Many approaches based on intensity, colour gradient of the image patches or region is

discussed in the previous part of this paper. Although as noted, these descriptors are

very powerful and have shown to perform object recognition with remarkable

effectiveness. Still there may be a case where two object class exist with same colour

and texture with only difference in shape or for the classes where the appearance is

very much variable in every instance of the object. Such objects cannot be represented



with only colour, intensity based features alone. For example, if we consider fruit

class, raw mango and capsicum are of green colour, but having entirely different

shape. Recognition community also very soon understood that across the exemplars

that belong to a category, shape is a more invariant property than appearance. As a

result, the majority of recognition systems from the mid-1960s to the late 1990s

attempted to extract shape features, typically beginning with the extraction of edges,

at occluding boundaries and surface discontinuities, edges capture shape information

So, shape is yet another important cue which can be used to generate a discriminative

representation of objects. To compute the shape of the object, different authors have

taken different methods. Shape cues are frequently captured and described at the

region level for object class recognition or detection using contour or boundary

fragments [19], shapelets, edgelets [20], shockgrphs etc. Another area of research, as

far as object’s shape-based detection is concerned, is how to set up the

correspondence between shape extracted from training and test image i.e ; How to say

two shapes are matching [52,53]. One of the limitations of shape-based object

description is that it cannot capture intra-class variations in very discriminative way.

For example, a Zebra cannot be differentiated from a Horse. Often shape-based cues

are combined with other appearance based object cues.

4.2.3 Part-Based approaches

Object as 3D volumetric parts

Earliest attempts at solving the object recognition problem used high level 3D parts

based objects, such as generalized cylinders (Binford) and other deformable objects,

such as geons (Biederman [13] )and superquadrics (Pentland) [79]. The common

characteristic among all of them is that they all based on symmetry; a physical

regularity in our world which is exploited by our human visual system. However in

practice it becomes too complex to extract such parts efficiently and in an inexpensive

manner. But once extracted they are more semantically nearer to description of the

image content. Such parts would be limited in number, as compared to the approaches

where low-level features and mid-level features are used to describe the object.

Although the methods based on low-level and mid-level features score on their

simplicity, ease of extraction and attractive invariance properties; but have proved to

be weak in expressing high-level semantic information of the image. The above noted

facts had made object representation using 3D volumetric parts had achieved lot of

attention in the decade of 70s and 80s. The detailed coverage of the topic is out of the

idea of this study but the works of Binford and Nevatia [115] can be explored for

further information related to the concept.

Recognition based on parts

In Part-based object recognition approaches, object is modeled as a set of

geometrically constrained set of various parts of the image where each part has a

distinctive appearance and spatial position. In such approaches, shape is represented

by the mutual position of the parts [22]. Using such features it is determined whether

an instance of object of interest exist in the image or not and if at all it exist where it

exist in the image. Various methods exist in literature which differs on how these

parts are detected, how their position could be represented and what should be the

ideal number of parts to represent an image. Generally these parameters are tuned to

the requirement of the approach. In [22] Objects are modelled as flexible

constellations of parts. A probabilistic representation (which in this case the authors



have used Gaussian), is used for all aspects of the object like shape, appearance,

occlusion and relative scale. To learn and model the object category, first regions and

their scales are detected. Once the regions are identified, they are cropped from the

image and rescaled to the size of a small typically 11×11 pixel patch and then

parameters of the above densities are estimated from these regions, such that the

model gives a maximum-likelihood description of the training data. To detect the

features, a histogram is generated of the intensities in a circular region of some radius.

This is done for each point on the image. The entropy of this histogram is calculated

and local maxima of this histogram are considered as scale of the region. The N-

regions with highest saliency over the image provide the features for learning and

recognition. To reduce the dimension of the feature set, PCA has been used.

Deformable part based approach

Deformable Part Models constitute the state of the art for sliding-window object

detection [99]. The DPM’s are inspired by pictorial structure representation

introduced in [91] by Fischler and Elschlager where an object is modelled by a

collection of parts arranged in a deformable configuration [92]. To represent visual

properties of the object small picture segments are used whereas the deformable

configuration is captured by spring-like connections between these visual picture

segments. An energy function is computed by computing match cost for each part and

deformation cost for each pair of connected parts and this energy function is

minimized to find the best match of model with in an image. The effectiveness of

pictorial representation in case of image matching demonstrated in [91] is due to the

fact that the representation is simple. In addition the representation possesses wide

general applicability as it is not dependent on any particular scheme to model the

appearance of the parts so can be used to represent quite generic objects. But at the

other end, the model suffers from certain very critical limitations. Too many

parameters are involved in the construction of the model thus the energy minimization

function solving becomes very computation intensive. Also the best match is only

found likewise, if the image consist multiple instances of the same object, they would

not be detected by the pictorial representation given by [91]. The issues in pictorial

representation are aptly handled by Felzenswalb in pioneering work reported in [92].

Pictorial representation proposed by Fischler and Elschlager constructs the

representation which can be viewed as graph whereas Felzenswalb and Huttenlocher

used tree representation realising that many objects in real-world can be represented

by using a tree structure especially when the object to be modelled are human beings,

animals. Using this improvisation finding best match model to an image can be

computed in polynomial time. The approach demands that the graph which is

generated to represent the object be acyclic and function dij(li , lj) measuring the

degree of deformation of the model when part vi is placed at location li and part vj is

placed at location lj needs to be a Mahalanobis distance between transformed

locations. DPM’s are impressive way of object representation. While deformable

models can capture significant variations in appearance, a single deformable model is

often not expressive enough to represent a rich set of object category [93]. It can also be noted that in practice simple models generally outperform approaches using

deformable part based representation. The reason being the simpler models can be

trained easily whereas it is more difficult to train more sophisticated models like

DPM. Authors in [93] illustrates that a deformable part-based model represents an

object by a low-resolution root filter and a set of higher-resolution part filters

arranged in a flexible spatial configuration. The flexible spatial configuration helps to



model the visual appearance at multiple scales. The approach has achieved benchmark

results in the PASCAL object detection challenges. The approach basically uses HOG

(Histogram of gradients) [32] using star-structured part-based model defined by a

filter similar to filter used in [32] and set of part-based filters and deformation models.

The model presented by [101] is effective for shallow structures consisting of at

most two layers, but as the number of layers in the structure increases, it becomes

difficult to scale the model without incorporating and tuning additional parameters.

Yullie et al in [106] have extended the model discussed in [101]. In this paper authors

have proposed that description of object class using several templates from different

viewpoints. Each template is represented as a tree-structure consisting of three layers.

The first layer represents entire image. The second layer divides the image into 9 sub-

images and third layer divides each sub-image of second layer into four sub-images

making third layer of 36 sub-images.

The approach used by Dalal and Triggs [32], to detect pedestrians, fails in

presence of articulation whereas [93, 94, 95, 96] allows an intermediate layer of parts

that can now be shifted with respect to each other making overall model deformable

and in this way achieves generalization. But such approaches do not work when it is a

question of extracting human pose from images. In [102] Bourdev and Malik have

introduced ‘Poselets’ ; parts that are tightly clustered in both appearance and

configuration; for detection and pose estimation of in image consisting human body.

Whereas, in [79] Pablo et al have unified the approaches presented by Dalal and

Triggs [32], Felzenswalb [95] and Bourdev and Malik [102] into a single recognition

framework and tries to take the benefit of each approach. The region-based object

descriptors are used to perform purposive semantic segmentation and subsequently

their outputs are combined and hence performance is achieved.

4.2.4. Recent Approaches and Advancement

We have discussed many approaches with their benefits and limitations in the earlier

sections. One thing can also be noted that all those approaches to object recognition

make essential use of machine learning methods. Most current machine learning

methods work well because of human-designed representations and inputs features.

Early conventional approaches involve hand-crafted features for object representation

and look for these features in image. To do this the programmer was required to have

a deep knowledge of the data and would laboriously engineer each one the feature

detection algorithms [114]. There have been big improvements in image analysis over

the last few years due to the adoption of deep learning neural networks to solve vision

problems. Fig-8 shows schematically the difference between traditional vision

systems and recent deep neural network based system.

Neural Nets for Object Recognition: Neural Networks have been used in object

recognition systems since decades. Neural Nets implement a classification approach.

Their attraction lies in their ability to partition the feature space using nonlinear

boundaries for classes. Earlier Neural Networks were used as classifier only in the

classification stage of Object recognition pipeline (Figure 8), but only recently, with

the progress in vision research and the increase in computational power, neural

networks are utilized for automatic feature learning( from the raw data of the image)

as well as classification also. LeCunn [123] in 1989 demonstrated an algorithm to

train Neural Networks in supervised way and proved applications like hand-written

digit recognition performs remarkably and is benefitted from it. Since then

Convolutional Neural network are being used by many research communities.



Convolutional Neural Networks are different than conventional approaches like BoF,

DPM (Deformable Part Model).This difference is due to two very important reasons.

First, they are deep architecture whereas the conventional approaches were shallow

architectures. And second they doesn’t need to have prior knowledge of data of

image. Deep learning neural networks made it possible to learn features in an

unsupervised manner directly from data instead of handcrafting them explicitly. The

approach has helped vision tasks particularly object recognition greatly thereby

enabling effective capturing of low-level as well as middle level cues of object to be

recognized. As a result Deep learning Neural Networks have brought huge

improvements in the performance of image analysis results, over the last few years.

What makes deep architectures achieve such a good result?

Conventional neural nets used 1 to 2 layers of neurons whereas Deep Neural

Network” is one class of neural nets that uses deep architectures with 2 to 6 layers of

neurons stacked on top of each other. As a result DNN can learn more complex

models easily without the need of hand-designed features. DNN’s have shown good

results on ImageNet dataset [126]. On the test data authors achieved top-1 and top-5

error rates of 37.5% and 17%. Their neural network consisted of 650,000 neurons and

had 5 convolutional layers and learnt 60 million parameters in ILSVRC 2010.

Like every other approach Deep architecture also has got certain limitations.

Needs very sophisticated hardware and also image of fixed size typically 224 x 224.

Contains huge number of parameters to be trained so computation intensive

When trained using Gradient descent, the gradient does not trickle down to the lower

layers; so the sub-optimal sets of weights are obtained [114].

Various modifications to DNN’s have been suggested in the literature to

overcome these limitations. To overcome the constraint of fixed sized images required

by deep neural networks, several efficient pooling strategies are proposed. In [113],

network is equipped with spatial pooling strategy (SPP-net). SPP-net can generate

fixed length representation irrespective of image scale and size. Spatial pyramid

pooling is based on spatial pyramid matching [24] which in turn is an extension of

BoF [26] approach. Another improvement achieved is RNN (Recursive Neural

Network) [130] used for scene classification. The method predicts tree structure for

scene images.

Figure 8(a). Block diagram representing typical traditional object recognition system

Various competitive challenges and Datasets

Here in this section we present some of the challenges that computer vision

community organize annually to invite, evaluate and report the innovative approaches

developed by research groups all across the world. These challenges serve the purpose

of setting a common platform for researchers to present their work and compete with

each other in the area of Object detection, localization and categorization. These



challenges also provide dataset of sample images so that the approaches can be

evaluated for all possible image content and variations in image capturing conditions.

Figure 8(b). Block diagram of Deep Architectures (Image taken from:

ufldl.stanford.edu/eccv10-tutorial/eccv10-tutorial_part4.ppt)

Early approaches for object recognition used very small set of images to evaluate

their algorithmic work. But with the advent of sophisticated world-wide-web, large

number of annotated images is readily available in public as well as private

repositories. To harness the benefit of such repositories datasets are created and made

available to the research community. As mentioned earlier these challenges is an

effort to bring together the research community together in a framework of

competition so that best approaches in computer vision can be evaluated and

publicized. These challenges consist of two components. First, a publicly available

dataset with ground truth annotations with standardised evaluation software and

second a competition and workshop [119]. To review these challenges, we first

discuss the datasets made available by these competitions along with certain other

widely used datasets.

Datasets: No research is possible in any research area [30] without appropriate

datasets. The same fact applies to object recognition and computer vision research

also. Appropriate datasets are needed for all stages of recognition research; may be for

learning visual models of objects and scene categories, detecting and localizing

instances of these models in images, and evaluating the performance of recognition

algorithms. Work mentioned in [30] reviews existing Image datasets from the point of

expectation, challenges and limitations. Datasets ideally should offer vide range of

image variability and should be sufficiently challenging so that algorithms can be

evaluated. One of the major limitations in creating such datasets is that images are to

be annotated. This task of annotating has to be done by human experts and turns out

to be mammoth task considering huge number of objects existing in the real world

which are to be recognized for various application and it is not so easy to get the

human experts accomplishing this effectively, correctly and efficiently. A wonderful

approach of automatic dataset collection using web is mentioned in [66], using an

object recognition techniques in incremental method. The images present on web are

used to learn the model in a robust way. Another solution for getting annotated



training examples can be by crowdsourcing but the most common error that an

untrained annotator is susceptible to is a failure to consider a relevant class as a

possible label because they are unaware of its existence.

Now we discuss certain most prevalent datasets.

Caltech-101 & Caltech-256: Caltech-101 is a collection of pictures of objects

belonging to 101 categories collected by Fei-Fei et al [64] in 2003. About 40 to 800

images exist per category. Most categories have about 50 images. Most images have

little or no clutter. The objects tend to be centered in each image and in stereotypical

pose. In comparison to Caltech-101, Caltech-256 is collection of 256 categories of

objects. Total 30608 images are present. Fig-9 compares Caltech-101 and Caltech-

256.

Figure 9 (Courtesy: http://www.vision.caltech.edu/Image_Datasets/Caltech256/details.html)

TRECVID: TRECVID organizes competition every year and for evaluating the

performance, releases dataset consisting of video shots. The goal of the conference

series is to encourage research in information retrieval by providing a large test

collection, uniform scoring procedures, and a forum for organizations interested in

comparing their results. Annotation is not provided by the organizers.

LabelMe: LabelMe [74] is a publically available annotated image database open for

public contribution. The dataset is provided with annotation tool, so that anyone can

annotate any image. As images are annotated by experts as well as casual users,

cannot be relied for obtaining test set whereas huge quantity of training images can be

obtained.

COIL-20 & COIL-100: Coil-20 and Coil-100 is a database of grayscale images of 20

and 100 categories of objects respectively [120]. Different poses of objects were

generated by placing the objects on rotating turn table and images were captured at

angular displacement of 5 degrees generating 72 views per object. It consists of 720

unprocessed images of 10 object categories. 1440 size normalized images are also

provided.

Microsoft COCO: Common Objects in Context database is a large-scale database of

images that addresses three core research problems in scene understanding: detecting

non-iconic views of objects, contextual reasoning between objects and the precise 2D

localization of objects [122]. Contextual knowledge can be helpful to boost all the

components of the object recognition framework. The dataset is provided to support

object recognition based on the context in which they lie in the scene. The dataset

consist 91 common object categories from which 82 of them having more than 5000

labeled instances. In total dataset have 2,500,000 labeled instances in 328000 images.

The dataset consists of less object categories but very high number of instances per

category that differentiates it with other popular large-scale datasets like PASCAL

VOC and ImageNet dataset which would be discussed in following sections.

The PASCAL Visual Object Classes Challenge

The PASCAL VOC challenge was first time organized in the year 2005. Since then

up to 2012, every year this challenge was organized annually. The challenge basically

consists of two components. A dataset consisting of 1000 images related to objects of

http://www.vision.caltech.edu/Image_Datasets/Caltech256/details.html

http://www.vision.caltech.edu/Image_Datasets/Caltech256/ppt_images/table1_blur.gif



20 categories; obtained from Flickr web-site; was made available publicly and

competition involving object classification, detection, segmentation, action

classification and person layout. Everingham et al has reviewed PASCAL VOC in

[119]. The objects were fully annotated for each of the objects. Not that in 2005, such

a rich dataset was released. Only dataset consisting of four categories (motorbikes,

bicycles, cars, and people) was made available in 2005, but every year organizers kept

on enriching it and finally in 2011 of 1000 images were released. To assess different

methods bootstrapping of ROC curve is used.

The evaluation technique is used in a number of different ways: to simply judge

the variability for a given method, to compare the relative strength of two methods, or

to look at rank ranges in order to get an overall sense of all methods in a competition

[119].

ImageNet Large Scale Visual Recognition Challenge (ILSVRC): ILSVRC was

first organized in 2010 and since then, the event is organized annually. ILSVRC is

one of the most prestigious series of competition and workshop in computer vision

community to evaluate the performance of all contemporary approaches developed by

various researchers. The challenge from various aspects is nicely reviewed in [118].

Similar to PASCAL VOC, ILSVRC also provides a huge collection of annotated

images under the name ImageNet by Deng. et al.

ImageNet Dataset: ImageNet is an image database organized according to

the WordNet hierarchy in which each node of the hierarchy is depicted by hundreds

and thousands of images. Currently, there exist over five hundred images per node

[121]. In ImageNet, on an average 1000 images to illustrate each synset has been

provided. Images of each concept are quality-controlled and human-annotated.

ILSVRC has dimension very high as compared to PASCAL VOC [121]. As per 2010

data it is organized in form of 12 subtrees with 5247 synsets and 3.2 million images in

total. As ImageNet organization is inspired by WordNet structure and there are

around 80,000 noun synset in WordNet, Similarly ImageNet also aims at providing

nearly all the majority of the 80,000 synsets of WordNet with an average of 500- 1000

clean and full resolution images. To evaluate the approaches effective strategy of

bootstrapping used by PASCAL VOC is employed in ILSVR challenge series also. In

Table 2, we present the comparison between PASCAL VOC challenge dataset and

ILSVRC challenge as per year 2012 as referred from [101].

Table 2 Comparison of PASCAL VOC and ILSVRC as per [101]

Aspect for comparison PASCAL VOC ILSVRC

Diversity of Object classes Objects are only one class

label for e.g Boat for all types

of boat, be it lifeboat, fireboat

Objects are further refined to

subcategories for for e.g boat

is just not boat but lifeboat,

gondola

Chance Performance of

Localization(CPL)

8.8% on validation set for 20

categories

20.8% for all 1000 categories

Average Object Scale /class 0.241 0.358

Average Number of Instances per

class

1.69 1.59

Clutter per class when clutter is

computed as No. of boxes.

129.96 106.98

http://wordnet.princeton.edu/



In addition to all these, various other datasets are used in literature like GRAZ-01

by Opelt and Pinz, which contains four types of images: bikes, people, background

with no bikes, background with no people. INRIA (people) dataset used by Dalal and

Triggs in their work in [32, 33], MNIST – Dataset of handwritten digits, ImageCLEF,

INRIA (Horses,cars), TinyImages dataset by Torralba et al , ETH-80 etc. CIFAR-10

and CIFAR-100 set has 6000 examples of each of 10 classes and the CIFAR-100 set

has 600 examples of each of 100 non-overlapping classes [125]. The list that we have

considered is not exhaustive but exemplary. For an exhaustive list [127] can further be

explored.

5. FUTURE RESEARCH DIRECTIONS

Object recognition is one of the most exciting research area in the field of computer

vision. Today need is to develop systems which are, computationally efficient and at

the same time cost effective. We suggest some of the future research directions which

can be explored and in turn be incorporated in recognition systems. We attempt to

suggest these directions at algorithmic level or at product level; many of which can be

at present considered as an idea which may need knowledge base from

multidisciplinary fields.

Currently, deep learning is the current state-of-the-art in object recognition and has

produced promising results but they suffer from certain serious limitations of being

resource intensive. So, in absence of sophisticated hardware DNN’s cannot be

adopted for object recognition. In such cases enhancing the performance of

conventional feature extraction techniques on shallow architectures can be helpful.

New approaches are in need which require shallow architectures and are still

efficient. Also it is realised that DNN’s have not shown very impressive results in the

task of object localization, this area can further be explored.

It is also to be understood, how learning of features take place in Convolutional

neural networks? What makes deep architectures giving so high accuracy of

recognition?

Due to advent of mobile and other hand-held devices with very nice image capturing

abilities, algorithms for mobile and other hand-held devices are in great need.

Although considerable work exists in literature related to action recognition systems

and a complete line of research is going on in this direction as the area in itself

involves many and varied issues and research problems Products can be developed

involving action and activity recognition from videos.

Computer vision techniques can be good method to generate assistive technology for

blind people. For example, products can be developed which sees the surrounding

and generates a natural language description of the scene and can be given as output

in spoken form. This will help to understand the surroundings and will help blind

people to navigate.

Research in the area of understanding videos from its content already started, but still

in its infancy, Generic object recognition also paves the path for research in areas like

emotion recognition which will actually enable us to recognize the meaning of the

content in the video.

Robotics is one another important field which can benefit from active object

recognition. Today’s robots are able to work only in well -structured and constrained

environments. Whereas, the requirement is to develop robots which can learn, adapt

and execute their tasks in real human environments.



Almost every device has a camera and devices are now powerful enough to record

and process live video. These videos can be exploited for real-time applications. How

do we organize and personalize all of this content for common man?

New performance evaluation techniques are needed.

Many rich datasets have been generated and made available to public like ImageNet,

PASCAL VOC by computer vision research community. Although these datasets

holds huge number of images pertaining to various categories, but if we were to reach

to the level of near human vision capabilities in terms of flexibility and dynamism,

then these datasets are to be enriched. Novel ways of labelling huge number of

unlabeled image data should be found so that images annotated with ground truth can

be generated and can be made available publically.

6. CONCLUSION

From the literature available on the subject it was found that the demand for the

efficient Generic object recognition system is increasing very fast as the spectrum of

applications in which the object recognition is needed is very wide and rich. One

major problem in process of Generic Object Recognition is that, categories available

in the real world are varied and huge in number. Due to this fact, the training of the

recognition system for such a large number of categories and classes becomes a

challenging task however sophisticated, the approach may be. Also in such kind of

system the property of plasticity i.e the system should be able to gradually train itself

for unseen categories, is expected which further adds to the complexity of the system.

Such systems can be developed which should be flexible enough to train themselves

for new classes of objects. Another important issue with generic object recognition

system is with the feature extraction and description phase. In most of the approaches,

the number of features obtained is too large and are handcrafted. This very critical

limitation has been overcome by deep architectures which in turn have exploited

sophisticated hardware accelerations evolved recently. Approaches are needed which

make the entire set up cost effective requiring fewer resources. Ideally, it is desirable

that, the recognition task should be performed at semantic level which will result into

near human vision systems.

One of the key objectives behind this survey was to get the answers of the

research questions identified by us and mentioned in Section 1. From the literature

surveyed it can be deduced that, the earlier work related to generic object recognition

were putting more weightage on feature extraction stage and type of features, whereas

the later works were giving more prominence to type of classifier used. Also, recent

approaches are learning features directly from the image data. This can be regarded as

very striking innovation achieved by the vision community. Now the ways are needed

which can bring enhancement to these approaches. The above efforts can also be

extended for 3D images and also for videos.

As a result of this study and from the referred material, a general remark can also

be made about the kind of work that is done in the field. Most of the papers before

2008, mainly present novel ways of modelling the object class. i.e they emphasize on

novel ways of feature detection and descript tion. However, work presented and

published in the recent past since 2011, with the advent of sophisticated hardware,

more emphasis is given on handling more categories accurately and efficiently.

In this paper a current scenario of generic object recognition is portrayed in brief

with a hope that in near future, such an object recognition system will be developed



which would be capable of performing vision task similar to the human vision system

least possible effort and in a cost effective manner.

REFERENCES

[1] Bennamoun, Mohammed, and George J. Mamic. Object recognition:

fundamentals and case studies. Springer Science & Business Media, 2002.

[2] Takahiro Hori, Tetsuya Takiguchi, Yasuo Ariki. Generic Object Recognition

Using Graph Embedding into a Vector Space, American Journal of Software

Engineering and Applications. Vol. 2, No. 1, 2013, pp. 13-18.

[3] David. G. Lowe, “Object Recognition from Local Scale-Invariant Features”,

Proc. Of the International Conference on Computer Vision, /corfu.(Sept-

1999)

[4] David.G.Lowe,” Distinctive Image Features from Scale-Invariant

Keypoints”, 2004.

[5] Ke, Yan, and Rahul Sukthankar. "PCA-SIFT: A more distinctive

representation for local image descriptors." Computer Vision and Pattern

Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer

Society Conference on. Vol. 2. IEEE, 2004.

[6] Bay, Herbert, Tinne Tuytelaars, and Luc Van Gool. "Surf: Speeded up robust

features." Computer Vision–ECCV 2006. Springer Berlin Heidelberg, 2006.

404-417.

[7] B. Leibe, A. Leonardis, and B. Schiele. Combined Object Categorization and

Segmentation with an Implicit Shape Model”, In ECCV04. Workshop on

Stat. Learning in Computer Vision, pages 17–32, May 2004.

[8] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, Bernt

Schiele, Luc Van Gool”, Towards Multi-View Object Class Detection”,

Proceedings of the 2006 IEEE Computer Society Conference on Computer

Vision and Pattern Recognition (CVPR’06)

[9] Bastian Leibe, Aleˇs Leonardis, and Bernt Schiele,” Robust Object Detection

with Interleaved Categorization and Segmentation”, IJCV, (2008) 77,259-

289.

[10] Jia, Menglei , Li, Hua, Xie, Xing , Chen, Zheng Ma, Wei-ying, “Automatic

Classification Of Objects Within IMAGES” United states Microsoft

Corporation (Redmond, WA, US)20080037877

http://www.freepatentsonline.com/y2008/0037877.html

[11] Pisipati, Radha Krishna (Hyderabad, IN), Syed, Shahanaz (Guntur, IN),

Jonna, Kishore (Proddatur, IN), Bandyopadhyay, Subhadip (Kolkata, IN),

Narayan, Rudra Narayan (Jemadeipur, IN) 2014. Systems And Methods

For Multi-Dimensional Object Detection United States 20140029852


[12] Besl, Paul J., and Ramesh C. Jain. "Three-dimensional object recognition."

ACM Computing Surveys (CSUR) 17.1 (1985): 75-145.

[13] Irving Biederman. Recognition-by-components: A theory of human image

understanding. Psychological Review, 94(2):115-147, 1987.

[14] Zhang, Xin, et al. "Object class detection: A survey." ACM Computing

Surveys (CSUR) 46.1 (2013): 10.

[15] Andreopoulos, Alexander, and John K. Tsotsos. "50 Years of object

recognition: Directions forward." Computer Vision and Image Understanding

117.8 (2013): 827-891.





[16] Roth, Peter M., and Martin Winter. "Survey of appearance-based methods for

object recognition." Inst. for Computer Graphics and Vision, Graz University

of Technology, Austria, Technical Report ICGTR0108 (ICG-TR-01/08)

(2008).

[17] Agarwal, Shivani, Aatif Awan, and Dan Roth. "Learning to detect objects in

images via a sparse, part-based representation." Pattern Analysis and

Machine Intelligence, IEEE Transactions on 26.11 (2004): 1475-1490.

[18] Fergus, Robert, Pietro Perona, and Andrew Zisserman. "Object class

recognition by unsupervised scale-invariant learning." Computer Vision and

Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society

Conference on. Vol. 2. IEEE, 2003.

[19] A. Opelt, A. Pinz, and A. Zisserman. A Boundary-Fragment Model For

Object Detection. In Proc. ECCV, volume 2, pp 575–588, May 2006.

[20] Wu, Bo, and Ramakant Nevatia. "Detection of multiple, partially occluded

humans in a single image by bayesian combination of edgelet part

detectors."Computer Vision, 2005. ICCV 2005. Tenth IEEE International


[21] Andreas Opelt, Axel Pinz,Andrew Zisserman,” Incremental Learning Of

Object Detectors Using A Visual Shape Alphabet”, Proceedings of the 2006

IEEE Computer Society Conference on Computer Vision and Pattern

Recognition (CVPR’06)

[22] Andreas Opelt, Axel Pinz ,Andrew Zisserman,,“Fusing shape and appearance

information for object category detection”, 2006 - eprints.pascal-network.org

[23] Andreas Opelt, Axel Pinz ,Andrew Zisserman, “Learning an Alphabet of

Shape and Appearance for Multi-Class Object Detection”, IJCV (2008) 80:

16–44

[24] Grauman, Kristen, and Trevor Darrell. "Pyramid match kernel and related

techniques." U.S. Patent No. 7,949,186. 24 May 2011.

[25] Zhang, Jianguo, et al. "Local features and kernels for classification of texture

and object categories: A comprehensive study." International journal of

computer vision 73.2 (2007): 213-238.

[26] Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Beyond bags of

features: Spatial pyramid matching for recognizing natural scene categories."

Computer Vision and Pattern Recognition, 2006 IEEE Computer Society


[27] Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. "Spatial pyramid

matching." Object Categorization: Computer and Human Vision Perspectives

3 (2009): 4.

[28] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. “A Discriminative

Framework for Texture and Object Recognition Using Local Image Features.

In Toward Category-Level Object Recognition. Springer-Verlag Lecture

Notes in Computer Science, J. Ponce, M. Hebert, C. Schmid, and A.

Zisserman (eds.), 2006.

[29] Lazebnik, Svetlana. "Local, semi-local and global models for texture, object

and scene recognition." (2006).

[30] J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik,

M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, C. K. I. Williams, J.

Zhang, and A. Zisserman. “Dataset Issues in Object Recognition” In Toward

Category-Level Object Recognition. Springer-Verlag Lecture Notes in

Computer Science, J. Ponce, M. Hebert, C. Schmid, and A. Zisserman (eds.),

2006.

http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/sicily06a.pdf

http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/sicily06a.pdf

http://www-cvr.ai.uiuc.edu/ponce_grp/publication/paper/sicily06c.pdf



[31] Dorko, Gyuri, and Cordelia Schmid. "Object class recognition using

discriminative local features." (2005): 22.

[32] Dalal, N. And Triggs, B. 2005. Histograms of Oriented Gradients for Human

Detection. In Proceedings of theIEEE Conference on Computer Vision and

Pattern Recognition (CVPR’05).

[33] Dalal,N.,Triggs, B., And Schmid,C. 2006. Human Detection Using Oriented

Histograms Of Flow And Appearance. In Proceedings of the European

Conference on Computer Vision (ECCV’06).

[34] Dalal, Navneet. Finding people in images and videos. Diss. Institut National

Polytechnique de Grenoble-INPG, 2006.

[35] Jurie, Frederic, and Bill Triggs. "Creating efficient codebooks for visual

recognition." Computer Vision, 2005. ICCV 2005. Tenth IEEE International


[36] Bosch Anna, Andrew Zisserman, and Xavier Munoz. "Representing shape

with a spatial pyramid kernel." Proceedings of the 6th ACM international

conference on Image and video retrieval. ACM, 2007.

[37] Watanabe, Tomoki, Satoshi Ito, and Kentaro Yokoi. "Co-occurrence

histograms of oriented gradients for pedestrian detection." Advances in

Image and Video Technology. Springer Berlin Heidelberg, 2009. 37-47.

[38] JOACHIMS, T. 1997. A probabilistic analysis of the rocchio algorithm with

tfidf for text categorization. In Proceedings of the International Conference

on Machine Learning (ICML’97)

[39] Csurka, Gabriella, et al. "Visual categorization with bags of keypoints."

Workshop on statistical learning in computer vision, ECCV. Vol. 1. No. 1-22.

2004.

[40] Ramanan, Amirthalingam, and Mahesan Niranjan. "A review of codebook

models in patch-based visual object recognition." Journal of Signal

Processing Systems 68.3 (2012): 333-352.

[41] K. Mikolajczyk, C.Schmid,” Indexing based on scale invariant interest

points”, International Conference on Computer Vision (ICCV '01) 1 (2001)

525—531.

[42] K.Mikolajczyk A. Zisserman C. Schmid,” Shape recognition with edge-based

features”, British Machine Vision Conference (BMVC '03) 2 (2003) 779—

788.

[43] K. Mikolajczyk and C. Schmid. “Scale and affine invariant interest point

detectors”. Int. J. Comput. Vision, 60(1):63–86, 2004.

[44] K. Mikolajczyk1, T. Tuytelaars2, C. Schmid4, A. Zisserman ,”A Comparison

of Affine Region Detectors”, International Journal of Computer Vision 65,

1/2 (2005) 43—72

[45] K.Mikolajczyk and C.Schmid,”A performance evaluation of local

descriptors”, IEEE Transactions on Pattern Analysis and Machine

Intelligence 27, 10 (2005) 1615—1630.

[46] Douze, Matthijs, et al. "Evaluation of gist descriptors for web-scale image

search." Proceedings of the ACM International Conference on Image and

Video Retrieval. ACM, 2009.

[47] Jégou, Hervé, Matthijs Douze, and Cordelia Schmid. "Improving bag-of-

features for large scale image search." International Journal of Computer

Vision 87.3 (2010): 316-336.

[48] Tuytelaars, Tinne, and Krystian Mikolajczyk. "Local invariant feature

detectors: a survey." Foundations and Trends® in Computer Graphics and

Vision 3.3 (2008): 177-280.



[49] K. Mikolajczyk, Bastian Leibe, Bernt Schiele,” Multiple Object Class

Detection with a Generative Model”, Computer Vision and Pattern

Recognition, 2006 IEEE Computer Society Conference on vol-1,pp 26 – 36

[50] Jégou, Hervé, et al. "Aggregating local descriptors into a compact image

representation." Computer Vision and Pattern Recognition (CVPR), 2010

IEEE Conference on. IEEE, 2010.

[51] Wengert, Christian, Matthijs Douze, and Hervé Jégou. "Bag-of-colors for

improved image search." Proceedings of the 19th ACM international

conference on Multimedia. ACM, 2011.

[52] Belongie, S., Malik, J., And Puzicha, J. 2001. “Matching shapes”, In

Proceedings of the IEEE International Conference on Computer Vision

(ICCV’01).

[53] Belongie, Serge, Jitendra Malik, and Jan Puzicha. "Shape matching and

object recognition using shape contexts." Pattern Analysis and Machine

Intelligence, IEEE Transactions on 24.4 (2002): 509-522

[54] Andras Ferencz, Erik G. Learned-Miller, Jitendra Malik,” Building a

Classification Cascade for Visual Identification from One Example”,

Proceedings of the Tenth IEEE International Conference on Computer Vision

(ICCV’05).

[55] Hao Zhang Alexander C. Berg Michael Maire Jitendra Malik,”SVM-

KNN:Discriminative Nearest Neighbor Classification for Visual

Recognition”, Proceedings of the 2006 IEEE Computer Society Conference

on Computer Vision and Pattern Recognition (CVPR’06)

[56] Bjorn Ommer Jitendra Malik,”Multi-Scale Object Detection by Clustering

Lines”, 2009 IEEE 12th International Conference on Computer Vision

(ICCV)pp 484-491

[57] Subhransu Maji, Jitendra Malik,”Object Detection using a Max-Margin

Hough Transform”,IEEE 2009, pp 1038-1045 .

[58] Vidal-Naquet, Michel, and Shimon Ullman. "Object Recognition with

Informative Features and Linear Classification." ICCV. Vol. 3. 2003

[59] Fergus, Robert, Pietro Perona, and Andrew Zisserman. "Object class

recognition by unsupervised scale-invariant learning." Computer Vision and

Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society


[60] Boris Epshtein Shimon Ullman,” Identifying Semantically Equivalent Object

Fragments”, Proceedings of the 2005 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (CVPR’05)

[61] Eran Borenstein and Shimon Ullman,” Combined Top-Down/Bottom-Up

Segmentation”, IEEE Transactions On Pattern Analysis And Machine

Intelligence, Vol. 30, No. 12, December 2008, pp 2109-2125

[62] L. Fei-Fei, R. Fergus, and P. Perona,”Learning generative visual models from

few training examples: An incremental bayesian approach tested on 101

object categories.” In Proc. CVPR Workshop on Generative-Model Based

Vision, 2004.

[63] Fei-Fei, Li, Rob Fergus, and Pietro Perona. "Learning generative visual

models from few training examples: An incremental bayesian approach tested

on 101 object categories." Computer Vision and Image Understanding 106.1

(2007): 59-70.

[64] www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html

http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=10924


http://www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html



[65] Fei-Fei, Li, Robert Fergus, and Pietro Perona. "One-shot learning of object

categories." Pattern Analysis and Machine Intelligence, IEEE Transactions

on28.4 (2006): 594-611.

[66] Li-Jia Li, Gang Wang and Li Fei-Fei,” OPTIMOL: Automatic Online Picture

Collection via Incremental Model Learning”, 2007 IEEE

[67] Hao Su, Min Sun, Li Fei-Fei,Silvio Savarese,”Learning a Dense Multi-View

Representation For Detection, Viewpoint Classification And Synthesis Of

Object Categories”, 2009 IEEE 12th International Conference on Computer

Vision (ICCV)

[68] Bangpeng Yao, Li Fei-Fei,” Recognizing Human-Object Interactions in Still

Images by Modeling Mutual Context of Object and Human Pose in Human-

Object Interaction Activities”, Ieee Transactions on Pattern Analysis and

Machine Intelligence, Vol. 34, No. 9, September 2012, pp 1691-1703

[69] Oliva, Aude, and Antonio Torralba. "Building the gist of a scene: The role of

global image features in recognition." Progress in brain research 155 (2006):

23-36.

[70] Antonio Torralba, Kevin P. Murphy and William T. Freeman, “Sharing

Visual Features for Multiclass and Multiview Object Detection”, April 2004.

[71] Antonio Torralba , Kevin P. Murphy, William T. Freeman,” Sharing features:

efficient boosting procedures for multiclass object detection”

[72] Oliva, Aude, and Antonio Torralba. "Modeling the shape of the scene: A

holistic representation of the spatial envelope." International journal of

computer vision 42.3 (2001): 145-175.

[73] Torralba, Antonio, Robert Fergus, and Yair Weiss. "Small codes and large

image databases for recognition." Computer Vision and Pattern Recognition,

2008. CVPR 2008. IEEE Conference on. IEEE, 2008.

[74] BC Russell, A Torralba, KP Murphy,“LabelMe: a database and web-based

tool for image annotation”,International journal of Computer Vision,

2008(77) – Springer, pp-157-173.

[75] Taha H. Rassem, Bee Ee Khoo,” Object Class Recognition using

Combination of Color SIFT Descriptors”, 2011 IEEE

[76] Gyuri Dork_o, Cordelia Schmid,” Object Class Recognition Using

Discriminative Local Features”,Technical Report

[77] Gy. Dorko and C. Schmid. Selection of scale-invariant parts for object class

recognition”. In Proceedings of the Ninth IEEE International Conference on

Computer Vision (ICCV’03), pages 634–639, 2003.

[78] Thomas Serre, Lior Wolf, Stanley Bileschi, Maximilian Riesenhuber, and

Tomaso Poggio,” Robust Object Recognition with Cortex-Like

Mechanisms”, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND

MACHINE INTELLIGENCE, VOL. 29, NO. 3, MARCH 2007 pp 411-426.

[79] B. Mayurathan, A. Ramanan, S. Mahesan & U.A.J. Pinidiyaarachchi,”

Speeded-up and Compact Visual Codebook for Object Recognition”,

International Journal of Image Processing (IJIP), Volume (7): Issue (1): 2013

pp 31-50

[80] Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing.

Prentice-Hall Inc., 2002.

[81] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning object

Categories from google’s Image Search,” Computer Vision, 2005. ICCV’

2005. Tenth IEEE International Conference on, 2005.

http://scholar.google.co.in/citations?user=3RuMCpcAAAAJ&hl=en&oi=sra

http://scholar.google.co.in/citations?user=8cxDHS4AAAAJ&hl=en&oi=sra

http://scholar.google.co.in/citations?user=3f2mgJIAAAAJ&hl=en&oi=sra

http://link.springer.com/article/10.1007/s11263-007-0090-8

http://link.springer.com/article/10.1007/s11263-007-0090-8



[82] Joao Carreira and Cristian Sminchisescu, “Constrained Parametric Min-Cuts

for Automatic Object Segmentation”, Computer Vision and Pattern

Recognition (CVPR), 2010 IEEE Conference pp 3241-3248

[83] Van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2008). Evaluation of

color descriptors for object and scene recognition. In Proceedings of the IEEE

conference on computer vision and pattern recognition (CVPR’08)

[84] Uijlings, Jasper RR, et al. "Selective search for object

recognition."International journal of computer vision 104.2 (2013): 154-171.

[85] Van de Sande, Koen EA, et al. "Segmentation as selective search for object

recognition." Computer Vision (ICCV), 2011 IEEE International Conference

on. IEEE, 2011. (selective search) (reviewerI and III)

[86] Fuxin Li_ and Joao Carreira_ and Cristian Sminchisescu, “Object

Recognition as Ranking Holistic Figure-Ground Hypotheses” , Computer

Vision and Pattern Recognition (CVPR), 2010 IEEE Conference , pp 1712 –

1719.

[87] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection

and semantic segmentation." Computer Vision and Pattern Recognition

(CVPR), 2014 IEEE Conference on. IEEE, 2014.

[88] Carreira, Joao, et al. "Semantic segmentation with second-order pooling."

Computer Vision–ECCV 2012. Springer Berlin Heidelberg, 2012. 430-443.

[89] Carreira, Joao, and Cristian Sminchisescu. "CPMC: Automatic object

segmentation using constrained parametric min-cuts." Pattern Analysis and

Machine Intelligence, IEEE Transactions on 34.7 (2012): 1312-1328.

[90] Li, Fuxin, Joao Carreira, and Cristian Sminchisescu. "Object recognition as

ranking holistic figure-ground hypotheses." Computer Vision and Pattern

Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010

[91] Fischler, Martin A., and Robert A. Elschlager. "The representation and

matching of pictorial structures." IEEE Transactions on Computers 22.1

(1973): 67-92.

[92] Felzenszwalb, Pedro F., and Daniel P. Huttenlocher. "Pictorial structures for

object recognition." International Journal of Computer Vision 61.1 (2005):

55-79.

[93] Felzenszwalb, Pedro F., et al. "Object detection with discriminatively trained

part-based models." Pattern Analysis and Machine Intelligence, IEEE

Transactions on 32.9 (2010): 1627-1645.

[94] Girshick, Ross B., Pedro F. Felzenszwalb, and D. McAllester.

"Discriminatively trained deformable part models, release 5." (2012).

[95] Felzenszwalb, Pedro, David McAllester, and Deva Ramanan. "A

discriminatively trained, multiscale, deformable part model." Computer

Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.

IEEE, 2008.

[96] Felzenszwalb, Pedro F., Ross B. Girshick, and David McAllester. "Cascade

object detection with deformable part models." Computer vision and pattern

recognition (CVPR), 2010 IEEE conference on. IEEE, 2010.

[97] Ferrari, Vittorio, Frederic Jurie, and Cordelia Schmid. "Accurate object

detection with deformable shape models learnt from images." Computer

Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE,

2007.

[98] Pentland, Alex P. "Automatic extraction of deformable part models."

International Journal of Computer Vision 4.2 (1990): 107-126.





[99] Pandey, Megha, and Svetlana Lazebnik. "Scene recognition and weakly

supervised object localization with deformable part-based models." Computer

Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.

[100] Ren, Xiaofeng, and Deva Ramanan. "Histograms of sparse codes for object

detection." Computer Vision and Pattern Recognition (CVPR), 2013 IEEE

Conference on. IEEE, 2013.

[101] Yang, Yi, and Deva Ramanan. "Articulated pose estimation with flexible

mixtures-of-parts." Computer Vision and Pattern Recognition (CVPR), 2011

IEEE Conference on. IEEE, 2011.

[102] Bourdev, Lubomir, and Jitendra Malik. "Poselets: Body part detectors trained

using 3d human pose annotations." Computer Vision, 2009 IEEE 12th

International Conference on. IEEE, 2009.

[103] Arbeláez, Pablo, et al. "Semantic segmentation using regions and parts."

Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference

on. IEEE, 2012.

[104] Bourdev, Lubomir, et al. "Detecting people using mutually consistent poselet

activations." Computer Vision–ECCV 2010. Springer Berlin Heidelberg,

2010. 168-181.

[105] Arbeláez, Pablo, Bharath Hariharan, Chunhui Gu, Saurabh Gupta, Lubomir

Bourdev, and Jitendra Malik. "Semantic segmentation using regions and

parts." In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE

Conference on, pp. 3378-3385. IEEE, 2012.

[106] Zhu, Long, et al. "Latent hierarchical structural learning for object detection."

Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference

on. IEEE, 2010.

[107] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding

convolutional networks." Computer Vision–ECCV 2014. Springer

International Publishing, 2014. 818-833.

[108] Zhao, Wenyi, et al. "Face recognition: A literature survey." Acm Computing

Surveys (CSUR) 35.4 (2003): 399-458.

[109] Yang, Ming-Hsuan, David Kriegman, and Narendra Ahuja. "Detecting faces

in images: A survey." Pattern Analysis and Machine Intelligence, IEEE

Transactions on 24.1 (2002): 34-58.

[110] T.yamazaki,T.Fujikawa,J.Katto,”Improving the performance of SIFT using

Bilateral Filter and its Application to Generic Object Recognition.”, ICASSP

2012, IEEE , pp 945 – 948.

[111] Chiu, Liang-Chi, et al. "Fast SIFT Design For Real-Time Visual Feature

Extraction." Image Processing, IEEE Transactions on 22.8 (2013): 3158-

3167.

[112] Kamencay, Patrik, et al. "Feature extraction for object recognition using

PCA-KNN with application to medical image analysis." Telecommunications

and Signal Processing (TSP), 2013 36th International Conference on. IEEE,

2013.

[113] He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks

for visual recognition."arXiv preprint arXiv: 1406.4729 (2014).

[114] Goyal, Soren, and Paul Benjamin. "Object Recognition Using Deep Neural

Networks: A Survey." arXiv preprint arXiv: 1412.3684 (2014).

[115] Nevatia, Ramakant, and Thomas O. Binford. "Description and recognition of

curved objects." Artificial Intelligence 8.1 (1977): 77-98.



[116] Fidler, Sanja, Marko Boben, and Ales Leonardis. "Learning a hierarchical

compositional shape vocabulary for multi-class object representation." arXiv

preprint arXiv: 1408.5516 (2014).

[117] Lee, Tom, Sanja Fidler, and Sven Dickinson. "Multi-cue mid-level

grouping."

[118] Russakovsky, Olga, et al. "Imagenet large scale visual recognition

challenge." arXiv preprint arXiv: 1409.0575 (2014).

[119] Everingham, Mark, et al. "The pascal visual object classes challenge: A

retrospective." International Journal of Computer Vision 111.1 (2014): 98-

136.Fei

[120] Nene, Sameer A., Shree K. Nayar, and Hiroshi Murase. Columbia object

image library (COIL-20). Technical Report CUCS-005-96, 1996.

[121] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-

scale hierarchical image database, IEEE Computer Vision and Pattern

Recognition, 2009. <http://www.image-net.org/>.

[122] Lin, Tsung-Yi, et al. "Microsoft COCO: Common objects in

context." Computer Vision–ECCV 2014. Springer International Publishing,

2014. 740-755.

[123] LeCun, Yann, et al. "Backpropagation applied to handwritten zip code

recognition." Neural computation 1.4 (1989): 541-551.

[124] Humphrey, Eric J., Juan Pablo Bello, and Yann LeCun. "Moving Beyond

Feature Design: Deep Architectures and Automatic Feature Learning in

Music Informatics." ISMIR. 2012.

[125] Krizhevsky, Alex, and Geoffrey Hinton. "Learning multiple layers of features

from tiny images." Computer Science Department, University of Toronto,

Tech. Rep 1.4 (2009): 7.

[126] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet

classification with deep convolutional neural networks." Advances in neural

information processing systems. 2012.

[127] http://riemenschneider.hayko.at/vision/dataset/index.php as referred on 12th

April 2015

[128] http://image-net.org/challenges/LSVRC/2012/analysis/as referred on 12th

April 2015

Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding

convolutional networks." Computer Vision–ECCV 2014. Springer

International Publishing, 2014. 818-833.

[129] Bengio, Yoshua, et al. "Greedy layer-wise training of deep

networks."Advances in neural information processing systems 19 (2007):

153.

[130] Mrs. Manisha Bhisekar and Prof. Prajakta Deshmane, Image Retrieval and

Face Recognition Techniques: Literature Survey. International Journal of

Electronics and communication Engineering and Technology, 5(1), 2014, pp.

52-58.

[131] Yoel E. Almeida, Ashray S. Bhandare and Aishwary P. Nipane, Computer

Vision Based Adaptive Lighting Solutions for Smart and Efficient System.

International Journal of Computer Engineering and Technology, 6(3), 2015,

pp. 01-11.

[132] Socher, Richard, et al. Parsing natural scenes and natural language with

recursive neural networks. Proceedings of the 28th international conference

on machine learning (ICML-11). 2011.

http://riemenschneider.hayko.at/vision/dataset/index.php

http://image-net.org/challenges/LSVRC/2012/analysis/

REVIEW ON GENERIC OBJECT RECOGNITION TECHNIQUES: CHALLENGES AND OPPORTUNITIES

Engineering

Transcript of REVIEW ON GENERIC OBJECT RECOGNITION TECHNIQUES: CHALLENGES AND OPPORTUNITIES