arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In...

7
Deep learning for fine-grained image analysis: A survey Xiu-Shen Wei 1 , Jianxin Wu 2 , Quan Cui 1,3 1 Megvii Research Nanjing, Megvii Technology, Nanjing, China 2 National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 3 Graduate School of IPS, Waseda University, Fukuoka, Japan Abstract Computer vision (CV) is the process of using ma- chines to understand and analyze imagery, which is an integral branch of artificial intelligence. Among various research areas of CV, fine-grained image analysis (FGIA) is a longstanding and fundamen- tal problem, and has become ubiquitous in diverse real-world applications. The task of FGIA targets an- alyzing visual objects from subordinate categories, e.g., species of birds or models of cars. The small inter-class variations and the large intra-class vari- ations caused by the fine-grained nature makes it a challenging problem. During the booming of deep learning, recent years have witnessed remarkable progress of FGIA using deep learning techniques. In this paper, we aim to give a survey on recent ad- vances of deep learning based FGIA techniques in a systematic way. Specifically, we organize the exist- ing studies of FGIA techniques into three major cate- gories: fine-grained image recognition, fine-grained image retrieval and fine-grained image generation. In addition, we also cover some other important is- sues of FGIA, such as publicly available benchmark datasets and its related domain specific applications. Finally, we conclude this survey by highlighting sev- eral directions and open problems which need be further explored by the community in the future. 1 Introduction Computer vision (CV) is an interdisciplinary scientific field of artificial intelligence (AI), which deals with how computers can be made to gain high-level understanding from digital im- ages or videos. The tasks of computer vision include methods for acquiring, processing, analyzing and understanding digital images, and the process of extracting numerical or symbolic information, e.g., in the forms of decisions or predictions, from high-dimensional raw image data in the real world. As an interesting, fundamental and challenging problem in computer vision, fine-grained image analysis (FGIA) has been an active area of research for several decades. The goal of FGIA is to retrieve, recognize and generate images belonging to multiple subordinate categories of a super-category (aka Generic image recognition Birds Dogs Oranges Fine-grained image recognition Husky Alaska Samoyed Figure 1: Fine-grained image analysis vs. generic image analysis (taking the recognitiont task for an example). meta-category), e.g., different species of animals/plants, dif- ferent models of cars, different kinds of retail products, etc (cf. Fig. 1). In the real-world, FGIA enjoys a wide-range of applications in both industry and research societies, such as automatic biodiversity monitoring, climate change evalu- ation, intelligent retail, intelligent transportation, and many more. Particularly, a number of influential academic competi- tions about FGIA are frequently held on Kaggle. 1 Several representative competitions, to name a few, are the Nature Conservancy Fisheries Monitoring (for fish species catego- rization), Humpback Whale Identification (for whale identity categorization) and so on. Each competition attracted more than 300 teams worldwide to participate, and some even ex- ceeded 2,000 teams. On the other hand, deep learning techniques [LeCun et al., 2015] have emerged in recent years as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the filed of FGIA. With rough statistics on each year, on average, there are around ten conference papers of deep learning based FGIA techniques published on each of AI’s and CV’s premium conferences, like IJCAI, AAAI, CVPR, ICCV, ECCV, etc. It shows that FGIA with deep learning is of notable research interests. Given this period of rapid evolution, the aim of this paper is to provide a comprehensive survey of the recent achievements in the FGIA filed brought by deep learning techniques. In the literature, there was an existing survey related to fine- 1 Kaggle is an online community of data scientists and machine learners: https://www.kaggle.com/. arXiv:1907.03069v1 [cs.CV] 6 Jul 2019

Transcript of arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In...

Page 1: arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1, we list a number of image datasets commonly used by the fine-grained community,

Deep learning for fine-grained image analysis: A survey

Xiu-Shen Wei1 , Jianxin Wu2 , Quan Cui1,31Megvii Research Nanjing, Megvii Technology, Nanjing, China

2National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China3Graduate School of IPS, Waseda University, Fukuoka, Japan

Abstract

Computer vision (CV) is the process of using ma-chines to understand and analyze imagery, which isan integral branch of artificial intelligence. Amongvarious research areas of CV, fine-grained imageanalysis (FGIA) is a longstanding and fundamen-tal problem, and has become ubiquitous in diversereal-world applications. The task of FGIA targets an-alyzing visual objects from subordinate categories,e.g., species of birds or models of cars. The smallinter-class variations and the large intra-class vari-ations caused by the fine-grained nature makes it achallenging problem. During the booming of deeplearning, recent years have witnessed remarkableprogress of FGIA using deep learning techniques.In this paper, we aim to give a survey on recent ad-vances of deep learning based FGIA techniques in asystematic way. Specifically, we organize the exist-ing studies of FGIA techniques into three major cate-gories: fine-grained image recognition, fine-grainedimage retrieval and fine-grained image generation.In addition, we also cover some other important is-sues of FGIA, such as publicly available benchmarkdatasets and its related domain specific applications.Finally, we conclude this survey by highlighting sev-eral directions and open problems which need befurther explored by the community in the future.

1 IntroductionComputer vision (CV) is an interdisciplinary scientific field ofartificial intelligence (AI), which deals with how computerscan be made to gain high-level understanding from digital im-ages or videos. The tasks of computer vision include methodsfor acquiring, processing, analyzing and understanding digitalimages, and the process of extracting numerical or symbolicinformation, e.g., in the forms of decisions or predictions, fromhigh-dimensional raw image data in the real world.

As an interesting, fundamental and challenging problem incomputer vision, fine-grained image analysis (FGIA) has beenan active area of research for several decades. The goal ofFGIA is to retrieve, recognize and generate images belongingto multiple subordinate categories of a super-category (aka

Generic image recognition

Birds

Dogs

Oranges

Fine-grained image recognition

Husky

Alaska

Samoyed

Figure 1: Fine-grained image analysis vs. generic image analysis(taking the recognitiont task for an example).

meta-category), e.g., different species of animals/plants, dif-ferent models of cars, different kinds of retail products, etc(cf. Fig. 1). In the real-world, FGIA enjoys a wide-rangeof applications in both industry and research societies, suchas automatic biodiversity monitoring, climate change evalu-ation, intelligent retail, intelligent transportation, and manymore. Particularly, a number of influential academic competi-tions about FGIA are frequently held on Kaggle.1 Severalrepresentative competitions, to name a few, are the NatureConservancy Fisheries Monitoring (for fish species catego-rization), Humpback Whale Identification (for whale identitycategorization) and so on. Each competition attracted morethan 300 teams worldwide to participate, and some even ex-ceeded 2,000 teams.

On the other hand, deep learning techniques [LeCun et al.,2015] have emerged in recent years as powerful methods forlearning feature representations directly from data, and haveled to remarkable breakthroughs in the filed of FGIA. Withrough statistics on each year, on average, there are around tenconference papers of deep learning based FGIA techniquespublished on each of AI’s and CV’s premium conferences, likeIJCAI, AAAI, CVPR, ICCV, ECCV, etc. It shows that FGIAwith deep learning is of notable research interests. Given thisperiod of rapid evolution, the aim of this paper is to provide acomprehensive survey of the recent achievements in the FGIAfiled brought by deep learning techniques.

In the literature, there was an existing survey related to fine-

1Kaggle is an online community of data scientists and machinelearners: https://www.kaggle.com/.

arX

iv:1

907.

0306

9v1

[cs

.CV

] 6

Jul

201

9

Page 2: arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1, we list a number of image datasets commonly used by the fine-grained community,

FG image recognition

FG image retrieval

FG image generation

By localization-classification subnetworks

By end-to-end feature encoding

With external information

With web data

With multi-modality data

With humans in the loop

Unsupervised with pre-trained models

Supervised with metric learning

Future directions of FGIA

Generating from FG image distributions

Generating from text descriptions

Automatic fine-grained models

Fine-grained few shot learning

Fine-grained hashing

FGIA within more realistic settings

FGIA

Figure 2: Main aspects of our hierarchical and structrual organizationof fine-grained image analysis (FGIA) in this survey paper.

grained tasks, i.e., [Zhao et al., 2017], which simply includedseveral fine-grained recognition approaches for comparisons.Our work differs with it in that ours is more comprehensive.Specifically, except for fine-grained recognition, we also an-alyze and discuss the other two central fine-grained analysistasks, i.e., fine-grained image retrieval and fine-grained imagegeneration, which can not be overlooked as they are two inte-gral aspects of FGIA. Additionally, on another important AIconference in the Pacific Rim nations, PRICAI, Wei and Wuorganized a specific tutorial2 aiming at the fine-grained imageanalysis topic. We refer interested readers to the tutorial whichprovides some additional detailed information.

In this paper, our survey take a unique deep learning basedperspective to review the recent advances of FGIA in a sys-tematic and comprehensive manner. The main contributionsof this survey are three-fold:

• We give a comprehensive review of FGIA techniquesbased on deep learning, including problem backgrounds,benchmark datasets, a family of FGIA methods with deeplearning, domain-specific FGIA applications, etc.

• We provide a systematic overview of recent advances ofdeep learning based FGIA techniques in a hierarchicaland structural manner, cf. Fig. 2.

• We discuss the challenges and open issues, and identify

2http://www.weixiushen.com/tutorial/PRICAI18/FGIA.html

Artic Tern

Common Tern

Forster’s Tern

Caspian Tern

Inter-class variances

Intra-class variances

Figure 3: Key challenge of fine-grained image analysis, i.e., smallinter-class variations and large intra-class variations. We here presenteach of four Tern species in each row in the figure, respectively.

the new trends and future directions to provide a potentialroad map for fine-grained researchers or other interestedreaders in the broad AI community.

The rest of the survey is organized as follows. Section 2introduce backgrounds of this paper, i.e., the FGIA problemand its main challenges. In Section 3, we review multiple com-monly used fine-grained benchmark datasets. Section 4 ana-lyzes the three main paradigms of fine-grained image recogni-tion. Section 5 presents recent progress of fine-grained imageretrieval. Section 6 discusses fine-grained image generationfrom a generative perspective. Furthermore, in Section 7, weintroduce some other domain specific applications of real-world related to FGIA. Finally, we conclude this paper anddiscuss future directions and open issues in Section 8.

2 Background: problem and main challengesIn this section, we summarize the related background of thispaper, including the problem and its key challenges.

Fine-grained image analysis (FGIA) focuses on dealingwith the objects belonging to multiple sub-categories of thesame meta-category (e.g., birds, dogs and cars), and generallyinvolves central tasks like fine-grained image recognition, fine-grained image retrieval, fine-grained image generation, etc.

What distinguishes FGIA from the generic one is: in genericimage analysis, the target objects belong to coarse-grainedmeta-categories (e.g., birds, oranges and dogs), and thus arevisually quite different. However, in FGIA, since objects comefrom sub-categories of one meta-category, the fine-grainednature causes them visually quite similar. We take imagerecognition for illustration. As shown in Fig. 1, in fine-grainedrecognition, the task is required to identify multiple similarspecies of dogs, e.g., Husky, Samoyed and Alaska. Foraccurate recognition, it is desirable to distinguish them bycapturing slight and subtle differences (e.g., ears, noses, tails),which also meets the demand of other FGIA tasks (e.g., re-trieval and generation).

Page 3: arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1, we list a number of image datasets commonly used by the fine-grained community,

Figure 4: Example fine-grained images belonging to different species of flowers/vegetable, different models of cars/aircrafts and different kindsof retail products. Accurate identification of these fine-grained objects requires the dependences on the discriminative but subtle object parts orimage regions. (Best viewed in color and zoomed in.)

Table 1: Summary of popular fine-grained image datasets. Note that “BBox” indicates whether this dataset provides object bounding boxsupervisions. “Part anno.” means providing the key part localizations. “HRCHY” corresponds to hierarchical labels. “ATR” represents theattribute labels (e.g., wing color, male, female, etc). “Texts” indicates whether fine-grained text descriptions of images are supplied.

Dataset name Meta-class ] images ] categories BBox Part anno. HRCHY ATR TextsOxford Flower [Nilsback and Zisserman, 2008] Flowers 8,189 102 X

CUB200-2011 [Wah et al., 2011] Birds 11,788 200 X X X XStanford Dog [Khosla et al., 2011] Dogs 20,580 120 XStanford Car [Krause et al., 2013] Cars 16,185 196 XFGVC Aircraft [Maji et al., 2013] Aircrafts 10,000 100 X X

Birdsnap [Berg et al., 2014] Birds 49,829 500 X X XFru92 [Hou et al., 2017] Fruits 69,614 92 XVeg200 [Hou et al., 2017] Vegetable 91,117 200 X

iNat2017 [Horn et al., 2017] Plants & Animals 859,000 5,089 X XRPC [Wei et al., 2019a] Retail products 83,739 200 X X

Furthermore, fine-grained nature also brings the small inter-class variations caused by highly similar sub-categories, andthe large intra-class variations in poses, scales and rotations,as presented by Fig. 3. It is the opposite of the generic imageanalysis (i.e., the small intra-class variations and the large inter-class variations), which makes fine-grained image analysis achallenging problem.

3 Benchmark datasetsIn the past decade, the vision community has releasedmany benchmark fine-grained datasets covering diverse do-mains such as birds [Wah et al., 2011; Berg et al., 2014],dogs [Khosla et al., 2011], cars [Krause et al., 2013], air-planes [Maji et al., 2013], flowers [Nilsback and Zisserman,2008], vegetable [Hou et al., 2017], fruits [Hou et al., 2017],retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1,we list a number of image datasets commonly used by thefine-grained community, and specifically indicate their meta-category, the amounts of fine-grained images, the number offine-grained categories, extra different kinds of available su-pervisions, i.e., bounding boxes, part annotations, hierarchicallabels, attribute labels and text visual descriptions, cf. Fig. 5.

These datasets have been one of the most important fac-tors for the considerable progress in the filed, not only as acommon ground for measuring and comparing performanceof competing approaches, but also pushing this filed towardsincreasingly complex, practical and challenging problems.

Specifically, among them, CUB200-2011 is one of the mostpopular fine-grained datasets. Almost all the FGIA approaches

Image label: American Goldfinch

Text descriptions:“This very distinct looking bird has a black crown, ayellow belly, and white and black wings.”

Has_Bill_Shape: Cone

Has_Wing_Color: Black and white

Has_Back_Color: Yellow

ATR

… …

Figure 5: An example image with its supervisions associated withCUB200-2011. As shown, multiple types of supervisions include:image labels, part annotations (aka key point localizations), objectbounding boxes (i.e., the green one), attribute labels (i.e., “ATR”),and text descriptions by natural languages. (Best viewed in color.)

choose it for comparisons with state-of-the-arts. Moreover,constant contributions are made upon CUB200-2011 for fur-ther research, e.g., collecting text descriptions of the fine-grained images for multi-modality analysis, cf. [Reed et al.,2016; He and Peng, 2017a].

Additionally, in recent years, more challenging and practicalfine-grained datasets are proposed increasingly, e.g., iNat2017for natural species of plants, animals [Horn et al., 2017] andRPC for daily retail products [Wei et al., 2019a]. Many novelfeatures deriving from these datasets are, to name a few, large-scale, hierarchical structure, domain gap and long-tail distri-bution, which reveals the practical requirements in real-worldand could arouse the studies of FGIA in more realistic settings.

Page 4: arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1, we list a number of image datasets commonly used by the fine-grained community,

4 Fine-grained image recognitionFine-grained image recognition has been the most active re-search area of FGIA in the past decade. In this section,we review the milestones of fine-grained recognition frame-works since deep learning entered the filed. Broadly, thesefine-grained recognition approaches can be organized intothree main paradigms, i.e., fine-grained recognition (1) withlocalization-classification subnetworks; (2) with end-to-endfeature encoding and (3) with external information. Amongthem, the first and second paradigms restrict themselves byonly utilizing the supervisions associated with fine-grainedimages such as image labels, bounding boxes, part annota-tions, etc. In addition, automatic recognition systems cannotyet achieve excellent performance due to the fine-grained chal-lenges. Thus, researchers gradually attempt to involve externalbut cheap information (e.g., web data, text descriptions) intofine-grained recognition for further improving accuracy, whichcorresponds to the third paradigm of fine-grained recognition.Popularly used evaluation metric in fine-grained recognition isthe averaged classification accuracy across all the subordinatecategories of the datasets.

4.1 By localization-classification subnetworksTo mitigate the challenge of intra-class variations, researchersin the fine-grained community pay attentions on capturingdiscriminative semantic parts of fine-grained objects and thenconstructing a mid-level representation corresponding to theseparts for the final classification. Specifically, a localizationsubnetwork is designed for locating these key parts. Whilelater, a classification subnetwork follows and is employed forrecognition. The framework of such two collaborative subnet-works forms the first paradigm, i.e., fine-grained recognitionwith localization-classification subnetworks.

Thanks to the localization information, e.g., part-levelbounding boxes or segmentation masks, it can obtain more dis-criminative mid-level (part-level) representations w.r.t. thesefine-grained parts. Also, it further enhances the learning ca-pability of the classification subnetwork, which could signifi-cantly boost the final recognition accuracy.

Earlier works belonging to this paradigm depend on addi-tional dense part annotations (aka key points localization)to locate semantic key parts (e.g., head, torso) of objects.Some of them learn part-based detectors [Zhang et al., 2014;Lin et al., 2015a], and some of them leverage segmentationmethods for localizing parts [Wei et al., 2018a]. Then, thesemethods concatenate multiple part-level features as a wholeimage representation, and feed it into the following classifica-tion subnetwork for final recognition. Thus, these approachesare also termed as part-based recognition methods.

However, obtaining such dense part annotations is labor-intensive, which limits both scalability and practicality ofreal-world fine-grained applications. Recently, it emerges atrend that more techniques under this paradigm only requireimage labels [Jaderberg et al., 2015; Fu et al., 2017; Zheng etal., 2017; Sun et al., 2018] for accurate part localization. Thecommon motivation of them is to first find the correspondingparts and then compare their appearance. Concretely, it isdesirable to capture semantic parts (e.g., head and torso) tobe shared across fine-grained categories, and meanwhile, it

is also eager for discovering the subtle differences betweenthese part representations. Advanced techniques, like attentionmechanisms [Yang et al., 2018] and multi-stage strategies [Heand Peng, 2017b] complicate the joint training of the integratedlocalization-classification subnetworks.

4.2 By end-to-end feature encodingDifferent from the first paradigm, the second paradigm, i.e.,end-to-end feature encoding, leans to directly learn a morediscriminative feature representation by developing powerfuldeep models for fine-grained recognition. The most representa-tive method among them is Bilinear CNNs [Lin et al., 2015b],which represents an image as a pooled outer product of fea-tures derived from two deep CNNs, and thus encodes higherorder statistics of convolutional activations to enhance the mid-level learning capability. Thanks to its high model capacity,Bilinear CNNs achieve remarkable fine-grained recognitionperformance. However, the extremely high dimensionality ofbilinear features still makes it impractical for realistic applica-tions, especially for the large-scale ones.

Aiming at this problem, more recent attempts, e.g., [Gaoet al., 2016; Kong and Fowlkes, 2017; Cui et al., 2017], tryto aggregate low-dimensional embeddings by applying tensorsketching [Pham and Pagh, 2013; Charikar et al., 2002], whichcan approximate the bilinear features and maintain comparableor higher recognition accuracy. Other works, e.g., [Dubey etal., 2018], focus on designing a specific loss function tailoredfor fine-grained and is able to drive the whole deep model forlearning discriminative fine-grained representations.

4.3 With external informationAs aforementioned, beyond the conventional recognitionparadigms, another paradigm is to leverage external informa-tion, e.g., web data, multi-modality data or human-computerinteractions, to further assist fine-grained recognition.

With web dataTo identify the minor distinction among various fine-grainedcategories, sufficient well-labeled training images are in highdemand. However, accurate human annotations for fine-grained categories are not easy to acquire, due to the difficultyof annotations (always requiring domain experts) and the myr-iads of fine-grained categories (i.e., more than thousands ofsubordinate categories in a meta-category).

Therefore, a part of fine-grained recognition methods seekto utilize the free but noisy web data to boost recognition per-formance. The majority of existing works in this line can beroughly grouped into two directions. One of them is to crawlnoisy labeled web data for the test categories as training data,which is regarded as webly supervised learning [Zhuang etal., 2017; Sun et al., 2019]. Main efforts of these approachesconcentrate on: (1) overcoming the dataset gap between easilyacquired web images and the well-labeled data from standarddatasets; and (2) reducing the negative effects caused by thenoisy data. For dealing with the aforementioned problems,deep learning techniques of adversarial learning [Goodfellowet al., 2014] and attention mechanisms [Zhuang et al., 2017]are frequently utilized. The other direction of using web datais to transfer the knowledge from an auxiliary categories with

Page 5: arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1, we list a number of image datasets commonly used by the fine-grained community,

Pine Warbler

Tennessee Warbler

Kentucky Warbler

Hooded Warbler

Belly: yellowEye: Black

Throat: Black

Throat: Yellow

Crown: Grey Yellow

Crown: Black

Wing: Yellow Olive

Wing: Grey Yellow

Crown: Yellow

Wing: Brown White

… …

Figure 6: An example knowledge graph for modeling the category-attribute correlations on CUB200-2011.

well-labeled training data to the test categories, which usu-ally employs zero-shot learning [Niu et al., 2018] or metalearning [Zhang et al., 2018] to achieve that goal.

With multi-modality dataMulti-modal analysis has attracted a lot of attentions with therapid growth of multi-media data (e.g., image, text, knowledgebase, etc). In fine-grained recognition, it takes multi-modalitydata to establish joint-representations/embeddings for incor-porating multi-modality information. It is able to boost fine-grained recognition accuracy. In particular, frequently utilizedmulti-modality data includes text descriptions (e.g., sentencesand phrases of natural languages) and graph-structured knowl-edge base. Compared with strong supervisions of fine-grainedimages, e.g., part annotations, text descriptions are weak super-visions. Besides, text descriptions can be relatively accuratelyreturned by ordinary humans, rather than the experts in a spe-cific domain. In addition, high-level knowledge graph is anexisting resource and contains rich professional knowledge,such as DBpedia [Lehmann et al., 2015]. In practice, bothtext descriptions and knowledge base are effective as extraguidance for better fine-grained image representation learning.

Specifically, [Reed et al., 2016] collects text descriptions,and introduces a structured joint embedding for zero-shot fine-grained image recognition by combining texts and images.Later, [He and Peng, 2017a] combines the vision and languagestreams in a joint training end-to-end fashion to preserve theintra-modality and inter-modality information for generatingcomplementary fine-grained representations. For fine-grainedrecognition with knowledge base, some works, e.g., [Chen etal., 2018; Xu et al., 2018a], introduce the knowledge base in-formation (always associating with attribute labels, cf. Fig. 6)to implicitly enriching the embedding space (also reasoningabout the discriminative attributes for fine-grained objects).

With humans in the loopFine-grained recognition with humans in the loop is usuallyan iterative system composed of a machine and a human user,which combines both human and machine efforts and intel-ligence. Also, it requires the system to work in a humanlabor-economy way as possible. Generally, for these kinds ofrecognition methods, the system in each round is seeking tounderstand how humans perform recognition, e.g., by askinguntrained humans to label the image class and pick up hardexamples [Cui et al., 2016], or by identifying key part localiza-tion and selecting discriminative features [Deng et al., 2016]for fine-grained recognition.

Figure 7: An illustration of fine-grained retrieval. Given a queryimage (aka probe) of “Dodge Charger Sedan 2012”, fine-grained retrieval is required to return images of the same car modelfrom a car database (aka galaxy). In this figure, the top-4 returnedimage marked in a red rectangle presents a wrong result, since itsmodel is “Dodge Caliber Wagon 2012”.

5 Fine-grained image retrievalBeyond image recognition, fine-grained retrieval is anothercrucial aspect of FGIA and emerges as a hot topic. Its evalua-tion metric is the common mean average precision (mAP).

In fine-grained image retrieval, given database images ofthe same sub-category (e.g., birds or cars) and a query, itshould return images which are in the same variety as thequery, without resorting to any other supervision signals, cf.Fig. 7. Compared with generic image retrieval which focuseson retrieving near-duplicate images based on similarities intheir contents (e.g., textures, colors and shapes), while fine-grained retrieval focuses on retrieving the images of the sametypes (e.g., the same subordinate species for the animals andthe same model for the cars). Meanwhile, objects in fine-grained images have only subtle differences, and vary in poses,scales and rotations.

In the literature, [Wei et al., 2017] is the first attemptto fine-grained image retrieval using deep learning. It em-ploys pre-trained CNN models to select the meaningful deepdescriptors by localizing the main object in fine-grainedimages unsupervisedly, and further reveals that selectingonly useful deep descriptors with removing background ornoise could significantly benefit retrieval tasks. Recently, tobreak through the limitation of unsupervised fine-grained re-trieval by pre-trained models, some trials [Zheng et al., 2018;Zheng et al., 2019] tend to discovery novel loss functionsunder the supervised metric learning paradigm. Meanwhile,they still design additional specific sub-modules tailored forfine-grained objects, e.g., the weakly-supervised localizationmodule proposed in [Zheng et al., 2018], which is under theinspiration of [Wei et al., 2017].

6 Fine-grained image generationApart from the supervised learning tasks, image generationis a representative topic of unsupervised learning. It deploysdeep generative models, e.g., GAN [Goodfellow et al., 2014],to learn to synthesize realistic images which looks visuallyauthentic. With the quality of generated images becoming

Page 6: arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1, we list a number of image datasets commonly used by the fine-grained community,

higher, more challenging goals are expected, i.e., fine-grainedimage generation. As the term suggests, fine-grained genera-tion will synthesize images in fine-grained categories such asfaces of a specific person or objects in a subordinate category.

The first work in this line was CVAE-GAN proposed in [Baoet al., 2017], which combines a variational auto-encoder witha generative adversarial network under a conditional genera-tive process to tackle this problem. Specifically, CVAE-GANmodels an image as a composition of label and latent attributesin a probabilistic model. Then, by varying the fine-grainedcategory fed into the resulting generative model, it can gener-ate images in a specific category. More recently, generatingimages from text descriptions [Xu et al., 2018b] behaves pop-ular in the light of its diverse and practical applications, e.g.,art generation and computer-aided design. By performing anattention equipped generative network, the model can synthe-size fine-grained details of subtle regions by focusing on therelevant words of text descriptions.

7 Domain specific applications related tofine-grained image analysis

In the real world, deep learning based fine-grained image anal-ysis techniques are also adopted to diverse domain specific ap-plications and shows great performance, such as clothes/shoesretrieval [Song et al., 2017] in recommendation systems, fash-ion image recognition [Liu et al., 2016] in e-commerce plat-forms, product recognition [Wei et al., 2019a] in intelligentretail, etc. These applications are highly related to both fine-grained retrieval and recognition of FGIA.

Additionally, if we move down the spectrum of granularity,in the extreme, face identification can be viewed as an instanceof fine-grained recognition, where the granularity is underthe identity granularity level. Moreover, person/vehicle re-identification is another fine-grained related task, which aimsat determining whether two images are taken from the samespecific person/vehicle. Apparently, re-identification tasks arealso under identity granularity.

In practice, these works solve the corresponding domainspecific tasks by following the motivations of FGIA, whichincludes capturing the discriminative parts of objects (faces,persons and vehicles) [Suh et al., 2018], discovering coarse-to-fine structural information [Wei et al., 2018b], developingattribute-based models [Liu et al., 2016], and so on.

8 Concluding remarks and future directionsFine-grained image analysis (FGIA) based on deep learninghave made great progress in recent years. In this paper, wegive an extensive survey on recent advances in FGIA withdeep learning. We mainly introduced the FGIA problem andits challenges, discussed the significant improvements of fine-grained image recognition/retrieval/generation, and also pre-sented some domain specific applications related to FGIA.Despite the great success, there are still many unsolved prob-lems. Thus, in this section, we will point out these problemsexplicitly and introduce some research trends for the futureevolution. We hope that this survey not only provides a bet-ter understanding of FGIA but also facilitates future researchactivities and application developments in this field.

Automatic fine-grained models Nowadays, automated ma-chine learning (AutoML) [Feurer et al., 2015] and neuralarchitecture search (NAS) [Elsken et al., 2018] are attractingfervent attentions in the artificial intelligence community, es-pecially in computer vision. AutoML targets automating theend-to-end process of applying machine learning to real-worldtasks. While, NAS, the process of automating neural networkarchitecture designing, is thus a logical next step in AutoML.Recent methods of AutoML and NAS could be comparableor even outperform hand-designed architectures in variouscomputer vision applications. Thus, it is also promising thatautomatic fine-grained models developed by AutoML or NAStechniques could find a better and more tailor-made deep mod-els, and meanwhile it can advance the studies of AutoML andNAS in turn.Fine-grained few-shot learning Humans are capable oflearning a new fine-grained concept with very little super-vision, e.g., few exemplary images for a species of bird, yetour best deep learning fine-grained systems need hundreds orthousands of labeled examples. Even worse, the supervisionof fine-grained images are both time-consuming and expen-sive, since fine-grained objects should be always accuratelylabeled by domain experts. Thus, it is desirable to developfine-grained few-shot learning (FGFS) [Wei et al., 2019b].The task of FGFS requires the learning systems to build clas-sifiers for novel fine-grained categories from few examples(only one or less than five) in an meta-learning fashion. RobustFGFS methods could extremely strengthen the usability andscalability of fine-grained recognition.Fine-grained hashing As there exist growing attentions onFGIA, more large-scale and well-constructed fine-graineddatasets have been released, e.g., [Berg et al., 2014; Hornet al., 2017; Wei et al., 2019a]. In real applications like fine-grained image retrieval, it is natural to raise a problem thatthe cost of finding the exact nearest neighbor is prohibitivelyhigh in the case that the reference database is very large. Hash-ing [Wang et al., 2018; Li et al., 2016], acting as one of themost popular and effective techniques of approximate nearestneighbor search, has the potential to deal with large-scale fine-grained data. Therefore, fine-grained hashing is a promisingdirection worth further explorations.Fine-grained analysis within more realistic settings Inthe past decade, fine-grained image analysis related techniqueshave been developed and achieve good performance in itstraditional settings, e.g., the empirical protocols of [Wah etal., 2011; Khosla et al., 2011; Krause et al., 2013]. How-ever, these settings can not satisfy the daily requirements ofvarious real-world applications nowadays, e.g., recognizingretail products in storage racks by models trained with im-ages collected in controlled environments [Wei et al., 2019a]and recognizing/detecting natural species in the wild [Horn etal., 2017]. In consequence, novel fine-grained image analy-sis topics, to name a few—fine-grained analysis with domainadaptation, fine-grained analysis with knowledge transfer, fine-grained analysis with long-tailed distribution, and fine-grainedanalysis running on resource constrained embedded devices—deserve a lot of research efforts towards the more advancedand practical FGIA.

Page 7: arXiv:1907.03069v1 [cs.CV] 6 Jul 2019 · retail products [Wei et al., 2019a], etc (cf. Fig. 4). In Table 1, we list a number of image datasets commonly used by the fine-grained community,

References[Bao et al., 2017] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CVAE-GAN: Fine-

grained image generation through asymmetric training. In ICCV, pages 2745–2754,2017.

[Berg et al., 2014] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N.Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. InCVPR, pages 2019–2026, 2014.

[Charikar et al., 2002] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequentitems in data streams. In ICALP, pages 693–703, 2002.

[Chen et al., 2018] T. Chen, L. Lin, R. Chen, Y. Wu, and X. Luo. Knowledge-embedded representation learning for fine-grained image recognition. In IJCAI,pages 627–634, 2018.

[Cui et al., 2016] Y. Cui, F. Zhou, Y. Lin, and S. Belongie. Fine-grained categorizationand dataset bootstrapping using deep metric learning with humans in the loop. InCVPR, pages 1153–1162, 2016.

[Cui et al., 2017] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie. Kernelpooling for convolutional neural network. In CVPR, pages 2921–2930, 2017.

[Deng et al., 2016] J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the wis-dom of the crowd for fine-grained recognition. TPAMI, 38(4):666–676, 2016.

[Dubey et al., 2018] A. Dubey, O. Gupta, R. Raskar, and N. Naik. Maximum entropyfine-grained classification. In NeurIPS, pages 637–647, 2018.

[Elsken et al., 2018] T. Elsken, J. H. Metzen, and F. Hutter. Neural architecture search:A survey. arXiv preprint arXiv:1808.05377, 2018.

[Feurer et al., 2015] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum,and F. Hutter. Efficient and robust automated machine learning. In NIPS, pages2962–2970, 2015.

[Fu et al., 2017] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrentattention convolutional neural network for fine-grained image recognition. In CVPR,pages 4438–4446, 2017.

[Gao et al., 2016] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinearpooling. In CVPR, pages 317–326, 2016.

[Goodfellow et al., 2014] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarialnets. In NIPS, pages 2672–2680, 2014.

[He and Peng, 2017a] X. He and Y. Peng. Fine-grained image classification via com-bining vision and language. In CVPR, pages 5994–6002, 2017.

[He and Peng, 2017b] X. He and Y. Peng. Weakly supervised learning of part selectionmodel with spatial constraints for fine-grained image classification. In AAAI, pages4075–4081, 2017.

[Horn et al., 2017] G. Van Horn, O. M. Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard,H. Adam, P. Perona, and S. Belongie. The iNaturalist species classification anddetection dataset. In CVPR, pages 8769–8778, 2017.

[Hou et al., 2017] S. Hou, Y. Feng, and Z. Wang. VegFru: A domain-specific datasetfor fine-grained visual categorization. In ICCV, pages 541–549, 2017.

[Jaderberg et al., 2015] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial transformer networks. In NIPS, pages 2017–2025,2015.

[Khosla et al., 2011] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. Noveldataset for fine-grained image categorization. In CVPR Workshop on Fine-GrainedVisual Categorization, pages 806–813, 2011.

[Kong and Fowlkes, 2017] S. Kong and C. Fowlkes. Low-rank bilinear pooling forfine-grained classification. In CVPR, pages 365–374, 2017.

[Krause et al., 2013] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D object repre-sentations for fine-grained categorization. In ICCV Workshop on 3D Representationand Recognition, 2013.

[LeCun et al., 2015] Y. LeCun, Y. Bengion, and G. Hinton. Deep learning. Nature,521:436–444, 2015.

[Lehmann et al., 2015] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas,P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia- A large-scale, multilingual knowledge base extracted from Wikipedia. SemanticWeb Journal, pages 167–195, 2015.

[Li et al., 2016] W.-J. Li, S. Wang, and W.-C. Kang. Feature learning based deep su-pervised hashing with pairwise labels. In IJCAI, pages 1711–1717, 2016.

[Lin et al., 2015a] D. Lin, X. Shen, C. Lu, and J. Jia. Deep LAC: Deep localization,alignment and classification for fine-grained recognition. In CVPR, pages 1666–1674, 2015.

[Lin et al., 2015b] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN modelsfor fine-grained visual recognition. In ICCV, pages 1449–1457, 2015.

[Liu et al., 2016] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. DeepFashion: Power-ing robust clothes recognition and retrieval with rich annotations. In CVPR, pages1096–1104, 2016.

[Maji et al., 2013] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.

[Nilsback and Zisserman, 2008] M.-E. Nilsback and A. Zisserman. Automated flowerclassification over a large number of classes. In Indian Conf. on Comput. Vision,Graph. and Image Process., pages 722–729, 2008.

[Niu et al., 2018] L. Niu, A. Veeraraghavan, and A. Sabharwal. Webly supervisedlearning meets zero-shot learning: A hybrid approach for fine-grained classification.In CVPR, pages 7171–7180, 2018.

[Pham and Pagh, 2013] N. Pham and R. Pagh. Fast and scalable polynomial kernelsvia explicit feature maps. In KDD, pages 239–247, 2013.

[Reed et al., 2016] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep represen-tations of fine-grained visual descriptions. In CVPR, pages 49–58, 2016.

[Song et al., 2017] J. Song, Q. Yu, Y.-Z. Song, T. Xiang, and T. M. Hospedales. Deepspatial-semantic attention for fine-grained sketch-based image retrieval. In ICCV,pages 5551–5560, 2017.

[Suh et al., 2018] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Part-aligned bilin-ear representations for person re-identification. In ECCV, pages 402–419, 2018.

[Sun et al., 2018] M. Sun, Y. Yuan, F. Zhou, and E. Ding. Multi-attention multi-classconstraint for fine-grained image recognition. In ECCV, pages 834–850, 2018.

[Sun et al., 2019] X. Sun, L. Chen, and J. Yang. Learning from web data using adver-sarial discriminative neural networks for fine-grained classification. In AAAI, 2019.

[Wah et al., 2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. TheCaltech-UCSD birds-200-2011 dataset. Tech. Report CNS-TR-2011-001, 2011.

[Wang et al., 2018] J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen. A survey onlearning to hash. TPAMI, 40(4):769–790, 2018.

[Wei et al., 2017] X.-S. Wei, J.-H. Luo, J. Wu, and Z.-H. Zhou. Selective convolutionaldescriptor aggregation for fine-grained image retrieval. TIP, 26(6):2868–2881, 2017.

[Wei et al., 2018a] X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen. Mask-CNN: Localizingparts and selecting descriptors for fine-grained bird species categorization. PatternRecognition, 76:704–714, 2018.

[Wei et al., 2018b] X.-S. Wei, C.-L. Zhang, L. Liu, C. Shen, and J. Wu. Coarse-to-fine:A RNN-based hierarchical attention model for vehicle re-identification. In ACCV,2018.

[Wei et al., 2019a] X.-S. Wei, Q. Cui, L. Yang, P. Wang, and L. Liu. RPC: A large-scale retail product checkout dataset. arXiv preprint arXiv:1901.07249, 2019.

[Wei et al., 2019b] X.-S. Wei, P. Wang, L. Liu, C. Shen, and J. Wu. Piecewise classifiermappings: Learning fine-grained learners for novel categories with few examples.TIP, in press, 2019.

[Xu et al., 2018a] H. Xu, G. Qi, J. Li, M. Wang, K. Xu, and H. Gao. Fine-grainedimage classification by visual-semantic embedding. In IJCAI, pages 1043–1049,2018.

[Xu et al., 2018b] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He.AttnGAN: Fine-grained text to image generation with attentional generative adver-sarial networks. In CVPR, pages 1316–1324, 2018.

[Yang et al., 2018] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang. Learningto navigate for fine-grained classification. In ECCV, pages 438–454, 2018.

[Zhang et al., 2014] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-basedR-CNNs for fine-grained category detection. In ECCV, pages 834–849, 2014.

[Zhang et al., 2018] Y. Zhang, H. Tang, and K. Jia. Fine-grained visual categorizationusing meta-learning optimization with sample selection of auxiliary data. In ECCV,pages 233–248, 2018.

[Zhao et al., 2017] B. Zhao, J. Feng, X. Wu, and S. Yan. A survey on deep learning-based fine-grained object classification and semantic segmentation. InternationalJournal of Automation and Computing, 14(2):119–135, 2017.

[Zheng et al., 2017] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attentionconvolutional neural network for fine-grained image recognition. In ICCV, pages5209–5217, 2017.

[Zheng et al., 2018] X. Zheng, R. Ji, X. Sun, Y. Wu, F. Huang, and Y. Yang. Cen-tralized ranking loss with weakly supervised localization for fine-grained object re-trieval. In IJCAI, pages 1226–1233, 2018.

[Zheng et al., 2019] X. Zheng, R. Ji, X. Sun, B. Zhang, Y. Wu, and F. Huang. Towardsoptimal fine grained retrieval via decorrelated centralized loss with normalize-scalelayer. In AAAI, 2019.

[Zhuang et al., 2017] B. Zhuang, L. Liu, Y. Li, C. Shen, and I. Reid. Attend in groups:a weakly-supervised deep learning framework for learning from web data. In CVPR,pages 1878–1887, 2017.