Aniket Bhatnagar Sanchit Aggarwal - arXiv.org e Aniket Bhatnagar Sanchit Aggarwal Abstract The...
Embed Size (px)
Transcript of Aniket Bhatnagar Sanchit Aggarwal - arXiv.org e Aniket Bhatnagar Sanchit Aggarwal Abstract The...
Fine-grained Apparel Classification and Retrieval without rich annotations
Aniket Bhatnagar · Sanchit Aggarwal
Abstract The ability to correctly classify and retrieve apparel images has a variety of applications important to e-commerce, online advertising, internet search, and visual surveillance industry. In this work, we propose a robust framework for fine-grained apparel classification, in-shop and cross-domain retrieval which eliminates the requirement of rich annotations like bounding boxes and human-joints or clothing landmarks, and training of bounding box/ key-landmark detector for the same. Factors such as subtle appearance differ- ences, variations in human poses, different shooting angles, apparel deforma- tions, and self-occlusion add to the challenges in classification and retrieval of apparel items. Cross-domain retrieval is even harder due to the presence of large variation between online shopping images, usually taken in ideal lighting, pose, positive angle and clean background as compared with street photos cap- tured by users in complicated conditions with poor lighting, cluttered scenes, and complex background. Our framework utilizes compact bilinear CNN  with tensor sketch algorithm to generate embeddings that capture local pair- wise feature interactions in a translationally invariant manner. For apparel classification, we pass the obtained feature embeddings through a softmax classifier, while, the in-shop and cross-domain retrieval pipelines use a triplet- loss based optimization approach and deploy three compact BCNNs, with a ranking loss such that squared Euclidean distance between embeddings mea- sures the dissimilarity between the images. Unlike previous settings that relied on bounding box, key clothing landmarks or human joint detectors to assist the final deep classifier, proposed framework can be trained directly on the pro- vided category labels or generated triplets for triplet loss optimization. Lastly,
Aniket Bhatnagar Squadrun Solutions Private Limited Mobile.: +91 9999105171 E-mail: email@example.com
Sanchit Aggarwal Squadrun Solutions Private Limited E-mail: firstname.lastname@example.org
2 Aniket Bhatnagar, Sanchit Aggarwal
Experimental results on the DeepFashion  fine-grained categorization, and in-shop and consumer-to-shop retrieval datasets provide a comparative anal- ysis with previous work performed in the domain.
Keywords Apparel classification · In shop apparel retrieval · cross domain apparel retrieval · Compact bilinear CNN
Many methods have been proposed on the subject of apparel recognition [1,17, 25,38], clothes parsing [10,13,16,30,34,36,37,39], apparel attribute detection and description [5, 6, 9, 18, 26, 31], clothing item retrieval and recommenda- tion [8, 12,14,15,20,21,23,24,33,35] due to its tremendous impact on various industries.
Online retail stores itself have huge opportunities ranging from a rich user discovery experience to quality control operation such as product identifica- tion, tagging, moderation, enrichment, and contextual advertisement. Conse- quentially, algorithms for automatic categorization and retrieval of visually similar fashion products would have significant benefits.
However, clothes categorization and retrieval is still an open problem, espe- cially due to a large number of fine-grained categories with very subtle visual differences in style, texture, and cutting, compared with other common object categories. It is even more challenging to classify apparel images due to factors such as appearance, variations of human poses, and different shooting angles.
Another challenge is the subjectivity of apparels to deformations and self- occlusion. Moreover, apparel retrieval is often confronted with difficulties due to large variations between online shopping images compared with selfies. Usu- ally, online shopping images have ideal lighting, pose, clean backgrounds and are captured from a positive angle, while street photos captured by users have poor lighting, cluttered scenes, and complex background.
Last few years have seen an emergence in the domain of fine-grained cate- gorization [3,22,33,41]. Since, discriminating parts for the apparel categories tend to become subtle differences in shapes, styles and textures, categoriza- tion or identifying different attributes of a clothing item like sleeve length or pattern can be posed as a fine-grained classification problem.
Fine-grained classification compared to general purpose visual categoriza- tion problems, focuses on the characteristic challenge of making subtle distinc- tions despite high intra-class variance due to factors such as pose, viewpoint or location of the object. A common approach to fine-grained classification prob- lem is to first localize various parts of the object and model the appearance conditioned on detected locations.
Parts are often defined manually and a part detector is also learned as part of the overall pipeline. Branson et al. , approached the problem for fine-grained visual categorization of bird species, by estimating objects pose and computing features by deploying deep convolutional nets to image patches that are located and normalized by the pose. Zhang et al.  leveraged deep
Fine-grained Apparel Classification and Retrieval without rich annotations 3
convolutional features computed on bottom-up region proposals. These meth- ods generally increase the cost of labeling as annotating tags is easier than annotating bounding box coordinates for part detectors.
Recently, an easier method which does not require bounding boxes or landmark locations for fine-grained classification namely Bilinear Convolu- tional Neural Nets (BCNNs) have been introduced. Lin et al.  proposed a framework which utilizes a layer of bilinear pooling, just before the last fully connected layer which helps achieve remarkable performance on fine-grained classification datasets.
Bilinear pooling collects second order statistics of local features over the whole image and then does a global pool operation across each channel. The second order statistics capture pairwise correlations between the feature chan- nels and global pooling introduces invariance to deformations. However, the representational power of bilinear features comes with very high dimensional feature maps. To reduce the model size and end-to-end optimization of the visual recognition system, Gao et al. , proposed using Compact Bilinear CNNs with TensorSketch or Random Maclaurin algorithm. Gao used a ker- nelized representation to exhibit that bilinear descriptor compares each local descriptor in the first image with that in the second image and the comparison operator is a second order polynomial kernel. Thus proving that a compact version of the bilinear pooling is possible using any low dimensional approxi- mation of the second order polynomial kernel. Further, compact bilinear CNNs exhibit near equal and at times better performance as compared to full bilinear CNN.
Thus, in this work, we propose an efficient and reliable framework based on compact bilinear CNN for fine-grained apparel classification, in-shop, and cross-domain retrieval, eliminating requirements for bounding box or key land- marks. To our information, this is the first attempt at solving Fashion products (apparel items) classification and retrieval without using bounding boxes or finding key landmarks and employing a compact bilinear CNN to do the same.
2 Related Work
We now take a closer look at the more recent work on Apparel Classifica- tion and Retrieval [1, 9, 12, 14, 17, 20, 23–25, 33]. These methods are quite ef- ficient although based on strong requirement of manually labelled clothing landmarks , or object detectors to predict bounding boxes [9, 12, 14, 20], estimating human pose [17,24] or body parts detector [1, 23].
Liu et al.  proposed FahionNet, which tries to simultaneously model local attribute level, general category level, and clothing image similarity level representation with the dependence on clothing attributes and landmarks. Apart from this, FashionNet also requires bounding box annotation around clothing item or around human model wearing the clothing apparel in the image for learning classifiers. Obtaining these massive attribute annotations along with clothing landmarks for apparel items is a tedious and costly task. It
4 Aniket Bhatnagar, Sanchit Aggarwal
is not always possible for online marketplaces which maintain huge catalogues of clothing items to create such hand-crafted annotated datasets.
Dong et al. , construct a deep model capable of recognizing fine-grained clothing attributes on images in the wild using multi-task curriculum transfer learning. They collected a large clothing dataset and their meta-label as at- tributes from different online shopping web-sites. They learned a pre-processor to detect clothing images using Faster R-CCN and then employed an object detector which was trained on PASCAL VOC2007 followed by fine-tuning on an assembled bounding box annotated clothing dataset consisting of 8, 000 street/shop photos. Their model was then trained using obtained bounding boxes and rich annotations.
Hadi et al.  utilize an alexnet  with activations from a fully connected layer FC6 to identify exact matching clothes from street to shop domain. They collected street and shop photos, and obtained bounding box annotations using Mechanical Turk service. Huang et al. , proposed a dual attribute-aware ranking network, which