bare_jrnl_compsoc.pdf

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

Utilizing Google Images for SemanticSegmentation via CRF-MAP

Rizki Perdana Rangkuti, Vektor Dewanto, Wisnu Jatmiko

AbstractThis research aims to improve the capability of semantic segmentation through data perspective. This researchutilizes the Google Images as training datasets. Google Images passes images related to a given keyword. The keywordslead to finding the images which represent the desired objects, for example a car keyword would catch the images ofcar. This condition would benefit the semantic segmentation to collect many images as training datasets without any cost.The more the datasets, the higher the accuracy. This paper challenges the argument and it turns out that by using thecombination between VOC PASCAL dataset and Google Images dataset gives a competitive prediction result in accuracyand visual representation. However, the increasing number of Google Images do not significantly improve the predictionaccuracy compared to using VOC dataset solely and it even does worse.

KeywordsComputer Society, IEEEtran, journal, LATEX, paper, template.

F

1 INTRODUCTION

The goal of semantic segmentation is to labelevery pixel with an object-class label from apre-defined set of labels, e.g., car, person, andbus. Figure 1 provides some semantic seg-mentation examples. The three pictures in thefirst row are original images that are typicallytaken by cameras. The next row consists of thesemantic segmentation of the original imagesrespectively. In practice, a semantic segmenta-tion marks the pixels with some unique colorsaccording to their labels. These colored imagesare called as labeled images. For example, cowlabel has blue color, grass label has green color,body or person has brown color etc. A pixelmay own a label different from its neighbors.The snapshot or configuration of labels is calleda labeling. Two identical original images shouldhave the same labeling and identical labeledimages.

A semantic segmentation needs a number ofdatasets to establish a prediction model. Recentworks depend heavily on limited number ofdatasets. The rule of thumb in machine learn-ing is that the more the data, the better theprediction accuracy. The current progress forproducing pixel-wise labeled training imagesis costly, since it is generated by hands. Lackof training datasets leaves the classifier a poor

Fig. 1: Pixel-wise semantic segmentation of [1].The first row displays the original images. Thesecond row displays the labeled images of theoriginal images respectively.

accuracy. VOC PASCAL 2010 [2] and MSRC [3]are popular datasets in the topics semantic seg-mentation. VOC PASCAL 2012 and MSRC arecategorized as strong labeled datasets, becausethey are labeled pixel-wise. VOC PASCAL 2010and MSRC contain 1928 and 591 labeled imagesrespectively. In multiclass setting, those num-bers are not enough for training the classifiersas the size of the classes grow.

On the other hand, the Internet providesthousands of images. Some image search ser-vices provide images with metadata, so that itenriches the image with an amount of useful in-formation. However, unlike VOC PASCAL 2010and MSRC, many of them cannot directly beused as training datasets, because the straight-forward information of the label is unavailable.


The Google Images, for example, is categorizedas a weak labeled dataset, because it is notlabeled pixel-wise. The Google Images passessome images according to a given keyword.The desired object often appears salient in thesearch results when the keyword refers to areal world object. The keyword gives the hintof what kinds of labels should be assigned,but it is not adequately informative to placethe labels. A salient region of an image coversthe important part which belongs to a partic-ular class of labels. If computers can recognizethe salient region, then it will be informativeto put labels on the weak-labeled images. Inother words, the computer can generate stronglabeled datasets from the Google Images au-tomatically. The author argues that by enablingthe computers to recognize saliency, the GoogleImages can be utilized as training datasets forsemantic segmentation.

2 CONDITIONAL RANDOM FIELDS -MAXIMUM A POSTERIORI (MRF-MAP)A Conditional Random Fields (CRF) describesan image as a graph where the pixels arerepresented as nodes. A CRF determines anunobserved label yi based on an observed valuexi. The observed value xi is simply a feature,e.g., color, texture, or location. The node maycorrelate with its neighbours.

Conditional Random Fields is a variant ofMarkov Random Fields which directly esti-mates the posterior probability. According toMarkov-Gibbs equivalence, the posterior prob-ability is exponentially proportional to the en-ergy of labeling y given the data x.

P py|xq 1

ZepEpy,xqq (1)

The energy of labeling y is the sum of potentialfunctions.

Epy, xq N

i

upyi, xiq N

i

jPNi

ppyi, yj, xiq

Ni denotes the neighboring pixel indices of i.Z term normalizes the probability to a rangebetween 0 and 1. Computing Z for large CRF

can be intractable, because it is the sum ofexponentially many terms.

Z

yPY

epEpy,xqq

To obtain optimal prediction, one shouldminimize the misclassification risk. Accordingto [4], the minimal risk estimate is equivalentto Maximum A Posteriori.

y argmaxyPY

P py|xq

The Hammersley-Clifford theorem provesthat the probability of a pixel i being labeled asyi depends on the potentials of the neighours ofthe pixel i [5]. The theorem provides a practicalsimplicity for estimating the joint probability oflabeling y by specifying the potential functionsof the energy.u and p denote unary potential and pair-

wise potential respectively. The potential func-tion can be regarded as a penalty of a label.A unary potential penalizes a label assignmentbased on its likelihood to features. For exam-ple, a furry texture would certainly penalizesanimal-related labels, i.e., cat and dog, lowerthan man-made related labels, such as aero-plane and car. A pair-wise potential penalizesa pair of labels that each of which unlikelyto coexist. For example, an aeroplane label ispenalized lower when its neighbor is an aero-plane label rather than the others. In order torealize the potential function terms, the CRFcan utilize a classifier for the unary potentials,such as TextonBoost classifier proposed by [1],and Potts model as the pair-wise potentials (seeEquation 2).

ppyi, yj, xiq vyi yjw (2)

The role of energy function is to assist thevalidation of the CRF. The parameters and thepotential functions are learnt through trainingdatasets such that the ground truth labeling hasthe lowest energy. In opposite direction, oncethe parameters and the potential functions areestablished, a prediction of semantic segmenta-tion can be made for unseen data.

A good labeling y maximizes the poste-rior probability globally. Maximizing posterior


probability is similar to minimizing the energyof labeling y. This fact holds an importantconsequence because it simplifies the visionproblem into optimization-based problem.

y argminyPY

Epy, xq

Some efficient optimization methods for dis-crete labels are proved to exist in some domainsof problem, such as binary segmentation andmulticlass segmentation. Not only for infer-ence, the optimization method is also used forlearning the CRF parameters.

3 SALIENCY FILTERSSaliency or salience is a state of being stand-ing out and very ease to see. According toLongman dictionary, an object is salient whenit appears as the most important or noticeableobject among other objects. Saliency of an objectis interpreted as a visual existence and a stateof being the central object. Salient objects tendto have a more compact appearance comparedto the background objects. For example, Figure2a shows a cat as the main object in the imageand dominates the visual perception over allobjects, whereas Figure 2b shows a cat and ababy where the cat is not a salient object.

(a) Cat as a salient ob-ject

(b) Cat as a non-salientobject

Fig. 2: The picture on the left shows a cat assalient object according to human sense. Thepicture on the right shows a scene of a cat anda baby. The cat is not a salient object, becauseit shares its appearance with the baby.

Most methods for saliency detection use con-trast information. The work of [6] redefines thecontrast information in two measurements, ele-ment uniqueness and element distribution. TheSLIC method of [7] is utilized to abstract theimage components through superpixels. The

abstraction is employed to remove undesireddetails. An element uniqueness of each pixel iscomputed as follows.

Ui N

j1

}ci cj}2 .wpij

wpij 1

Ziexpp

1

22cvpi pjwq

(3)

pi and ci denote the position of pixel i andthe color of superpixel i respectively. Equation4 describes the calculation of element distri-bution of each pixel. A value Di is computedfrom a sum of distance between exclusive pixellocation pi and weighted mean position i mul-tiplied by color similarity metrics wij .

Di N

j1

}pj i}2wcij

wcij 1

Ziexpp

1

22cvci cjwq

(4)

The location information encodes the localityaspects. In another view, the locality aspectsare defined as Gaussian filtering kernel. Thisallows an approximation of the locality aspectsto reduce the complexity from OpN2q to OpNqthrough permutohedral lattice [6]. The saliencylevel of pixel i (Si) is formulated as:

Si Ui.exppk.Diq (5)

4 DATASETSVOC PASCAL 2010 was introduced in VOCPASCAL 2010 competition [2]. It provides rawJPEG images, PNG annotation files, and eval-uation program made in Matlab. A set of1928 JPEG and PNG files are provided asthe groundtruth. Every pixels is classified intoclasses of aeroplane, bicycle, bird, boat, ottle,bus, car, cat, chair, cow, iningtable, dog, horse,motorbike, person, pottedplant, sheep, sofa,rain, tvmonitor, and background.

VOC PASCAL 2010 is split into training, vali-dation, and testing portions. The ratios betweenset follows the size which [8] has determined inhis work. 600 images are used for training, 364for validation, and 964 for testing separately.


(a) (b)

Fig. 3: The samples of VOC images. a Theoriginal image (JPEG files) b The ground truth(PNG files)

5 GOOGLE IMAGES TRANSFORMATIONA keyword can be utilized to represent a class.The Google Images passes some images basedon the keyword. The author finds out thatevery salient regions of the images comes fromthe results of the same keyword. Meanwhile,the saliency filters can guide the computers torecognize salient regions. This creates a possi-bility to segment the salient regions and regardsit as the part of the object class that the keywordrefers to. Figure 4 illustrates the Google Imagestransformation.

GoogleImages

Saliency Detector

LabeledImages

query: aeroplane

Fig. 4: The Google Images can be transformedto strong labeled dataset. A keyword deter-mines the labels of images. The transformationconsiders the foreground object as the queriedobject class (i.e. aeroplane). The aeroplane labelclass has red color, while the background labelclass has black color.

The Google Images transformation performsbinary segmentation to differentiate the salient

Fig. 5: Saliency filters assign saliency levelin each pixel. A binary segmentation utilizesthem as potential functions to separate betweensalient and non-salient region (the rightmostimage).

regions from the background. The binary seg-mentation employs a CRF and the saliencyfilters. From Equation 5, Si can be regarded as asaliency map that rates the saliency level fromevery pixels. Figure 5 shows the results of thesaliency filters. The saliency map is shown bythe image in the middle. It can be utilized tosegment the salient part of an image through abinary segmentation where the unary potentialis represented by Si [6]. The energy formulationof the CRF is written as follows.

y argminyY

Epy, lf , xq

Epy, lf , xq N

i

saliencypyi, lf , xq N

i

jPNj

vyi yjw

saliencypyi, lf , xq

#

p1 saliencypi, xqq if yi lfsaliencypi, xq otherwise

saliency denotes a unary potential function thattakes Si as the value. The rightmost image isthe result of the binary segmentation. lf informsthe inference algorithm about which label classthe CRF should assign. Since lf has the valueof doll, the foreground region is regarded asthe figure of a doll and the rest is labeled asbackground. Figure 6 shows some examples oftransformation results.

6 EXPERIMENT RESULTSExperiment scenarios aim to investigate thebehaviour of the semantic segmentation overunder different settings. The procedure of the


(a) Original Google Im-age (b) Labeled Image

Fig. 6: The results of Google Images Transfor-mation

experiment mainly consists of two steps, train-ing and testing. In training phase, a CRF modelis learned from the given datasets. In testingphase, the CRF model is utilized to predict theunknown samples. There are three experimentscenarios. Each scenarios follows the steps asdescribed before, yet differs in datasets compo-sition when training.

The first scenario aims to compare the perfor-mance between two cases, i.e. an experimentusing the combination of VOC PASCAL 2010and Google Images, and an experiment usingVOC PASCAL 2010 datasets solely.

The second scenario aims to compare the per-formance between two cases, i.e. an experimentusing the VOC PASCAL 2010 dataset and anexperiment using the Google Images only.

The third scenario aims to compare the per-formance among several cases, where eachcases uses a certain amount of Google Imagesonly. In this paper, the scenario uses 600, 700,800, and 900 Google Images respectively. Theperformance is measured with averaged classaccuracy (abbreviated as CA) and global accu-racy (abbreviated as GA).

Table 1 summarizes the result of the first sce-

nario. The first experiment (CRF+VOC) utilizesVOC PASCAL 2010 as training dataset. Whilethe second experiment (CRF+VOC+Google Im-ages) utilizes VOC PASCAL 2010 and GoogleImages as the training datasets. The first ex-periment achieves 11.0450% CA and 79.213%GA. In the second experiment, the CA and GAincrease by 0.6592% and 0.0860% respectively.Compared to the related work, [9] reported 13%averaged CA by using unary potentials withoutpair-wise potentials. The difference in accuracybetween this result and the result of the orig-inal work is due to the different scale. Theexperiment utilized the rescaled version of theimages a half of the original size, whereas theoriginal work employs normal sized images.This result shows that the Google Images im-proves the prediction accuracy. Figures 7 showsthat the the second experiment tends to predictcorrectly the part of which the baseline methodis incapable at despite of imperfect segmen-tation boundaries. One possible reason is thatGoogle Images introduces novel characteristicsthat VOC PASCAL 2010 does not providesto the CRF. Based on the result, the combi-nation of VOC and Google Images improvesthe accuracy of the semantic segmentation bybroadening the characteristics of the classes.

Experiment Name Averaged CA (%) GA(CRF+VOC) 11.0450 79.213

(CRF+VOC+Google Images) 11.7042 79.299

TABLE 1: Summarized results from the firstscenario

Table 2 elaborates the detail of the perfor-mance from each classes. VOC PASCAL 2010and Google Images combination fails to im-prove the CA from several classes such asaeroplane, cat, chair, dog, horse, person, pottedplant, sofa, and background.

Table 3 summarizes the result of the secondscenario. The first experiment (CRF+VOC) uti-lizes VOC PASCAL 2010 as training dataset.While the second experiment (CRF+Google Im-ages) utilizes Google Images as the trainingdatasets. The first experiment achieves 11.045%CA and 79.213% GA. In the second experiment,the CA and GA decrease by 1.789% and 5.854%respectively.


No Classes (CRF+VOC) (CRF+VOC+Google Images)1 aeroplane 15.1694 12.32112 bicycle 0.0000 0.91293 bird 3.1685 3.53914 boat 5.3229 10.41285 bottle 0.7876 3.55606 bus 10.5571 14.95767 car 14.3507 16.61338 cat 12.5929 10.60889 chair 4.0795 2.062410 cow 1.6487 6.189611 diningtable 3.9595 4.012012 dog 5.1569 2.272713 horse 4.7258 4.700214 motorbike 14.9855 16.221115 person 24.1621 22.742816 pottedplant 1.1087 0.702917 sheep 11.7376 12.885418 sofa 1.5923 1.256619 train 15.5669 15.451420 tvmonitor 3.4973 6.993021 background 77.7755 77.3758

Averaged CA 11.0450 11.7042

TABLE 2: Comparison of prediction accuracy ofthe first experiment between (CRF+VOC) caseand (CRF+VOC+Google Images) case. VOCPASCAL 2010 and Google Images combina-tion fails to improve the CA of aeroplane, cat,chair, dog, horse, person, potted plant, sofa,and background class.

Experiment Name Averaged CA (%) GA(CRF+VOC) 11.045 79.213

(CRF+Google Images) 9.256 73.359

TABLE 3: Summarized results from the secondscenario

This result confirms that the Google Imagessolely cannot surpasses the accuracy of thebaseline accuracy. The reason is that the GoogleImages lacks of classes variability in a singleimage. One can find that every images oftencontains a particular class of object and back-ground class. Meanwhile, this closes the chanceof the CRF to learn the correlation between ob-ject classes. Therefore, it will hardly to performmulticlass segmentation in testing phase.

Table 4 elaborates the detail of the perfor-mance from each classes. The training with soleGoogle Images dataset fails in most classes.

Table 5 summarizes the result of the thirdscenario. The prediction accuracy decreases asthe number of the Google Images increases. Thenumber of images to achieve an optimum resultmight depend on the complexity of the objects.

No Classes (CRF+VOC) (CRF+Google Images)1 aeroplane 15.169 8.0122 bicycle 0.000 0.0583 bird 3.169 2.1384 boat 5.323 8.0565 bottle 0.788 0.0006 bus 10.557 17.2127 car 14.351 4.1128 cat 12.593 9.4619 chair 4.080 0.55410 cow 1.649 6.10611 diningtable 3.960 0.65412 dog 5.157 1.75313 horse 4.726 3.89114 motorbike 14.986 7.36415 person 24.162 9.66416 pottedplant 1.109 2.20717 sheep 11.738 13.95918 sofa 1.592 0.72119 train 15.567 17.52120 tvmonitor 3.497 5.35221 background 77.776 78.314

Averaged CA 11.045 9.256

TABLE 4: Comparison of prediction accuracy ofthe first experiment between (CRF+VOC) caseand (CRF+Google Images) case.

No Classes 600 700 800 9001 aeroplane 8.012 7.610 11.095 9.4772 bicycle 0.058 0.465 0.880 1.4313 bird 2.138 0.743 0.566 0.9594 boat 8.056 10.260 7.982 7.8105 bottle 0.000 0.041 0.000 1.2666 bus 7617.212 19.210 19.067 18.7837 car 334.112 4.845 6.014 6.0138 cat 889.461 7.204 6.695 8.3599 chair 0.554 0.260 0.085 0.37910 cow 6.106 7.961 5.635 8.12111 diningtable 0.654 0.912 1.153 0.62312 dog 1.753 3.890 2.775 2.56613 horse 3.891 4.651 4.198 5.46414 motorbike 7.364 4.887 6.762 5.06415 person 9.664 9.677 7.442 7.47916 pottedplant 2.207 2.039 3.603 1.69117 sheep 13.959 11.578 10.771 9.97518 sofa 0.721 0.922 1.242 0.37119 train 17.521 17.446 13.478 12.82620 tvmonitor 5.352 3.548 5.188 4.48521 background 75.578 75.459 75.690 75.170

Averaged CA 9.256 9.219 9.063 8.967Global Accuracy 73.359 73.503 73.770 73.046

TABLE 5: Comparison of prediction accuracy indifferent Google Images sizes.

7 CONCLUSIONS AND FUTURE WORKS

This research proposes Google Images as train-ing dataset. The Google Images is convertedinto strong labeled dataset by saliency filtering.The perfomance improvement varies in somescenarios. Combining the datasets from bothof VOC PASCAL 2010 and the Google Imagesincreases the prediction accuracy. The Google


(a) (b) (c) (d)

Fig. 7: The examples of results from scenario 2. a The original images b The ground truth labeledimages c The result from the first experiment (CRF+VOC) d The result from the second experiment(CRF+VOC+Google Images)

Images helps the semantic segmentation toenlarge the class characteristics. On the otherhand, solely using the Google Images does nothelp to improve the performance. Furthermore,adding more the Google Images does not leadto a better performance.

The author realizes that this research leavesmany things to explore. It requires an investi-gation of the effective number of the GoogleImages, because the experiment shows thatadding more datasets cannot increase the per-

formance. The experiment has not performedthe exact rate of improvement which explainshow much the Google Images is needed toachieve a certain amount of accuracy. In theother side, the keywords can also affect thesearch results in some ways. There might bea better keyword to find a suitable word todescribe an object, so that it can give a more de-cent result. Choosing the right keyword wouldbe an interesting problem.


REFERENCES[1] J. Shotton, J. Winn, C. Rother, and A. Criminisi,

Textonboost for image understanding: Multi-class objectrecognition and segmentation by jointly modelingtexture, layout, and context, Int. J. Comput. Vision,vol. 81, no. 1, pp. 223, Jan. 2009. [Online]. Available:http://dx.doi.org/10.1007/s11263-007-0109-1

[2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman, The PASCAL Visual Object ClassesChallenge 2010 (VOC2010) Results, http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html,2010.

[3] Research.microsoft.com, Object class recogni-tion - microsoft research, 2015. [On-line]. Available: http://research.microsoft.com/en-us/projects/objectclassrecognition/

[4] S. Geman and D. Geman, Stochastic relaxation,gibbs distributions, and the bayesian restoration ofimages, IEEE Trans. Pattern Anal. Mach. Intell., vol. 6,no. 6, pp. 721741, Nov. 1984. [Online]. Available:http://dx.doi.org/10.1109/TPAMI.1984.4767596

[5] J. M. Hammersley and P. Clifford, Markov fields on finitegraphs and lattices, 1971.

[6] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung,Saliency filters: Contrast based filtering for salient regiondetection, in CVPR, 2012, pp. 733740.

[7] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, andS. Ssstrunk, Slic superpixels, EPFL, Tech. Rep. 149300,June 2010.

[8] P. Kraehenbuehl, Efficient inference in fully connectedcrfs with gaussian edge potentials, 2014. [Online]. Avail-able: http://graphics.stanford.edu/projects/densecrf/

[9] P. Krahenbuhl and V. Koltun, Efficient inference in fullyconnected crfs with gaussian edge potentials, in Advancesin Neural Information Processing Systems 24, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,Eds. Curran Associates, Inc., 2011, pp. 109117.[Online]. Available: http://papers.nips.cc/paper/4296-efficient-inference-in-fully-connected-crfs-with-gaussian-edge-potentials.pdf

bare_jrnl_compsoc.pdf

Documents

Transcript of bare_jrnl_compsoc.pdf