論文紹介 Fast imagetagging

Fast Image TaggingM. Chen(Amazon.com), A. Zheng(MSR, Redmond), and K. Weinberger(Washington Univ.)

ICML2013

ICML2013読み会 2013.7.9

株式会社Preferred Infrastructure

Takashi Abe <[email protected]>

自己紹介

阿部厳（あべたかし）

Twitter: @tabe2314

東北大岡谷研（コンピュータビジョン）→PFIインターン→PFI新入

社員

2

紹介する論文

M. Chen, A. Zheng and K. Weinberger. Fast Image Tagging. ICML, 2013.

※ スライド中の図表はこの論文より引用しています

3

Image Tagging (1)

4

画像から、関連するタグ（複数）を推定

training:

入力: {(画像, タグ), …} 出力: 画像→タグ集合

testing:

入力: 画像出力: 推定したタグ集合

bear polar snow tundra buildings clothes shops

street???

training testing

Image Tagging (2): 何が難しい？

効果的な特徴が物体によって違う → いろんな特徴を入れたい

見えの多様性 → 大きなデータセットを使いたい

不完全なアノテーションデータ

PrecisionはともかくRecallが低いデータしか得られない

（本来のタグから一部が抜け落ちたデータが得られる）

例: Flickrのタグ

タグの出現頻度の偏り

5

Color Edges

FastTag

6

基本的なアイデア

アノテーションされたタグ集合を補完しつつ、画像から補完された

タグへの（線形の）マッピングを学習

B: タグ集合 → 補完されたタグ集合

W: 画像特徴 → 補完されたタグ集合

学習:

7

Fast I m age Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,

lake, water ,feet , legs,

boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − B y i2

p( y i )

sky, clouds,lake, water ,feet , legs,

boat , t r ees

W x

W

W 22

Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to

predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.

joint ly convex and has closed form solut ions in each

iterat ion of the opt imizat ion.

Co-regular ized lear ning. As we are only provided

with an incomplete set of tags, we create an addit ional

auxiliary problem and obtain two sub-tasks: 1) t rain-

ing an image classifier x i → W x i that predicts the

complete tag set from image features, and 2) t raining

a mapping y i → By i to enrich the exist ing sparse

tag vector y i by est imat ing which tags are likely to

co-occur with those already in y i . We train both clas-

sifiers simultaneously and force their output to agree

by minimizing

1

n

n

i = 1

By i − W x i2. (1)

Here, By i is the enriched tag set for the i-th t raining

image, and each row of W contains the weights of a

linear classifier that t ries to predict the corresponding

(enriched) tag based on image features.

The loss funct ion as current ly writ ten has a trivial so-

lut ion at B = 0 = W , suggest ing that the current for-

mulat ion is underconst rained. We next describe ad-

dit ional regularizat ions on B that guides the solut ion

toward something more useful.

M arginal ized blank-out regular izat ion. We take

inspirat ion from the idea of marginalized stacked de-

noising autoencoders (Chen et al., 2012) and related

works (?) in formulat ing the tag enrichment mapping

B : { 0, 1} T →RT . Our intent ion is to enrich the incom-

plete user tags by turning on relevant keywords that

should have been tagged but were not . Imagine that

the observed tags y are randomly sampled from the

complete set of tags: it is a “ corrupted” version of the

original set . We leverage this insight and train the en-

richment mapping B to reverse thecorrupt ion process.

To thisend, weconstruct a further corrupted version of

the observed tags y and train B to reconstruct y from

y . If this secondary corrupt ion mechanism matches

the original corrupt ion mechanism, then re-applying

B to y would recover the likely original prist ine tag

set.

For simplicity, we use uniform corrupt ion as the sec-

ondary corrupt ion mechanism. In pract ice, human la-

belers may select tags with bias, not uniform proba-

bility. We can approximate the unknown corrupt ing

dist ribut ion with piecewise uniform corrupt ion in the

learning step (see sect ion 3.2). If prior knowledge on

the original corrupt ion mechanism is available, it can

also easily be incorporated into our model.

More formally, for each y, a corrupted version y is

created by randomly removing (i .e., set t ing to zero)

each entry in y with some probability p≥ 0 and there-

fore, for each user tag vector y and dimensions t,

p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B

to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.

Here, each row of B is an ordinary least squares re-

gressor that predicts the presence of a tag given all

exist ing tags in y . To reduce variance in B , we take

repeated samples of y . In the limit (with infinitely

many corrupted versions of y), the expected recon-

st ruct ion error under the corrupt ing distribut ion can

be expressed as

r (B ) =1

n

n

i = 1

E y i − B y i2

p( y i |y ). (2)

Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-

ing the part ial labels for each image in each column.

Define P ≡ni = 1 y i E[y i ] and Q ≡

ni = 1 E[y i y i ],

then we can rewrite the loss in (2) as

r (B ) =1

nt race(BQB − 2PB + Y Y ) (3)

We use Eq. (3) to regularize B . For the uniform

“ blank-out ” noise introduced above, we have the ex-

pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,

Marginalized blank-out regularization (1)

を単純に最小化するとB=0=Wなので要制約

Bはアノテーションyiを真のタグ集合ziにマップして欲しい

ziは得られないので、yiの要素をそれぞれ確率pで落としたから yi

を復元するBを考える

の生成を繰り返し行うことを考えると復元誤差の期待値は

8


incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,


boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − B y i2

p( y i )

sky, clouds,lake, water ,feet , legs,

boat , t r ees

W x

W

W 22





Co-regular ized lear ning. As we are only provided









by minimizing

1

n

n

i = 1

By i − W x i2. (1)







mulat ion is underconst rained. We next describe ad-



















set.














to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.







be expressed as

r (B ) =1

n

n

i = 1

E y i − B y i2

p( y i |y ). (2)




ni = 1 E[y i y i ],


r (B ) =1



“ blank-out ” noise introduced above, we have the ex-


Fast I mage Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,


boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − B y i2

p( y i )

sky, clouds,lake, water,feet , legs,

boat , t r ees

W x

W

W 22





Co-regular ized learning. As we are only provided









by minimizing

1

n

n

i = 1

By i − W x i2. (1)







mulat ion is underconstrained. We next describe ad-



















set.














to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.







be expressed as

r (B ) =1

n

n

i = 1

E y i − By i2

p( y i |y ). (2)




ni = 1 E[y i y i ],


r (B ) =1



“ blank-out” noise introduced above, we have the ex-


Fast I mage Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,

lake, water,feet , legs,

boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − By i2

p( y i )


boat , t r ees

W x

W

W 22

Figure 1. Schemat ic illust rat ion of FastTag. During training two classifiers B and W are learned and co-regularized to








complete tag set from image features, and 2) training





by minimizing

1

n

n

i = 1

By i − W x i2. (1)

Here, By i is the enriched tag set for the i-th training













B : { 0, 1} T →R T . Our intent ion is to enrich the incom-





original set. We leverage this insight and train the en-



the observed tags y and train B to reconst ruct y from

y. If this secondary corrupt ion mechanism matches



set.





distribut ion with piecewise uniform corrupt ion in the









to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.






st ruct ion error under the corrupt ing dist ribut ion can

be expressed as

r (B ) =1

n

n

i = 1

E y i − By i2

p( y i |y ). (2)




ni = 1 E[y i y i ],


r (B ) =1



“ blank-out ” noise int roduced above, we have the ex-


Marginalized blank-out regularization (2)

(2)を式変形して

つまり実際にを作る必要は無い（pだけ決めればよい）

この復元誤差の期待値をロス関数に加える

9

Fast I mage Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,

lake, water,feet , legs,

boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − By i2

p( y i )


boat , t r ees

W x

W

W 22







auxiliary problem and obtain two sub-tasks: 1) train-


complete tag set from image features, and 2) training





by minimizing

1

n

n

i = 1

By i − W x i2. (1)

Here, By i is the enriched tag set for the i-th training






















y. If this secondary corrupt ion mechanism matches



set.





distribut ion with piecewise uniform corrupt ion in the





created by randomly removing (i .e., sett ing to zero)




to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.






st ruct ion error under the corrupt ing dist ribut ion can

be expressed as

r (B ) =1

n

n

i = 1

E y i − By i2

p( y i |y ). (2)




ni = 1 E[y i y i ],


r (B ) =1



“ blank-out ” noise int roduced above, we have the ex-


Optimization

Bを固定すると(5)を最小化するWは閉じた式で

同様にWを固定すると

交互に求めると大域解に収束 (jointly convex)

10

拡張

Tag bootstrapping

Bはタグの共起関係で学習されてるので、共起しないけど似たタグ

が補完されない（例: lakeとpond）

stacking

Byiを新たなアノテーションとしてもう一度学習、を繰り返す

共起関係を伝搬させるイメージ？

スタック数は実験的に決定

11

画像特徴

複数の特徴を組み合わせ（既存手法と同じもの）

GIST

6種類の色ヒストグラム

8種類の局所特徴のBoW

事前に内積がχ^2距離を近似する空間にあらかじめ写像しておく

Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit

feature maps. PAMI, 34(3):480–492, 2012.

12

実験結果

13

精度評価

leastSquares: FastTagのタグ補完無し版、ベースライン

TagProp: これまでのstate of the art, 学習O(n^2), テストO(n)

FastTagの精度はほぼTagPropと同じ

14


bug, green, insect,

tree, wood

baby, doll, dress,

green, hair

blue, earth, globe,

map, world

fish, fishing, fly,

hook, orange

blue, cloud, ocean,

sky, water

fly, plane, red,

sky, train

black, computer, drawing

handle, screen

brown, ear, painting,

woman, yellow

board, lake, man

wave, white

blue, circle, feet

round, white

drawing, hat, people

red, woman

blue, dot, feet,

microphone, statue

hair, ice, man,

white, woman

black, moon, red,

shadow, woman

asian, boy, gun,

man, white

anime, comic, people,

red, woman

feet, flower, fur.

red, shoes

blue, chart, diagram,

internet, table

gray, sky, stone,

water, white

black, dark, game,

man, night

plane, red, sky,

train, truck

Hig

h F

-1 s

co

reL

ow

F-1

sc

ore

Ra

nd

om

Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords).

Table 1. Comparison of FastTag and TagProp in terms of

P, R, F1 score and N+ on the Corel5K dataset . Previously

reported results using other image annotat ion techniques

are also included for reference.

Name P R F1 N+

leastSquares 29 32 30 125CRM (L avrenko et al ., 2003) 16 19 17 107

InfNet (M et zler & M anmat ha, 2004) 17 24 20 112NPDE (Yavl insky et al ., 2005) 18 21 19 114

SML (Carneiro et al ., 2007) 23 29 26 137MBRM (Feng et al ., 2004) 24 25 24 122TGLM (L iu et al ., 2009) 25 29 27 131

JEC (M akadia et al ., 2008) 27 32 29 139TagProp (Gui l laumin et al ., 2009) 33 42 37 160

Fast Tag 32 43 37 166

report the number of keywords with non-zero recall

value (N+ ). In all met rics a higher value indicates

bet ter performance.

B asel ines. We compare against leastSquares, a ridge

regression model which uses the part ial subset of tags

y1, . . . , yn as labels to learn W , i.e., FastTag without

tag enrichment . We also compare against the Tag-

Prop algorithm (Guillaumin et al., 2009), a local kNN

method combining different distance metrics through

metric learning. It is the current best performer on

these benchmark sets. Most exist ing work do not pro-

vide publicly available implementat ions. As a result ,

we include their previously reported results for ref-

erence (Lavrenko et al., 2003; Metzler & Manmatha,

2004; Yavlinsky et al., 2005; Carneiro et al., 2007; Feng

et al., 2004; Liu et al., 2009; Makadia et al., 2008) .


P , R, F1 score and N+ on the Espgame and IAPRTC-12

datasets.

ESP game IAPRP R F1 N+ P R F1 N+

leastSquares 35 19 25 215 40 19 26 198MBRM 18 19 18 209 24 23 23 223

JEC 24 19 21 222 29 19 23 211TagProp 39 27 32 238 45 34 39 260FastTag 46 22 30 247 47 26 34 280

4.2. Compar ison wit h r elat ed wor k

Table 1 shows a detailed comparison of FastTag to

the leastSquares baseline and eight published results

on the Corel5K dataset . We can make three obser-

vat ions: 1. The performance of FastTag aligns with

that of TagProp (so far the best algorithm in terms

of accuracy on this dataset), and significant ly outper-

forms the other methods; 2. The leastSquares base-

line, which corresponds to FastTag without the tag

enricher, performs surprisingly well compared to exist -

ing approaches, which suggests the advantage of a sim-

ple model that can extend to a large number of visual

descriptor, as opposed to a complex model that can af-

ford fewer descriptors. One may instead more cheaply

glean the benefits of a complex model via non-linear

t ransformat ion of the features. 3. The duo classifier

formulat ion of FastTag, which adds the tag enricher,

alleviates the int rinsic label sparsity problem of image

annotat ion. It leads to a 10% improvement on preci-

sion, 28% on recall, and an overall 20% improvement

on F1 score over the leastSquares baseline. We also


bug, green, insect,

tree, wood

baby, doll, dress,

green, hair

blue, earth, globe,

map, world

fish, fishing, fly,

hook, orange

blue, cloud, ocean,

sky, water

fly, plane, red,

sky, train


handle, screen


woman, yellow

board, lake, man

wave, white

blue, circle, feet

round, white


red, woman

blue, dot, feet,

microphone, statue

hair, ice, man,

white, woman

black, moon, red,

shadow, woman

asian, boy, gun,

man, white


red, woman

feet, flower, fur.

red, shoes


internet, table

gray, sky, stone,

water, white

black, dark, game,

man, night

plane, red, sky,

train, truck

Hig

h F

-1 s

co

reL

ow

F-1

sc

ore

Ra

nd

om






Name P R F1 N+





Fast Tag 32 43 37 166



















datasets.










of accuracy on this dataset), and significant ly outper-








glean the benefits of a complex model via non-linear







15

最大タグ数

16


bug, green, insect,

tree, wood

baby, doll, dress,

green, hair

blue, earth, globe,

map, world

fish, fishing, fly,

hook, orange

blue, cloud, ocean,

sky, water

fly, plane, red,

sky, train


handle, screen


woman, yellow

board, lake, man

wave, white

blue, circle, feet

round, white


red, woman

blue, dot, feet,

microphone, statue

hair, ice, man,

white, woman

black, moon, red,

shadow, woman

asian, boy, gun,

man, white


red, woman

feet, flower, fur.

red, shoes


internet, table

gray, sky, stone,

water, white

black, dark, game,

man, night

plane, red, sky,

train, truck

Hig

h F

-1 s

co

reL

ow

F-1

sc

ore

Ra

nd

om






Name P R F1 N+





Fast Tag 32 43 37 166



















datasets.










of accuracy on this dataset ), and significant ly outper-








glean the benefit s of a complex model via non-linear







論文紹介 Fast imagetagging

Technology

Transcript of 論文紹介 Fast imagetagging