論文紹介 Fast imagetagging

16
Fast Image Tagging M. Chen(Amazon.com), A. Zheng(MSR, Redmond), and K. Weinberger(Washington Univ.) ICML2013 ICML2013読み会 2013.7.9 株式会社Preferred Infrastructure Takashi Abe <[email protected]>

Transcript of 論文紹介 Fast imagetagging

Page 1: 論文紹介 Fast imagetagging

Fast Image TaggingM. Chen(Amazon.com), A. Zheng(MSR, Redmond), and K. Weinberger(Washington Univ.)

ICML2013

ICML2013読み会 2013.7.9

株式会社Preferred Infrastructure

Takashi Abe <[email protected]>

Page 2: 論文紹介 Fast imagetagging

自己紹介

阿部厳 (あべたかし)

Twitter: @tabe2314

東北大岡谷研(コンピュータビジョン)→PFIインターン→PFI新入

社員

2

Page 3: 論文紹介 Fast imagetagging

紹介する論文

M. Chen, A. Zheng and K. Weinberger. Fast Image Tagging. ICML, 2013.

※ スライド中の図表はこの論文より引用しています

3

Page 4: 論文紹介 Fast imagetagging

Image Tagging (1)

4

画像から、関連するタグ(複数)を推定

training:

入力: {(画像, タグ), …} 出力: 画像→タグ集合

testing:

入力: 画像 出力: 推定したタグ集合

bear polar snow tundra buildings clothes shops

street???

training testing

Page 5: 論文紹介 Fast imagetagging

Image Tagging (2): 何が難しい?

効果的な特徴が物体によって違う → いろんな特徴を入れたい

見えの多様性 → 大きなデータセットを使いたい

不完全なアノテーションデータ

PrecisionはともかくRecallが低いデータしか得られない

(本来のタグから一部が抜け落ちたデータが得られる)

例: Flickrのタグ

タグの出現頻度の偏り

5

Color Edges

Page 6: 論文紹介 Fast imagetagging

FastTag

6

Page 7: 論文紹介 Fast imagetagging

基本的なアイデア

アノテーションされたタグ集合を補完しつつ、画像から補完された

タグへの(線形の)マッピングを学習

B: タグ集合 → 補完されたタグ集合

W: 画像特徴 → 補完されたタグ集合

学習:

7

Fast I m age Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,

lake, water ,feet , legs,

boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − B y i2

p( y i )

sky, clouds,lake, water ,feet , legs,

boat , t r ees

W x

W

W 22

Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to

predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.

joint ly convex and has closed form solut ions in each

iterat ion of the opt imizat ion.

Co-regular ized lear ning. As we are only provided

with an incomplete set of tags, we create an addit ional

auxiliary problem and obtain two sub-tasks: 1) t rain-

ing an image classifier x i → W x i that predicts the

complete tag set from image features, and 2) t raining

a mapping y i → By i to enrich the exist ing sparse

tag vector y i by est imat ing which tags are likely to

co-occur with those already in y i . We train both clas-

sifiers simultaneously and force their output to agree

by minimizing

1

n

n

i = 1

By i − W x i2. (1)

Here, By i is the enriched tag set for the i-th t raining

image, and each row of W contains the weights of a

linear classifier that t ries to predict the corresponding

(enriched) tag based on image features.

The loss funct ion as current ly writ ten has a trivial so-

lut ion at B = 0 = W , suggest ing that the current for-

mulat ion is underconst rained. We next describe ad-

dit ional regularizat ions on B that guides the solut ion

toward something more useful.

M arginal ized blank-out regular izat ion. We take

inspirat ion from the idea of marginalized stacked de-

noising autoencoders (Chen et al., 2012) and related

works (?) in formulat ing the tag enrichment mapping

B : { 0, 1} T →RT . Our intent ion is to enrich the incom-

plete user tags by turning on relevant keywords that

should have been tagged but were not . Imagine that

the observed tags y are randomly sampled from the

complete set of tags: it is a “ corrupted” version of the

original set . We leverage this insight and train the en-

richment mapping B to reverse thecorrupt ion process.

To thisend, weconstruct a further corrupted version of

the observed tags y and train B to reconstruct y from

y . If this secondary corrupt ion mechanism matches

the original corrupt ion mechanism, then re-applying

B to y would recover the likely original prist ine tag

set.

For simplicity, we use uniform corrupt ion as the sec-

ondary corrupt ion mechanism. In pract ice, human la-

belers may select tags with bias, not uniform proba-

bility. We can approximate the unknown corrupt ing

dist ribut ion with piecewise uniform corrupt ion in the

learning step (see sect ion 3.2). If prior knowledge on

the original corrupt ion mechanism is available, it can

also easily be incorporated into our model.

More formally, for each y, a corrupted version y is

created by randomly removing (i .e., set t ing to zero)

each entry in y with some probability p≥ 0 and there-

fore, for each user tag vector y and dimensions t,

p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B

to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.

Here, each row of B is an ordinary least squares re-

gressor that predicts the presence of a tag given all

exist ing tags in y . To reduce variance in B , we take

repeated samples of y . In the limit (with infinitely

many corrupted versions of y), the expected recon-

st ruct ion error under the corrupt ing distribut ion can

be expressed as

r (B ) =1

n

n

i = 1

E y i − B y i2

p( y i |y ). (2)

Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-

ing the part ial labels for each image in each column.

Define P ≡ni = 1 y i E[y i ] and Q ≡

ni = 1 E[y i y i ],

then we can rewrite the loss in (2) as

r (B ) =1

nt race(BQB − 2PB + Y Y ) (3)

We use Eq. (3) to regularize B . For the uniform

“ blank-out ” noise introduced above, we have the ex-

pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,

Page 8: 論文紹介 Fast imagetagging

Marginalized blank-out regularization (1)

を単純に最小化するとB=0=Wなので要制約

Bはアノテーションyiを真のタグ集合ziにマップして欲しい

ziは得られないので、yiの要素をそれぞれ確率pで落とした から yi

を復元するBを考える

の生成を繰り返し行うことを考えると復元誤差の期待値は

8

Fast I m age Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,

lake, water ,feet , legs,

boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − B y i2

p( y i )

sky, clouds,lake, water ,feet , legs,

boat , t r ees

W x

W

W 22

Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to

predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.

joint ly convex and has closed form solut ions in each

iterat ion of the opt imizat ion.

Co-regular ized lear ning. As we are only provided

with an incomplete set of tags, we create an addit ional

auxiliary problem and obtain two sub-tasks: 1) t rain-

ing an image classifier x i → W x i that predicts the

complete tag set from image features, and 2) t raining

a mapping y i → By i to enrich the exist ing sparse

tag vector y i by est imat ing which tags are likely to

co-occur with those already in y i . We train both clas-

sifiers simultaneously and force their output to agree

by minimizing

1

n

n

i = 1

By i − W x i2. (1)

Here, By i is the enriched tag set for the i-th t raining

image, and each row of W contains the weights of a

linear classifier that t ries to predict the corresponding

(enriched) tag based on image features.

The loss funct ion as current ly writ ten has a trivial so-

lut ion at B = 0 = W , suggest ing that the current for-

mulat ion is underconst rained. We next describe ad-

dit ional regularizat ions on B that guides the solut ion

toward something more useful.

M arginal ized blank-out regular izat ion. We take

inspirat ion from the idea of marginalized stacked de-

noising autoencoders (Chen et al., 2012) and related

works (?) in formulat ing the tag enrichment mapping

B : { 0, 1} T →RT . Our intent ion is to enrich the incom-

plete user tags by turning on relevant keywords that

should have been tagged but were not . Imagine that

the observed tags y are randomly sampled from the

complete set of tags: it is a “ corrupted” version of the

original set . We leverage this insight and train the en-

richment mapping B to reverse thecorrupt ion process.

To thisend, weconstruct a further corrupted version of

the observed tags y and train B to reconstruct y from

y . If this secondary corrupt ion mechanism matches

the original corrupt ion mechanism, then re-applying

B to y would recover the likely original prist ine tag

set.

For simplicity, we use uniform corrupt ion as the sec-

ondary corrupt ion mechanism. In pract ice, human la-

belers may select tags with bias, not uniform proba-

bility. We can approximate the unknown corrupt ing

dist ribut ion with piecewise uniform corrupt ion in the

learning step (see sect ion 3.2). If prior knowledge on

the original corrupt ion mechanism is available, it can

also easily be incorporated into our model.

More formally, for each y, a corrupted version y is

created by randomly removing (i .e., set t ing to zero)

each entry in y with some probability p≥ 0 and there-

fore, for each user tag vector y and dimensions t,

p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B

to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.

Here, each row of B is an ordinary least squares re-

gressor that predicts the presence of a tag given all

exist ing tags in y . To reduce variance in B , we take

repeated samples of y . In the limit (with infinitely

many corrupted versions of y), the expected recon-

st ruct ion error under the corrupt ing distribut ion can

be expressed as

r (B ) =1

n

n

i = 1

E y i − B y i2

p( y i |y ). (2)

Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-

ing the part ial labels for each image in each column.

Define P ≡ni = 1 y i E[y i ] and Q ≡

ni = 1 E[y i y i ],

then we can rewrite the loss in (2) as

r (B ) =1

nt race(BQB − 2PB + Y Y ) (3)

We use Eq. (3) to regularize B . For the uniform

“ blank-out ” noise introduced above, we have the ex-

pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,

Fast I mage Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,

lake, water ,feet , legs,

boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − B y i2

p( y i )

sky, clouds,lake, water,feet , legs,

boat , t r ees

W x

W

W 22

Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to

predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.

joint ly convex and has closed form solut ions in each

iterat ion of the opt imizat ion.

Co-regular ized learning. As we are only provided

with an incomplete set of tags, we create an addit ional

auxiliary problem and obtain two sub-tasks: 1) t rain-

ing an image classifier x i → W x i that predicts the

complete tag set from image features, and 2) t raining

a mapping y i → By i to enrich the exist ing sparse

tag vector y i by est imat ing which tags are likely to

co-occur with those already in y i . We train both clas-

sifiers simultaneously and force their output to agree

by minimizing

1

n

n

i = 1

By i − W x i2. (1)

Here, By i is the enriched tag set for the i-th t raining

image, and each row of W contains the weights of a

linear classifier that t ries to predict the corresponding

(enriched) tag based on image features.

The loss funct ion as current ly writ ten has a trivial so-

lut ion at B = 0 = W , suggest ing that the current for-

mulat ion is underconstrained. We next describe ad-

dit ional regularizat ions on B that guides the solut ion

toward something more useful.

M arginal ized blank-out regular izat ion. We take

inspirat ion from the idea of marginalized stacked de-

noising autoencoders (Chen et al., 2012) and related

works (?) in formulat ing the tag enrichment mapping

B : { 0, 1} T →RT . Our intent ion is to enrich the incom-

plete user tags by turning on relevant keywords that

should have been tagged but were not . Imagine that

the observed tags y are randomly sampled from the

complete set of tags: it is a “ corrupted” version of the

original set . We leverage this insight and train the en-

richment mapping B to reverse thecorrupt ion process.

To thisend, weconstruct a further corrupted version of

the observed tags y and train B to reconstruct y from

y . If this secondary corrupt ion mechanism matches

the original corrupt ion mechanism, then re-applying

B to y would recover the likely original prist ine tag

set.

For simplicity, we use uniform corrupt ion as the sec-

ondary corrupt ion mechanism. In pract ice, human la-

belers may select tags with bias, not uniform proba-

bility. We can approximate the unknown corrupt ing

dist ribut ion with piecewise uniform corrupt ion in the

learning step (see sect ion 3.2). If prior knowledge on

the original corrupt ion mechanism is available, it can

also easily be incorporated into our model.

More formally, for each y, a corrupted version y is

created by randomly removing (i .e., set t ing to zero)

each entry in y with some probability p≥ 0 and there-

fore, for each user tag vector y and dimensions t,

p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B

to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.

Here, each row of B is an ordinary least squares re-

gressor that predicts the presence of a tag given all

exist ing tags in y . To reduce variance in B , we take

repeated samples of y . In the limit (with infinitely

many corrupted versions of y), the expected recon-

st ruct ion error under the corrupt ing distribut ion can

be expressed as

r (B ) =1

n

n

i = 1

E y i − By i2

p( y i |y ). (2)

Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-

ing the part ial labels for each image in each column.

Define P ≡ni = 1 y i E[y i ] and Q ≡

ni = 1 E[y i y i ],

then we can rewrite the loss in (2) as

r (B ) =1

nt race(BQB − 2PB + Y Y ) (3)

We use Eq. (3) to regularize B . For the uniform

“ blank-out” noise introduced above, we have the ex-

pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,

Fast I mage Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,

lake, water,feet , legs,

boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − By i2

p( y i )

sky, clouds,lake, water,feet , legs,

boat , t r ees

W x

W

W 22

Figure 1. Schemat ic illust rat ion of FastTag. During training two classifiers B and W are learned and co-regularized to

predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.

joint ly convex and has closed form solut ions in each

iterat ion of the opt imizat ion.

Co-regular ized learning. As we are only provided

with an incomplete set of tags, we create an addit ional

auxiliary problem and obtain two sub-tasks: 1) t rain-

ing an image classifier x i → W x i that predicts the

complete tag set from image features, and 2) training

a mapping y i → By i to enrich the exist ing sparse

tag vector y i by est imat ing which tags are likely to

co-occur with those already in y i . We train both clas-

sifiers simultaneously and force their output to agree

by minimizing

1

n

n

i = 1

By i − W x i2. (1)

Here, By i is the enriched tag set for the i-th training

image, and each row of W contains the weights of a

linear classifier that t ries to predict the corresponding

(enriched) tag based on image features.

The loss funct ion as current ly writ ten has a trivial so-

lut ion at B = 0 = W , suggest ing that the current for-

mulat ion is underconstrained. We next describe ad-

dit ional regularizat ions on B that guides the solut ion

toward something more useful.

M arginal ized blank-out regular izat ion. We take

inspirat ion from the idea of marginalized stacked de-

noising autoencoders (Chen et al., 2012) and related

works (?) in formulat ing the tag enrichment mapping

B : { 0, 1} T →R T . Our intent ion is to enrich the incom-

plete user tags by turning on relevant keywords that

should have been tagged but were not . Imagine that

the observed tags y are randomly sampled from the

complete set of tags: it is a “ corrupted” version of the

original set. We leverage this insight and train the en-

richment mapping B to reverse thecorrupt ion process.

To thisend, weconstruct a further corrupted version of

the observed tags y and train B to reconst ruct y from

y. If this secondary corrupt ion mechanism matches

the original corrupt ion mechanism, then re-applying

B to y would recover the likely original prist ine tag

set.

For simplicity, we use uniform corrupt ion as the sec-

ondary corrupt ion mechanism. In pract ice, human la-

belers may select tags with bias, not uniform proba-

bility. We can approximate the unknown corrupt ing

distribut ion with piecewise uniform corrupt ion in the

learning step (see sect ion 3.2). If prior knowledge on

the original corrupt ion mechanism is available, it can

also easily be incorporated into our model.

More formally, for each y, a corrupted version y is

created by randomly removing (i .e., set t ing to zero)

each entry in y with some probability p≥ 0 and there-

fore, for each user tag vector y and dimensions t,

p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B

to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.

Here, each row of B is an ordinary least squares re-

gressor that predicts the presence of a tag given all

exist ing tags in y . To reduce variance in B , we take

repeated samples of y . In the limit (with infinitely

many corrupted versions of y), the expected recon-

st ruct ion error under the corrupt ing dist ribut ion can

be expressed as

r (B ) =1

n

n

i = 1

E y i − By i2

p( y i |y ). (2)

Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-

ing the part ial labels for each image in each column.

Define P ≡ni = 1 y i E[y i ] and Q ≡

ni = 1 E[y i y i ],

then we can rewrite the loss in (2) as

r (B ) =1

nt race(BQB − 2PB + Y Y ) (3)

We use Eq. (3) to regularize B . For the uniform

“ blank-out ” noise int roduced above, we have the ex-

pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,

Page 9: 論文紹介 Fast imagetagging

Marginalized blank-out regularization (2)

(2)を式変形して

つまり実際に を作る必要は無い(pだけ決めればよい)

この復元誤差の期待値をロス関数に加える

9

Fast I mage Tagging

incomplete

user tags

y

visual

features

predicted

relevant tags

W Bsnow,lake,feet

mountain,snow, sky,

lake, water,feet , legs,

boat , t r ees

x

visual

features

predicted

relevant tags

x

training testing

W x − By 22

E y i − By i2

p( y i )

sky, clouds,lake, water,feet , legs,

boat , t r ees

W x

W

W 22

Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to

predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.

joint ly convex and has closed form solut ions in each

iterat ion of the opt imizat ion.

Co-regular ized learning. As we are only provided

with an incomplete set of tags, we create an addit ional

auxiliary problem and obtain two sub-tasks: 1) train-

ing an image classifier x i → W x i that predicts the

complete tag set from image features, and 2) training

a mapping y i → By i to enrich the exist ing sparse

tag vector y i by est imat ing which tags are likely to

co-occur with those already in y i . We train both clas-

sifiers simultaneously and force their output to agree

by minimizing

1

n

n

i = 1

By i − W x i2. (1)

Here, By i is the enriched tag set for the i-th training

image, and each row of W contains the weights of a

linear classifier that t ries to predict the corresponding

(enriched) tag based on image features.

The loss funct ion as current ly writ ten has a trivial so-

lut ion at B = 0 = W , suggest ing that the current for-

mulat ion is underconstrained. We next describe ad-

dit ional regularizat ions on B that guides the solut ion

toward something more useful.

M arginal ized blank-out regular izat ion. We take

inspirat ion from the idea of marginalized stacked de-

noising autoencoders (Chen et al., 2012) and related

works (?) in formulat ing the tag enrichment mapping

B : { 0, 1} T →RT . Our intent ion is to enrich the incom-

plete user tags by turning on relevant keywords that

should have been tagged but were not . Imagine that

the observed tags y are randomly sampled from the

complete set of tags: it is a “ corrupted” version of the

original set . We leverage this insight and train the en-

richment mapping B to reverse thecorrupt ion process.

To thisend, weconstruct a further corrupted version of

the observed tags y and train B to reconstruct y from

y. If this secondary corrupt ion mechanism matches

the original corrupt ion mechanism, then re-applying

B to y would recover the likely original prist ine tag

set.

For simplicity, we use uniform corrupt ion as the sec-

ondary corrupt ion mechanism. In pract ice, human la-

belers may select tags with bias, not uniform proba-

bility. We can approximate the unknown corrupt ing

distribut ion with piecewise uniform corrupt ion in the

learning step (see sect ion 3.2). If prior knowledge on

the original corrupt ion mechanism is available, it can

also easily be incorporated into our model.

More formally, for each y, a corrupted version y is

created by randomly removing (i .e., sett ing to zero)

each entry in y with some probability p≥ 0 and there-

fore, for each user tag vector y and dimensions t,

p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B

to opt imize

B = argminB

1

n

n

i = 1

y i − B y i2.

Here, each row of B is an ordinary least squares re-

gressor that predicts the presence of a tag given all

exist ing tags in y . To reduce variance in B , we take

repeated samples of y . In the limit (with infinitely

many corrupted versions of y), the expected recon-

st ruct ion error under the corrupt ing dist ribut ion can

be expressed as

r (B ) =1

n

n

i = 1

E y i − By i2

p( y i |y ). (2)

Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-

ing the part ial labels for each image in each column.

Define P ≡ni = 1 y i E[y i ] and Q ≡

ni = 1 E[y i y i ],

then we can rewrite the loss in (2) as

r (B ) =1

nt race(BQB − 2PB + Y Y ) (3)

We use Eq. (3) to regularize B . For the uniform

“ blank-out ” noise int roduced above, we have the ex-

pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,

Page 10: 論文紹介 Fast imagetagging

Optimization

Bを固定すると(5)を最小化するWは閉じた式で

同様にWを固定すると

交互に求めると大域解に収束 (jointly convex)

10

Page 11: 論文紹介 Fast imagetagging

拡張

Tag bootstrapping

Bはタグの共起関係で学習されてるので、共起しないけど似たタグ

が補完されない(例: lakeとpond)

stacking

Byiを新たなアノテーションとしてもう一度学習、を繰り返す

共起関係を伝搬させるイメージ?

スタック数は実験的に決定

11

Page 12: 論文紹介 Fast imagetagging

画像特徴

複数の特徴を組み合わせ(既存手法と同じもの)

GIST

6種類の色ヒストグラム

8種類の局所特徴のBoW

事前に内積がχ^2距離を近似する空間にあらかじめ写像しておく

Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit

feature maps. PAMI, 34(3):480–492, 2012.

12

Page 13: 論文紹介 Fast imagetagging

実験結果

13

Page 14: 論文紹介 Fast imagetagging

精度評価

leastSquares: FastTagのタグ補完無し版、ベースライン

TagProp: これまでのstate of the art, 学習O(n^2), テストO(n)

FastTagの精度はほぼTagPropと同じ

14

Fast I m age Tagging

bug, green, insect,

tree, wood

baby, doll, dress,

green, hair

blue, earth, globe,

map, world

fish, fishing, fly,

hook, orange

blue, cloud, ocean,

sky, water

fly, plane, red,

sky, train

black, computer, drawing

handle, screen

brown, ear, painting,

woman, yellow

board, lake, man

wave, white

blue, circle, feet

round, white

drawing, hat, people

red, woman

blue, dot, feet,

microphone, statue

hair, ice, man,

white, woman

black, moon, red,

shadow, woman

asian, boy, gun,

man, white

anime, comic, people,

red, woman

feet, flower, fur.

red, shoes

blue, chart, diagram,

internet, table

gray, sky, stone,

water, white

black, dark, game,

man, night

plane, red, sky,

train, truck

Hig

h F

-1 s

co

reL

ow

F-1

sc

ore

Ra

nd

om

Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords).

Table 1. Comparison of FastTag and TagProp in terms of

P, R, F1 score and N+ on the Corel5K dataset . Previously

reported results using other image annotat ion techniques

are also included for reference.

Name P R F1 N+

leastSquares 29 32 30 125CRM (L avrenko et al ., 2003) 16 19 17 107

InfNet (M et zler & M anmat ha, 2004) 17 24 20 112NPDE (Yavl insky et al ., 2005) 18 21 19 114

SML (Carneiro et al ., 2007) 23 29 26 137MBRM (Feng et al ., 2004) 24 25 24 122TGLM (L iu et al ., 2009) 25 29 27 131

JEC (M akadia et al ., 2008) 27 32 29 139TagProp (Gui l laumin et al ., 2009) 33 42 37 160

Fast Tag 32 43 37 166

report the number of keywords with non-zero recall

value (N+ ). In all met rics a higher value indicates

bet ter performance.

B asel ines. We compare against leastSquares, a ridge

regression model which uses the part ial subset of tags

y1, . . . , yn as labels to learn W , i.e., FastTag without

tag enrichment . We also compare against the Tag-

Prop algorithm (Guillaumin et al., 2009), a local kNN

method combining different distance metrics through

metric learning. It is the current best performer on

these benchmark sets. Most exist ing work do not pro-

vide publicly available implementat ions. As a result ,

we include their previously reported results for ref-

erence (Lavrenko et al., 2003; Metzler & Manmatha,

2004; Yavlinsky et al., 2005; Carneiro et al., 2007; Feng

et al., 2004; Liu et al., 2009; Makadia et al., 2008) .

Table 2. Comparison of FastTag and TagProp in terms of

P , R, F1 score and N+ on the Espgame and IAPRTC-12

datasets.

ESP game IAPRP R F1 N+ P R F1 N+

leastSquares 35 19 25 215 40 19 26 198MBRM 18 19 18 209 24 23 23 223

JEC 24 19 21 222 29 19 23 211TagProp 39 27 32 238 45 34 39 260FastTag 46 22 30 247 47 26 34 280

4.2. Compar ison wit h r elat ed wor k

Table 1 shows a detailed comparison of FastTag to

the leastSquares baseline and eight published results

on the Corel5K dataset . We can make three obser-

vat ions: 1. The performance of FastTag aligns with

that of TagProp (so far the best algorithm in terms

of accuracy on this dataset), and significant ly outper-

forms the other methods; 2. The leastSquares base-

line, which corresponds to FastTag without the tag

enricher, performs surprisingly well compared to exist -

ing approaches, which suggests the advantage of a sim-

ple model that can extend to a large number of visual

descriptor, as opposed to a complex model that can af-

ford fewer descriptors. One may instead more cheaply

glean the benefits of a complex model via non-linear

t ransformat ion of the features. 3. The duo classifier

formulat ion of FastTag, which adds the tag enricher,

alleviates the int rinsic label sparsity problem of image

annotat ion. It leads to a 10% improvement on preci-

sion, 28% on recall, and an overall 20% improvement

on F1 score over the leastSquares baseline. We also

Fast I m age Tagging

bug, green, insect,

tree, wood

baby, doll, dress,

green, hair

blue, earth, globe,

map, world

fish, fishing, fly,

hook, orange

blue, cloud, ocean,

sky, water

fly, plane, red,

sky, train

black, computer, drawing

handle, screen

brown, ear, painting,

woman, yellow

board, lake, man

wave, white

blue, circle, feet

round, white

drawing, hat, people

red, woman

blue, dot, feet,

microphone, statue

hair, ice, man,

white, woman

black, moon, red,

shadow, woman

asian, boy, gun,

man, white

anime, comic, people,

red, woman

feet, flower, fur.

red, shoes

blue, chart, diagram,

internet, table

gray, sky, stone,

water, white

black, dark, game,

man, night

plane, red, sky,

train, truck

Hig

h F

-1 s

co

reL

ow

F-1

sc

ore

Ra

nd

om

Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords).

Table 1. Comparison of FastTag and TagProp in terms of

P, R, F1 score and N+ on the Corel5K dataset . Previously

reported results using other image annotat ion techniques

are also included for reference.

Name P R F1 N+

leastSquares 29 32 30 125CRM (L avrenko et al ., 2003) 16 19 17 107

InfNet (M et zler & M anmat ha, 2004) 17 24 20 112NPDE (Yavl insky et al ., 2005) 18 21 19 114

SML (Carneiro et al ., 2007) 23 29 26 137MBRM (Feng et al ., 2004) 24 25 24 122TGLM (L iu et al ., 2009) 25 29 27 131

JEC (M akadia et al ., 2008) 27 32 29 139TagProp (Gui l laumin et al ., 2009) 33 42 37 160

Fast Tag 32 43 37 166

report the number of keywords with non-zero recall

value (N+ ). In all met rics a higher value indicates

bet ter performance.

B asel ines. We compare against leastSquares, a ridge

regression model which uses the part ial subset of tags

y1, . . . , yn as labels to learn W , i.e., FastTag without

tag enrichment . We also compare against the Tag-

Prop algorithm (Guillaumin et al., 2009), a local kNN

method combining different distance metrics through

metric learning. It is the current best performer on

these benchmark sets. Most exist ing work do not pro-

vide publicly available implementat ions. As a result ,

we include their previously reported results for ref-

erence (Lavrenko et al., 2003; Metzler & Manmatha,

2004; Yavlinsky et al., 2005; Carneiro et al., 2007; Feng

et al., 2004; Liu et al., 2009; Makadia et al., 2008) .

Table 2. Comparison of FastTag and TagProp in terms of

P , R, F1 score and N+ on the Espgame and IAPRTC-12

datasets.

ESP game IAPRP R F1 N+ P R F1 N+

leastSquares 35 19 25 215 40 19 26 198MBRM 18 19 18 209 24 23 23 223

JEC 24 19 21 222 29 19 23 211TagProp 39 27 32 238 45 34 39 260FastTag 46 22 30 247 47 26 34 280

4.2. Compar ison wit h r elat ed wor k

Table 1 shows a detailed comparison of FastTag to

the leastSquares baseline and eight published results

on the Corel5K dataset . We can make three obser-

vat ions: 1. The performance of FastTag aligns with

that of TagProp (so far the best algorithm in terms

of accuracy on this dataset), and significant ly outper-

forms the other methods; 2. The leastSquares base-

line, which corresponds to FastTag without the tag

enricher, performs surprisingly well compared to exist -

ing approaches, which suggests the advantage of a sim-

ple model that can extend to a large number of visual

descriptor, as opposed to a complex model that can af-

ford fewer descriptors. One may instead more cheaply

glean the benefits of a complex model via non-linear

t ransformat ion of the features. 3. The duo classifier

formulat ion of FastTag, which adds the tag enricher,

alleviates the int rinsic label sparsity problem of image

annotat ion. It leads to a 10% improvement on preci-

sion, 28% on recall, and an overall 20% improvement

on F1 score over the leastSquares baseline. We also

Page 15: 論文紹介 Fast imagetagging

15

最大タグ数

Page 16: 論文紹介 Fast imagetagging

16

Fast I m age Tagging

bug, green, insect,

tree, wood

baby, doll, dress,

green, hair

blue, earth, globe,

map, world

fish, fishing, fly,

hook, orange

blue, cloud, ocean,

sky, water

fly, plane, red,

sky, train

black, computer, drawing

handle, screen

brown, ear, painting,

woman, yellow

board, lake, man

wave, white

blue, circle, feet

round, white

drawing, hat, people

red, woman

blue, dot, feet,

microphone, statue

hair, ice, man,

white, woman

black, moon, red,

shadow, woman

asian, boy, gun,

man, white

anime, comic, people,

red, woman

feet, flower, fur.

red, shoes

blue, chart, diagram,

internet, table

gray, sky, stone,

water, white

black, dark, game,

man, night

plane, red, sky,

train, truck

Hig

h F

-1 s

co

reL

ow

F-1

sc

ore

Ra

nd

om

Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords).

Table 1. Comparison of FastTag and TagProp in terms of

P, R, F1 score and N+ on the Corel5K dataset . Previously

reported results using other image annotat ion techniques

are also included for reference.

Name P R F1 N+

leastSquares 29 32 30 125CRM (L avrenko et al ., 2003) 16 19 17 107

InfNet (M et zler & M anmat ha, 2004) 17 24 20 112NPDE (Yavl insky et al ., 2005) 18 21 19 114

SML (Carneiro et al ., 2007) 23 29 26 137MBRM (Feng et al ., 2004) 24 25 24 122TGLM (L iu et al ., 2009) 25 29 27 131

JEC (M akadia et al ., 2008) 27 32 29 139TagProp (Gui l laumin et al ., 2009) 33 42 37 160

Fast Tag 32 43 37 166

report the number of keywords with non-zero recall

value (N+ ). In all met rics a higher value indicates

bet ter performance.

B asel ines. We compare against leastSquares, a ridge

regression model which uses the part ial subset of tags

y1, . . . , yn as labels to learn W , i.e., FastTag without

tag enrichment . We also compare against the Tag-

Prop algorithm (Guillaumin et al., 2009), a local kNN

method combining different distance metrics through

metric learning. It is the current best performer on

these benchmark sets. Most exist ing work do not pro-

vide publicly available implementat ions. As a result ,

we include their previously reported results for ref-

erence (Lavrenko et al., 2003; Metzler & Manmatha,

2004; Yavlinsky et al., 2005; Carneiro et al., 2007; Feng

et al., 2004; Liu et al., 2009; Makadia et al., 2008) .

Table 2. Comparison of FastTag and TagProp in terms of

P , R, F1 score and N+ on the Espgame and IAPRTC-12

datasets.

ESP game IAPRP R F1 N+ P R F1 N+

leastSquares 35 19 25 215 40 19 26 198MBRM 18 19 18 209 24 23 23 223

JEC 24 19 21 222 29 19 23 211TagProp 39 27 32 238 45 34 39 260FastTag 46 22 30 247 47 26 34 280

4.2. Compar ison wit h r elat ed wor k

Table 1 shows a detailed comparison of FastTag to

the leastSquares baseline and eight published results

on the Corel5K dataset . We can make three obser-

vat ions: 1. The performance of FastTag aligns with

that of TagProp (so far the best algorithm in terms

of accuracy on this dataset ), and significant ly outper-

forms the other methods; 2. The leastSquares base-

line, which corresponds to FastTag without the tag

enricher, performs surprisingly well compared to exist -

ing approaches, which suggests the advantage of a sim-

ple model that can extend to a large number of visual

descriptor, as opposed to a complex model that can af-

ford fewer descriptors. One may instead more cheaply

glean the benefit s of a complex model via non-linear

t ransformat ion of the features. 3. The duo classifier

formulat ion of FastTag, which adds the tag enricher,

alleviates the int rinsic label sparsity problem of image

annotat ion. It leads to a 10% improvement on preci-

sion, 28% on recall, and an overall 20% improvement

on F1 score over the leastSquares baseline. We also