論文紹介 Fast imagetagging
-
Upload
takashi-abe -
Category
Technology
-
view
7.243 -
download
0
Transcript of 論文紹介 Fast imagetagging
Fast Image TaggingM. Chen(Amazon.com), A. Zheng(MSR, Redmond), and K. Weinberger(Washington Univ.)
ICML2013
ICML2013読み会 2013.7.9
株式会社Preferred Infrastructure
Takashi Abe <[email protected]>
自己紹介
阿部厳 (あべたかし)
Twitter: @tabe2314
東北大岡谷研(コンピュータビジョン)→PFIインターン→PFI新入
社員
2
紹介する論文
M. Chen, A. Zheng and K. Weinberger. Fast Image Tagging. ICML, 2013.
※ スライド中の図表はこの論文より引用しています
3
Image Tagging (1)
4
画像から、関連するタグ(複数)を推定
training:
入力: {(画像, タグ), …} 出力: 画像→タグ集合
testing:
入力: 画像 出力: 推定したタグ集合
bear polar snow tundra buildings clothes shops
street???
training testing
Image Tagging (2): 何が難しい?
効果的な特徴が物体によって違う → いろんな特徴を入れたい
見えの多様性 → 大きなデータセットを使いたい
不完全なアノテーションデータ
PrecisionはともかくRecallが低いデータしか得られない
(本来のタグから一部が抜け落ちたデータが得られる)
例: Flickrのタグ
タグの出現頻度の偏り
5
Color Edges
FastTag
6
基本的なアイデア
アノテーションされたタグ集合を補完しつつ、画像から補完された
タグへの(線形の)マッピングを学習
B: タグ集合 → 補完されたタグ集合
W: 画像特徴 → 補完されたタグ集合
学習:
7
Fast I m age Tagging
incomplete
user tags
y
visual
features
predicted
relevant tags
W Bsnow,lake,feet
mountain,snow, sky,
lake, water ,feet , legs,
boat , t r ees
x
visual
features
predicted
relevant tags
x
training testing
W x − By 22
E y i − B y i2
p( y i )
sky, clouds,lake, water ,feet , legs,
boat , t r ees
W x
W
W 22
Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to
predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.
joint ly convex and has closed form solut ions in each
iterat ion of the opt imizat ion.
Co-regular ized lear ning. As we are only provided
with an incomplete set of tags, we create an addit ional
auxiliary problem and obtain two sub-tasks: 1) t rain-
ing an image classifier x i → W x i that predicts the
complete tag set from image features, and 2) t raining
a mapping y i → By i to enrich the exist ing sparse
tag vector y i by est imat ing which tags are likely to
co-occur with those already in y i . We train both clas-
sifiers simultaneously and force their output to agree
by minimizing
1
n
n
i = 1
By i − W x i2. (1)
Here, By i is the enriched tag set for the i-th t raining
image, and each row of W contains the weights of a
linear classifier that t ries to predict the corresponding
(enriched) tag based on image features.
The loss funct ion as current ly writ ten has a trivial so-
lut ion at B = 0 = W , suggest ing that the current for-
mulat ion is underconst rained. We next describe ad-
dit ional regularizat ions on B that guides the solut ion
toward something more useful.
M arginal ized blank-out regular izat ion. We take
inspirat ion from the idea of marginalized stacked de-
noising autoencoders (Chen et al., 2012) and related
works (?) in formulat ing the tag enrichment mapping
B : { 0, 1} T →RT . Our intent ion is to enrich the incom-
plete user tags by turning on relevant keywords that
should have been tagged but were not . Imagine that
the observed tags y are randomly sampled from the
complete set of tags: it is a “ corrupted” version of the
original set . We leverage this insight and train the en-
richment mapping B to reverse thecorrupt ion process.
To thisend, weconstruct a further corrupted version of
the observed tags y and train B to reconstruct y from
y . If this secondary corrupt ion mechanism matches
the original corrupt ion mechanism, then re-applying
B to y would recover the likely original prist ine tag
set.
For simplicity, we use uniform corrupt ion as the sec-
ondary corrupt ion mechanism. In pract ice, human la-
belers may select tags with bias, not uniform proba-
bility. We can approximate the unknown corrupt ing
dist ribut ion with piecewise uniform corrupt ion in the
learning step (see sect ion 3.2). If prior knowledge on
the original corrupt ion mechanism is available, it can
also easily be incorporated into our model.
More formally, for each y, a corrupted version y is
created by randomly removing (i .e., set t ing to zero)
each entry in y with some probability p≥ 0 and there-
fore, for each user tag vector y and dimensions t,
p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B
to opt imize
B = argminB
1
n
n
i = 1
y i − B y i2.
Here, each row of B is an ordinary least squares re-
gressor that predicts the presence of a tag given all
exist ing tags in y . To reduce variance in B , we take
repeated samples of y . In the limit (with infinitely
many corrupted versions of y), the expected recon-
st ruct ion error under the corrupt ing distribut ion can
be expressed as
r (B ) =1
n
n
i = 1
E y i − B y i2
p( y i |y ). (2)
Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-
ing the part ial labels for each image in each column.
Define P ≡ni = 1 y i E[y i ] and Q ≡
ni = 1 E[y i y i ],
then we can rewrite the loss in (2) as
r (B ) =1
nt race(BQB − 2PB + Y Y ) (3)
We use Eq. (3) to regularize B . For the uniform
“ blank-out ” noise introduced above, we have the ex-
pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,
Marginalized blank-out regularization (1)
を単純に最小化するとB=0=Wなので要制約
Bはアノテーションyiを真のタグ集合ziにマップして欲しい
ziは得られないので、yiの要素をそれぞれ確率pで落とした から yi
を復元するBを考える
の生成を繰り返し行うことを考えると復元誤差の期待値は
8
Fast I m age Tagging
incomplete
user tags
y
visual
features
predicted
relevant tags
W Bsnow,lake,feet
mountain,snow, sky,
lake, water ,feet , legs,
boat , t r ees
x
visual
features
predicted
relevant tags
x
training testing
W x − By 22
E y i − B y i2
p( y i )
sky, clouds,lake, water ,feet , legs,
boat , t r ees
W x
W
W 22
Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to
predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.
joint ly convex and has closed form solut ions in each
iterat ion of the opt imizat ion.
Co-regular ized lear ning. As we are only provided
with an incomplete set of tags, we create an addit ional
auxiliary problem and obtain two sub-tasks: 1) t rain-
ing an image classifier x i → W x i that predicts the
complete tag set from image features, and 2) t raining
a mapping y i → By i to enrich the exist ing sparse
tag vector y i by est imat ing which tags are likely to
co-occur with those already in y i . We train both clas-
sifiers simultaneously and force their output to agree
by minimizing
1
n
n
i = 1
By i − W x i2. (1)
Here, By i is the enriched tag set for the i-th t raining
image, and each row of W contains the weights of a
linear classifier that t ries to predict the corresponding
(enriched) tag based on image features.
The loss funct ion as current ly writ ten has a trivial so-
lut ion at B = 0 = W , suggest ing that the current for-
mulat ion is underconst rained. We next describe ad-
dit ional regularizat ions on B that guides the solut ion
toward something more useful.
M arginal ized blank-out regular izat ion. We take
inspirat ion from the idea of marginalized stacked de-
noising autoencoders (Chen et al., 2012) and related
works (?) in formulat ing the tag enrichment mapping
B : { 0, 1} T →RT . Our intent ion is to enrich the incom-
plete user tags by turning on relevant keywords that
should have been tagged but were not . Imagine that
the observed tags y are randomly sampled from the
complete set of tags: it is a “ corrupted” version of the
original set . We leverage this insight and train the en-
richment mapping B to reverse thecorrupt ion process.
To thisend, weconstruct a further corrupted version of
the observed tags y and train B to reconstruct y from
y . If this secondary corrupt ion mechanism matches
the original corrupt ion mechanism, then re-applying
B to y would recover the likely original prist ine tag
set.
For simplicity, we use uniform corrupt ion as the sec-
ondary corrupt ion mechanism. In pract ice, human la-
belers may select tags with bias, not uniform proba-
bility. We can approximate the unknown corrupt ing
dist ribut ion with piecewise uniform corrupt ion in the
learning step (see sect ion 3.2). If prior knowledge on
the original corrupt ion mechanism is available, it can
also easily be incorporated into our model.
More formally, for each y, a corrupted version y is
created by randomly removing (i .e., set t ing to zero)
each entry in y with some probability p≥ 0 and there-
fore, for each user tag vector y and dimensions t,
p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B
to opt imize
B = argminB
1
n
n
i = 1
y i − B y i2.
Here, each row of B is an ordinary least squares re-
gressor that predicts the presence of a tag given all
exist ing tags in y . To reduce variance in B , we take
repeated samples of y . In the limit (with infinitely
many corrupted versions of y), the expected recon-
st ruct ion error under the corrupt ing distribut ion can
be expressed as
r (B ) =1
n
n
i = 1
E y i − B y i2
p( y i |y ). (2)
Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-
ing the part ial labels for each image in each column.
Define P ≡ni = 1 y i E[y i ] and Q ≡
ni = 1 E[y i y i ],
then we can rewrite the loss in (2) as
r (B ) =1
nt race(BQB − 2PB + Y Y ) (3)
We use Eq. (3) to regularize B . For the uniform
“ blank-out ” noise introduced above, we have the ex-
pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,
Fast I mage Tagging
incomplete
user tags
y
visual
features
predicted
relevant tags
W Bsnow,lake,feet
mountain,snow, sky,
lake, water ,feet , legs,
boat , t r ees
x
visual
features
predicted
relevant tags
x
training testing
W x − By 22
E y i − B y i2
p( y i )
sky, clouds,lake, water,feet , legs,
boat , t r ees
W x
W
W 22
Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to
predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.
joint ly convex and has closed form solut ions in each
iterat ion of the opt imizat ion.
Co-regular ized learning. As we are only provided
with an incomplete set of tags, we create an addit ional
auxiliary problem and obtain two sub-tasks: 1) t rain-
ing an image classifier x i → W x i that predicts the
complete tag set from image features, and 2) t raining
a mapping y i → By i to enrich the exist ing sparse
tag vector y i by est imat ing which tags are likely to
co-occur with those already in y i . We train both clas-
sifiers simultaneously and force their output to agree
by minimizing
1
n
n
i = 1
By i − W x i2. (1)
Here, By i is the enriched tag set for the i-th t raining
image, and each row of W contains the weights of a
linear classifier that t ries to predict the corresponding
(enriched) tag based on image features.
The loss funct ion as current ly writ ten has a trivial so-
lut ion at B = 0 = W , suggest ing that the current for-
mulat ion is underconstrained. We next describe ad-
dit ional regularizat ions on B that guides the solut ion
toward something more useful.
M arginal ized blank-out regular izat ion. We take
inspirat ion from the idea of marginalized stacked de-
noising autoencoders (Chen et al., 2012) and related
works (?) in formulat ing the tag enrichment mapping
B : { 0, 1} T →RT . Our intent ion is to enrich the incom-
plete user tags by turning on relevant keywords that
should have been tagged but were not . Imagine that
the observed tags y are randomly sampled from the
complete set of tags: it is a “ corrupted” version of the
original set . We leverage this insight and train the en-
richment mapping B to reverse thecorrupt ion process.
To thisend, weconstruct a further corrupted version of
the observed tags y and train B to reconstruct y from
y . If this secondary corrupt ion mechanism matches
the original corrupt ion mechanism, then re-applying
B to y would recover the likely original prist ine tag
set.
For simplicity, we use uniform corrupt ion as the sec-
ondary corrupt ion mechanism. In pract ice, human la-
belers may select tags with bias, not uniform proba-
bility. We can approximate the unknown corrupt ing
dist ribut ion with piecewise uniform corrupt ion in the
learning step (see sect ion 3.2). If prior knowledge on
the original corrupt ion mechanism is available, it can
also easily be incorporated into our model.
More formally, for each y, a corrupted version y is
created by randomly removing (i .e., set t ing to zero)
each entry in y with some probability p≥ 0 and there-
fore, for each user tag vector y and dimensions t,
p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B
to opt imize
B = argminB
1
n
n
i = 1
y i − B y i2.
Here, each row of B is an ordinary least squares re-
gressor that predicts the presence of a tag given all
exist ing tags in y . To reduce variance in B , we take
repeated samples of y . In the limit (with infinitely
many corrupted versions of y), the expected recon-
st ruct ion error under the corrupt ing distribut ion can
be expressed as
r (B ) =1
n
n
i = 1
E y i − By i2
p( y i |y ). (2)
Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-
ing the part ial labels for each image in each column.
Define P ≡ni = 1 y i E[y i ] and Q ≡
ni = 1 E[y i y i ],
then we can rewrite the loss in (2) as
r (B ) =1
nt race(BQB − 2PB + Y Y ) (3)
We use Eq. (3) to regularize B . For the uniform
“ blank-out” noise introduced above, we have the ex-
pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,
Fast I mage Tagging
incomplete
user tags
y
visual
features
predicted
relevant tags
W Bsnow,lake,feet
mountain,snow, sky,
lake, water,feet , legs,
boat , t r ees
x
visual
features
predicted
relevant tags
x
training testing
W x − By 22
E y i − By i2
p( y i )
sky, clouds,lake, water,feet , legs,
boat , t r ees
W x
W
W 22
Figure 1. Schemat ic illust rat ion of FastTag. During training two classifiers B and W are learned and co-regularized to
predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.
joint ly convex and has closed form solut ions in each
iterat ion of the opt imizat ion.
Co-regular ized learning. As we are only provided
with an incomplete set of tags, we create an addit ional
auxiliary problem and obtain two sub-tasks: 1) t rain-
ing an image classifier x i → W x i that predicts the
complete tag set from image features, and 2) training
a mapping y i → By i to enrich the exist ing sparse
tag vector y i by est imat ing which tags are likely to
co-occur with those already in y i . We train both clas-
sifiers simultaneously and force their output to agree
by minimizing
1
n
n
i = 1
By i − W x i2. (1)
Here, By i is the enriched tag set for the i-th training
image, and each row of W contains the weights of a
linear classifier that t ries to predict the corresponding
(enriched) tag based on image features.
The loss funct ion as current ly writ ten has a trivial so-
lut ion at B = 0 = W , suggest ing that the current for-
mulat ion is underconstrained. We next describe ad-
dit ional regularizat ions on B that guides the solut ion
toward something more useful.
M arginal ized blank-out regular izat ion. We take
inspirat ion from the idea of marginalized stacked de-
noising autoencoders (Chen et al., 2012) and related
works (?) in formulat ing the tag enrichment mapping
B : { 0, 1} T →R T . Our intent ion is to enrich the incom-
plete user tags by turning on relevant keywords that
should have been tagged but were not . Imagine that
the observed tags y are randomly sampled from the
complete set of tags: it is a “ corrupted” version of the
original set. We leverage this insight and train the en-
richment mapping B to reverse thecorrupt ion process.
To thisend, weconstruct a further corrupted version of
the observed tags y and train B to reconst ruct y from
y. If this secondary corrupt ion mechanism matches
the original corrupt ion mechanism, then re-applying
B to y would recover the likely original prist ine tag
set.
For simplicity, we use uniform corrupt ion as the sec-
ondary corrupt ion mechanism. In pract ice, human la-
belers may select tags with bias, not uniform proba-
bility. We can approximate the unknown corrupt ing
distribut ion with piecewise uniform corrupt ion in the
learning step (see sect ion 3.2). If prior knowledge on
the original corrupt ion mechanism is available, it can
also easily be incorporated into our model.
More formally, for each y, a corrupted version y is
created by randomly removing (i .e., set t ing to zero)
each entry in y with some probability p≥ 0 and there-
fore, for each user tag vector y and dimensions t,
p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B
to opt imize
B = argminB
1
n
n
i = 1
y i − B y i2.
Here, each row of B is an ordinary least squares re-
gressor that predicts the presence of a tag given all
exist ing tags in y . To reduce variance in B , we take
repeated samples of y . In the limit (with infinitely
many corrupted versions of y), the expected recon-
st ruct ion error under the corrupt ing dist ribut ion can
be expressed as
r (B ) =1
n
n
i = 1
E y i − By i2
p( y i |y ). (2)
Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-
ing the part ial labels for each image in each column.
Define P ≡ni = 1 y i E[y i ] and Q ≡
ni = 1 E[y i y i ],
then we can rewrite the loss in (2) as
r (B ) =1
nt race(BQB − 2PB + Y Y ) (3)
We use Eq. (3) to regularize B . For the uniform
“ blank-out ” noise int roduced above, we have the ex-
pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,
Marginalized blank-out regularization (2)
(2)を式変形して
つまり実際に を作る必要は無い(pだけ決めればよい)
この復元誤差の期待値をロス関数に加える
9
Fast I mage Tagging
incomplete
user tags
y
visual
features
predicted
relevant tags
W Bsnow,lake,feet
mountain,snow, sky,
lake, water,feet , legs,
boat , t r ees
x
visual
features
predicted
relevant tags
x
training testing
W x − By 22
E y i − By i2
p( y i )
sky, clouds,lake, water,feet , legs,
boat , t r ees
W x
W
W 22
Figure 1. Schemat ic illust rat ion of FastTag. During t raining two classifiers B and W are learned and co-regularized to
predict similar results. At test ing t ime, a simple linear mapping x → W x predicts tags from image features.
joint ly convex and has closed form solut ions in each
iterat ion of the opt imizat ion.
Co-regular ized learning. As we are only provided
with an incomplete set of tags, we create an addit ional
auxiliary problem and obtain two sub-tasks: 1) train-
ing an image classifier x i → W x i that predicts the
complete tag set from image features, and 2) training
a mapping y i → By i to enrich the exist ing sparse
tag vector y i by est imat ing which tags are likely to
co-occur with those already in y i . We train both clas-
sifiers simultaneously and force their output to agree
by minimizing
1
n
n
i = 1
By i − W x i2. (1)
Here, By i is the enriched tag set for the i-th training
image, and each row of W contains the weights of a
linear classifier that t ries to predict the corresponding
(enriched) tag based on image features.
The loss funct ion as current ly writ ten has a trivial so-
lut ion at B = 0 = W , suggest ing that the current for-
mulat ion is underconstrained. We next describe ad-
dit ional regularizat ions on B that guides the solut ion
toward something more useful.
M arginal ized blank-out regular izat ion. We take
inspirat ion from the idea of marginalized stacked de-
noising autoencoders (Chen et al., 2012) and related
works (?) in formulat ing the tag enrichment mapping
B : { 0, 1} T →RT . Our intent ion is to enrich the incom-
plete user tags by turning on relevant keywords that
should have been tagged but were not . Imagine that
the observed tags y are randomly sampled from the
complete set of tags: it is a “ corrupted” version of the
original set . We leverage this insight and train the en-
richment mapping B to reverse thecorrupt ion process.
To thisend, weconstruct a further corrupted version of
the observed tags y and train B to reconstruct y from
y. If this secondary corrupt ion mechanism matches
the original corrupt ion mechanism, then re-applying
B to y would recover the likely original prist ine tag
set.
For simplicity, we use uniform corrupt ion as the sec-
ondary corrupt ion mechanism. In pract ice, human la-
belers may select tags with bias, not uniform proba-
bility. We can approximate the unknown corrupt ing
distribut ion with piecewise uniform corrupt ion in the
learning step (see sect ion 3.2). If prior knowledge on
the original corrupt ion mechanism is available, it can
also easily be incorporated into our model.
More formally, for each y, a corrupted version y is
created by randomly removing (i .e., sett ing to zero)
each entry in y with some probability p≥ 0 and there-
fore, for each user tag vector y and dimensions t,
p(yt = 0) = p and p(yt = yt ) = 1 − p. We train B
to opt imize
B = argminB
1
n
n
i = 1
y i − B y i2.
Here, each row of B is an ordinary least squares re-
gressor that predicts the presence of a tag given all
exist ing tags in y . To reduce variance in B , we take
repeated samples of y . In the limit (with infinitely
many corrupted versions of y), the expected recon-
st ruct ion error under the corrupt ing dist ribut ion can
be expressed as
r (B ) =1
n
n
i = 1
E y i − By i2
p( y i |y ). (2)
Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-
ing the part ial labels for each image in each column.
Define P ≡ni = 1 y i E[y i ] and Q ≡
ni = 1 E[y i y i ],
then we can rewrite the loss in (2) as
r (B ) =1
nt race(BQB − 2PB + Y Y ) (3)
We use Eq. (3) to regularize B . For the uniform
“ blank-out ” noise int roduced above, we have the ex-
pected value of the corrupt ions E[y ]p( y |y ) = (1− p)y ,
Optimization
Bを固定すると(5)を最小化するWは閉じた式で
同様にWを固定すると
交互に求めると大域解に収束 (jointly convex)
10
拡張
Tag bootstrapping
Bはタグの共起関係で学習されてるので、共起しないけど似たタグ
が補完されない(例: lakeとpond)
stacking
Byiを新たなアノテーションとしてもう一度学習、を繰り返す
共起関係を伝搬させるイメージ?
スタック数は実験的に決定
11
画像特徴
複数の特徴を組み合わせ(既存手法と同じもの)
GIST
6種類の色ヒストグラム
8種類の局所特徴のBoW
事前に内積がχ^2距離を近似する空間にあらかじめ写像しておく
Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit
feature maps. PAMI, 34(3):480–492, 2012.
12
実験結果
13
精度評価
leastSquares: FastTagのタグ補完無し版、ベースライン
TagProp: これまでのstate of the art, 学習O(n^2), テストO(n)
FastTagの精度はほぼTagPropと同じ
14
Fast I m age Tagging
bug, green, insect,
tree, wood
baby, doll, dress,
green, hair
blue, earth, globe,
map, world
fish, fishing, fly,
hook, orange
blue, cloud, ocean,
sky, water
fly, plane, red,
sky, train
black, computer, drawing
handle, screen
brown, ear, painting,
woman, yellow
board, lake, man
wave, white
blue, circle, feet
round, white
drawing, hat, people
red, woman
blue, dot, feet,
microphone, statue
hair, ice, man,
white, woman
black, moon, red,
shadow, woman
asian, boy, gun,
man, white
anime, comic, people,
red, woman
feet, flower, fur.
red, shoes
blue, chart, diagram,
internet, table
gray, sky, stone,
water, white
black, dark, game,
man, night
plane, red, sky,
train, truck
Hig
h F
-1 s
co
reL
ow
F-1
sc
ore
Ra
nd
om
Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords).
Table 1. Comparison of FastTag and TagProp in terms of
P, R, F1 score and N+ on the Corel5K dataset . Previously
reported results using other image annotat ion techniques
are also included for reference.
Name P R F1 N+
leastSquares 29 32 30 125CRM (L avrenko et al ., 2003) 16 19 17 107
InfNet (M et zler & M anmat ha, 2004) 17 24 20 112NPDE (Yavl insky et al ., 2005) 18 21 19 114
SML (Carneiro et al ., 2007) 23 29 26 137MBRM (Feng et al ., 2004) 24 25 24 122TGLM (L iu et al ., 2009) 25 29 27 131
JEC (M akadia et al ., 2008) 27 32 29 139TagProp (Gui l laumin et al ., 2009) 33 42 37 160
Fast Tag 32 43 37 166
report the number of keywords with non-zero recall
value (N+ ). In all met rics a higher value indicates
bet ter performance.
B asel ines. We compare against leastSquares, a ridge
regression model which uses the part ial subset of tags
y1, . . . , yn as labels to learn W , i.e., FastTag without
tag enrichment . We also compare against the Tag-
Prop algorithm (Guillaumin et al., 2009), a local kNN
method combining different distance metrics through
metric learning. It is the current best performer on
these benchmark sets. Most exist ing work do not pro-
vide publicly available implementat ions. As a result ,
we include their previously reported results for ref-
erence (Lavrenko et al., 2003; Metzler & Manmatha,
2004; Yavlinsky et al., 2005; Carneiro et al., 2007; Feng
et al., 2004; Liu et al., 2009; Makadia et al., 2008) .
Table 2. Comparison of FastTag and TagProp in terms of
P , R, F1 score and N+ on the Espgame and IAPRTC-12
datasets.
ESP game IAPRP R F1 N+ P R F1 N+
leastSquares 35 19 25 215 40 19 26 198MBRM 18 19 18 209 24 23 23 223
JEC 24 19 21 222 29 19 23 211TagProp 39 27 32 238 45 34 39 260FastTag 46 22 30 247 47 26 34 280
4.2. Compar ison wit h r elat ed wor k
Table 1 shows a detailed comparison of FastTag to
the leastSquares baseline and eight published results
on the Corel5K dataset . We can make three obser-
vat ions: 1. The performance of FastTag aligns with
that of TagProp (so far the best algorithm in terms
of accuracy on this dataset), and significant ly outper-
forms the other methods; 2. The leastSquares base-
line, which corresponds to FastTag without the tag
enricher, performs surprisingly well compared to exist -
ing approaches, which suggests the advantage of a sim-
ple model that can extend to a large number of visual
descriptor, as opposed to a complex model that can af-
ford fewer descriptors. One may instead more cheaply
glean the benefits of a complex model via non-linear
t ransformat ion of the features. 3. The duo classifier
formulat ion of FastTag, which adds the tag enricher,
alleviates the int rinsic label sparsity problem of image
annotat ion. It leads to a 10% improvement on preci-
sion, 28% on recall, and an overall 20% improvement
on F1 score over the leastSquares baseline. We also
Fast I m age Tagging
bug, green, insect,
tree, wood
baby, doll, dress,
green, hair
blue, earth, globe,
map, world
fish, fishing, fly,
hook, orange
blue, cloud, ocean,
sky, water
fly, plane, red,
sky, train
black, computer, drawing
handle, screen
brown, ear, painting,
woman, yellow
board, lake, man
wave, white
blue, circle, feet
round, white
drawing, hat, people
red, woman
blue, dot, feet,
microphone, statue
hair, ice, man,
white, woman
black, moon, red,
shadow, woman
asian, boy, gun,
man, white
anime, comic, people,
red, woman
feet, flower, fur.
red, shoes
blue, chart, diagram,
internet, table
gray, sky, stone,
water, white
black, dark, game,
man, night
plane, red, sky,
train, truck
Hig
h F
-1 s
co
reL
ow
F-1
sc
ore
Ra
nd
om
Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords).
Table 1. Comparison of FastTag and TagProp in terms of
P, R, F1 score and N+ on the Corel5K dataset . Previously
reported results using other image annotat ion techniques
are also included for reference.
Name P R F1 N+
leastSquares 29 32 30 125CRM (L avrenko et al ., 2003) 16 19 17 107
InfNet (M et zler & M anmat ha, 2004) 17 24 20 112NPDE (Yavl insky et al ., 2005) 18 21 19 114
SML (Carneiro et al ., 2007) 23 29 26 137MBRM (Feng et al ., 2004) 24 25 24 122TGLM (L iu et al ., 2009) 25 29 27 131
JEC (M akadia et al ., 2008) 27 32 29 139TagProp (Gui l laumin et al ., 2009) 33 42 37 160
Fast Tag 32 43 37 166
report the number of keywords with non-zero recall
value (N+ ). In all met rics a higher value indicates
bet ter performance.
B asel ines. We compare against leastSquares, a ridge
regression model which uses the part ial subset of tags
y1, . . . , yn as labels to learn W , i.e., FastTag without
tag enrichment . We also compare against the Tag-
Prop algorithm (Guillaumin et al., 2009), a local kNN
method combining different distance metrics through
metric learning. It is the current best performer on
these benchmark sets. Most exist ing work do not pro-
vide publicly available implementat ions. As a result ,
we include their previously reported results for ref-
erence (Lavrenko et al., 2003; Metzler & Manmatha,
2004; Yavlinsky et al., 2005; Carneiro et al., 2007; Feng
et al., 2004; Liu et al., 2009; Makadia et al., 2008) .
Table 2. Comparison of FastTag and TagProp in terms of
P , R, F1 score and N+ on the Espgame and IAPRTC-12
datasets.
ESP game IAPRP R F1 N+ P R F1 N+
leastSquares 35 19 25 215 40 19 26 198MBRM 18 19 18 209 24 23 23 223
JEC 24 19 21 222 29 19 23 211TagProp 39 27 32 238 45 34 39 260FastTag 46 22 30 247 47 26 34 280
4.2. Compar ison wit h r elat ed wor k
Table 1 shows a detailed comparison of FastTag to
the leastSquares baseline and eight published results
on the Corel5K dataset . We can make three obser-
vat ions: 1. The performance of FastTag aligns with
that of TagProp (so far the best algorithm in terms
of accuracy on this dataset), and significant ly outper-
forms the other methods; 2. The leastSquares base-
line, which corresponds to FastTag without the tag
enricher, performs surprisingly well compared to exist -
ing approaches, which suggests the advantage of a sim-
ple model that can extend to a large number of visual
descriptor, as opposed to a complex model that can af-
ford fewer descriptors. One may instead more cheaply
glean the benefits of a complex model via non-linear
t ransformat ion of the features. 3. The duo classifier
formulat ion of FastTag, which adds the tag enricher,
alleviates the int rinsic label sparsity problem of image
annotat ion. It leads to a 10% improvement on preci-
sion, 28% on recall, and an overall 20% improvement
on F1 score over the leastSquares baseline. We also
15
最大タグ数
16
Fast I m age Tagging
bug, green, insect,
tree, wood
baby, doll, dress,
green, hair
blue, earth, globe,
map, world
fish, fishing, fly,
hook, orange
blue, cloud, ocean,
sky, water
fly, plane, red,
sky, train
black, computer, drawing
handle, screen
brown, ear, painting,
woman, yellow
board, lake, man
wave, white
blue, circle, feet
round, white
drawing, hat, people
red, woman
blue, dot, feet,
microphone, statue
hair, ice, man,
white, woman
black, moon, red,
shadow, woman
asian, boy, gun,
man, white
anime, comic, people,
red, woman
feet, flower, fur.
red, shoes
blue, chart, diagram,
internet, table
gray, sky, stone,
water, white
black, dark, game,
man, night
plane, red, sky,
train, truck
Hig
h F
-1 s
co
reL
ow
F-1
sc
ore
Ra
nd
om
Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords).
Table 1. Comparison of FastTag and TagProp in terms of
P, R, F1 score and N+ on the Corel5K dataset . Previously
reported results using other image annotat ion techniques
are also included for reference.
Name P R F1 N+
leastSquares 29 32 30 125CRM (L avrenko et al ., 2003) 16 19 17 107
InfNet (M et zler & M anmat ha, 2004) 17 24 20 112NPDE (Yavl insky et al ., 2005) 18 21 19 114
SML (Carneiro et al ., 2007) 23 29 26 137MBRM (Feng et al ., 2004) 24 25 24 122TGLM (L iu et al ., 2009) 25 29 27 131
JEC (M akadia et al ., 2008) 27 32 29 139TagProp (Gui l laumin et al ., 2009) 33 42 37 160
Fast Tag 32 43 37 166
report the number of keywords with non-zero recall
value (N+ ). In all met rics a higher value indicates
bet ter performance.
B asel ines. We compare against leastSquares, a ridge
regression model which uses the part ial subset of tags
y1, . . . , yn as labels to learn W , i.e., FastTag without
tag enrichment . We also compare against the Tag-
Prop algorithm (Guillaumin et al., 2009), a local kNN
method combining different distance metrics through
metric learning. It is the current best performer on
these benchmark sets. Most exist ing work do not pro-
vide publicly available implementat ions. As a result ,
we include their previously reported results for ref-
erence (Lavrenko et al., 2003; Metzler & Manmatha,
2004; Yavlinsky et al., 2005; Carneiro et al., 2007; Feng
et al., 2004; Liu et al., 2009; Makadia et al., 2008) .
Table 2. Comparison of FastTag and TagProp in terms of
P , R, F1 score and N+ on the Espgame and IAPRTC-12
datasets.
ESP game IAPRP R F1 N+ P R F1 N+
leastSquares 35 19 25 215 40 19 26 198MBRM 18 19 18 209 24 23 23 223
JEC 24 19 21 222 29 19 23 211TagProp 39 27 32 238 45 34 39 260FastTag 46 22 30 247 47 26 34 280
4.2. Compar ison wit h r elat ed wor k
Table 1 shows a detailed comparison of FastTag to
the leastSquares baseline and eight published results
on the Corel5K dataset . We can make three obser-
vat ions: 1. The performance of FastTag aligns with
that of TagProp (so far the best algorithm in terms
of accuracy on this dataset ), and significant ly outper-
forms the other methods; 2. The leastSquares base-
line, which corresponds to FastTag without the tag
enricher, performs surprisingly well compared to exist -
ing approaches, which suggests the advantage of a sim-
ple model that can extend to a large number of visual
descriptor, as opposed to a complex model that can af-
ford fewer descriptors. One may instead more cheaply
glean the benefit s of a complex model via non-linear
t ransformat ion of the features. 3. The duo classifier
formulat ion of FastTag, which adds the tag enricher,
alleviates the int rinsic label sparsity problem of image
annotat ion. It leads to a 10% improvement on preci-
sion, 28% on recall, and an overall 20% improvement
on F1 score over the leastSquares baseline. We also