Finding Similar Items by Leveraging Social Tag Cloudscho/papers/SAC12.pdf · Flickr have been...
Transcript of Finding Similar Items by Leveraging Social Tag Cloudscho/papers/SAC12.pdf · Flickr have been...
Finding similar items by leveraging social tag clouds
Chu-Cheng Hsieh University of California, Los Angeles
4732 Bolter Hall Los Angeles, CA 90095, USA
(+1)310-773-0733
Junghoo Cho
University of California, Los Angeles 4732 Bolter Hall
Los Angeles, CA 90095, USA (+1)310-773-0733
ABSTRACT Recently social collaboration projects such as Wikipedia and Flickr have been gaining popularity, and more and more social tag information is being accumulated. In this study, we demonstrate how to effectively use social tags created by humans to find similar items. We create a query-by-example interface for finding similar items through offering examples as a query. Our work aims to measure the similarity between a query, expressed as a group of items, and another item through utilizing the tag information. We show that using human-generated tags to find similar items has at least two major challenges: popularity bias and the missing tag effect. We propose several approaches to overcome the challenges. We build a prototype website allowing users to search over all entries in Wikipedia based on tag information, and then collect 600 valid questionnaires from 69 students to create a benchmark for evaluating our algorithms based on user satisfaction. Our results show that the presented techniques are promising and surpass the leading commercial product, Google Sets, in terms of user satisfaction.
Categories and Subject Descriptors H.3.3 [INFORMATION STORAGE AND RETRIEVAL]: Information Search and Retrieval – clustering, information filtering, retrieval models, selection process.
General Terms Algorithms, Experimentation
Keywords World Wide Web (WWW), Social tags, Information retrieval, Entity resolution, Folksonomy
1. INTRODUCTION The dominant interface of search engines today requires users to pinpoint their information needs with a few keywords. However, users sometimes find it difficult to identify the keywords that best describe their needs. For example, a user who plans to apply for a graduate school in California may issue the query “outstanding universities in California”. Many outstanding schools, such as Stanford University, are missing in the top results of all major search engines, because the keywords “outstanding” and
“California” are not presented in the web pages of those schools.
As a potential solution to this problem, we study methodologies for providing a “query-by-example” interface. In this interface, users provide a few representative examples of the ultimate information they seek. The system then returns search results most similar to the examples provided; for instance, to find outstanding graduate schools, a user may issue a query like “Caltech, UC Berkeley” and expects that the system will return similar outstanding schools in California such as UCLA and Stanford University.
The major challenge in building such a system is to identify similar items based on the user-provided set of examples. In this study, we leverage the tag clouds that are collaboratively created by web users in defining and measuring the similarity between multiple items. To verify the effectiveness of our solutions, we conduct experiments on one of the largest social collaboration projects, Wikipedia. In the Wikipedia dataset, we consider a wiki page or entry as an entity and a category label of a page as a tag. We aim to identify and rank entities that are similar to the user-provided examples based on tag information.
As other researchers [9] have noted, the uncontrolled nature of user-generated metadata, such as free-form tagging in Wikipedia, often causes problems of imprecision and ambiguity when these tags are used as a foundation of designing algorithms. In our study, we identify and deal with two challenges associated with free-form uncontrolled tag clouds: popularity bias and the missing tag effect (Section 3.2).
We propose several approaches to overcome the challenges and subsequently build a prototype website allowing users to issue a query by examples. Our results show that our techniques are able to return a sizable number of high-quality similar items even when the user provides only a few examples in the query. The proposed approaches are evaluated against a benchmark dataset that is built based on 600 valid questionnaire responses from 69 students. In terms of user satisfaction, the questionnaire responses show that our techniques outperform Google Sets.
In summary, we highlight our contributions as follows:
We investigate how to extract a set of similar items through analyzing noisy social tags created by human beings, and show that the tag information is effective in identifying relevant similar items.
We identify and solve two challenges in tag-based search frameworks: popularity bias and the missing tag effect.
We propose and compare several models based on tag information. We build algorithms on top of these models, and study their advantages and disadvantages.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’12, March 25-29, 2012, Riva del Garda, Italy. Copyright 2012 ACM 978-1-4503-0857-1/12/03…$10.00.
644
2Ttoqcpc
Afrefud
q
2Wethatahccfrp
Apeusn
Wuar
W
E
In
We perforand show approache
2. PROBLEThe essential proo retrieve releva
queries are comcharacteristics. Wprocess of findincomposed of a nu
Assuming a userframework is theentities, like {Cfunction to medataset and the
, impliequery .
2.1 Tag genWe start the discexample, as showhis example, we
about Washingtoags City, Capita
have to be the sconcept of a metcollaboration profree-form taggingphrasing of an ex
Although we illupages” as entitieentities are not utilize only tag simplification alnon-textual datas
We use to universe of entitiassociated with eferring to all ta
We illustrate our
Example 1 Cons: :: :
n this example, t: , ::
rm an extensive that in terms o
es outperform Go
EM OVERVoblem can be phrant items sharin
mposed of repreWe consider thng similar itemsumber of items f
r issues a query e query itself, w
Caltech, UC Bereasure the similquery , whe
s a higher simila
neration cussion of our twn in Fig. 1, to ee say a user readon D.C.; then, tal, North Amerisame word usedtropolis is matchojects, a user cag. The phrase mxisting tag, or ev
Fig. 1 The
ustrate our tag-ees and their titllimited to plai
information, ignlows our techniset as well.
represent all enies. We use
. The univeags in a dataset.
r notations with t
sider a dataset th : ,
. . : : ,
:the dataset conta
, :, :
evaluation baseof user satisfactioogle Sets in mo
VIEW rased as the follo
ng some charactesentative examphe “query-by-exas in a dataset, wfrom the same da
: , , . . .where shouldrkeley}. Our golarity between aere a higher scarity between th
tag-entity modelexplain how tagsds the content of the user labels tica, etc. Note thd in the content;hed by the tag Can select any ph
may be an existinven a newly creat
e data model
entity model by les as identifierin texts. Our prnoring the conteiques to be broa
ntities in a dat, , … to
erse of tags is
the following ex
hat contains four , ,, ,
, ,,
ains a total of six, :
ed on user surveion, our tag-baseost testing cases.
owing: users wateristics, and theples with desireample” task as
where the query ataset.
, the input to thd be composed oal is to create an entity in thcore measured be entity and th
l with an artificis are generated.
f a Wikipedia pagthe page with thhat a tag does n; for example, th
City. In most socihrase as a tag, ing tag, a modifieted tag.
considering “wirs, the content roposed solutioent of entity. Thadly applicable
taset, that is, threpresent the tadenoted by
xample:
entities:
,
x tags:
,
ys ed
ant eir ed
a is
he of a
he by he
ial In ge he
not he ial .e. ed
iki of ns
his to
he ags
,
□
We deentities
:we deExamp
:
Withousimilaritag-entirelationregardi
3. THAPPRThe corfigure oinput ethis pro
In Exax :
interprewith th
capitalsuser is lookingwell. Sbut alssimilar
In the ia subseonly loalthoug:
Asia be
3.1 SHere, wthe que
The syquery, query cthe set
Accordan entitentity isuch asbut notEquatiomeanincommo
rankingthat the
fine the functios associated with
, :efine , ,
ple 1, , :
ut referring to ity measuremenity relations arn is missing froming the imperfect
HE INTERROACH re challenge in out what types o
entities providedoblem, we start w
ample 1, when, :
etation of the ushe tags :: , :
s. Given this relooking for citie
g for cities, but tSince the second so : sh
entities.
intersection-drivet of input examook for entities gh the input ent
, it is unlikelyecause this tag is
Similarity Mwe define the deery as follow
, |
ymbol ∩ standsi.e. ∩ ∩
consists of entit∩ : ,
ding to our definty is associated wis. For instance,s the entity :t with :on (1), the rankng that the enton with ∩ ; L
g score of e entity :
on as an h the tag . For
. . ,… as
: , :.
the content ofnt to tag informare unreliable anm an entity. We t nature later in
RSECTION-
providing a queof entities a userd by the user. Towith an artificial
n the user-prov . .
ser’s intention? and :
}. That is, sult, we may ha
es that are also cthe input examp
d interpretation ishould be weigh
ven approach, if amples, we claim
associated withtity Beijing is a
y that the user is not associated
Measuremenegree of similaris:
∩ ∩|
s for tags assoc∩ …∩ , ∀
ties x : : .
nition in Equatiowith, the more s, if an entity
, it. According to
king score is tity : Likewise, the e
: , has two tags
operation for aexample, : : . ∪ ∪ … N
f an entity, weation. One mighnd sometimes a
will discuss theSection 3.2.
-DRIVEN
ery-by-example r is looking for bo convey how wl scenario.
vided input is , what will be aBoth entities ar as we can seeboth entities arave two interpreapitals, or the us
ples happen to bes possible, not oed when we id
a tag is associatethat the user is
h such a tag. Falso associated ws only looking with :
nt ity between an e
ciated with all e∈ . In Exam, :
on (1), the moresimilar (to the qu is a city but no
t is associated wo the similarity d
: has only
entity :∩ ∩ 2
s in common wit
acquiring all
In addition, Namely, in ∪
e limit our ht argue that a reasonable ese concerns
service is to based on the we approach
the query a reasonable e associated e from ∩re cities and etations: the ser is simply e capitals as only : dentify other
ed with only s unlikely to For instance, with the tag for cities in
. .
entity and
(1)
entities in a mple 1, if a
. . ,
e tags in ∩ uery ) the ot a capital, with : definition in , 1,
one tag in has the
2, meaning th ∩.
645
Inthmtha
ifc
3Wdenthb
E In
3Ncmethnn
Mu
E
inGinto
the
Teaeu
e
n the above examhe entity :
matter whether that are also c
appropriate answ: beca
f the user’s intencannot deny suc:
3.2 ChallenWe create Exampdataset. The examexpect to encounnotation signifiehrough a tag, re
but does not appe
Example 2 Cons:
:
North Ame:
Summer O: : :
n this example, t: , ::
3.2.1 MissingNot only in Exacan be missing. may happen. In entity might not he content of th
not be aware of newly created tag
Missing tag-entituser intent. For
:Example 2 da:: , the
ntention as findiGiven these twonterpretation “Ooo general. The : a
hat : equally similar to
The impact of thentities are inclua set of entities expected to be ause the notation
, the probabiexamples becom
mple, our appro . This
the user’s intentcapitals, the enwer. The entityause : nt is to find citiech a possibility, is reasonable.
nges ple 2 based on omple illustrates
nter in a real dates that a tag isepresenting that ear in a real data
sider a dataset th : ,
, China . . :
erica, Object } :{City, CapiOlympic, Object }
:{City, :{Su
:{City, Europethe dataset conta
, : ,, :
g Tag Effect ample 2, but also
There are seveany social collabe well tagged
he entity. At thea newly createdg for other entiti
ty relations couldr example, sup . ., :
ataset; since th . . , and thintersection–dri
ing entities assoo input entities, Object” is problem
intersection–dris an irrele
, :o the query .
he missing tag efded in a query. associated with
associated with a to represent t
ility of beinges
ach ranks :s result seems intt is to find citientity :y :
is an inapes that are also c ranking :
our observationsand highlights taset. In Example
s very popular. a tag-entity rela
aset.
hat contains six e, ,
a, Object } {City, Capital,
ital, Europe, } North America, ummer Olympic,, Object } ains a total of eig: , : ,
o in practice a eral possible reaaboration projectuntil its editors
e same time, thed tag or may deces.
d cause the systepose that a us
has beehe tag :he tag :iven approach i
ociated with the intuitively, we
matic because thiven approach coevant one,
,
ffect is more proSuppose that a uh a tag ; ideaall the entities inthe probability og not associate
higher thatuitive because ns or to find citi
is always a is inferior
ppropriate answcapitals. Since w
higher tha
on the Wikipedthe challenges we 2, an underlineAlso, we strik
ation should exi
entities.
Object } , Object }
ght tags: ,
:
tag-entity relatioasons of why tht, a newly createfinish revising a
e community macide not to use th
em to misinterprser’s query en issued again
is missing is missing
interprets the ustag { :feel that such a
he interpretation onsiders the enti
and suggesand : a
onounced as mouser is looking fally, the tag n the query. If wof missing the taed with all inp
an no ies an to
wer we an
dia we ed
ke-ist
□
on his ed all ay he
ret
nst in in
ser }. an is
ity sts are
ore for
is we ag
put
, where is 20
missingnumberthe des
3.2.2To genmissinginsteadsome
:these :
entitiesThese tpartial similari
but alsseparat
:{1st po“ :
The strthus en∩ an
the ent∩, the
long asquery.
Such a in Ex :Neverthidea. Ttend toresults.
3.2.3Both rexperimentity fare assassocia
We deftags asspopulara tag where
We defsimilariin the qexamplprocessrank al
e is the num% and the numbg the tag in r of input exampired tag increase
Partial Weigneralize the interg tag effect, one
d of either zero oentities in a q
.tags: : ,
s in the query antags, although nweights in re
ity function 2 because not o
so : and tely. Similarly,
, 2.osition: “ :
” and “ :
rategy weights tansures that we fentity is associa
tity is. Moreovee strategy helpss some tags are
generalization bxample 2, as
heless, introduci
Therefore, if the o not adopt gen.
Popularity Bresults in the pments show thafollows a power ociated with a l
ated with a small
fine the popularsociated with r than another en is more popula
represents
fine the populariity between an equery, we tend tle, a query s the query, well cities, we ass
∩ 1
mber of entities ber of input exam
∩ is close to ples increases toes to 89.26%.
ghting Generarsection-driven ae solution is to asor one, to tags thquery. In the ., : , :and :
nd each tag assonot in ∩({ :eal numbers, an
, . As a only the tag ::
, : The system ret and
:
ags that appear follow the sameated with, the moer, in case the ins the system retuassociated with
brings more entissigning partia brings :
ing this suspiciosystem already
neralization unle
Bias previous researcat the number olaw distribution
large number ofl number of tags
rity of an entity, denoted by |ntity if | |ar than another ts the set of all en
ity entity-bias asentity and a queto favor entities consists of
e firstly identifysign full weight
1
in the query. If mples is 3, the pra half (48.8%)
o 10, the chance
alization approach for adssign scores in rhat are associateprevious exam
, we now as, :, because we
ociates with only, are no
nd thus contribresult, :
contribu contribute
, turns a better rad : ”, 2n
” .
in ∩ the most h intuition: the m
ore similar (to thntent cannot beurn some relateone or more en
ities to results. Fal weight to
toous result may no
returns satisfiedess the user ask
ch [5] and resof tags associat
n. That is, only a f tags, and mosts.
y based on the|. Then, an entit| |. Similarly,
tag ′ if | |ntities associated
s follows: if we ry based on tag
s that are more p: , :
all cities. Thent (1.0) to the ta
(2)
the value of robability of ). When the e of missing
dressing the real number, ed with only
mple (ssign 0.5 to ,
e have two y one entity. ow assigned bute to the ,
utes 1 point, 0.5 points
1.5 and anking result nd position:
heavily, and more tags in he query )
captured in ed results as ntities in the
For example, o the tag o the result. ot be a good d results, we ks for more
sults in our ted with an few entities
t entities are
e number of ty is more we say that
| ′ |, d with .
measure the information
popular. For . To
n, to further ag City, and
646
pC
r
Tkpinupna
4Wthutoebs
Tth
Epthethtamoaep
Inasrbetossfd
IngeN
partial weight (0China, Europe a: , enti
eturned in the re
The popularity bkind of bias as popular tag likenput example(s)
user intends to sprobably only benumber. In Exama query, :
4. BALANCWe now further he following pro
unfairly influenco the popular en
entity relations abe able to identifsubset of input ex
To achieve the ahe similarity bet
,
,
Equation (3) canpopularity entityhe query as hav
each tag associathe entity’s scoreag effect in tha
missing in some of a tag is inverassociated with entity will get apopular entity, al
n Equation (4),associated with symbol ∪ standelevance score
by summing up tentity. Here, o the query .
score to each tasubset of the inpfollowing exampdefinition.
n Example 2, egains 0.2 points beach tag associaNamely, if a qu
: ,
.5) to these tagsand Object. Sincities similar to esult and ranked
bias happens in popularity tag-
Object in Exa), this popular taearch for. That ecause it is popu
mple 2, if we rand is the most li
CED VOTINrefine the intersoperties: (1) a poce the results sucntity, but not to
are missing in infy relevant entitixamples.
above desired prtween an entity
1| |
∈ 0,
∈ ∩ ∪n mitigate both
y-bias. In a nutsving a total scoted with the entie. The non-zero at we now valinput examples
sely proportiona. As a result,
a lower score thlleviating the pop
we introduce tone or more en
ds for the notifor every entitythe relevance sco, representNote that in th
g in ∪; therefoput entities wil
ple shows how th
each tag associbecause five tagated with the e
uery consists 0.2 0.33
: Capital, Asia, ce most of the ta: are
in higher positio
tags as well, a-bias. We argueample 2 is assoag is probably nis, the tag :
ular, i.e. | :domly select twokely tag shown i
NG MODEection-driven ap
opular entity in ach that the resulto others, and (2)nput examples, tes based on tags
operties, in this and the query
, if ∈
,
h the missing tashell, we considre of one in Eqity is assigned anassignment alleue significance
s. Furthermore, tal to | |, i.e., tha tag originatin
han a tag originpularity entity b
the symbol ∪ntities in a querion “union”. W
y in the dataset wores of all tags ats the relevance shis equation we ore, a tag assocl still get a nonhe scores are com
iated with the es are associated entity : of { :0.53 . Althoug
Summer Olympiags originate fromore likely to bons.
and we name the that although ciated with somot the concept th
exists in | is a larg
o entities to creain ∩.
EL pproach to includa query should nts are similar on) even a few tathe system shous associated with
model, we defin as follows:
(3)
(4)
ag effect and thder every entity quation (3). Then equal portion viates the missinof tags that a
the assigned scohe number of tang from a populnating from a leias.
to represent tary , where th
We calculate thwith Equation (
associated with thscore of the tag assign a non-ze
ciated with onlyn-zero score. Thmputed under th
entity :with it. Similarlgets 0.33 point, : }, th
gh :
ic, om be
his a
me he ∩
ge ate
de not nly ag-uld h a
ne
he in
en, of ng are ore gs lar ess
gs he he (4) he
ero y a he his
ly, ts. he is
associacontrib
We calwith Erelevantags a:
(0.33).
Tags behighestclings tcapturecaused intersec:
now. Tsensitiv
5. OSo far,weightiAlternaprobabisome ebased o
We starsimilar mind, desiredcapturetag(s) in
We prfollowicorrespunprediselecteduser raquery, w
5.1 MIn ordemodelpremisereconsi
When model, comput
In Equa is th|
an entit
ated with more butes a higher ran
lculate the relevEquation (4). Fonce score of 1.3associated with
elonging to the t ranking scoresto our belief thate a user’s inten
by a popularction-driven app
contributesThis difference ve to the popular
ONE-CLASSwe view the siming scheme foatively, we canility that an entientities as possion the calculated
art by thinking aitems. We thinkand then the u
d property baseded by a tag (or tan the dataset are
ropose the oneing assumptionsponds to one tictable, we clad from all tags.
andomly selects where are
Measuring Ser to better unworks, in this
e that there is ider the problem
a query is gthe probability
ted as
| ≜
ation (5), the he desired tag w
stands for thety if the desir
tags than :nking score than
vance score for or instance, the39 points by sum
: , (0.2), :
set ∩, such as s. Therefore, thet tags shared by ntion; meanwhr entity in a proach, each tas less influence makes our balarity entity-bias.
S PROBABImilarity measureor tags associan consider the pity is the target ible results, we d probabilities.
about how peoplk that a user firstuser tries to recd on his/her knowags) in a dataset,e considered as s
e-class probabilis. At first, we atag in a daaim that the d
Then, once the| | entities fr
e all entities that
Similarity Understand how
section we disno missing tag
m of missing tags
given, based on that is what a
≜ | ∗
| stands fwhen the system
e probability thared tag is . He
, each tag n a tag in :
every entity ine entity :mming up the sincluding :
(0.33), and
: , still coe balanced votinall entities in a ile, it compensquery. Compa
ag in a popularin terms of ran
anced voting ap
ILISTIC Mement as determiated with inputproblem as coma user is lookingcan then rank
le create a querytly has a desired
call some entitiwledge. If the in, entities associasimilar entities.
istic model baassume that a uataset. Since thdesired tag ie desired tag is som to synare associated w
Using Probaour one-class pscuss the modegs in the datases later in the next
our one-class pa user is looking
∗ |
for the probabilitis given the useat the system shere, we assume
in : .
n the dataset gets a
score of the (0.53),
d :
ontribute the ng approach query likely sates biases
ared to our r entity like nking scores pproach less
MODEL ining a good t examples. mputing the g for. Given the entities
y for finding d property in es with the ntent can be ated with the
ased on the user’s intent he intent is s randomly selected, the nthesize the
with .
ability probabilistic
el under the et. We will t section.
probabilistic g for can be
(5)
ty that a tag er query ; hould return that all tags
647
ap
UE
Pe∑
(p
Gmins
n
d
s
n
th
h
Inapaskda
Iubb
W
Pe
(
* ~
are independent probability score
Using Bayes’ tEquation (5), as s
Proposition 1 equivalent to ∑ |∈
Proof) Because proofs in this sec
Given a query model, if the tantention, our pre
selecting | | en
number of entitie
distinct choices t
select | | entiti
number of combi
he query is o
having being
n this model, eassociated with property in the uare established, shared by all enknow an entity desired tag is any entity , onl
f an entity is up all probabilitbeing a similar based on this pro
We summarize th
Proposition 2 entities to which
∈
Proof) See the a
The appendix c~chucheng/publi
from each othes when more tha
theorem, we cashown in Propos
Ranking entiranking en. That is,
∝∈
of space constraction to the appen
, | | is the nuag is the deemise says that ntities from the
es associated wi
to create a quer
ies from ,
inations in a set
one of the |
the query and
entities in the the desired ta
user’s mind. If athe desired tag
ntities in the que can be a simil
associated with ly tags in ∩
associated withty values to getentity. Then, w
obability.
he above discuss
If every tag iit is related (no
|
∈
appendix.
can be downloadcation/SAC_201
her, so we canan one tag is con
an find an equsition 1.
ties based otities through
|∈
aints, we defer andix*.
umber of input esired tag, reprethe query is creset . |
th a tag , so w
ry because the u
where |
with | | distin|
outcomes,
being the desir
1| |
query are selecag , represenall tag-entity relamust be one of
ery, i.e., ∈ ∩
lar entity to the the entity, i.e., ∩ are consider
h two or more tat the probability
we rank every e
sions as follows:
is correctly assmissing tag), we
1|
∩ ∩
ded from http://oa12_Appendix_C
n sum up all thnsidered.
uivalent form
on h the formu
(6)
all derivations an
examples. In thesenting a user
eated by random| represents th
we have | tX
user can random|
stands for th
ct elements. Sin
the probability
red tag is
(7)
cted from entitinting the desireations in a datasf the tags that a∩. In addition, w
query only if th∈ . Thus, f
red.
ags in ∩, we suy of the entity
entity in a datas
:
sociated with ae show that
1|
(8)
ak.cs.ucla.edu/Chucheng.pdf
he
to
is ula
nd
his r’s
mly he |
mly
he
ce
of
ies ed set are we he for
um
set
all
Equatiobias bepopulartag ( |query, desiresassociamore vnumber
In Exacould aobjectsentity ichance:
alleviat|
5.2 RIn this introduprobabi
The funbut theof enti
:
The syma query:
because:
) is mis
The symthe syminput ex
Proposshow
approx
,where ≫ |
.
(Proof)
The twThe fir
on (8) has an aecause the systemr tag. The equat
|: a small the probability is high becau
ated with the tagvalue on it thanr).
ample 2, supposargue that :s, or because in the dataset. of the latter is
might not btes popularity
to a popular t
Refinement section, we dea
ucing some newilistic model.
nction ree tag-entity relatities missing th
={ :
mbol denotery. For example
. .e the query con, 1 becau
ssing the tag :
mbol stands fombol ∪ represexample. Then, w
sition 3 Whenthat ∑ ∈
ximated by the fo
∈
we claim that th| and ≫
f) See the append
wo properties in Prst property, ≫
advantage of allem will assign a ltion conveys the
number) is shthat the rarely
use it is unlikelg by chance. As aon a popular co
se that we see exists bec
: is assIf | :s high, and thue the desired tagtag-bias througtag.
of the Appral with the missiw notations for
eturns entities thtion is missing, he tag . For
}.
es the number ofe, in Example , :
ntains three inpuuse only one en
.
for the number oent all tags assowe generalize Pro
n considering th| in
ollowing formula
∈ ∩ ∪
∗1
| |
he dataset has t≫ | | and
dix.
Proposition 3 are≫ | | and
eviating the poplow value to
e notion: when ahared by all ent seen tag is whly that input exa result, Equatio
ommon tag (|
a tag :cause a user is sociated with al| is a large n
us we could reag. In other wordgh assigning a
roach ng tag effect. Wr extending ou
hat are relevant tand | | is
r instance, in E
f entities missing2, a query { } has value
ut examples, andntity ( :
of all entities in ociated with “atoposition 2 as fo
he missing tags Proposition 1
a:
| |
∪
| |
the following pro(2) | | |
e easily satisfied≫ | |, is
pularity tag-| for a
a rarely seen tities of the hat the user xamples are on (8) places
|: a large
in ∩ ; we looking for
lmost every number, the ason the tag s, the model
low value
We start with ur one-class
to the tag the number
Example 2,
g a tag in : ,
es 3 d for the tag
.
our dataset; t least one” ollows:
effect , we 1 can be
(9)
operties: (1) | | ≫
d in practice. s satisfied if
648
thecn(W
thin
Tatathainthm
W
a
e
pp
Insrmmu
6Etacap
ShsqWc
Indu
6WoeE
+
he number of alentities associatcollaboration prnumber, for examWikipedia), so | | |hat in practice a nput examples a
The part of the eadjustment for mag is shared by his part of the eq
are required. Asncreases ( inchis part adheres
model that tags in
When we con| | |
addressing the po
explanation of 1
place more valupopular tag, so th
n our experimensimply make telations being
missing, we coumake a reasonabunderstanding of
6. EXPERIEvaluating effecask because the
corpus exists. Thalgorithm is to perceive our new
Some IR commuhttp://trec.nist.gosearch. Unfortunquery-by-examplWikipedia and ccollecting user su
n this section, dataset, explain users’ satisfactio
6.1 A studyWe implement ouon November 3rencyclopedia, anEnglish. The coll
+ The user survey
~chucheng/pub
l entities in a dated with any ojects, the nummple, 3,459,565
the property is| ≫ , is
user provides a and expects a siz
quation |missing a tag
all the input enquation becomess the number ocreases), the vals to the notion n ∩ are importa
nsider the mis|
can be i
opularity of a ta
/| |
in P
ue on a rarely-shat the model all
nts, since the ratihe following missing. Thus,
uld conclude |le approximation
f target datasets a
IMENT tiveness of diffe
e quality of resuhus, we think thbuild a search
w ranking results
unities, such as Tov), provide danately, those datle search. As a r
create a search iurveys+.
we are going the design of thn scores for each
y of Dataset ur system basedrd, 2009. Wikipnd it contains laborative nature
ys can be downloblication/qbe_su
ataset is larger thindividual tag.
mber of entitiesentities in our e
s satisfied. The also satisfied besmall portion ofable number of r
|/ can bein some input e
ntities, the numbs one, meaning tf input entities lue decreases expwe learned in n
ant.
ssing tag effec
interpreted as
ag. The concept
Proposition 2, w
seen tag in the leviates the popu
io of missing tagassumption: 50
if half tag-en|~| |
n of tag-missingand some heuris
erent approacheults is subjectivehe best way to e
engine and see.
TREC (Text retratasets for evaasets are not coresult, we downlnterface on top
to provide the he surveys, andh proposed algor
– Wikipedid on a Wikipediapedia is the largmore than 3 me means that all
oaded from httprvey_results_pu
han the number In most soci
s is a very largexperiment datas
second propertecause we believf desired results results [1].
e interpreted as aexamples. Whenber is zero anthat no adjustmemissing a tag
ponentially. Thunaïve intersectio
ct, the part 1
a refinement f
is identical to th
where we tend
query than ularity tag-bias.
gs is unknown, w0% of tag-entintity relations a
. In practical, g ratio requires thtic trials.
s is a challengine and no standaevaluate a rankine how well use
rieval conferencaluating keywoollected for testinload the dataset of the dataset f
overview of od then analyze thrithm.
ia a snapshot dumpegest collaborativ
million articles tags in Wikiped
://oak.cs.ucla.edublic.zip
of ial ge set ty, ve as
an n a nd ent
us, on
1/
for
he
to
a
we ity are to he
ng ard ng ers
ce; ord ng of
for
our he
ed ve in
dia
du/
are genmissing
We conof the winformaconsidean articarticle 3,459,5
In Fig.associatags w“x=1” itags) apthey hathe righits y v413,98
The chand maof tagsentitiesfifty endistribuwas als
6.2 EEvaluatbecauseis subjeNikon Toyotaare simwe dec
6.2.1We colto creaquestioscenaritwo to t
nerated by humag.
nsider every wikwiki page becomation, every wiker each categorycle, we consider
(the entity). Th565 entities, and
Fig. 2
. 2, the x-axis ated with the tag
with the size |is 79,870, whichppear only onceave not yet been ht-most point ofvalue, one, mea1 entities (value
hart in Fig. 2 shany tags are unps are singleton, s, and less than ntities (| |>ution on Del.iciso observed.
Effectivenesting the similarie whether the enective. For exam(a manufacture
a (a manufacturemilar because bo
ide to create a b
Experiment llected 600 validate a benchmarkonnaire, the sysios (listed in thethree entities to
an, and some ta
ki page as an entmes a unique ideki page containsy as a tag. Whenthe category (thhe dataset cont13,196,971 asso
Size of category
refers to |g , and the y-a
|. For examph means that 79e. These tags are
accepted by othf the line in Fig. ans that only t
e of x-axis).
hows that some popular. Our stat
74% of tags a10% of tags are
>50). Golder etio.us data where
ss Evaluatioity between an enntity is similar tomple, someone er of cameras) ier of cars), but anoth of them are enchmark throu
design d questionnairesk for evaluatingstem randomly e appendix), whform a query.
ag-entity relation
tity, so the nameentifier of the ens “category” entrn a category is lhe tag) is associatains 471,443 uociations.
y distribution
| , the numberaxis refers to theple, the value o
9,870 tags (aboute possibly newlyher users. On the 2 is the Living this tag is asso
tags are extremtistics show that
are associated we associated witht al. [5] reportee a power law
on ntity and a queryo another entity might argue thais not similar tonother person mJapanese compagh conducting u
from 69 studeng user satisfacti
selects one ofhere every scena
ns might be
e (page title) ntity. For tag ries, and we labeled with ated with the unique tags,
r of entities e number of of y-axis of t 16% of all y created, or e other hand, People, and
ociated with
mely popular t about 16%
with 2 to 50 h more than ed a similar
distribution
y is difficult in the query at the entity o the entity
may feel they anies. Thus,
user surveys.
nts in UCLA ion. In each f 10 pre-set ario contains
649
Oaathwhfimomhgpq
Tqbininthqais
Eeop6p
6WeaWtosaccd
Fteateinthto
Once the scenarand we collect thall the top 200 hrough randoml
we run a customhigh-rank resultsfirst few pagesmportant becaus
of entities. Withmore interested ihigh rank resulgeneration, 30 particularly the equestions are dra
Fig
To avoid bias, questionnaires arbelong to a grountent of the quern a questionnairhe query, such
questionnaire takand make an evas a (partial) snap
Every questionnentities are knowout invalid survepositive example600 of 714 questprocess, and a tot
6.2.2 BenchmWe use Equatioentity to a queryanswer, the corrWhile he cannoto the entity. A
satisfaction scoreand entities in tcould imply a sconsider the satdifferent algorith
#
Fig. 4 contains then scenarios w
algorithms propoerm frequency–nto the comparihe y-axis indicaop results.
io is selected, thhe top 200 resuresults, and the
ly selecting entitmized questionns are reviewed s, high-ranked se users tend to rh limited resourcin knowing howts. Therefore, questions are
entity in the topawn from the oth
g. 3 A screen cli
questionnaire tare generated. Thup where all enry). They are as
re belongs to theas Car Manuf
ker. They are asaluation of possipshot of a questio
naire contains 5wn to be clearly ueys. If any pre-se, we rule out thtionnaires are cotal of 24,000 fee
marks and Reson (10) to calcuy. Whenever a qresponding entimake a clear ju
After averaginge implies that mthe query are siituation that usetisfaction scores
hms.
# 0.5#
he comparison rwe use in experosed in the paper–inverse documeison. The x-axisates the running
he query is issuults from each ale questionnaire ties from the minaire generationby more person
results, are read results accoce (questionnair
w each approach in the process drawn from h
p 40 results of aher entities.
ip of a questionn
akers are unawhey are told that ntities share somsked to examine e intent of the qufacturers, is notsked to read theible intentions toonnaire.
50 entities; howunrelated to the et unrelated enti
he entire questiononsidered valid aedbacks are colle
sults ulate the satisfacquestionnaire takity earns the fuudgment, we stilg scores of anmost users belieimilar. And a sers hold their os as benchmark
5 ∗ #
results of all moriments. In addr, we add Googlent frequency (Ts represents for average of satis
ued to our systelgorithm. We mis then generate
ixed set. Note thn process so thns. Entities in thoften considere
ording to the ordre takers), we aperforms in theof questionnai
high-rank entitieany model, and
naire
ware of how theentities in a que
me properties (thwhether an enti
uery. The intent t revealed to th
e Wikipedia pago the query. Fig.
wever, 10 of 5query for filterinity is marked asnnaire. Thus, onafter the screeninected.
ction score of aker reports a gooull credit of onll assign 0.5 poin entity, a hig
eve that the entiscore close to 0opinions. We theks for evaluatin
(10
odels based on thdition to the fole Sets [6] and thTFIDF) algorithtop results, an
sfaction scores f
em mix
ed hat hat he ed
der are ese ire es, 10
eir ery he ity of he
ges . 3
50 ng s a nly ng
an od ne. int gh ity 0.5 en ng
)
he our he
hm nd for
In Fig.modelsone clacomparand GoAs a reuser capromisibased o
6.2.3Althougcomparthe leadproblemexampl
F
Fig. 5 dthe resuWe crequery iseems contrashosted
Thoughintervieimpresswere aOlympiEven fwere stAtlanta
7. REThe stattractin
Fig. 4 Comp
. 4, our one clas in the top 10 rass probabilisticring top 40 resuoogle Sets returnesult, the averagannot find any ing result: 5.26on 2400 question
Discussiongh the algorithmre our results agding search engim: finding simles.
Fig. 5 One clas
demonstrates a qults provided by
eate a query {Bes to find cities thto interpret ou
st, our one-classthe Olympic Ga
h which interprews after the susions after seeinabout the eventics, or about thefor responders ptill unhappy wh
a and neglects pr
ELATED Rtudy of social ng attention fro
arison of differe
ass probabilisticresults; our balac model have hiults. We notice n approximately ge user satisfacti
results. The pro6% more satisfans in 600 questio
ms of Google gainst its result bine and Google
milar items thro
s probabilistic m
query in which my Google Sets aeijing, Atlanta}, hat hosted the O
ur intent as finds probabilistic mames.
retation is betteurvey, many respng {Beijing, Att that Beijing e notion that botpreferring the nhen seeing a reroperties contrib
RESEARCHcollaboration
om researchers. M
ent models (top 4
c model outperanced voting moigh user satisfacthat the interseonly 25 entities
ion promptly droobabilistic modction than the Gonnaires.
Sets are not dibecause Google Sets aims to solough quires co
model vs. Google
most survey respare inferior or qwhere the inten
Olympic Games. ding cities in Amodel identifies
er is controverponders said thatlanta} shown inhosted the 200
th of them are mnotion “metropoesult that is biasbuted by Beijing.
H tagging systemMathes [9] inve
40)
rforms other odel and our ctions while ction model on average.
ops in that a del reports a Google Sets
isclosed, we is currently
lve the same onsisting of
e Sets
ponders feel questionable. nt behind the Google Sets America. In s cities that
rsial, during at their first n the query 08 Summer
metropolises. olises”, they sed towards .
m has been estigated the
650
social tag data and pointed out that it was a fundamentally chaotic. Shirkly [12] also argued that using tag information is difficult because a user has the freedom of choosing any word he or she wants as a tag. Later, Gloder et al. [5] analyzed the structure of collaborative tagging systems on top of Del.icio.us data and concluded that tag data follow a power law distribution. Their studies back up our argument that some tags are extremely popular while others are rarely used.
Despite the challenges in using tag-entity information, many researchers continue to work in the field and have shown the potential of social tag clouds. Tso-Sutter et al. [14] used relationships among tags, users, and entities to recommend possible interesting entities to users. Penev et al. [10] used WordNet to acquire terms relating to a tag and applied TFIDF [11] similarity on both the tags and their related terms for finding similar entities (pages in their research). They used WordNet to interpret the meaning of tags, trying to measure the similarity between entities based on keywords through expanding tags with WordNet. Although we aim to solve similar problems, our approaches focus on using only tag information, because we believe, as Strohmaier et al. [13] suggested, that users use social tags for categorizing and describing resources.
We believe that our study can benefit many applications. For example, Givon et al. [4] showed that social tags can be used in dealing with recommendations in large-scale book datasets. Moreover, the keyword generation task can be considered as a similar problem, for example, Fuxman et al [3] draw an analogy from a keyword to a url and an entity to a tag. Finding similar entities based on shared tags is similar to finding related keywords based on shared urls that users click on.
Many researchers also work on tag ranking or tag aggregation. Recently, Wetzker et al. [15] focused on creating a mapping between personal tags to aggregate tags with the same meaning, Heymann et al. [7] used tag information to organize library data, Wu et al. [16] explained how to avoid noise and compensate for semantic loss, Liu et al. [8] studied how to rank tags based on importance, and Dattolo et al. [2] studied how to identify similar tags through detecting relationships between them.
8. CONCLUSION In this paper, we investigated the problem of finding similar items with query-by-example interface on top of social tag clouds. We introduced three approaches, and built a search engine on top of them, creating a benchmark for evaluating the users’ satisfaction through collecting 600 questionnaires. The experiment results suggest that social tag data, even though they are uncontrolled and noisy, are sufficient for finding similar items. Finally, we show that both the voting model and the one-class probabilistic model reach high user satisfaction.
We explain two important challenges of utilizing tag information: popularity bias and the missing tag effect, and explain how to overcome these difficulties through partial weight strategies and probability utilization. We show that, in terms of users’ satisfaction, our algorithms are superior or at least compatible to Google Sets and TFIDF model. Our approaches return hundreds of relevant entities without sacrificing the quality in the top results. Moreover, our models rely on only social tag information.
Our proposed framework not only provides an ability to find similar items, but also shows the application potential of social tag information. We demonstrate that the task can be accomplished through providing a query consisting of entities and using only tag information, even though the tag information is uncontrolled and noisy. Through this study, queries for finding similar items, such as “Honda or Toyota or similar”, are handled properly. Our research also highlights the value of using social collaboration data, tag clouds, to refine existing search technologies.
9. REFERENCE [1] Arampatzis, A., and Kamps, J. A study of query length. In
Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval, 2008, 811-812.
[2] Dattolo, A., Eynard, D., and Mazzola, L. An integrated approach to discover tag semantics. In Proceedings of the ACM Symposium on Applied Computing, 2011, 814–820.
[3] Fuxman, A., Tsaparas, P., Achan, K., and Agrawal, R. Using the wisdom of the crowds for keyword generation. In Proceddings of the International World Wide Web Conference, 2008, 61-70.
[4] Givon, S., and Lavrenko, V. Large Scale Book Annotation with Social Tags. In Proceedings of International AAAI Conference on Weblogs and Social Media, 2009, 210-213.
[5] Golder, S. and Huberman, B. A. Usage Patterns of Collaborative Tagging Systems. Journal of Information Science, 32(2), 2006, 198-208
[6] Google Sets. http://labs.google.com/sets [7] Heymann, P., Paepcke, A., and Garcia-Molina, H. Tagging
human knowledge. In Proceedings of the Web Search and Web Data Mining, 2010, 51-60.
[8] Liu, D., Hua, X., Yang, L., Wang, M., and Zhang, H. Tag ranking. In Proceedings of the International World Wide Web Conference, 2009, 351-360.
[9] Mathes, A. Folksonomies - cooperative classification and communication through shared metadata. Computer Mediated Communication, LIS590CMC (Doctoral Seminar), University of Illinois Urbana-Champaign, Dec. 2004.
[10] Penev, A., and Wong, R. K. Finding similar pages in a social tagging repository. In Proceeding of the 17th international conference on World Wide Web , 2008, 1091–1092.
[11] Robertson, S. E., and Sparck-Jones, K. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 1975, 129–146.
[12] Shirky, C. Ontology is overrated: categories, links and tags. http://www.shirky.com/writings/ontology-overrated.html, 2005.
[13] Strohmaier, M, Körner C., and Kern, R. Why do Users Tag? Detecting Users’ Motivation for Tagging in Social Tagging Systems, In Proceedings of International AAAI Conference on Weblogs and Social Media, 2010, 339-342.
[14] Tso-Sutter, K.H.L., Marinho, L.B., and Schmidt-Thieme, L. Tag-aware recommender systems by fusion of collaborative filtering algorithms. In Proceedings of the ACM Symposium on Applied Computing, 2008, 1995–1999.
[15] Wetzker, R., Zimmermann, C., Bauckhage, C., and Albayrak, S. I tag, you tag: translating tags for advanced user models. In Proceedings of the Web Search and Web Data Mining, 2010, 71-80.
[16] Wu, L., Yang, L., Yu, N., and Hua, X. Learning to tag, In Proceedings of the International World Wide Web Conference, 2009, 361-370.
651