Finding Similar Items by Leveraging Social Tag Cloudscho/papers/SAC12.pdf · Flickr have been...

Finding similar items by leveraging social tag clouds

Chu-Cheng Hsieh University of California, Los Angeles

4732 Bolter Hall Los Angeles, CA 90095, USA

(+1)310-773-0733

[email protected]

Junghoo Cho

University of California, Los Angeles 4732 Bolter Hall

Los Angeles, CA 90095, USA (+1)310-773-0733

[email protected]

ABSTRACT Recently social collaboration projects such as Wikipedia and Flickr have been gaining popularity, and more and more social tag information is being accumulated. In this study, we demonstrate how to effectively use social tags created by humans to find similar items. We create a query-by-example interface for finding similar items through offering examples as a query. Our work aims to measure the similarity between a query, expressed as a group of items, and another item through utilizing the tag information. We show that using human-generated tags to find similar items has at least two major challenges: popularity bias and the missing tag effect. We propose several approaches to overcome the challenges. We build a prototype website allowing users to search over all entries in Wikipedia based on tag information, and then collect 600 valid questionnaires from 69 students to create a benchmark for evaluating our algorithms based on user satisfaction. Our results show that the presented techniques are promising and surpass the leading commercial product, Google Sets, in terms of user satisfaction.

Categories and Subject Descriptors H.3.3 [INFORMATION STORAGE AND RETRIEVAL]: Information Search and Retrieval – clustering, information filtering, retrieval models, selection process.

General Terms Algorithms, Experimentation

Keywords World Wide Web (WWW), Social tags, Information retrieval, Entity resolution, Folksonomy

1. INTRODUCTION The dominant interface of search engines today requires users to pinpoint their information needs with a few keywords. However, users sometimes find it difficult to identify the keywords that best describe their needs. For example, a user who plans to apply for a graduate school in California may issue the query “outstanding universities in California”. Many outstanding schools, such as Stanford University, are missing in the top results of all major search engines, because the keywords “outstanding” and

“California” are not presented in the web pages of those schools.

As a potential solution to this problem, we study methodologies for providing a “query-by-example” interface. In this interface, users provide a few representative examples of the ultimate information they seek. The system then returns search results most similar to the examples provided; for instance, to find outstanding graduate schools, a user may issue a query like “Caltech, UC Berkeley” and expects that the system will return similar outstanding schools in California such as UCLA and Stanford University.

The major challenge in building such a system is to identify similar items based on the user-provided set of examples. In this study, we leverage the tag clouds that are collaboratively created by web users in defining and measuring the similarity between multiple items. To verify the effectiveness of our solutions, we conduct experiments on one of the largest social collaboration projects, Wikipedia. In the Wikipedia dataset, we consider a wiki page or entry as an entity and a category label of a page as a tag. We aim to identify and rank entities that are similar to the user-provided examples based on tag information.

As other researchers [9] have noted, the uncontrolled nature of user-generated metadata, such as free-form tagging in Wikipedia, often causes problems of imprecision and ambiguity when these tags are used as a foundation of designing algorithms. In our study, we identify and deal with two challenges associated with free-form uncontrolled tag clouds: popularity bias and the missing tag effect (Section 3.2).

We propose several approaches to overcome the challenges and subsequently build a prototype website allowing users to issue a query by examples. Our results show that our techniques are able to return a sizable number of high-quality similar items even when the user provides only a few examples in the query. The proposed approaches are evaluated against a benchmark dataset that is built based on 600 valid questionnaire responses from 69 students. In terms of user satisfaction, the questionnaire responses show that our techniques outperform Google Sets.

In summary, we highlight our contributions as follows:

We investigate how to extract a set of similar items through analyzing noisy social tags created by human beings, and show that the tag information is effective in identifying relevant similar items.

We identify and solve two challenges in tag-based search frameworks: popularity bias and the missing tag effect.

We propose and compare several models based on tag information. We build algorithms on top of these models, and study their advantages and disadvantages.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’12, March 25-29, 2012, Riva del Garda, Italy. Copyright 2012 ACM 978-1-4503-0857-1/12/03…$10.00.

644

2Ttoqcpc

Afrefud

q

2Wethatahccfrp

Apeusn

Wuar

W

E

In

We perforand show approache

2. PROBLEThe essential proo retrieve releva

queries are comcharacteristics. Wprocess of findincomposed of a nu

Assuming a userframework is theentities, like {Cfunction to medataset and the

, impliequery .

2.1 Tag genWe start the discexample, as showhis example, we

about Washingtoags City, Capita

have to be the sconcept of a metcollaboration profree-form taggingphrasing of an ex

Although we illupages” as entitieentities are not utilize only tag simplification alnon-textual datas

We use to universe of entitiassociated with eferring to all ta

We illustrate our

Example 1 Cons: :: :

n this example, t: , ::

rm an extensive that in terms o

es outperform Go

EM OVERVoblem can be phrant items sharin

mposed of repreWe consider thng similar itemsumber of items f

r issues a query e query itself, w

Caltech, UC Bereasure the similquery , whe

s a higher simila

neration cussion of our twn in Fig. 1, to ee say a user readon D.C.; then, tal, North Amerisame word usedtropolis is matchojects, a user cag. The phrase mxisting tag, or ev

Fig. 1 The

ustrate our tag-ees and their titllimited to plai

information, ignlows our techniset as well.

represent all enies. We use

. The univeags in a dataset.

r notations with t

sider a dataset th : ,

. . : : ,

:the dataset conta

, :, :

evaluation baseof user satisfactioogle Sets in mo

VIEW rased as the follo

ng some charactesentative examphe “query-by-exas in a dataset, wfrom the same da

: , , . . .where shouldrkeley}. Our golarity between aere a higher scarity between th

tag-entity modelexplain how tagsds the content of the user labels tica, etc. Note thd in the content;hed by the tag Can select any ph

may be an existinven a newly creat

e data model

entity model by les as identifierin texts. Our prnoring the conteiques to be broa

ntities in a dat, , … to

erse of tags is

the following ex

hat contains four , ,, ,

, ,,

ains a total of six, :

ed on user surveion, our tag-baseost testing cases.

owing: users wateristics, and theples with desireample” task as

where the query ataset.

, the input to thd be composed oal is to create an entity in thcore measured be entity and th

l with an artificis are generated.

f a Wikipedia pagthe page with thhat a tag does n; for example, th

City. In most socihrase as a tag, ing tag, a modifieted tag.

considering “wirs, the content roposed solutioent of entity. Thadly applicable

taset, that is, threpresent the tadenoted by

xample:

entities:

,

x tags:

,

ys ed

ant eir ed

a is

he of a

he by he

ial In ge he

not he ial .e. ed

iki of ns

his to

he ags

,

□

We deentities

:we deExamp

:

Withousimilaritag-entirelationregardi

3. THAPPRThe corfigure oinput ethis pro

In Exax :

interprewith th

capitalsuser is lookingwell. Sbut alssimilar

In the ia subseonly loalthoug:

Asia be

3.1 SHere, wthe que

The syquery, query cthe set

Accordan entitentity isuch asbut notEquatiomeanincommo

rankingthat the

fine the functios associated with

, :efine , ,

ple 1, , :

ut referring to ity measuremenity relations arn is missing froming the imperfect

HE INTERROACH re challenge in out what types o

entities providedoblem, we start w

ample 1, when, :

etation of the ushe tags :: , :

s. Given this relooking for citie

g for cities, but tSince the second so : sh

entities.

intersection-drivet of input examook for entities gh the input ent

, it is unlikelyecause this tag is

Similarity Mwe define the deery as follow

, |

ymbol ∩ standsi.e. ∩ ∩

consists of entit∩ : ,

ding to our definty is associated wis. For instance,s the entity :t with :on (1), the rankng that the enton with ∩ ; L

g score of e entity :

on as an h the tag . For

. . ,… as

: , :.

the content ofnt to tag informare unreliable anm an entity. We t nature later in

RSECTION-

providing a queof entities a userd by the user. Towith an artificial

n the user-prov . .

ser’s intention? and :

}. That is, sult, we may ha

es that are also cthe input examp

d interpretation ishould be weigh

ven approach, if amples, we claim

associated withtity Beijing is a

y that the user is not associated

Measuremenegree of similaris:

∩ ∩|

s for tags assoc∩ …∩ , ∀

ties x : : .

nition in Equatiowith, the more s, if an entity

, it. According to

king score is tity : Likewise, the e

: , has two tags

operation for aexample, : : . ∪ ∪ … N

f an entity, weation. One mighnd sometimes a

will discuss theSection 3.2.

-DRIVEN

ery-by-example r is looking for bo convey how wl scenario.

vided input is , what will be aBoth entities ar as we can seeboth entities arave two interpreapitals, or the us

ples happen to bes possible, not oed when we id

a tag is associatethat the user is

h such a tag. Falso associated ws only looking with :

nt ity between an e

ciated with all e∈ . In Exam, :

on (1), the moresimilar (to the qu is a city but no

t is associated wo the similarity d

: has only

entity :∩ ∩ 2

s in common wit

acquiring all

In addition, Namely, in ∪

e limit our ht argue that a reasonable ese concerns

service is to based on the we approach

the query a reasonable e associated e from ∩re cities and etations: the ser is simply e capitals as only : dentify other

ed with only s unlikely to For instance, with the tag for cities in

. .

entity and

(1)

entities in a mple 1, if a

. . ,

e tags in ∩ uery ) the ot a capital, with : definition in , 1,

one tag in has the

2, meaning th ∩.

645

Inthmtha

ifc

3Wdenthb

E In

3Ncmethnn

Mu

E

inGinto

the

Teaeu

e

n the above examhe entity :

matter whether that are also c

appropriate answ: beca

f the user’s intencannot deny suc:

3.2 ChallenWe create Exampdataset. The examexpect to encounnotation signifiehrough a tag, re

but does not appe

Example 2 Cons:

:

North Ame:

Summer O: : :

n this example, t: , ::

3.2.1 MissingNot only in Exacan be missing. may happen. In entity might not he content of th

not be aware of newly created tag

Missing tag-entituser intent. For

:Example 2 da:: , the

ntention as findiGiven these twonterpretation “Ooo general. The : a

hat : equally similar to

The impact of thentities are inclua set of entities expected to be ause the notation

, the probabiexamples becom

mple, our appro . This

the user’s intentcapitals, the enwer. The entityause : nt is to find citiech a possibility, is reasonable.

nges ple 2 based on omple illustrates

nter in a real dates that a tag isepresenting that ear in a real data

sider a dataset th : ,

, China . . :

erica, Object } :{City, CapiOlympic, Object }

:{City, :{Su

:{City, Europethe dataset conta

, : ,, :

g Tag Effect ample 2, but also

There are seveany social collabe well tagged

he entity. At thea newly createdg for other entiti

ty relations couldr example, sup . ., :

ataset; since th . . , and thintersection–dri

ing entities assoo input entities, Object” is problem

intersection–dris an irrele

, :o the query .

he missing tag efded in a query. associated with

associated with a to represent t

ility of beinges

ach ranks :s result seems intt is to find citientity :y :

is an inapes that are also c ranking :

our observationsand highlights taset. In Example

s very popular. a tag-entity rela

aset.

hat contains six e, ,

a, Object } {City, Capital,

ital, Europe, } North America, ummer Olympic,, Object } ains a total of eig: , : ,

o in practice a eral possible reaaboration projectuntil its editors

e same time, thed tag or may deces.

d cause the systepose that a us

has beehe tag :he tag :iven approach i

ociated with the intuitively, we

matic because thiven approach coevant one,

,

ffect is more proSuppose that a uh a tag ; ideaall the entities inthe probability og not associate

higher thatuitive because ns or to find citi

is always a is inferior

ppropriate answcapitals. Since w

higher tha

on the Wikipedthe challenges we 2, an underlineAlso, we strik

ation should exi

entities.

Object } , Object }

ght tags: ,

:

tag-entity relatioasons of why tht, a newly createfinish revising a

e community macide not to use th

em to misinterprser’s query en issued again

is missing is missing

interprets the ustag { :feel that such a

he interpretation onsiders the enti

and suggesand : a

onounced as mouser is looking fally, the tag n the query. If wof missing the taed with all inp

an no ies an to

wer we an

dia we ed

ke-ist

□

on his ed all ay he

ret

nst in in

ser }. an is

ity sts are

ore for

is we ag

put

, where is 20

missingnumberthe des

3.2.2To genmissinginsteadsome

:these :

entitiesThese tpartial similari

but alsseparat

:{1st po“ :

The strthus en∩ an

the ent∩, the

long asquery.

Such a in Ex :Neverthidea. Ttend toresults.

3.2.3Both rexperimentity fare assassocia

We deftags asspopulara tag where

We defsimilariin the qexamplprocessrank al

e is the num% and the numbg the tag in r of input exampired tag increase

Partial Weigneralize the interg tag effect, one

d of either zero oentities in a q

.tags: : ,

s in the query antags, although nweights in re

ity function 2 because not o

so : and tely. Similarly,

, 2.osition: “ :

” and “ :

rategy weights tansures that we fentity is associa

tity is. Moreovee strategy helpss some tags are

generalization bxample 2, as

heless, introduci

Therefore, if the o not adopt gen.

Popularity Bresults in the pments show thafollows a power ociated with a l

ated with a small

fine the popularsociated with r than another en is more popula

represents

fine the populariity between an equery, we tend tle, a query s the query, well cities, we ass

∩ 1

mber of entities ber of input exam

∩ is close to ples increases toes to 89.26%.

ghting Generarsection-driven ae solution is to asor one, to tags thquery. In the ., : , :and :

nd each tag assonot in ∩({ :eal numbers, an

, . As a only the tag ::

, : The system ret and

:

ags that appear follow the sameated with, the moer, in case the ins the system retuassociated with

brings more entissigning partia brings :

ing this suspiciosystem already

neralization unle

Bias previous researcat the number olaw distribution

large number ofl number of tags

rity of an entity, denoted by |ntity if | |ar than another ts the set of all en

ity entity-bias asentity and a queto favor entities consists of

e firstly identifysign full weight

1

in the query. If mples is 3, the pra half (48.8%)

o 10, the chance

alization approach for adssign scores in rhat are associateprevious exam

, we now as, :, because we

ociates with only, are no

nd thus contribresult, :

contribu contribute

, turns a better rad : ”, 2n

” .

in ∩ the most h intuition: the m

ore similar (to thntent cannot beurn some relateone or more en

ities to results. Fal weight to

toous result may no

returns satisfiedess the user ask

ch [5] and resof tags associat

n. That is, only a f tags, and mosts.

y based on the|. Then, an entit| |. Similarly,

tag ′ if | |ntities associated

s follows: if we ry based on tag

s that are more p: , :

all cities. Thent (1.0) to the ta

(2)

the value of robability of ). When the e of missing

dressing the real number, ed with only

mple (ssign 0.5 to ,

e have two y one entity. ow assigned bute to the ,

utes 1 point, 0.5 points

1.5 and anking result nd position:

heavily, and more tags in he query )

captured in ed results as ntities in the

For example, o the tag o the result. ot be a good d results, we ks for more

sults in our ted with an few entities

t entities are

e number of ty is more we say that

| ′ |, d with .

measure the information

popular. For . To

n, to further ag City, and

646

pC

r

Tkpinupna

4Wthutoebs

Tth

Epthethtamoaep

Inasrbetossfd

IngeN

partial weight (0China, Europe a: , enti

eturned in the re

The popularity bkind of bias as popular tag likenput example(s)

user intends to sprobably only benumber. In Exama query, :

4. BALANCWe now further he following pro

unfairly influenco the popular en

entity relations abe able to identifsubset of input ex

To achieve the ahe similarity bet

,

,

Equation (3) canpopularity entityhe query as hav

each tag associathe entity’s scoreag effect in tha

missing in some of a tag is inverassociated with entity will get apopular entity, al

n Equation (4),associated with symbol ∪ standelevance score

by summing up tentity. Here, o the query .

score to each tasubset of the inpfollowing exampdefinition.

n Example 2, egains 0.2 points beach tag associaNamely, if a qu

: ,

.5) to these tagsand Object. Sincities similar to esult and ranked

bias happens in popularity tag-

Object in Exa), this popular taearch for. That ecause it is popu

mple 2, if we rand is the most li

CED VOTINrefine the intersoperties: (1) a poce the results sucntity, but not to

are missing in infy relevant entitixamples.

above desired prtween an entity

1| |

∈ 0,

∈ ∩ ∪n mitigate both

y-bias. In a nutsving a total scoted with the entie. The non-zero at we now valinput examples

sely proportiona. As a result,

a lower score thlleviating the pop

we introduce tone or more en

ds for the notifor every entitythe relevance sco, representNote that in th

g in ∪; therefoput entities wil

ple shows how th

each tag associbecause five tagated with the e

uery consists 0.2 0.33

: Capital, Asia, ce most of the ta: are

in higher positio

tags as well, a-bias. We argueample 2 is assoag is probably nis, the tag :

ular, i.e. | :domly select twokely tag shown i

NG MODEection-driven ap

opular entity in ach that the resulto others, and (2)nput examples, tes based on tags

operties, in this and the query

, if ∈

,

h the missing tashell, we considre of one in Eqity is assigned anassignment alleue significance

s. Furthermore, tal to | |, i.e., tha tag originatin

han a tag originpularity entity b

the symbol ∪ntities in a querion “union”. W

y in the dataset wores of all tags ats the relevance shis equation we ore, a tag assocl still get a nonhe scores are com

iated with the es are associated entity : of { :0.53 . Althoug

Summer Olympiags originate fromore likely to bons.

and we name the that although ciated with somot the concept th

exists in | is a larg

o entities to creain ∩.

EL pproach to includa query should nts are similar on) even a few tathe system shous associated with

model, we defin as follows:

(3)

(4)

ag effect and thder every entity quation (3). Then equal portion viates the missinof tags that a

the assigned scohe number of tang from a populnating from a leias.

to represent tary , where th

We calculate thwith Equation (

associated with thscore of the tag assign a non-ze

ciated with onlyn-zero score. Thmputed under th

entity :with it. Similarlgets 0.33 point, : }, th

gh :

ic, om be

his a

me he ∩

ge ate

de not nly ag-uld h a

ne

he in

en, of ng are ore gs lar ess

gs he he (4) he

ero y a he his

ly, ts. he is

associacontrib

We calwith Erelevantags a:

(0.33).

Tags behighestclings tcapturecaused intersec:

now. Tsensitiv

5. OSo far,weightiAlternaprobabisome ebased o

We starsimilar mind, desiredcapturetag(s) in

We prfollowicorrespunprediselecteduser raquery, w

5.1 MIn ordemodelpremisereconsi

When model, comput

In Equa is th|

an entit

ated with more butes a higher ran

lculate the relevEquation (4). Fonce score of 1.3associated with

elonging to the t ranking scoresto our belief thate a user’s inten

by a popularction-driven app

contributesThis difference ve to the popular

ONE-CLASSwe view the siming scheme foatively, we canility that an entientities as possion the calculated

art by thinking aitems. We thinkand then the u

d property baseded by a tag (or tan the dataset are

ropose the oneing assumptionsponds to one tictable, we clad from all tags.

andomly selects where are

Measuring Ser to better unworks, in this

e that there is ider the problem

a query is gthe probability

ted as

| ≜

ation (5), the he desired tag w

stands for thety if the desir

tags than :nking score than

vance score for or instance, the39 points by sum

: , (0.2), :

set ∩, such as s. Therefore, thet tags shared by ntion; meanwhr entity in a proach, each tas less influence makes our balarity entity-bias.

S PROBABImilarity measureor tags associan consider the pity is the target ible results, we d probabilities.

about how peoplk that a user firstuser tries to recd on his/her knowags) in a dataset,e considered as s

e-class probabilis. At first, we atag in a daaim that the d

Then, once the| | entities fr

e all entities that

Similarity Understand how

section we disno missing tag

m of missing tags

given, based on that is what a

≜ | ∗

| stands fwhen the system

e probability thared tag is . He

, each tag n a tag in :

every entity ine entity :mming up the sincluding :

(0.33), and

: , still coe balanced votinall entities in a ile, it compensquery. Compa

ag in a popularin terms of ran

anced voting ap

ILISTIC Mement as determiated with inputproblem as coma user is lookingcan then rank

le create a querytly has a desired

call some entitiwledge. If the in, entities associasimilar entities.

istic model baassume that a uataset. Since thdesired tag ie desired tag is som to synare associated w

Using Probaour one-class pscuss the modegs in the datases later in the next

our one-class pa user is looking

∗ |

for the probabilitis given the useat the system shere, we assume

in : .

n the dataset gets a

score of the (0.53),

d :

ontribute the ng approach query likely sates biases

ared to our r entity like nking scores pproach less

MODEL ining a good t examples. mputing the g for. Given the entities

y for finding d property in es with the ntent can be ated with the

ased on the user’s intent he intent is s randomly selected, the nthesize the

with .

ability probabilistic

el under the et. We will t section.

probabilistic g for can be

(5)

ty that a tag er query ; hould return that all tags

647

ap

UE

Pe∑

(p

Gmins

n

d

s

n

th

h

Inapaskda

Iubb

W

Pe

(

* ~

are independent probability score

Using Bayes’ tEquation (5), as s

Proposition 1 equivalent to ∑ |∈

Proof) Because proofs in this sec

Given a query model, if the tantention, our pre

selecting | | en

number of entitie

distinct choices t

select | | entiti

number of combi

he query is o

having being

n this model, eassociated with property in the uare established, shared by all enknow an entity desired tag is any entity , onl

f an entity is up all probabilitbeing a similar based on this pro

We summarize th

Proposition 2 entities to which

∈

Proof) See the a

The appendix c~chucheng/publi

from each othes when more tha

theorem, we cashown in Propos

Ranking entiranking en. That is,

∝∈

of space constraction to the appen

, | | is the nuag is the deemise says that ntities from the

es associated wi

to create a quer

ies from ,

inations in a set

one of the |

the query and

entities in the the desired ta

user’s mind. If athe desired tag

ntities in the que can be a simil

associated with ly tags in ∩

associated withty values to getentity. Then, w

obability.

he above discuss

If every tag iit is related (no

|

∈

appendix.

can be downloadcation/SAC_201

her, so we canan one tag is con

an find an equsition 1.

ties based otities through

|∈

aints, we defer andix*.

umber of input esired tag, reprethe query is creset . |

th a tag , so w

ry because the u

where |

with | | distin|

outcomes,

being the desir

1| |

query are selecag , represenall tag-entity relamust be one of

ery, i.e., ∈ ∩

lar entity to the the entity, i.e., ∩ are consider

h two or more tat the probability

we rank every e

sions as follows:

is correctly assmissing tag), we

1|

∩ ∩

ded from http://oa12_Appendix_C

n sum up all thnsidered.

uivalent form

on h the formu

(6)

all derivations an

examples. In thesenting a user

eated by random| represents th

we have | tX

user can random|

stands for th

ct elements. Sin

the probability

red tag is

(7)

cted from entitinting the desireations in a datasf the tags that a∩. In addition, w

query only if th∈ . Thus, f

red.

ags in ∩, we suy of the entity

entity in a datas

:

sociated with ae show that

1|

(8)

ak.cs.ucla.edu/Chucheng.pdf

he

to

is ula

nd

his r’s

mly he |

mly

he

ce

of

ies ed set are we he for

um

set

all

Equatiobias bepopulartag ( |query, desiresassociamore vnumber

In Exacould aobjectsentity ichance:

alleviat|

5.2 RIn this introduprobabi

The funbut theof enti

:

The syma query:

because:

) is mis

The symthe syminput ex

Proposshow

approx

,where ≫ |

.

(Proof)

The twThe fir

on (8) has an aecause the systemr tag. The equat

|: a small the probability is high becau

ated with the tagvalue on it thanr).

ample 2, supposargue that :s, or because in the dataset. of the latter is

might not btes popularity

to a popular t

Refinement section, we dea

ucing some newilistic model.

nction ree tag-entity relatities missing th

={ :

mbol denotery. For example

. .e the query con, 1 becau

ssing the tag :

mbol stands fombol ∪ represexample. Then, w

sition 3 Whenthat ∑ ∈

ximated by the fo

∈

we claim that th| and ≫

f) See the append

wo properties in Prst property, ≫

advantage of allem will assign a ltion conveys the

number) is shthat the rarely

use it is unlikelg by chance. As aon a popular co

se that we see exists bec

: is assIf | :s high, and thue the desired tagtag-bias througtag.

of the Appral with the missiw notations for

eturns entities thtion is missing, he tag . For

}.

es the number ofe, in Example , :

ntains three inpuuse only one en

.

for the number oent all tags assowe generalize Pro

n considering th| in

ollowing formula

∈ ∩ ∪

∗1

| |

he dataset has t≫ | | and

dix.

Proposition 3 are≫ | | and

eviating the poplow value to

e notion: when ahared by all ent seen tag is whly that input exa result, Equatio

ommon tag (|

a tag :cause a user is sociated with al| is a large n

us we could reag. In other wordgh assigning a

roach ng tag effect. Wr extending ou

hat are relevant tand | | is

r instance, in E

f entities missing2, a query { } has value

ut examples, andntity ( :

of all entities in ociated with “atoposition 2 as fo

he missing tags Proposition 1

a:

| |

∪

| |

the following pro(2) | | |

e easily satisfied≫ | |, is

pularity tag-| for a

a rarely seen tities of the hat the user xamples are on (8) places

|: a large

in ∩ ; we looking for

lmost every number, the ason the tag s, the model

low value

We start with ur one-class

to the tag the number

Example 2,

g a tag in : ,

es 3 d for the tag

.

our dataset; t least one” ollows:

effect , we 1 can be

(9)

operties: (1) | | ≫

d in practice. s satisfied if

648

thecn(W

thin

Tatathainthm

W

a

e

pp

Insrmmu

6Etacap

ShsqWc

Indu

6WoeE

+

he number of alentities associatcollaboration prnumber, for examWikipedia), so | | |hat in practice a nput examples a

The part of the eadjustment for mag is shared by his part of the eq

are required. Asncreases ( inchis part adheres

model that tags in

When we con| | |

addressing the po

explanation of 1

place more valupopular tag, so th

n our experimensimply make telations being

missing, we coumake a reasonabunderstanding of

6. EXPERIEvaluating effecask because the

corpus exists. Thalgorithm is to perceive our new

Some IR commuhttp://trec.nist.gosearch. Unfortunquery-by-examplWikipedia and ccollecting user su

n this section, dataset, explain users’ satisfactio

6.1 A studyWe implement ouon November 3rencyclopedia, anEnglish. The coll

+ The user survey

~chucheng/pub

l entities in a dated with any ojects, the nummple, 3,459,565

the property is| ≫ , is

user provides a and expects a siz

quation |missing a tag

all the input enquation becomess the number ocreases), the vals to the notion n ∩ are importa

nsider the mis|

can be i

opularity of a ta

/| |

in P

ue on a rarely-shat the model all

nts, since the ratihe following missing. Thus,

uld conclude |le approximation

f target datasets a

IMENT tiveness of diffe

e quality of resuhus, we think thbuild a search

w ranking results

unities, such as Tov), provide danately, those datle search. As a r

create a search iurveys+.

we are going the design of thn scores for each

y of Dataset ur system basedrd, 2009. Wikipnd it contains laborative nature

ys can be downloblication/qbe_su

ataset is larger thindividual tag.

mber of entitiesentities in our e

s satisfied. The also satisfied besmall portion ofable number of r

|/ can bein some input e

ntities, the numbs one, meaning tf input entities lue decreases expwe learned in n

ant.

ssing tag effec

interpreted as

ag. The concept

Proposition 2, w

seen tag in the leviates the popu

io of missing tagassumption: 50

if half tag-en|~| |

n of tag-missingand some heuris

erent approacheults is subjectivehe best way to e

engine and see.

TREC (Text retratasets for evaasets are not coresult, we downlnterface on top

to provide the he surveys, andh proposed algor

– Wikipedid on a Wikipediapedia is the largmore than 3 me means that all

oaded from httprvey_results_pu

han the number In most soci

s is a very largexperiment datas

second propertecause we believf desired results results [1].

e interpreted as aexamples. Whenber is zero anthat no adjustmemissing a tag

ponentially. Thunaïve intersectio

ct, the part 1

a refinement f

is identical to th

where we tend

query than ularity tag-bias.

gs is unknown, w0% of tag-entintity relations a

. In practical, g ratio requires thtic trials.

s is a challengine and no standaevaluate a rankine how well use

rieval conferencaluating keywoollected for testinload the dataset of the dataset f

overview of od then analyze thrithm.

ia a snapshot dumpegest collaborativ

million articles tags in Wikiped

://oak.cs.ucla.edublic.zip

of ial ge set ty, ve as

an n a nd ent

us, on

1/

for

he

to

a

we ity are to he

ng ard ng ers

ce; ord ng of

for

our he

ed ve in

dia

du/

are genmissing

We conof the winformaconsidean articarticle 3,459,5

In Fig.associatags w“x=1” itags) apthey hathe righits y v413,98

The chand maof tagsentitiesfifty endistribuwas als

6.2 EEvaluatbecauseis subjeNikon Toyotaare simwe dec

6.2.1We colto creaquestioscenaritwo to t

nerated by humag.

nsider every wikwiki page becomation, every wiker each categorycle, we consider

(the entity). Th565 entities, and

Fig. 2

. 2, the x-axis ated with the tag

with the size |is 79,870, whichppear only onceave not yet been ht-most point ofvalue, one, mea1 entities (value

hart in Fig. 2 shany tags are unps are singleton, s, and less than ntities (| |>ution on Del.iciso observed.

Effectivenesting the similarie whether the enective. For exam(a manufacture

a (a manufacturemilar because bo

ide to create a b

Experiment llected 600 validate a benchmarkonnaire, the sysios (listed in thethree entities to

an, and some ta

ki page as an entmes a unique ideki page containsy as a tag. Whenthe category (thhe dataset cont13,196,971 asso

Size of category

refers to |g , and the y-a

|. For examph means that 79e. These tags are

accepted by othf the line in Fig. ans that only t

e of x-axis).

hows that some popular. Our stat

74% of tags a10% of tags are

>50). Golder etio.us data where

ss Evaluatioity between an enntity is similar tomple, someone er of cameras) ier of cars), but anoth of them are enchmark throu

design d questionnairesk for evaluatingstem randomly e appendix), whform a query.

ag-entity relation

tity, so the nameentifier of the ens “category” entrn a category is lhe tag) is associatains 471,443 uociations.

y distribution

| , the numberaxis refers to theple, the value o

9,870 tags (aboute possibly newlyher users. On the 2 is the Living this tag is asso

tags are extremtistics show that

are associated we associated witht al. [5] reportee a power law

on ntity and a queryo another entity might argue thais not similar tonother person mJapanese compagh conducting u

from 69 studeng user satisfacti

selects one ofhere every scena

ns might be

e (page title) ntity. For tag ries, and we labeled with ated with the unique tags,

r of entities e number of of y-axis of t 16% of all y created, or e other hand, People, and

ociated with

mely popular t about 16%

with 2 to 50 h more than ed a similar

distribution

y is difficult in the query at the entity o the entity

may feel they anies. Thus,

user surveys.

nts in UCLA ion. In each f 10 pre-set ario contains

649

Oaathwhfimomhgpq

Tqbininthqais

Eeop6p

6WeaWtosaccd

Fteateinthto

Once the scenarand we collect thall the top 200 hrough randoml

we run a customhigh-rank resultsfirst few pagesmportant becaus

of entities. Withmore interested ihigh rank resulgeneration, 30 particularly the equestions are dra

Fig

To avoid bias, questionnaires arbelong to a grountent of the quern a questionnairhe query, such

questionnaire takand make an evas a (partial) snap

Every questionnentities are knowout invalid survepositive example600 of 714 questprocess, and a tot

6.2.2 BenchmWe use Equatioentity to a queryanswer, the corrWhile he cannoto the entity. A

satisfaction scoreand entities in tcould imply a sconsider the satdifferent algorith

#

Fig. 4 contains then scenarios w

algorithms propoerm frequency–nto the comparihe y-axis indicaop results.

io is selected, thhe top 200 resuresults, and the

ly selecting entitmized questionns are reviewed s, high-ranked se users tend to rh limited resourcin knowing howts. Therefore, questions are

entity in the topawn from the oth

g. 3 A screen cli

questionnaire tare generated. Thup where all enry). They are as

re belongs to theas Car Manuf

ker. They are asaluation of possipshot of a questio

naire contains 5wn to be clearly ueys. If any pre-se, we rule out thtionnaires are cotal of 24,000 fee

marks and Reson (10) to calcuy. Whenever a qresponding entimake a clear ju

After averaginge implies that mthe query are siituation that usetisfaction scores

hms.

# 0.5#

he comparison rwe use in experosed in the paper–inverse documeison. The x-axisates the running

he query is issuults from each ale questionnaire ties from the minaire generationby more person

results, are read results accoce (questionnair

w each approach in the process drawn from h

p 40 results of aher entities.

ip of a questionn

akers are unawhey are told that ntities share somsked to examine e intent of the qufacturers, is notsked to read theible intentions toonnaire.

50 entities; howunrelated to the et unrelated enti

he entire questiononsidered valid aedbacks are colle

sults ulate the satisfacquestionnaire takity earns the fuudgment, we stilg scores of anmost users belieimilar. And a sers hold their os as benchmark

5 ∗ #

results of all moriments. In addr, we add Googlent frequency (Ts represents for average of satis

ued to our systelgorithm. We mis then generate

ixed set. Note thn process so thns. Entities in thoften considere

ording to the ordre takers), we aperforms in theof questionnai

high-rank entitieany model, and

naire

ware of how theentities in a que

me properties (thwhether an enti

uery. The intent t revealed to th

e Wikipedia pago the query. Fig.

wever, 10 of 5query for filterinity is marked asnnaire. Thus, onafter the screeninected.

ction score of aker reports a gooull credit of onll assign 0.5 poin entity, a hig

eve that the entiscore close to 0opinions. We theks for evaluatin

(10

odels based on thdition to the fole Sets [6] and thTFIDF) algorithtop results, an

sfaction scores f

em mix

ed hat hat he ed

der are ese ire es, 10

eir ery he ity of he

ges . 3

50 ng s a nly ng

an od ne. int gh ity 0.5 en ng

)

he our he

hm nd for

In Fig.modelsone clacomparand GoAs a reuser capromisibased o

6.2.3Althougcomparthe leadproblemexampl

F

Fig. 5 dthe resuWe crequery iseems contrashosted

Thoughintervieimpresswere aOlympiEven fwere stAtlanta

7. REThe stattractin

Fig. 4 Comp

. 4, our one clas in the top 10 rass probabilisticring top 40 resuoogle Sets returnesult, the averagannot find any ing result: 5.26on 2400 question

Discussiongh the algorithmre our results agding search engim: finding simles.

Fig. 5 One clas

demonstrates a qults provided by

eate a query {Bes to find cities thto interpret ou

st, our one-classthe Olympic Ga

h which interprews after the susions after seeinabout the eventics, or about thefor responders ptill unhappy wh

a and neglects pr

ELATED Rtudy of social ng attention fro

arison of differe

ass probabilisticresults; our balac model have hiults. We notice n approximately ge user satisfacti

results. The pro6% more satisfans in 600 questio

ms of Google gainst its result bine and Google

milar items thro

s probabilistic m

query in which my Google Sets aeijing, Atlanta}, hat hosted the O

ur intent as finds probabilistic mames.

retation is betteurvey, many respng {Beijing, Att that Beijing e notion that botpreferring the nhen seeing a reroperties contrib

RESEARCHcollaboration

om researchers. M

ent models (top 4

c model outperanced voting moigh user satisfacthat the interseonly 25 entities

ion promptly droobabilistic modction than the Gonnaires.

Sets are not dibecause Google Sets aims to solough quires co

model vs. Google

most survey respare inferior or qwhere the inten

Olympic Games. ding cities in Amodel identifies

er is controverponders said thatlanta} shown inhosted the 200

th of them are mnotion “metropoesult that is biasbuted by Beijing.

H tagging systemMathes [9] inve

40)

rforms other odel and our ctions while ction model on average.

ops in that a del reports a Google Sets

isclosed, we is currently

lve the same onsisting of

e Sets

ponders feel questionable. nt behind the Google Sets America. In s cities that

rsial, during at their first n the query 08 Summer

metropolises. olises”, they sed towards .

m has been estigated the

650

social tag data and pointed out that it was a fundamentally chaotic. Shirkly [12] also argued that using tag information is difficult because a user has the freedom of choosing any word he or she wants as a tag. Later, Gloder et al. [5] analyzed the structure of collaborative tagging systems on top of Del.icio.us data and concluded that tag data follow a power law distribution. Their studies back up our argument that some tags are extremely popular while others are rarely used.

Despite the challenges in using tag-entity information, many researchers continue to work in the field and have shown the potential of social tag clouds. Tso-Sutter et al. [14] used relationships among tags, users, and entities to recommend possible interesting entities to users. Penev et al. [10] used WordNet to acquire terms relating to a tag and applied TFIDF [11] similarity on both the tags and their related terms for finding similar entities (pages in their research). They used WordNet to interpret the meaning of tags, trying to measure the similarity between entities based on keywords through expanding tags with WordNet. Although we aim to solve similar problems, our approaches focus on using only tag information, because we believe, as Strohmaier et al. [13] suggested, that users use social tags for categorizing and describing resources.

We believe that our study can benefit many applications. For example, Givon et al. [4] showed that social tags can be used in dealing with recommendations in large-scale book datasets. Moreover, the keyword generation task can be considered as a similar problem, for example, Fuxman et al [3] draw an analogy from a keyword to a url and an entity to a tag. Finding similar entities based on shared tags is similar to finding related keywords based on shared urls that users click on.

Many researchers also work on tag ranking or tag aggregation. Recently, Wetzker et al. [15] focused on creating a mapping between personal tags to aggregate tags with the same meaning, Heymann et al. [7] used tag information to organize library data, Wu et al. [16] explained how to avoid noise and compensate for semantic loss, Liu et al. [8] studied how to rank tags based on importance, and Dattolo et al. [2] studied how to identify similar tags through detecting relationships between them.

8. CONCLUSION In this paper, we investigated the problem of finding similar items with query-by-example interface on top of social tag clouds. We introduced three approaches, and built a search engine on top of them, creating a benchmark for evaluating the users’ satisfaction through collecting 600 questionnaires. The experiment results suggest that social tag data, even though they are uncontrolled and noisy, are sufficient for finding similar items. Finally, we show that both the voting model and the one-class probabilistic model reach high user satisfaction.

We explain two important challenges of utilizing tag information: popularity bias and the missing tag effect, and explain how to overcome these difficulties through partial weight strategies and probability utilization. We show that, in terms of users’ satisfaction, our algorithms are superior or at least compatible to Google Sets and TFIDF model. Our approaches return hundreds of relevant entities without sacrificing the quality in the top results. Moreover, our models rely on only social tag information.

Our proposed framework not only provides an ability to find similar items, but also shows the application potential of social tag information. We demonstrate that the task can be accomplished through providing a query consisting of entities and using only tag information, even though the tag information is uncontrolled and noisy. Through this study, queries for finding similar items, such as “Honda or Toyota or similar”, are handled properly. Our research also highlights the value of using social collaboration data, tag clouds, to refine existing search technologies.

9. REFERENCE [1] Arampatzis, A., and Kamps, J. A study of query length. In

Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval, 2008, 811-812.

[2] Dattolo, A., Eynard, D., and Mazzola, L. An integrated approach to discover tag semantics. In Proceedings of the ACM Symposium on Applied Computing, 2011, 814–820.

[3] Fuxman, A., Tsaparas, P., Achan, K., and Agrawal, R. Using the wisdom of the crowds for keyword generation. In Proceddings of the International World Wide Web Conference, 2008, 61-70.

[4] Givon, S., and Lavrenko, V. Large Scale Book Annotation with Social Tags. In Proceedings of International AAAI Conference on Weblogs and Social Media, 2009, 210-213.

[5] Golder, S. and Huberman, B. A. Usage Patterns of Collaborative Tagging Systems. Journal of Information Science, 32(2), 2006, 198-208

[6] Google Sets. http://labs.google.com/sets [7] Heymann, P., Paepcke, A., and Garcia-Molina, H. Tagging

human knowledge. In Proceedings of the Web Search and Web Data Mining, 2010, 51-60.

[8] Liu, D., Hua, X., Yang, L., Wang, M., and Zhang, H. Tag ranking. In Proceedings of the International World Wide Web Conference, 2009, 351-360.

[9] Mathes, A. Folksonomies - cooperative classification and communication through shared metadata. Computer Mediated Communication, LIS590CMC (Doctoral Seminar), University of Illinois Urbana-Champaign, Dec. 2004.

[10] Penev, A., and Wong, R. K. Finding similar pages in a social tagging repository. In Proceeding of the 17th international conference on World Wide Web , 2008, 1091–1092.

[11] Robertson, S. E., and Sparck-Jones, K. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 1975, 129–146.

[12] Shirky, C. Ontology is overrated: categories, links and tags. http://www.shirky.com/writings/ontology-overrated.html, 2005.

[13] Strohmaier, M, Körner C., and Kern, R. Why do Users Tag? Detecting Users’ Motivation for Tagging in Social Tagging Systems, In Proceedings of International AAAI Conference on Weblogs and Social Media, 2010, 339-342.

[14] Tso-Sutter, K.H.L., Marinho, L.B., and Schmidt-Thieme, L. Tag-aware recommender systems by fusion of collaborative filtering algorithms. In Proceedings of the ACM Symposium on Applied Computing, 2008, 1995–1999.

[15] Wetzker, R., Zimmermann, C., Bauckhage, C., and Albayrak, S. I tag, you tag: translating tags for advanced user models. In Proceedings of the Web Search and Web Data Mining, 2010, 71-80.

[16] Wu, L., Yang, L., Yu, N., and Hua, X. Learning to tag, In Proceedings of the International World Wide Web Conference, 2009, 361-370.

651

Finding Similar Items by Leveraging Social Tag Cloudscho/papers/SAC12.pdf · Flickr have been...

Documents

Transcript of Finding Similar Items by Leveraging Social Tag Cloudscho/papers/SAC12.pdf · Flickr have been...