An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph...

An Integrated Approach to Extract-ing Ontological Structures from Folksonomies

Huairen Lin, Joseph Davis, Ying ZhouESWC 2009

Hyewon LimOctober 9th, 2009

2

Contents Introduction Related Research System Architecture Experimental Evaluation Conclusion and Future Work

3

Introduction (1/3)

4

Introduction (2/3)

Problem with folksonomies– Tags can be idiosyncratic– Not understood by many users– Concept and internal structure are not explicit to the ma-

chine

Various solutions have been proposed– Refine the query result

Clustering, tag cloud

– Takes an existing upper ontology as the base structure WordNet

An integrated approach– Knowledge extracted from folksonomies

+ relevant terms from an existing up-per ontology

5

Introduction (3/3)

Ontological structure extracted from folksonomies can be useful in many areas in CTS– Providing multi-dimensional views– Cataloguing and indexing– Query translation and tagging suggestion

Can enhance the precision and recall – by matching the query keywords and the potential results at the

level of semantics

6

Related Research Cosine similarity between tags

– Measure the distance from one tag to another– Organize them into a hierarchical tree

Association rule mining has been adopted to analyze and structure folksonomies– Output of association rule mining on a folksonomy dataset

Association rule like A→B

To discover the relationships within tags in clusters, sev-eral existing ontology resources can be used as refer-ence– E.g. WordNet

An et al., “Automatic Generation of Ontology from the Deep Web” Laniado et al., “Using WordNet to turn a folksonomy into a hierarchy

of concepts”

7

System Architecture (1/7)

Vocabularies used in folksonomy– Standard tags: genomics– Compound tags: evolutionary-genomics– Jargon tags: scientometrics, CSCW– Other nonsense tags: misspelling tags

8

System Architecture (2/7)- Low Support Association rule mining

Aim of association rule mining in CTS– Generate associations in the form ta → tc between tags ta and

tc that have support and confidence above certain thresholds

Traditional association rule mining– Set a relatively high support and confidence threshold– This is likely to miss important associations among tags

Tags in folksonomies usually follow a Zipf distribution Majority of the tags do not occur very frequently in the dataset

Low Support Association rule mining– Very low support threshold– Lower support may bring lots of noise in the rule set

Cosine similarity to filter out possible noise

9

System Architecture (3/7)- Low Support Association rule mining

LApriori algorithm (a simplified version of Apriori algorithm)

– Only calculate the relationship between tag pairs

* Apriori algorithm– Finding frequent itemsets using candidate generation – Find Lk-1, the set of frequent (k-1)-itemsets and Lk-1 is used to find Lk

10

System Architecture (4/7)- Standard Tags

Use WordNet as the upper ontology– Compute each semantic relation between tags in terms of

hypernym relation from WordNet

– Possible semantic relation more general(⊇), less general(⊆), equivalence(=) In folksonomies, another definitions

– essential tags: all distinct tags existing in association rules filtered by thresholds

– candidate hypernyms: hypernyms that exist in its related tags

11

System Architecture (5/7)- Standard Tags

Folk2Onto algorithm{food, beverage, wine, milk}

For tag “wine”,① Uk = {}Candidate hypernym = {food}Then Uk = {food}

② Uk = {beverage}Candidate hypernym = {food}Then Uk = {beverage} – break!

③ Uk = {food}Candidate hypernym = {beverage}Then Uk = {beverage}

food

beverage

wine milk

①

②

③

12

System Architecture (6/7)- Compound Tags

Compound Tags are non-standard terms– Cannot be processed by WordNet without transformation

Jawbone (by Mike Wallace)

– If they match certain defined criteria, the compound tags will be reserved and represented by its base term for more gen-eral parent finding

– EndWithFilter The last one is used to represent the whole compound collaborative_tagging → tagging

– StartsWithFilter The first token is used to represent the whole word Apply after the EndWithFilter

13

System Architecture (7/7)- Jargon Tags

Association rules show their relations with other common tags

Jargon tags are incorporated to the previously built ontological structure with a matcher using graph centrality in a similarity graph of tags– Considers each jargon tag as the central node of a subgraph– If there is more than one standard tag associated with the

jargon tag Tag with the highest cosine similarity index will have the priority “folksonomy” and “tagging, plurality, social, ontology”

– “Folksonomy → tagging” was selected (ranked by cosine similarity)

14

Experimental Evaluation (1/6)

Citeulike– Crawling keywords: including “science”, “philosophy”, “re-

search”– 30,769 rows of data

Flickr– Crawling keyword: “fruit”– 18,555 rows of data

Pre-processing operations were performed to clean up the datasets– For dataset from Flickr, only kept one record for each user– Remove the tags called “no-tag” (a system generated tag for

empty tag)– Remove objects with only one tag

15


Threshold of parameters– Minimum support: 0.02%– Minimum confidence: 0.8– Minimum cosine similarity: 0.2

Get 24,025 rules from citeulike at 0.02% minsup, 0.2 cosine similarity, 0.8 confidence thresholds

16


Measure how far the extracted ontological structure will help to influence and improve the results of cer-tain tasks– Multi-dimensional view, cataloguing and indexing

Multi-dimensional view– Result retrieved with the “fruit” was organized into several

dimensions

17


Multi-dimensional view– Comparing structure to an ontology (sei.cmu.edu)

18


Multi-dimensional view– Comparing structure to cluster result (flickr.com)

19


Cataloguing and Indexing

– Evaluated the catalogues manually– Observe that compound and jargon terms have been appropriately

incorporated In total, 1540 terms were incorporated into the ontological structure

– 35.65%: standard terms– 64%: non-standard terms (including 36.17% compound and 28.18% jargon

terms)

20

Conclusion and Future Work Mapping terms with WordNet ontology is not enough to find

the relationships among them– WordNet does not cover special domain vocabulary and cannot reflect

usage change– In CTS, many of the tags are in the form of jargon and compound

terms

Applied the association rules to find semantically related tags

Ontological structures could be enriched and deepened using larger tag datasets and more specialized semantic lexical re-sources

Represent the extracted ontologies in the web using RDF and SPARQL will enable the integration with other web services

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph...

Documents

Transcript of An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph...