An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph...
-
Upload
sharyl-butler -
Category
Documents
-
view
215 -
download
1
Transcript of An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph...
An Integrated Approach to Extract-ing Ontological Structures from Folksonomies
Huairen Lin, Joseph Davis, Ying ZhouESWC 2009
Hyewon LimOctober 9th, 2009
2
Contents Introduction Related Research System Architecture Experimental Evaluation Conclusion and Future Work
4
Introduction (2/3)
Problem with folksonomies– Tags can be idiosyncratic– Not understood by many users– Concept and internal structure are not explicit to the ma-
chine
Various solutions have been proposed– Refine the query result
Clustering, tag cloud
– Takes an existing upper ontology as the base structure WordNet
An integrated approach– Knowledge extracted from folksonomies
+ relevant terms from an existing up-per ontology
5
Introduction (3/3)
Ontological structure extracted from folksonomies can be useful in many areas in CTS– Providing multi-dimensional views– Cataloguing and indexing– Query translation and tagging suggestion
Can enhance the precision and recall – by matching the query keywords and the potential results at the
level of semantics
6
Related Research Cosine similarity between tags
– Measure the distance from one tag to another– Organize them into a hierarchical tree
Association rule mining has been adopted to analyze and structure folksonomies– Output of association rule mining on a folksonomy dataset
Association rule like A→B
To discover the relationships within tags in clusters, sev-eral existing ontology resources can be used as refer-ence– E.g. WordNet
An et al., “Automatic Generation of Ontology from the Deep Web” Laniado et al., “Using WordNet to turn a folksonomy into a hierarchy
of concepts”
7
System Architecture (1/7)
Vocabularies used in folksonomy– Standard tags: genomics– Compound tags: evolutionary-genomics– Jargon tags: scientometrics, CSCW– Other nonsense tags: misspelling tags
8
System Architecture (2/7)- Low Support Association rule mining
Aim of association rule mining in CTS– Generate associations in the form ta → tc between tags ta and
tc that have support and confidence above certain thresholds
Traditional association rule mining– Set a relatively high support and confidence threshold– This is likely to miss important associations among tags
Tags in folksonomies usually follow a Zipf distribution Majority of the tags do not occur very frequently in the dataset
Low Support Association rule mining– Very low support threshold– Lower support may bring lots of noise in the rule set
Cosine similarity to filter out possible noise
9
System Architecture (3/7)- Low Support Association rule mining
LApriori algorithm (a simplified version of Apriori algorithm)
– Only calculate the relationship between tag pairs
* Apriori algorithm– Finding frequent itemsets using candidate generation – Find Lk-1, the set of frequent (k-1)-itemsets and Lk-1 is used to find Lk
10
System Architecture (4/7)- Standard Tags
Use WordNet as the upper ontology– Compute each semantic relation between tags in terms of
hypernym relation from WordNet
– Possible semantic relation more general(⊇), less general(⊆), equivalence(=) In folksonomies, another definitions
– essential tags: all distinct tags existing in association rules filtered by thresholds
– candidate hypernyms: hypernyms that exist in its related tags
11
System Architecture (5/7)- Standard Tags
Folk2Onto algorithm{food, beverage, wine, milk}
For tag “wine”,① Uk = {}Candidate hypernym = {food}Then Uk = {food}
② Uk = {beverage}Candidate hypernym = {food}Then Uk = {beverage} – break!
③ Uk = {food}Candidate hypernym = {beverage}Then Uk = {beverage}
food
beverage
wine milk
①
②
③
12
System Architecture (6/7)- Compound Tags
Compound Tags are non-standard terms– Cannot be processed by WordNet without transformation
Jawbone (by Mike Wallace)
– If they match certain defined criteria, the compound tags will be reserved and represented by its base term for more gen-eral parent finding
– EndWithFilter The last one is used to represent the whole compound collaborative_tagging → tagging
– StartsWithFilter The first token is used to represent the whole word Apply after the EndWithFilter
13
System Architecture (7/7)- Jargon Tags
Association rules show their relations with other common tags
Jargon tags are incorporated to the previously built ontological structure with a matcher using graph centrality in a similarity graph of tags– Considers each jargon tag as the central node of a subgraph– If there is more than one standard tag associated with the
jargon tag Tag with the highest cosine similarity index will have the priority “folksonomy” and “tagging, plurality, social, ontology”
– “Folksonomy → tagging” was selected (ranked by cosine similarity)
14
Experimental Evaluation (1/6)
Citeulike– Crawling keywords: including “science”, “philosophy”, “re-
search”– 30,769 rows of data
Flickr– Crawling keyword: “fruit”– 18,555 rows of data
Pre-processing operations were performed to clean up the datasets– For dataset from Flickr, only kept one record for each user– Remove the tags called “no-tag” (a system generated tag for
empty tag)– Remove objects with only one tag
15
Experimental Evaluation (2/6)
Threshold of parameters– Minimum support: 0.02%– Minimum confidence: 0.8– Minimum cosine similarity: 0.2
Get 24,025 rules from citeulike at 0.02% minsup, 0.2 cosine similarity, 0.8 confidence thresholds
16
Experimental Evaluation (3/6)
Measure how far the extracted ontological structure will help to influence and improve the results of cer-tain tasks– Multi-dimensional view, cataloguing and indexing
Multi-dimensional view– Result retrieved with the “fruit” was organized into several
dimensions
17
Experimental Evaluation (4/6)
Multi-dimensional view– Comparing structure to an ontology (sei.cmu.edu)
18
Experimental Evaluation (5/6)
Multi-dimensional view– Comparing structure to cluster result (flickr.com)
19
Experimental Evaluation (6/6)
Cataloguing and Indexing
– Evaluated the catalogues manually– Observe that compound and jargon terms have been appropriately
incorporated In total, 1540 terms were incorporated into the ontological structure
– 35.65%: standard terms– 64%: non-standard terms (including 36.17% compound and 28.18% jargon
terms)
20
Conclusion and Future Work Mapping terms with WordNet ontology is not enough to find
the relationships among them– WordNet does not cover special domain vocabulary and cannot reflect
usage change– In CTS, many of the tags are in the form of jargon and compound
terms
Applied the association rules to find semantically related tags
Ontological structures could be enriched and deepened using larger tag datasets and more specialized semantic lexical re-sources
Represent the extracted ontologies in the web using RDF and SPARQL will enable the integration with other web services