Metadata first, ontologies second

Post on 11-May-2015

3.172 views 3 download

Tags:

Transcript of Metadata first, ontologies second

Towards a solution to extract knowledge from the social web

(“metadata first, ontologies second”)

Project Collaborative Ontology Building System (CollOnBus)

INTEK Nets 2005-2007

Aitor Almeida, Borja Sotomayor,

Joseba Abaitua, Diego Lopez de Ipiña

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Social web: source of Social web: source of knowledgeknowledge

Crowds share and tag resources of different types: – pictures, music, posts, videoclips, slides, books,

bookmarks, etc.

Social tagging (or crowd-tagging) is a very effective and economic way of generating knowledge

Crowdsourcing “the trend of leveraging the mass collaboration enabled by Web2.0 technologies to achieve business goals. ”

<http://en.wikipedia.org/wiki/Crowdsourcing>

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Related work Related work (since 2006)(since 2006)

mapping tags to ontologies Schmitz 2006. Inducing

Ontology from Flickr tags. WWW’2006: Collaborative Web Tagging workshop

Abbasi et. al. 2007. Organizing Resources on Tagging Systems using T-ORG. ESWC2007 SemNet workshop

identifying semantic relations Specia, Motta. 2007.

Integrating Folksonomies with the Semantic Web. ESWC2007

transforming folksonomies into formal representations

Marlow et al. 2006. Tagging, Taxonomy, Flickr, Article, ToRead. WWW’2006: Collaborative Web Tagging workshop

Hotho et al. 2006. Trend Detection in Folksonomies. Semantics And Digital Media Technology SAMT2006

Maala et. Al. A Conversion Process From Flickr Tags to RDF Descriptions. BIS2007 workshop

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Which Which knowledge knowledge representationrepresentation model? model?

Extracting knowledge from data sharing Web 2.0 sites, but into which formal representation?

Semantic Networks– Lexical networks (WordNet)

Taxonomines – eg. categories from Wikipedia, Thesauri

Metadata– “mapping to Dublin Core is a weak choice”

Ontologies

“metadata first, ontologies second”

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds tagging pictures

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds tagging pictures

Aitor Almeida

Borja Sotomayo

r

Diego López de

Ipiña

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds tagging pictures

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds tagging posts

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds taggingCrowds tagging slidesslides

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds taggingCrowds tagging booksbooks

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowds taggingCrowds tagging URLURL

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Crowd-sharing of tagsCrowd-sharing of tags

Flickr, del.icio.us... group tags by social sharing (or “co-usage”)– but the semantic information that socially

shared tags acquire is poorly exploited

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Mapping folksonomies Mapping folksonomies into tag clustersinto tag clusters

RawSugar <http://rawsugar.com/>– allows users to assign

hierarchies to their tags, improving the navigation and searching of folksonomies

– non-expert users will find it easier to tag resources without any restrictions

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Tag clusteringTag clustering

TAG clustering is the main technique used to improve the wealth of social tagging– but semantic

relations are not detected

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Beyond tag clusters?

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Should we mapShould we map them intothem into ontologies?ontologies?

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Better mappingBetter mapping 1st1st

into into metadatametadata

blog, japan, personal, spanish, geek

“Kirai.NET”http://kirai.bitacoras.com

Title: Kirai.NETFormat: text/htmlType: TextIdentifier: http://kirai.bitacoras.comSubject: blog, personal, geekLanguage: spanishCoverage: japan

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Metadata vs ontologiesMetadata vs ontologies

Why are metadata structures better than ontologies (for resource classification and categorisation)?

Let’s reflect on different knowledge representations and about who use them:– Folksonomies (crowds)– Taxonomies, ontologies (knowledge

engineers, AI/SW practitioners)– Metadata structures (librarians, archivists,

documentalists)

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

What are metadata?What are metadata?

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG vs metadata?

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Metadata vs ontologiesMetadata vs ontologies

Why are metadata structures better?– Because metadata provide wide and complete

range of facets for representing knowledge about an entity or resource

– Each facet (or data type) could be part of one or several ontological structures

– Facet “any of the definable aspects that make up a subject (as of contemplation) or an object (as of consideration)”

– “A faceted classification system allows the assignment of multiple classifications to an object, enabling the classifications to be ordered in multiple ways, rather than in a single, pre-determined, taxonomic order” (Wikipedia).

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Better mapping 1st folksonomiesBetter mapping 1st folksonomies into into metadata structuresmetadata structures

blog, japan, personal, spanish, geek

“Kirai.NET”http://kirai.bitacoras.com

Title: Kirai.NETFormat: text/htmlType: TextIdentifier: http://kirai.bitacoras.comSubject: blog, personal, geekLanguage: spanishCoverage: japan

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Dublin CoreDublin Core Metadata Initiative Metadata Initiative

http://jodi.tamu.edu/Articles/v02/i02/Greenberg/metadataform.gif

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Dublin CoreDublin Core Metadata Initiative Metadata Initiative

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Dublin CoreDublin Core Metadata Inicitive Metadata Inicitive

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Our mapping tool:Our mapping tool:folk2ontofolk2onto (? folk2meta)(? folk2meta)

Tagged resource

RS

S/H

TM

L

Tag Retriever

FolksonomyTAGs

Tag Trainer

Tag Distiller

TAG

s

TAG

s

Trained Tags DB

Tra

inin

gT

rain

ing

Filtered Tags

XM

L

Mappings Trainer

Mappings Distiller

XM

L

XM

L

Wordnet

Syn

sets

Wordnet

Syn

sets

Mapping DB

Map

pin

gM

app

ing

Annotated resource

RDF

designed by

Borja Sotomayor

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: Tag Distiller Tag Distiller

Tag Distiller: – Downloads tags from Web 2.0 sites– Matches each tag against WordNet

(taking into account the tag’s context/cloud)

– Filters out synonyms – Keeps the list of remaining tags – Generates an XML file

Implemented by Aitor Almeida

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG clouds TAG clouds fromfrom del.icio.us del.icio.us

1. http://del.icio.us/url/check?url=site2. Looks for <title> and gets its content: the

hash3. Gets the RSS in

http://del.icio.us/rss/url/ + hash

4. Then tag-clouds are downloaded from <rdf:li resource=\"http://del.icio.us/tag/">

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG cloudsTAG clouds from from Technorati Technorati

Technorati: blog aggregator We can get tag clouds from Technoraty

through: http://api.technorati.com/blogposttags?key=[apikey]&url=[blog URL]

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG clouds TAG clouds fromfrom Technorati Technorati

<?xml version="1.0" encoding="utf-8"?> <!-- generator="Technorati API version 1.0 /blogposttags" --><!DOCTYPE tapi PUBLIC "-//Technorati, Inc.//DTD TAPI 0.02//EN"

"http://api.technorati.com/dtd/tapi-002.xml"> <tapi version="1.0"> <document>

<result> <querycount>13</querycount>

</result> <item>

<tag>christmas cookie recipes</tag> <posts>274</posts>

</item> ….

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Tagged URL Tagged URL atat Technorati Technorati

All <tag> elements are downloaded

To get the “title” http://api.technorati.com/bloginfo?key=[apikey]&url=[blog url]

And<name> is recovered

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

semantic relationssemantic relations in WordNet in WordNet

WordNet relations for tag ‘Spanish’:

Romance,Romance language,Latinian language

Spanish

Mexican Spanish

hypernym

hyponym

national, subject

nation, land, country, a people

Spanish,Spanish people

hyponym

meronym

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG filtering algorithmTAG filtering algorithm

Tags are filtered out by means of WordNet If a TAG has only one meaning (synset) that meaning is

assigned If it has more than one, then

– T: resources tag set– Related(a,b): gives 1 if a and b have some type of relation (hypernym,

hyponym, holonym, meronym)– w: weights

Several iterations are made until a meaning is found (10 iterations max.)

folktagstag

Fttags

tagTtw

F

ttagrelatedw

T

ttagrelatedw

tag

||

),(

||

),(

max

}{

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG filtering algorithmTAG filtering algorithm

Once senses have been discarded, synonyms are also filtered out

Words then are grouped in senses using WordNet’s relation network

The output is exported to a:– XML file with senses– XML file with tags that were discarded– RDF containing WordNet’s relation

network

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG XML fileTAG XML file

<?xml version="1.0" encoding="UTF-8"?><resource>

<tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</tittle><type>Text</type><format>text/html</format><identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</identifier><tags>

<tag><lemma>tune</lemma>

< idlex>236726</idlex></tag><tag>

<lemma>bd</lemma><idlex>5604473</idlex>

</tag>

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

TAG file TAG file without senseswithout senses

<resource><tittle>Wired News: The Virus That Ate DHS</tittle><type>Text</type><format>text/html</format><identifier>www.wired.com/news/technology/0,72051-0.html?tw=rss.index</identifier><tags>

<tag>bit200f06</tag><tag>group141</tag><tag>dhs</tag><tag>group35</tag><tag>malware</tag><tag>group91</tag><tag>group17</tag><tag>group53</tag><tag>computer_security</tag>

</tags></resource>

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

WordNet’s WordNet’s sense setssense sets

Words are grouped in sense sets– If related(a,b) is = 1, then words are

grouped in the same set– The relations depth has to be equal or

smaller than 3

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: Tag Trainer Tag Trainer

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto:Map TrainerMap Trainer

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto: folk2onto: Tag MapperTag Mapper

The Mapper makes tag-element associations

These associations are made according to the senses asigned by the Distiller

Mapping targets into Dublin Core metadata records

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: Dublin Core Dublin Core

The Distiller gets 4 elements from the tag source (del.icio.us, Technorati, etc.):– Title: URL’s title -> from the <title> XML

tag– Type: content type -> depending on the

source (here both are “Text”)– Format: MIME class -> depending on the

source (here we have 2 text/html)– Identifier: we take the resource’s URL

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: Dublin Core Dublin Core

The Tag-Mapper deals with:– Subject: the “topic”.– Language: en, es, fr, de, ru...– Coverage: when, where (about the topic)– Rights: type of licence

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto: folk2onto: mapping formulaemapping formulae

When a TAG has one mapping, that TAG is used If it has more than one:

If it has no mapping, then:

mapMm

wM

mtrelated

||

),(

rel

T

ii

countomapt

ow

T

MTSameOrSonw

tcount

tcount

io

1

),(

)(

)(

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: file mapping file mapping

<rdf:RDF xmlns:j.0="http://purl.org/dc/elements/1.1" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

> <rdf:Description rdf:nodeID="A0">

<rdf:type rdf:resource="http://purl.org/dc/elements/1.1identifier"/><j.0:identifier>www.postgresql.org/docs/faqs.FAQ_brazilian.html</j.0:identifier> <j.0:type>Text</j.0:type> <j.0:format>text/html</j.0:format>

<j.0:tittle>PostgreSQL: Perguntas Frequentes (FAQ) sobre PostgreSQL</j.0:tittle><j.0:subject>database</j.0:subject>

<j.0:subject>performance</j.0:subject> <j.0:subject>bd</j.0:subject>

</rdf:Description></rdf:RDF>

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Mapping trainerMapping trainer

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto: folk2onto: 6 tests (A-F)6 tests (A-F)

Experiment A: Selecting random synsets for the tags. Experiment B: Without any limit in the semantic relation

depth. Only taking into account the trained synsets (frec=0, wordnet=0, trained=1).

Experiment C: Without any limit in the semantic relation depth. Only taking into account the context (frec=0, wordnet=1, trained=0).

Experiment D: Without any limit in the semantic relation depth. Taking the context and the trained synsets into account (frec=0,=wordnet0.4, trained=0.6).

Experiment E: Without any limit in the semantic relation depth. Taking al three components of the equation (familiarity, context and trained synsets) into account (frec=0.1, wordnet=0.3, trained=0.6).

Experiment F: Limiting the semantic relation depth to 3 and taking the context and the trained synsets into account. (frec=0, wordnet=0.4, trained=0.6).

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto:folk2onto: tests output tests output

Experiment Correct synsets Erroneous synsets

A 706 (%32.5) 1466 (%67.5)

B 1594 (%73.4) 578 (%26.6)

C 1199 (%55.2) 973 (%44.8)

D 1492 (%68.7) 680 (%31.3)

E 1349 (%62.1) 823 (%37.9)

F 1894 (%87.2) 278 (%12.8)

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

folk2onto: folk2onto: tests outputtests output

0

500

1000

1500

2000

2500

Exp.A

Exp.B

Exp.C

Exp.D

Exp.E

Exp.F

Erroneus

Correct

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

Open issuesOpen issues

Tag filtering through WordNet– blog, wiki– xml, rdf, rss– wordpress, tuenti, flickr– social, open

“tags can be about so many things – mapping to Dublin Core

is a weak choice” Mappings

– Coverage: Japan– Language: Spanish

Learning the right synset of eg. "jaguar" – "vehicle", "video

game console", or "cat of prey"

– "<dc:subject>Jaguar</dc:subject>"

Word-sense disambiguation– tag-category

disambiguation

ESWC 2008 (Tenerife)ESWC 2008 (Tenerife)

That was all about That was all about CollOnBus/folk2ontoCollOnBus/folk2onto

Thank you very much!Any question?