O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper [email protected]...

www.ontopedia.net

O N T O P E D I AThe Identity of Everything

Subject Identity

Steve [email protected]

INF5909, 2009-02-23

www.ontopedia.net


Agenda

Merging in Topic Maps

The Importance of Identity

The Topic Maps Approach to Identity

The Identity Crisis of the Web

Published Subjects

(Subject-centric Computing)

www.ontopedia.net


Merging in Topic Maps

An Example ofKnowledge Federation

www.ontopedia.net


Merging topic maps

Topic Maps can be merged automatically– Arbitrary topic maps can be merged into a single topic map

– This cannot be done with databases or XML documents

Merging enables many advanced applications– Information integration across repositories

– Sharing and reusing taxonomies

– Automated content aggregation

– Distributed knowledge management

– Global knowledge federation

Merging made possible by subject identity

www.ontopedia.net


Principles of merging

By definition: Every topic represents exactly one subject The goal: Every subject represented by exactly one topic

1. When two topic maps are merged, topics that represent thesame subject should be merged to a single topic

2. When two topics are merged, the resulting topic has theunion of the characteristics of the two original topics

name

occurrence

association role

T

association role

name

occurrence

association role

name

A second topic (in another topic map) “about” the same subject

TMerge the two topics together......and the resulting topic has the union

of the original characteristicsname

occurrence

association role

name

T

(Demo of merging in the Omnigator…)

www.ontopedia.net


The vision of seamless knowledge

Starting with ITU in 2001, Norway has seen an explosion in the number of portals that are based on Topic Maps

– Today there are dozens, especially in the public section

As the number of portals multiplies, the amount of overlap increases…

– The potential for integration is … staggering

Take these three portals as an example:– forskning.no (Research Council web site aimed at young adults)

– forbrukerportalen.no (Norwegian Consumer Association)

– matportalen.no (Biosecurity portal of the Department of Agriculture)

www.ontopedia.net


Genetically modified food at forskning.no

www.ontopedia.net


Genetically modified food at Forbukerrådet

•Terefe Badenod

www.ontopedia.net


Genetically modified foodstuffs at Matportalen

www.ontopedia.net


Three portals – one subject

one “virtual portal”with seamless navigation in all directions

www.ontopedia.net


The Importance of Identity

www.ontopedia.net


Identity and knowledge federation

Knowledge federation requiressubject-based merging

subject subject

www.ontopedia.net


The big challenge is

Knowing when we’re talkingabout the same thing

the computer domain

the real world

www.ontopedia.net


Humans get by using names

But names are ambiguous (homonyms)– Humans disambiguate using (a) context and (b) negotiation

Many names have the same referent (synonyms)– Humans can generally handle this

– Computers can’t – at least not without our help...

Computers need a simpler mechanism– Local identifiers (database keys, XML IDs, controlled

vocabularies, code sets, etc.) work OK in closed systems– but not across systems or domains (e.g. the code ”nor”)

– Open and multilingual systems need global identifiers

www.ontopedia.net


Requirements on global identifiers

The mechanism as a whole should be– open and democratic: top-down solutions won’t work

– scaleable: the number of potential subjects is open-ended

– easy to adopt: based on existing tools and methods

The identifiers themselves should be– easy for humans to use: locate, create, interpret, apply

given a subject, find an identifier (if one exists) given a subject, create an identifier given an identifier, find out what subject it identifies given an identifier, attach it to the information in question

– efficient for computers to use: comparison of identifiers lexical comparison simplest avoid normalization, network access, other computation

www.ontopedia.net


Some proposed solutions

URL based proposals

For web documents– HTTP URIs (URLs)– address = identifier

For resources in general– Source: SemWeb community– URIs for arbitrary “resources”

(esp. classes og properties) Published Subjects

– Source: Topic Maps community– Continuation of SemWeb

practice

Non-URL based proposals

URN (RFC 1737)– Uniform Resource Names

XRI (OASIS)– Extensible Resource Identifiers

Domain specific– ISBN (books)– DOI (“digital objects”)– GUID & UUID– UPC & EAN– RFID– (what else is out there?)

www.ontopedia.net


The Topic MapsApproach to Identity

Direct identification(subject locators)

Indirect identification(subject identifiers)

www.ontopedia.net


Subjects and topics

Topics are surrogates, or “proxies” (inside the computer) for the ineffable subjects that you want to talk about, such as Puccini, love, these slides, or the second law of thermodynamics

A subject in the real world

(referent)

TA topic in the computer domain(symbol)

www.ontopedia.net


Topics and subjects

Topics represent subjects– By definition every topic represents exactly one subject

– The goal when merging is to ensure that every subject is represented by exactly one topic (the collocation objective)

A subject can be anything you want– ISO 13250 definition:

A subject is any “thing” whatsoever, whether or not it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever.”

Some examples...

(Puccini)

(Lucca)

(Tosca)

(MadameButterfly)

www.ontopedia.net


The identity of subjects

Topics exist in order to allow us to talk about subjects

– The relationship between the two is sometimes called intentionality

We need to know exactly which subject a topic represents

– That is, we need to establish its subject identity

– The collocation objective depends on knowing when applications are talking about the same thing

Lucca

Tosca

Puccini

MadameButterfly

www.ontopedia.net


Subject locators

Sometimes the subject is an information resource (like these slides)

– It exists somewhere within the computer system

– It has a location and can be “addressed”, e.g. http://www.ontopedia.net/tutorials/tm-intro.ppt

– The address of such an addressable subject can be used to unequivocably establish the subject’s identity

– An address used in this way to identify a subject directly is called a subject locator

But most subjects are not information resources

– Puccini, Tosca, love, subject-centric computing, …

– Outside the computer domain and cannot be addressed directly...

subject

topic

subj

ect l

ocat

or

http

://w

ww

.ont

oped

ia.n

et/tu

toria

ls/tm

-intr

o.pp

t

(These slides)

www.ontopedia.net


Subject Identity

Steve [email protected]

INF5909, 2009-02-23

www.ontopedia.net


Life, the Universe and Everything

The Computer Domain

The Topic Map Domain

Subject identifiers

The identity of most subjects can only be established indirectly

– An information resource can provide an indication of the subject’s identity to a human

– Such a resource is called a subject descriptor*

A subject descriptor has an address,even though the subject it indicatesdoes not

– Computers can use the address of thesubject descriptor to establish identity

– Such addresses are calledsubject identifiers

Subject descriptors and subjectidentifiers represent the twofaces of the human-computerdichotomy

* also known as “subject indicator”

subject

Giacomo Puccini, Italian composer, b. Lucca 22nd Dec 1858, d. Brussels, 29th Nov 1924. Best known for his operas, of which Tosca is the most . . .

subject descriptor

Puccini

http://

psi.o

ntoped

ia.n

et/P

uccin

i

subject identifier

topic

www.ontopedia.net


A dual mechanismThe subject is identified by a URL

• The URL is called asubject identifier

GiacomoPuccini

topic

http://psi.ontopedia.net/Puccini

subject identifier

The URL is the address of a web page

• The web page describes the subject such that a human can know what subject is referred to

• This web page is called a subject descriptor

Giacomo Puccini

Italian composer, b. Lucca 22nd Dec 1858, d. Brussels, 29th Nov 1924. Best known for his operas, of whichTosca is one of the most popular and well-known.

subject descriptor

http://psi.ontopedia.net/PucciniHumans use the descriptor

By inspecting the web page the person responsible for assigning the identifier can be sure that it does not refer to, say, Giacomo’s grandfather Domenico (who was also a composer of operas)

Machines use the identifier

The link is not resolved. Instead simple lexical comparison is used. If the strings are identical, the subject is deemed to be the same and the topics are merged.

subject

www.ontopedia.net


Summary of the TM approach

Allows both direct and indirect identification of subjects

Direct identification is for information resources– “addressable subjects” only

– subject locators (orig. subject addresses)

Indirect identification is for anything– both “addressable” and “non-addressable subjects”

– subject identifiers and subject descriptors (orig. subject indicators)

There is also a construct called “item identifier”– used under the covers for mapping between syntax and internal

representation

www.ontopedia.net


The Identity Crisis of the Web

Also known as the httpRange14 issue

www.ontopedia.net


“Identity crisis”

Article on XML.com September 2002 by Kendall Clark– http://www.xml.com/pub/a/2002/09/11/deviant.html

Based on a review of the work of the W3C’s Technical Architecture Group (TAG)

– Architectural Principles of the World Wide Web

– http://www.w3.org/TR/webarch/

Part of a larger discussion in the “Web community”– What do HTTP URIs identify? (Tim Berners-Lee)

– Disambiguating RDF Identifiers (Sandro Hawke)

– Four Uses of a URL (David Booth)

– Web Proper Names (Harry Halpin & Henry S. Thompson)

www.ontopedia.net


The problem in a nutshell:

What do URIs identify?– Sandro Hawke: “To date, RDF has not been clear about

whether a URI like http://www.w3.org/Consortium identifies the W3C or a web page about the W3C. Throughout RDF, strings like http://www.w3.org/1999/02/22-rdf-syntax-ns#type are used with no consistent explanation of how they relate to the web.”

Why is this important? Because without clarity on this issue

– The challenge of the Semantic Web cannot be solved– Web services cannot be implemented in a scaleable manner– Ontologies and taxonomies will not be reusable– The goal of Global Knowledge Federation is unreachable– The problem of Infoglut will never go away

www.ontopedia.net


Introducing Eric Miller

Formerly of OCLC: Dublin Core, RDF Later Technical Lead of the W3C’s Semantic Web Activity

“I see both RDF … as well as Topic Maps workingtoward enabling the Semantic Web”

www.ontopedia.net


A simple example (1)

RDF Primer– http://www.w3.org/TR/2003/WD-rdf-primer-20030123/

Example 1: RDF/XML Describing Eric Miller

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">

<contact:Person rdf:about="http://www.w3.org/People/EM/contact#me">

<contact:fullName>Eric Miller</contact:fullName>

<contact:mailbox rdf:resource="mailto:[email protected]"/>

<contact:personalTitle>Dr.</contact:personalTitle>

</contact:Person>

</rdf:RDF>

www.ontopedia.net


A simple example (2)

Eric Miller

Person

Dr.

mailto:[email protected]

http://www.w3.org/People/EM/contact#me

www.ontopedia.net


Resolving the URI

Clicking on this URL displays the following document

Now let’s add some DC metadata to this document

<?xml version="1.0" ?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns="http://www.w3.org/2000/10/swap/pim/contact#">

<Person rdf:about="http://www.w3.org/People/EM/contact#me">

<rdf:value>Eric Miller, [email protected]</rdf:value>

<mailbox rdf:resource="mailto:[email protected]" />

<fullName>Eric Miller</fullName>|

<personalTitle>Semantic Web Activity Lead</personalTitle>

<company>W3C World Wide Web Consortium</company>

<phone>614.763.1100</phone>

</Person>

</rdf:RDF>

April 2nd 2002dc:creation-date

www.ontopedia.net


Encoding the metadata in RDF

Ex2: RDF/XML Describing the document about EM<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:dc="http://purl.org/dc/elements/1.0/">

<rdf:Description rdf:about="http://www.w3.org/People/EM/contact#me">

<dc:creator>Eric Miller</dc:creator>

<dc:creation-date>2002/06/04</dc:creation-date>

</rdf:Description>

</rdf:RDF>

Document about Eric Miller

April 2nd 2002


dc:creation-date

/Eric Miller

Person

Dr.



dc:creator

Person

Dr.



April 2nd 2002


dc:creation-date

www.ontopedia.net


The cause of the problem

URIs are being used for two distinct purposes– To identify information resources

– To identify the thing that an information resourcedescribes or indicates

And we don’t know the difference!

www.ontopedia.net


Problem recognized in W3C

Architectural Principles of the World Wide Web:2.2. Uses of URIs

The two primary uses of URIs are (1) To compare identifiers and (2) Dereference a URI

(that is, as identifiers and as addresses)2.2.5. Consistent use of URIs

It is confusing and costly when people use the same URI to refer to different resources (i.e., where there is some inconsistency in usage compared to the authoritative meaning of the resource). Suppose company A uses http://example.com/coolcompany to refer to CoolCompany's home page, while company B uses http://example.com/coolcompany to refer to CoolCompany. Company A then buys company B, but when they try to merge their databases, they cannot due to this inconsistent usage of the URI.

www.ontopedia.net


Original solution (2003) was…

… ineffectual handwaving:2.2.5. Consistent use of URIs

Good practice:Consistent URIs: Indiscriminate use of a URI undermines its value and interferes with people who rely on it.

In fairness, individuals in the Web and RDF communities have proposed solutions

• Larry Masinter: tdb URN namespace (“Thing Described By”)

• Sandro Hawke: Distinguish between “page mode” and “subject mode”

• David Booth: Distinguish between “names”, “concepts”, “web locations,” and “documents”

• Not taken seriously by the W3C

• (There is also the hash/slash proposal)

www.ontopedia.net


How the situation came about

In the Beginning the Web was a web of information resources– URIs (Uniform Resource Identifiers) originally called UDIs

(Uniform Document Identifiers)

– Name changed to avoid narrow interpretation of “document”

– But “resources” were still information resources

Most important kind of URI was the URL – the Uniform Resource Locator

– A locator is the address of something (e.g., an information resource)

– An address is a fairly robust way of identifying something

– So URLs started to be regarded as identifiers

All of this worked fine until someone had the bright idea of using URLs to identify things that were not information resources...

www.ontopedia.net


Redefining “resource”

Imperceptibly, “resource” acquired a new meaning…– No longer just an information resource…

– Came to mean anything whatsoever…

Practice codified in RFC 2396 in August 1998– “A resource can be anything that has identity. Familiar examples include

an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources.”(RFC 2396)

This was a mistake– Because it obscures a fundamental ontological feature of the Web…

– … that information resources have special significance

www.ontopedia.net


Information resources are special

They have locations within the system– A document has an address, a location

– Any information resource has an address

The address can be used to identify the resource But nothing else has an address

– Eric Miller does not have a location within the computer system

This fundamental ontological fact is recognized in Topic Maps

– Direct identification vs. indirect identification

Not recognized in RDF, or the Web Architecture in general

www.ontopedia.net


URIs as resource identifiers

subject locator

www.ontopedia.net


URIs as arbitrary subject identifiers

subject descriptor

www.ontopedia.net


httpRange14: The TAG’s resolution

Agreed on 15 Jun 2005:

The TAG provides advice to the community that they may mint "http" URIs for any resource provided that they follow this simple rule for the sake of removing ambiguity:

– If an "http" resource responds to a GET request with a 2xx response, then the resource identified by that URI is an information resource;

– If an "http" resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URI could be any resource;

– If an "http" resource responds to a GET request with a 4xx (error) response, then the nature of the resource is unknown.

This resolution (known as the “303 hack”) has not ended the debate...

www.ontopedia.net


Published Subjects

www.ontopedia.net


Published Subjects

In order for identifiers to be reused, they must made publicly available

– A subject identifier that has been made available for use outside one particular application is called a published subject identifier (PSI)

– Its descriptor is called a published subject descriptor (PSD)

Anyone can publish PSI sets– Adoption of PSI sets will be an evolutionary process based on trust

– It will lead to greater and greater interoperability – between topic map applications, between Topic Maps and RDF, and across information and knowledge management in general

– Check out http://psi.ontopedia.net (under development)

www.ontopedia.net


What is “Published Subjects”?

An extremely simple mechanism (or convention) for defining and sharing globally unique identifiers for arbitrary subjects

– The identifier is an HTTP URI (i.e. a URL)

– It’s called a published subject identifier (PSI)

It resolves to a web page– The contents of this page convey the identity of the subject in a

form that is human-interpretable

– This pages is called a published subject descriptor* (PSD)

www.ontopedia.net


The advantages of PSIs

URLs (HTTP URIs) are easier to use than, e.g. URNs– The resolution mechanism is now very widely supported

The PSI / PSD duality is simple and useful– Makes it possible for users to understand the publisher’s

“intentionality”

Open and democratic– Anyone can create a PSI – no top-down supervision

Common sets of PSI can emerge through consensus based on

Trust in the publisher (stability, longevity) Degree of adoption in particular communities

www.ontopedia.net


A little terminologi

Topic Maps standard (1999)– Public Subjects

– Public Subject Descriptor

XTM 1.0 (2001)– Published Subject Indicator (PSI)

OASIS PubSubj TC (2003)– Published Subject Indicator (PSI)

– Published Subject Identifier (PSID)

W3C Call for Action (2006)– Public Resource Identifier (PRI)

– Public Resource Descriptor (PRD)

Current usage– PSI abbreviation for the identifier

– Confusion identifier / indicator

My proposal– Published Subject Identifier (PSI)

– Published Subject Descriptor (PSD)

Rationale– PSI most often used for the identifier

– Term “indicator” a little too opaque

– “Identifier” and “indicator” too similar

– One abbreviation for two different terms leads to confusion

www.ontopedia.net


Proposed definitions

Published Subjects– a paradigm for creating globally unique identifiers for arbitrary subjects

published subject– a subject for which a published subject identifier has been published

published subject identifier (PSI)– a HTTP URI that was created explicitly for the purpose of serving as

the identifier for some subject

published subject descriptor (PSD)– an information resource to which a published subject identifier resolves

and whose purpose is to convey to a human the identity of the subject thus identified, i.e. the intentionality of the publisher of the PSI

www.ontopedia.net


OASIS PubSubj TC (oppdatert)

Requirements– A PSI must be a URI

– A PSI must resolve to a PSD

– A PSD must explicitly state its PSI

Recommendations– A PSD should provide human-readable metadata

– A PSD may provide machine-readable metadata

– Human-readable and machine-readable metadata should be consistent but need not be equivalent

– A PSD should indicate its intended use as a PSD

– A PSD should identify its publisher

www.ontopedia.net


Frequently Asked Questions

What happens if two people create PSIs for the same subject?– This will happen, but it’s no catastrophe

– Over time, stable sets of PSIs will emerge as de facto standards

– In the interim, mapping between PSIs (or between PSI sets) is simple

– With structured information, batch updates of identifiers is easy

How do I go about finding a PSI?– As of today there are no registries or lookup services

– We envisage an open, distributed system based on, or similar to, UDDI

What if I disagree with assertions made by the publisher?– Doesn’t matter. You aren’t being asked to agree!

– The assertions are only there to give you sufficient indication of the identity of the subject to be able to decide if it’s the same subject as the one you’re interested in.

www.ontopedia.net


Discussion points

Should we only use HTTP URIs?– Only HTTP URIs have a widely supported resolution mechanism

What form should the URI take?– Readability, use of fragment identifiers, queries, etc.

Are Wikipedia URLs suitable?– If so, what about other sources, e.g. Ethnologue

http://www.ethnologue.com/show_language.asp?code=nsl

What information should a PSD contain?– Content of descriptor itself, metadata

What kinds of discovery mechanism could be used?– Registries, search engines, ...

What is the role of the PSI server?– In addition to published the PSD, what services might it offer?

Norwegian terminology– publisert tema, publisert temaidentifikator, publisert temadeskriptor?

O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper [email protected]...

Documents

Transcript of O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper [email protected]...