O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper [email protected]...
-
Upload
corey-wilkins -
Category
Documents
-
view
214 -
download
0
Transcript of O N T O P E D I A The Identity of Everything Subject Identity Steve Pepper [email protected]...
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Subject Identity
Steve [email protected]
INF5909, 2009-02-23
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Agenda
Merging in Topic Maps
The Importance of Identity
The Topic Maps Approach to Identity
The Identity Crisis of the Web
Published Subjects
(Subject-centric Computing)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Merging in Topic Maps
An Example ofKnowledge Federation
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Merging topic maps
Topic Maps can be merged automatically– Arbitrary topic maps can be merged into a single topic map
– This cannot be done with databases or XML documents
Merging enables many advanced applications– Information integration across repositories
– Sharing and reusing taxonomies
– Automated content aggregation
– Distributed knowledge management
– Global knowledge federation
Merging made possible by subject identity
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Principles of merging
By definition: Every topic represents exactly one subject The goal: Every subject represented by exactly one topic
1. When two topic maps are merged, topics that represent thesame subject should be merged to a single topic
2. When two topics are merged, the resulting topic has theunion of the characteristics of the two original topics
name
occurrence
association role
T
association role
name
occurrence
association role
name
A second topic (in another topic map) “about” the same subject
TMerge the two topics together......and the resulting topic has the union
of the original characteristicsname
occurrence
association role
name
T
(Demo of merging in the Omnigator…)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The vision of seamless knowledge
Starting with ITU in 2001, Norway has seen an explosion in the number of portals that are based on Topic Maps
– Today there are dozens, especially in the public section
As the number of portals multiplies, the amount of overlap increases…
– The potential for integration is … staggering
Take these three portals as an example:– forskning.no (Research Council web site aimed at young adults)
– forbrukerportalen.no (Norwegian Consumer Association)
– matportalen.no (Biosecurity portal of the Department of Agriculture)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Genetically modified food at forskning.no
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Genetically modified food at Forbukerrådet
•Terefe Badenod
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Genetically modified foodstuffs at Matportalen
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Three portals – one subject
one “virtual portal”with seamless navigation in all directions
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The Importance of Identity
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Identity and knowledge federation
Knowledge federation requiressubject-based merging
subject subject
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The big challenge is
Knowing when we’re talkingabout the same thing
the computer domain
the real world
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Humans get by using names
But names are ambiguous (homonyms)– Humans disambiguate using (a) context and (b) negotiation
Many names have the same referent (synonyms)– Humans can generally handle this
– Computers can’t – at least not without our help...
Computers need a simpler mechanism– Local identifiers (database keys, XML IDs, controlled
vocabularies, code sets, etc.) work OK in closed systems– but not across systems or domains (e.g. the code ”nor”)
– Open and multilingual systems need global identifiers
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Requirements on global identifiers
The mechanism as a whole should be– open and democratic: top-down solutions won’t work
– scaleable: the number of potential subjects is open-ended
– easy to adopt: based on existing tools and methods
The identifiers themselves should be– easy for humans to use: locate, create, interpret, apply
given a subject, find an identifier (if one exists) given a subject, create an identifier given an identifier, find out what subject it identifies given an identifier, attach it to the information in question
– efficient for computers to use: comparison of identifiers lexical comparison simplest avoid normalization, network access, other computation
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Some proposed solutions
URL based proposals
For web documents– HTTP URIs (URLs)– address = identifier
For resources in general– Source: SemWeb community– URIs for arbitrary “resources”
(esp. classes og properties) Published Subjects
– Source: Topic Maps community– Continuation of SemWeb
practice
Non-URL based proposals
URN (RFC 1737)– Uniform Resource Names
XRI (OASIS)– Extensible Resource Identifiers
Domain specific– ISBN (books)– DOI (“digital objects”)– GUID & UUID– UPC & EAN– RFID– (what else is out there?)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The Topic MapsApproach to Identity
Direct identification(subject locators)
Indirect identification(subject identifiers)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Subjects and topics
Topics are surrogates, or “proxies” (inside the computer) for the ineffable subjects that you want to talk about, such as Puccini, love, these slides, or the second law of thermodynamics
A subject in the real world
(referent)
TA topic in the computer domain(symbol)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Topics and subjects
Topics represent subjects– By definition every topic represents exactly one subject
– The goal when merging is to ensure that every subject is represented by exactly one topic (the collocation objective)
A subject can be anything you want– ISO 13250 definition:
A subject is any “thing” whatsoever, whether or not it exists or has any other specific characteristics, about which anything whatsoever may be asserted by any means whatsoever.”
Some examples...
(Puccini)
(Lucca)
(Tosca)
(MadameButterfly)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The identity of subjects
Topics exist in order to allow us to talk about subjects
– The relationship between the two is sometimes called intentionality
We need to know exactly which subject a topic represents
– That is, we need to establish its subject identity
– The collocation objective depends on knowing when applications are talking about the same thing
Lucca
Tosca
Puccini
MadameButterfly
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Subject locators
Sometimes the subject is an information resource (like these slides)
– It exists somewhere within the computer system
– It has a location and can be “addressed”, e.g. http://www.ontopedia.net/tutorials/tm-intro.ppt
– The address of such an addressable subject can be used to unequivocably establish the subject’s identity
– An address used in this way to identify a subject directly is called a subject locator
But most subjects are not information resources
– Puccini, Tosca, love, subject-centric computing, …
– Outside the computer domain and cannot be addressed directly...
subject
topic
subj
ect l
ocat
or
http
://w
ww
.ont
oped
ia.n
et/tu
toria
ls/tm
-intr
o.pp
t
(These slides)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Subject Identity
Steve [email protected]
INF5909, 2009-02-23
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Life, the Universe and Everything
The Computer Domain
The Topic Map Domain
Subject identifiers
The identity of most subjects can only be established indirectly
– An information resource can provide an indication of the subject’s identity to a human
– Such a resource is called a subject descriptor*
A subject descriptor has an address,even though the subject it indicatesdoes not
– Computers can use the address of thesubject descriptor to establish identity
– Such addresses are calledsubject identifiers
Subject descriptors and subjectidentifiers represent the twofaces of the human-computerdichotomy
* also known as “subject indicator”
subject
Giacomo Puccini, Italian composer, b. Lucca 22nd Dec 1858, d. Brussels, 29th Nov 1924. Best known for his operas, of which Tosca is the most . . .
subject descriptor
Puccini
http://
psi.o
ntoped
ia.n
et/P
uccin
i
subject identifier
topic
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
A dual mechanismThe subject is identified by a URL
• The URL is called asubject identifier
GiacomoPuccini
topic
http://psi.ontopedia.net/Puccini
subject identifier
The URL is the address of a web page
• The web page describes the subject such that a human can know what subject is referred to
• This web page is called a subject descriptor
Giacomo Puccini
Italian composer, b. Lucca 22nd Dec 1858, d. Brussels, 29th Nov 1924. Best known for his operas, of whichTosca is one of the most popular and well-known.
subject descriptor
http://psi.ontopedia.net/PucciniHumans use the descriptor
By inspecting the web page the person responsible for assigning the identifier can be sure that it does not refer to, say, Giacomo’s grandfather Domenico (who was also a composer of operas)
Machines use the identifier
The link is not resolved. Instead simple lexical comparison is used. If the strings are identical, the subject is deemed to be the same and the topics are merged.
subject
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Summary of the TM approach
Allows both direct and indirect identification of subjects
Direct identification is for information resources– “addressable subjects” only
– subject locators (orig. subject addresses)
Indirect identification is for anything– both “addressable” and “non-addressable subjects”
– subject identifiers and subject descriptors (orig. subject indicators)
There is also a construct called “item identifier”– used under the covers for mapping between syntax and internal
representation
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The Identity Crisis of the Web
Also known as the httpRange14 issue
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
“Identity crisis”
Article on XML.com September 2002 by Kendall Clark– http://www.xml.com/pub/a/2002/09/11/deviant.html
Based on a review of the work of the W3C’s Technical Architecture Group (TAG)
– Architectural Principles of the World Wide Web
– http://www.w3.org/TR/webarch/
Part of a larger discussion in the “Web community”– What do HTTP URIs identify? (Tim Berners-Lee)
– Disambiguating RDF Identifiers (Sandro Hawke)
– Four Uses of a URL (David Booth)
– Web Proper Names (Harry Halpin & Henry S. Thompson)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The problem in a nutshell:
What do URIs identify?– Sandro Hawke: “To date, RDF has not been clear about
whether a URI like http://www.w3.org/Consortium identifies the W3C or a web page about the W3C. Throughout RDF, strings like http://www.w3.org/1999/02/22-rdf-syntax-ns#type are used with no consistent explanation of how they relate to the web.”
Why is this important? Because without clarity on this issue
– The challenge of the Semantic Web cannot be solved– Web services cannot be implemented in a scaleable manner– Ontologies and taxonomies will not be reusable– The goal of Global Knowledge Federation is unreachable– The problem of Infoglut will never go away
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Introducing Eric Miller
Formerly of OCLC: Dublin Core, RDF Later Technical Lead of the W3C’s Semantic Web Activity
“I see both RDF … as well as Topic Maps workingtoward enabling the Semantic Web”
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
A simple example (1)
RDF Primer– http://www.w3.org/TR/2003/WD-rdf-primer-20030123/
Example 1: RDF/XML Describing Eric Miller
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#">
<contact:Person rdf:about="http://www.w3.org/People/EM/contact#me">
<contact:fullName>Eric Miller</contact:fullName>
<contact:mailbox rdf:resource="mailto:[email protected]"/>
<contact:personalTitle>Dr.</contact:personalTitle>
</contact:Person>
</rdf:RDF>
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
A simple example (2)
Eric Miller
Person
Dr.
mailto:[email protected]
http://www.w3.org/People/EM/contact#me
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Resolving the URI
Clicking on this URL displays the following document
Now let’s add some DC metadata to this document
<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://www.w3.org/2000/10/swap/pim/contact#">
<Person rdf:about="http://www.w3.org/People/EM/contact#me">
<rdf:value>Eric Miller, [email protected]</rdf:value>
<mailbox rdf:resource="mailto:[email protected]" />
<fullName>Eric Miller</fullName>|
<personalTitle>Semantic Web Activity Lead</personalTitle>
<company>W3C World Wide Web Consortium</company>
<phone>614.763.1100</phone>
</Person>
</rdf:RDF>
April 2nd 2002dc:creation-date
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Encoding the metadata in RDF
Ex2: RDF/XML Describing the document about EM<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.0/">
<rdf:Description rdf:about="http://www.w3.org/People/EM/contact#me">
<dc:creator>Eric Miller</dc:creator>
<dc:creation-date>2002/06/04</dc:creation-date>
</rdf:Description>
</rdf:RDF>
Document about Eric Miller
April 2nd 2002
http://www.w3.org/People/EM/contact#me
dc:creation-date
/Eric Miller
Person
Dr.
mailto:[email protected]
http://www.w3.org/People/EM/contact#me
dc:creator
Person
Dr.
mailto:[email protected]
http://www.w3.org/People/EM/contact#me
April 2nd 2002
http://www.w3.org/People/EM/contact#me
dc:creation-date
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The cause of the problem
URIs are being used for two distinct purposes– To identify information resources
– To identify the thing that an information resourcedescribes or indicates
And we don’t know the difference!
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Problem recognized in W3C
Architectural Principles of the World Wide Web:2.2. Uses of URIs
The two primary uses of URIs are (1) To compare identifiers and (2) Dereference a URI
(that is, as identifiers and as addresses)2.2.5. Consistent use of URIs
It is confusing and costly when people use the same URI to refer to different resources (i.e., where there is some inconsistency in usage compared to the authoritative meaning of the resource). Suppose company A uses http://example.com/coolcompany to refer to CoolCompany's home page, while company B uses http://example.com/coolcompany to refer to CoolCompany. Company A then buys company B, but when they try to merge their databases, they cannot due to this inconsistent usage of the URI.
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Original solution (2003) was…
… ineffectual handwaving:2.2.5. Consistent use of URIs
Good practice:Consistent URIs: Indiscriminate use of a URI undermines its value and interferes with people who rely on it.
In fairness, individuals in the Web and RDF communities have proposed solutions
• Larry Masinter: tdb URN namespace (“Thing Described By”)
• Sandro Hawke: Distinguish between “page mode” and “subject mode”
• David Booth: Distinguish between “names”, “concepts”, “web locations,” and “documents”
• Not taken seriously by the W3C
• (There is also the hash/slash proposal)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
How the situation came about
In the Beginning the Web was a web of information resources– URIs (Uniform Resource Identifiers) originally called UDIs
(Uniform Document Identifiers)
– Name changed to avoid narrow interpretation of “document”
– But “resources” were still information resources
Most important kind of URI was the URL – the Uniform Resource Locator
– A locator is the address of something (e.g., an information resource)
– An address is a fairly robust way of identifying something
– So URLs started to be regarded as identifiers
All of this worked fine until someone had the bright idea of using URLs to identify things that were not information resources...
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Redefining “resource”
Imperceptibly, “resource” acquired a new meaning…– No longer just an information resource…
– Came to mean anything whatsoever…
Practice codified in RFC 2396 in August 1998– “A resource can be anything that has identity. Familiar examples include
an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources.”(RFC 2396)
This was a mistake– Because it obscures a fundamental ontological feature of the Web…
– … that information resources have special significance
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Information resources are special
They have locations within the system– A document has an address, a location
– Any information resource has an address
The address can be used to identify the resource But nothing else has an address
– Eric Miller does not have a location within the computer system
This fundamental ontological fact is recognized in Topic Maps
– Direct identification vs. indirect identification
Not recognized in RDF, or the Web Architecture in general
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
URIs as resource identifiers
subject locator
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
URIs as arbitrary subject identifiers
subject descriptor
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
httpRange14: The TAG’s resolution
Agreed on 15 Jun 2005:
The TAG provides advice to the community that they may mint "http" URIs for any resource provided that they follow this simple rule for the sake of removing ambiguity:
– If an "http" resource responds to a GET request with a 2xx response, then the resource identified by that URI is an information resource;
– If an "http" resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URI could be any resource;
– If an "http" resource responds to a GET request with a 4xx (error) response, then the nature of the resource is unknown.
This resolution (known as the “303 hack”) has not ended the debate...
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Published Subjects
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Published Subjects
In order for identifiers to be reused, they must made publicly available
– A subject identifier that has been made available for use outside one particular application is called a published subject identifier (PSI)
– Its descriptor is called a published subject descriptor (PSD)
Anyone can publish PSI sets– Adoption of PSI sets will be an evolutionary process based on trust
– It will lead to greater and greater interoperability – between topic map applications, between Topic Maps and RDF, and across information and knowledge management in general
– Check out http://psi.ontopedia.net (under development)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
What is “Published Subjects”?
An extremely simple mechanism (or convention) for defining and sharing globally unique identifiers for arbitrary subjects
– The identifier is an HTTP URI (i.e. a URL)
– It’s called a published subject identifier (PSI)
It resolves to a web page– The contents of this page convey the identity of the subject in a
form that is human-interpretable
– This pages is called a published subject descriptor* (PSD)
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
The advantages of PSIs
URLs (HTTP URIs) are easier to use than, e.g. URNs– The resolution mechanism is now very widely supported
The PSI / PSD duality is simple and useful– Makes it possible for users to understand the publisher’s
“intentionality”
Open and democratic– Anyone can create a PSI – no top-down supervision
Common sets of PSI can emerge through consensus based on
Trust in the publisher (stability, longevity) Degree of adoption in particular communities
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
A little terminologi
Topic Maps standard (1999)– Public Subjects
– Public Subject Descriptor
XTM 1.0 (2001)– Published Subject Indicator (PSI)
OASIS PubSubj TC (2003)– Published Subject Indicator (PSI)
– Published Subject Identifier (PSID)
W3C Call for Action (2006)– Public Resource Identifier (PRI)
– Public Resource Descriptor (PRD)
Current usage– PSI abbreviation for the identifier
– Confusion identifier / indicator
My proposal– Published Subject Identifier (PSI)
– Published Subject Descriptor (PSD)
Rationale– PSI most often used for the identifier
– Term “indicator” a little too opaque
– “Identifier” and “indicator” too similar
– One abbreviation for two different terms leads to confusion
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Proposed definitions
Published Subjects– a paradigm for creating globally unique identifiers for arbitrary subjects
published subject– a subject for which a published subject identifier has been published
published subject identifier (PSI)– a HTTP URI that was created explicitly for the purpose of serving as
the identifier for some subject
published subject descriptor (PSD)– an information resource to which a published subject identifier resolves
and whose purpose is to convey to a human the identity of the subject thus identified, i.e. the intentionality of the publisher of the PSI
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
OASIS PubSubj TC (oppdatert)
Requirements– A PSI must be a URI
– A PSI must resolve to a PSD
– A PSD must explicitly state its PSI
Recommendations– A PSD should provide human-readable metadata
– A PSD may provide machine-readable metadata
– Human-readable and machine-readable metadata should be consistent but need not be equivalent
– A PSD should indicate its intended use as a PSD
– A PSD should identify its publisher
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Frequently Asked Questions
What happens if two people create PSIs for the same subject?– This will happen, but it’s no catastrophe
– Over time, stable sets of PSIs will emerge as de facto standards
– In the interim, mapping between PSIs (or between PSI sets) is simple
– With structured information, batch updates of identifiers is easy
How do I go about finding a PSI?– As of today there are no registries or lookup services
– We envisage an open, distributed system based on, or similar to, UDDI
What if I disagree with assertions made by the publisher?– Doesn’t matter. You aren’t being asked to agree!
– The assertions are only there to give you sufficient indication of the identity of the subject to be able to decide if it’s the same subject as the one you’re interested in.
www.ontopedia.net
O N T O P E D I AThe Identity of Everything
Discussion points
Should we only use HTTP URIs?– Only HTTP URIs have a widely supported resolution mechanism
What form should the URI take?– Readability, use of fragment identifiers, queries, etc.
Are Wikipedia URLs suitable?– If so, what about other sources, e.g. Ethnologue
http://www.ethnologue.com/show_language.asp?code=nsl
What information should a PSD contain?– Content of descriptor itself, metadata
What kinds of discovery mechanism could be used?– Registries, search engines, ...
What is the role of the PSI server?– In addition to published the PSD, what services might it offer?
Norwegian terminology– publisert tema, publisert temaidentifikator, publisert temadeskriptor?