Post on 12-Jan-2016
OKKAM – Enabling the Web of Entities
A SCALABLE AND SUSTAINABLE SOLUTION
FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED
INFORMATION ENVIRONMENTS
KnowDive SeminarApril 11, 2007
Trento, Italy
Background: KR goes Global
Knowledge representation is a field which currently seems to have the reputation of being initially interesting, but which did not seem to shake the world to the extent that some of its proponents hoped.
It made sense but was of limited use on a small scale, but never made it to the large scale. This is exactly the state which the hypertext field was in before the Web.
Each field had made certain centralist assumptions -- if not in the philosophy, then in the implementations, which prevented them from spreading globally.
But each field was based on fundamentally sound ideas about the representation of knowledge.
The Semantic Web is what we will get if we perform the same globalization process to Knowledge Representation that the Web initially did to Hypertext. We remove the centralized concepts of absolute truth, total knowledge, and total provability, and see what we can do with limited knowledge.
[Tim Berners-Lee, What the Semantic Web can represent, 1998]
In practice … In practice …
www.google.com
www.unitn.it
www.l3s.org
www.ryanair.com
www.paolobouquet.net
ockham.org
www.trento.it
href
href hrefhref
hrefhref
hrefhref
Bouquet
UniTN
Niederee
L3S VIKEFWorks-for
Knows
Coordinates
Is_involved_inWorks-for
Web of Meanings
Web of Links
Web_page Web_page Web_page
What went wrong (personal view)
The Web of Meanings (the Semantic Web) is not happening, at least not as the WWW happened along the 90’s
Enabling factors for the Web of Links (the WWW):◦ Any available resource has a global URL, which allows Web
clients to address it◦ The same identifier can be resolved to retrieve the resource
through the HTTP protocol (running on top of TCP/IP)◦ Creating href links is easy on top of this infrastructure
What about the Web of Meanings?◦ Non addressable resources do not have an infrastructure for
supporting the use of global identifiers (more about this)◦ Non addressable resources cannot be retrieved◦ Creating global links between non addressable resources is
difficult Outcome: we lack the preconditions for the Web of Meanings to
happen!
Further (strategic) errors
On top of these infrastructural issues, a big strategic error was made (personal opinion!):
◦ The AI people came in, and tried to “recycle” their logical know-how on the Semantic Web
◦ The plan was to build the Semantic Web starting from representations (theories, currently known as ontologies) and not from resources (entities)
◦ This led to a scalability issue: reasoning is hard for local theories, forget about going global! [Heard about semantic heterogeneity, ontology mapping, alignment, distributed reasoning, …?]
My vision
Back to the building blocks: entities!◦ First, create the infrastructure for enabling in
practice a global space of identifiers (e.g. URIs)◦ Second, show how we can create value simply from
linking globally identified entities◦ Third, specify vocabularies and ontologies for
(subsets of) globally identified entities◦ Fourth, link ontologies to each others on top of the
already integrated domain of globally identified entities
Hopefully, this will lead to the Web of Entities, namely a global digital space in which any knowledge expressed in any local web of entities can be seamlessly integrated and reasoned about
OKKAM overall goal
The goal of the OKKAM project is to implement the first part of this plan.
Establishing a scalable and sustainable infrastructure for the storage and reuse of global identifiers for non addressable entities in decentralized information environments
Enabling different forms of OKKAMization of old and new content
Creating a primitive index which links global identifiers to OKKAMized content
Building applications which can showcase the potential value of this approach
But why “OKKAM”?
Ockham's Razor (14° century):
“entities should not be multiplied beyond necessity“
OKKAM’s Razor (21° century):
“entity identifiers should not be multiplied beyond necessity”
1. Infrastructure
Cornerstone: large-scale EntityRepository (ER) Architecture: distributed, supports federation of local ERs,
replicated (no single point of failure) ER vs. Entity Base (or Knowledge Base): supporting reuse
vs. collecting and providing knowledge about entities Basic schema: set of attribute/value pairs (called “labels”)
with no predefined semantics
Features: Size: unbelievable (billions of identifiers+profiles stored Network traffic: massive (up to millions of requests per
minute) Quality: hard to ensure Update: grows monotonically (no deletion). Aging
mechanism?
2. OKKAMization
Enabling the runtime or “ex post” OKKAMization of data in various formats (from unstructured to structured)
Examples:◦ Office tools (named entity recognition and
annotation)◦ Databases (annotating records with OKKAM ids)◦ Ontologies (replacing local URIs with global URIs)◦ HTML pages◦ …
Objective: creating the critical mass of OKKAMized content
3. Indexing
The model of knowledge devolution:◦ The ER stores only IDs + simple labels◦ Knowledge about entities must be developed outside
Idea: use OKKAM to store and index pointers to external resources which mention an OKKAM id
Different types of pointers:◦ Informal: pointing to a document which contains an
OKKAM id as a simple annotation for a piece of text◦ Formal: pointing to formal resources (e.g. ontologies)
in which an OKKAM id is used as a URI of an instance
Using this index also for entity resolution / matching
Okkam ArchitectureOkkam Architecture
Okkam ApplicationsOkkam Applications
Three examplary applications on top of OKKAM infrastrucure:
Entity-centric search engine Entitity-centric organizational knowledge management Multimedia authoring based
Purpose:
◦ Show benefits of entity-centric approach
◦ Trigger the development of further applications
◦ contribute to building a community around the OKKAM approach
Entity-centric search engineEntity-centric search engine
Starting point: Different types of OKKAMized content collections, e.g. knowledge bases, document collections, metadata repositories, image collectons, etc.
Goal: ◦ enabling completely new methods for browsing and
searching large collections of data and documents (including the Web itself)
◦ enable new forms of intelligent entity-centric search that exploit the OKKAMization of content
RTD Challenges ◦ Retrieval indexing that takes into account the OKKAM IDs◦ Combination of entity-centric and semantic search◦ Combined ranking◦ Adequate combination and visualization of the results from different
kinds of resources (e.g. knowledg base + document collection)◦ ...
Entity-centric organizational knowledge Entity-centric organizational knowledge managementmanagement
Idea:
Exploit OKKAM benefits in organizational context,
Managing and structuring corporate knowledge using entity identifiers as pivots for aggregating information not only from structured sources, but also from poorly or non-structured sources, like electronic documents, email messages, slide presentations, video and audio files, etc.
Using and interlinking a local organizational entity repository
Multimedia Content AuthoringMultimedia Content Authoring
Idea: Creation of an authoring environment, which makes use of
the OKKAMization of content
Variants authoring environment, which helps the scientific author by
providing targeted additional information during writing process
Support for the creation of value added artefact on the basis on OKKAMized content (text, video)◦ creation template for task-specific /selective) enrichment
with information about the entity found in the content object („semantic infusion“)
◦ tool for publishers, broadcasters
17
Example: Semantic InfusionExample: Semantic Infusion
VIKEF: Social Network Context of PapersVIKEF: Social Network Context of Papers
Hover over highlighted entities to explore the context.
Social Networks
Internet
Biology
Scientific Community
Web Services
Filter Container
25%
10%15%
5%
45%
Content Container
Reference Section Only
External Info Container
Meta Info Container
Overview Container
Legend Container
PersonBuild Citation BodyBackground CitationAbstractAuthor of this PublicationAuthor of selected citation.
According to Agent (1), the mixture of topics of the currently viewed publication is displayed. Legend is turned off (i.e. it is shown in the Filter Component)
(14): The overview container can contain a configured SRN View, for instance the social network of authors and the co-authorship relation.
Agent (2) indicates that all persons in the content are highlighted in red. This also applies to persons in the Reference Section.
Agent (3): When hovering over a Citation Mark in the Build context, the corresponding citation body is highlighted in dark cyan.
Agent (4): All Citation Marks in the Background context are highlighted in dark green.
Agent (5): When hovering over a citatoin Mark in the Background context, ist corresponding citation body is highlighted in dark green.
Agent (6): Paragraphs that are about one of the topics selected in the Filter Container, get highlighted in the topics’s color. This paragraph is about Social Networks, so it is highlighted in purple.
Agent (7): Sections of the content that are identified as the abstract are highlighted in yellow.
(17): The filter container allows the user to select some topics, the user is interested in. This may be used by several infusion agents, e.g. (7).Besides this, it acts as a topic color legend for the meta info.
Agent (8), (8') and (9): would hide all content, besides the header, abstract, and the reference section. In this example, it is disabled.
Agent (10): Clicking on a citation mark will make the view scroll down to the corresponding citation body.
Agent (11): Clicking on a Persons name will display his DBLP website in the external Info container
Agent (12): Clicking on an organizations’ name will display ist website in the external Info container (not shown in this picture)
Agent (13): Clicking on an citation body, will open the cited document in the application (and refresh all containers)
Agent (15) indicates that the persons in the SRN View, that are authors of the currently viewed publication are highlighted in light orange. (This is a static highlight that will only change if a different publication is loaded.)
Agent (16): When the user hovers the mouse over a citation mark in the content, the authors of the cited document are highlighted in pink in the SRN view.
The legend explains the semantics of the highlight colors (as indicated in the authoring component). In principle, the user could deactivate some of the highlighting to avoid info glut.
RTD Challenges
Building a scalable entity repository in which a massive and growing number of entity IDs and profiles can be collected, stored and indexed
Guaranteeing security and privacy for the data stored in the repository
Making the repository efficiently searchable and usable by Web users as well as through APIs
Supporting effective and reliable methods for entity matching and for ranking results
Enabling several channels through which the repository can be populated, either manually or automatically (import filters, crawling, harvesting, …)
Supporting the integration of OKKAM with a variety of content creation applications (e.g. text editors, office applications, HTML and XML editors, ontology editors, DBMS, etc.)
Ensuring the quality of data in the repository Enabling a virtuous circle of trust and collaboration with users
Conclusions
There are many critical issues:
Size and Performance Quality of entity search and matching Critical mass of data and applications Trust and community building Sustainability and exploitation …
… but it’s fun and I want to give it a try!!