OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL...

19
OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS KnowDive Seminar April 11, 2007 Trento, Italy

Transcript of OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL...

Page 1: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

OKKAM – Enabling the Web of Entities

A SCALABLE AND SUSTAINABLE SOLUTION

FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED

INFORMATION ENVIRONMENTS

KnowDive SeminarApril 11, 2007

Trento, Italy

Page 2: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

Background: KR goes Global

Knowledge representation is a field which currently seems to have the reputation of being initially interesting, but which did not seem to shake the world to the extent that some of its proponents hoped.

It made sense but was of limited use on a small scale, but never made it to the large scale. This is exactly the state which the hypertext field was in before the Web.

Each field had made certain centralist assumptions -- if not in the philosophy, then in the implementations, which prevented them from spreading globally.

But each field was based on fundamentally sound ideas about the representation of knowledge.

The Semantic Web is what we will get if we perform the same globalization process to Knowledge Representation that the Web initially did to Hypertext. We remove the centralized concepts of absolute truth, total knowledge, and total provability, and see what we can do with limited knowledge.

[Tim Berners-Lee, What the Semantic Web can represent, 1998]

Page 3: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

In practice … In practice …

www.google.com

www.unitn.it

www.l3s.org

www.ryanair.com

www.paolobouquet.net

ockham.org

www.trento.it

href

href hrefhref

hrefhref

hrefhref

Bouquet

UniTN

Niederee

L3S VIKEFWorks-for

Knows

Coordinates

Is_involved_inWorks-for

Web of Meanings

Web of Links

Web_page Web_page Web_page

Page 4: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

What went wrong (personal view)

The Web of Meanings (the Semantic Web) is not happening, at least not as the WWW happened along the 90’s

Enabling factors for the Web of Links (the WWW):◦ Any available resource has a global URL, which allows Web

clients to address it◦ The same identifier can be resolved to retrieve the resource

through the HTTP protocol (running on top of TCP/IP)◦ Creating href links is easy on top of this infrastructure

What about the Web of Meanings?◦ Non addressable resources do not have an infrastructure for

supporting the use of global identifiers (more about this)◦ Non addressable resources cannot be retrieved◦ Creating global links between non addressable resources is

difficult Outcome: we lack the preconditions for the Web of Meanings to

happen!

Page 5: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

Further (strategic) errors

On top of these infrastructural issues, a big strategic error was made (personal opinion!):

◦ The AI people came in, and tried to “recycle” their logical know-how on the Semantic Web

◦ The plan was to build the Semantic Web starting from representations (theories, currently known as ontologies) and not from resources (entities)

◦ This led to a scalability issue: reasoning is hard for local theories, forget about going global! [Heard about semantic heterogeneity, ontology mapping, alignment, distributed reasoning, …?]

Page 6: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

My vision

Back to the building blocks: entities!◦ First, create the infrastructure for enabling in

practice a global space of identifiers (e.g. URIs)◦ Second, show how we can create value simply from

linking globally identified entities◦ Third, specify vocabularies and ontologies for

(subsets of) globally identified entities◦ Fourth, link ontologies to each others on top of the

already integrated domain of globally identified entities

Hopefully, this will lead to the Web of Entities, namely a global digital space in which any knowledge expressed in any local web of entities can be seamlessly integrated and reasoned about

Page 7: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

OKKAM overall goal

The goal of the OKKAM project is to implement the first part of this plan.

Establishing a scalable and sustainable infrastructure for the storage and reuse of global identifiers for non addressable entities in decentralized information environments

Enabling different forms of OKKAMization of old and new content

Creating a primitive index which links global identifiers to OKKAMized content

Building applications which can showcase the potential value of this approach

Page 8: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

But why “OKKAM”?

Ockham's Razor (14° century):

“entities should not be multiplied beyond necessity“

OKKAM’s Razor (21° century):

“entity identifiers should not be multiplied beyond necessity”

Page 9: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

1. Infrastructure

Cornerstone: large-scale EntityRepository (ER) Architecture: distributed, supports federation of local ERs,

replicated (no single point of failure) ER vs. Entity Base (or Knowledge Base): supporting reuse

vs. collecting and providing knowledge about entities Basic schema: set of attribute/value pairs (called “labels”)

with no predefined semantics

Features: Size: unbelievable (billions of identifiers+profiles stored Network traffic: massive (up to millions of requests per

minute) Quality: hard to ensure Update: grows monotonically (no deletion). Aging

mechanism?

Page 10: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

2. OKKAMization

Enabling the runtime or “ex post” OKKAMization of data in various formats (from unstructured to structured)

Examples:◦ Office tools (named entity recognition and

annotation)◦ Databases (annotating records with OKKAM ids)◦ Ontologies (replacing local URIs with global URIs)◦ HTML pages◦ …

Objective: creating the critical mass of OKKAMized content

Page 11: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

3. Indexing

The model of knowledge devolution:◦ The ER stores only IDs + simple labels◦ Knowledge about entities must be developed outside

Idea: use OKKAM to store and index pointers to external resources which mention an OKKAM id

Different types of pointers:◦ Informal: pointing to a document which contains an

OKKAM id as a simple annotation for a piece of text◦ Formal: pointing to formal resources (e.g. ontologies)

in which an OKKAM id is used as a URI of an instance

Using this index also for entity resolution / matching

Page 12: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

Okkam ArchitectureOkkam Architecture

Page 13: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

Okkam ApplicationsOkkam Applications

Three examplary applications on top of OKKAM infrastrucure:

Entity-centric search engine Entitity-centric organizational knowledge management Multimedia authoring based

Purpose:

◦ Show benefits of entity-centric approach

◦ Trigger the development of further applications

◦ contribute to building a community around the OKKAM approach

Page 14: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

Entity-centric search engineEntity-centric search engine

Starting point: Different types of OKKAMized content collections, e.g. knowledge bases, document collections, metadata repositories, image collectons, etc.

Goal: ◦ enabling completely new methods for browsing and

searching large collections of data and documents (including the Web itself)

◦ enable new forms of intelligent entity-centric search that exploit the OKKAMization of content

RTD Challenges ◦ Retrieval indexing that takes into account the OKKAM IDs◦ Combination of entity-centric and semantic search◦ Combined ranking◦ Adequate combination and visualization of the results from different

kinds of resources (e.g. knowledg base + document collection)◦ ...

Page 15: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

Entity-centric organizational knowledge Entity-centric organizational knowledge managementmanagement

Idea:

Exploit OKKAM benefits in organizational context,

Managing and structuring corporate knowledge using entity identifiers as pivots for aggregating information not only from structured sources, but also from poorly or non-structured sources, like electronic documents, email messages, slide presentations, video and audio files, etc.

Using and interlinking a local organizational entity repository

Page 16: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

Multimedia Content AuthoringMultimedia Content Authoring

Idea: Creation of an authoring environment, which makes use of

the OKKAMization of content

Variants authoring environment, which helps the scientific author by

providing targeted additional information during writing process

Support for the creation of value added artefact on the basis on OKKAMized content (text, video)◦ creation template for task-specific /selective) enrichment

with information about the entity found in the content object („semantic infusion“)

◦ tool for publishers, broadcasters

Page 17: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

17

Example: Semantic InfusionExample: Semantic Infusion

VIKEF: Social Network Context of PapersVIKEF: Social Network Context of Papers

Hover over highlighted entities to explore the context.

Social Networks

Internet

Biology

Scientific Community

Web Services

Filter Container

25%

10%15%

5%

45%

Content Container

Reference Section Only

External Info Container

Meta Info Container

Overview Container

Legend Container

PersonBuild Citation BodyBackground CitationAbstractAuthor of this PublicationAuthor of selected citation.

According to Agent (1), the mixture of topics of the currently viewed publication is displayed. Legend is turned off (i.e. it is shown in the Filter Component)

(14): The overview container can contain a configured SRN View, for instance the social network of authors and the co-authorship relation.

Agent (2) indicates that all persons in the content are highlighted in red. This also applies to persons in the Reference Section.

Agent (3): When hovering over a Citation Mark in the Build context, the corresponding citation body is highlighted in dark cyan.

Agent (4): All Citation Marks in the Background context are highlighted in dark green.

Agent (5): When hovering over a citatoin Mark in the Background context, ist corresponding citation body is highlighted in dark green.

Agent (6): Paragraphs that are about one of the topics selected in the Filter Container, get highlighted in the topics’s color. This paragraph is about Social Networks, so it is highlighted in purple.

Agent (7): Sections of the content that are identified as the abstract are highlighted in yellow.

(17): The filter container allows the user to select some topics, the user is interested in. This may be used by several infusion agents, e.g. (7).Besides this, it acts as a topic color legend for the meta info.

Agent (8), (8') and (9): would hide all content, besides the header, abstract, and the reference section. In this example, it is disabled.

Agent (10): Clicking on a citation mark will make the view scroll down to the corresponding citation body.

Agent (11): Clicking on a Persons name will display his DBLP website in the external Info container

Agent (12): Clicking on an organizations’ name will display ist website in the external Info container (not shown in this picture)

Agent (13): Clicking on an citation body, will open the cited document in the application (and refresh all containers)

Agent (15) indicates that the persons in the SRN View, that are authors of the currently viewed publication are highlighted in light orange. (This is a static highlight that will only change if a different publication is loaded.)

Agent (16): When the user hovers the mouse over a citation mark in the content, the authors of the cited document are highlighted in pink in the SRN view.

The legend explains the semantics of the highlight colors (as indicated in the authoring component). In principle, the user could deactivate some of the highlighting to avoid info glut.

Page 18: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

RTD Challenges

Building a scalable entity repository in which a massive and growing number of entity IDs and profiles can be collected, stored and indexed

Guaranteeing security and privacy for the data stored in the repository

Making the repository efficiently searchable and usable by Web users as well as through APIs

Supporting effective and reliable methods for entity matching and for ranking results

Enabling several channels through which the repository can be populated, either manually or automatically (import filters, crawling, harvesting, …)

Supporting the integration of OKKAM with a variety of content creation applications (e.g. text editors, office applications, HTML and XML editors, ontology editors, DBMS, etc.)

Ensuring the quality of data in the repository Enabling a virtuous circle of trust and collaboration with users

Page 19: OKKAM – Enabling the Web of Entities A SCALABLE AND SUSTAINABLE SOLUTION FOR SYSTEMATIC AND GLOBAL IDENTIFIER REUSE IN DECENTRALIZED INFORMATION ENVIRONMENTS.

Conclusions

There are many critical issues:

Size and Performance Quality of entity search and matching Critical mass of data and applications Trust and community building Sustainability and exploitation …

… but it’s fun and I want to give it a try!!