Ontotext @ JRC

5-6 Oct 2005

Ontotext @ JRC

Ontotext @ JRC

2/68

5-6 Oct 2005

Semantic Web

• The Semantic Web is the abstract representation of data on the WWW, based on the RDF and other standards

• SW is being developed by the W3C, in collaboration with a large number of researchers and industrial partners

http://www.w3.org/2001/sw/ http://www.SemanticWeb.org

Ontotext @ JRC

3/68

5-6 Oct 2005

Semantic Web (II)

• "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.“ [Berners-Lee et al. 2001]

The spirit:• Automatically processable

metadata regarding:– the structure (syntax) and – the meaning (semantics)– of the content.

• Presented in a standard form;

• Dynamic interpretationfor unforeseen purposes

Ontotext @ JRC

4/68

5-6 Oct 2005

Semantic Web: Languages

• RDF(S) – the next slides• SHOE, XOL, etc – the pioneers • Topic Maps – a metadata language with limited impact• OIL – Ontology Interchange Language, the basis of the next two

http://www.ontoknowledge.org/oil/– Description Logics-based multilayered language

• DAML+OIL – the predecessor of OWL, not to be developed• OWL – the W3C standard for Semantic Web ontology language,

http://www.w3.org/2001/sw/WebOnt/– Extends RDF(S), but also constraints it– Has multiple layers (Lite, DL, Full)– Transitive/symmetric/etc properties, disjointness, cardinality

restrictions

Ontotext @ JRC

5/68

5-6 Oct 2005

Semantic Web: Problems

• Critical mass of metadata is necessary

• Still lack of consensus on many issues (like query languages)

• Lack of practices at the proper scale and complexity

• Lack of robust Semantic (in our days RDFS) repositories:– Should be as flexible, multi-purpose and easy to use as HTTP

servers and– As efficient in structured knowledge management as RDBMS

Ontotext @ JRC

6/68

5-6 Oct 2005

What are Sirma & Ontotext?

• Established in 1992 as a Bulgarian AI Lab.• Current structure:

– Sirma Group International Corp, Montreal, Canada; – 8 subsidiary companies; the most important ones follow

below.• Sirma AI, Sofia

– The R&D backbone of the group with two divisions:– Sirma Solutions: e-Business, banking, C3, e-Publishing,

consultancy;– Ontotext Lab: Knowledge and Language Engineering.

• EngView Systems, Montreal– CAD/CAM systems and applications.

• WorkLogic.Com, Ottawa– Web-based collaboration, workflow, e-Gov.

Ontotext @ JRC

7/68

5-6 Oct 2005

Software Development and Research since 1992

• Track record of success – large companies and government organizations in US, Canada, Western Europe and Bulgaria;

• Top-3 Software Company in Bulgaria;

• About 70 developers;

• ISO 2001 Certificate;

• 1999 EIST prize winner;

Ontotext @ JRC

8/68

5-6 Oct 2005

Sirma Businesses and Domains

Diverse business, ranging from COTS products to custom projects, consultancy, and outsourcing services.

Major areas:

• AI – expert systems (beside Ontotext);

• b2b market places

• CAD/CAM (for packaging, quality control)

• e-Government, CSCW, Groupware, Workflow;

• Banking

• C3/C4 Systems (military, airport traffic);

• VOIP billing systems;

• e-Publishing, Proofing tools.

Ontotext @ JRC

9/68

5-6 Oct 2005

Ontotext Lab

An R&D lab of Sirma for

Knowledge and Language Engineering

Research and core technology development for

knowledge discovery, management, and engineering.

Specialized for applications in Semantic Web, Knowledge Management, and Web Services.

Aside from the scientific matters, most of us are just professional software developers.

Ontotext @ JRC

10/68

5-6 Oct 2005

Leading Semantic Web Technology Provider

Ontotext is a leading Semantic Web technology provider, being: • the developer of the KIM Semantic Annotation Platform and • a co-developer of the GATE language engineering platform; • a co-developer of the Sesame semantic repository and OWLIM

high-performance OWL reasoner;• the developer of the WSMO4J semantic web services API; • a partner in the SWAN Semantic Web Annotator project. Ontotext is part of most of the major European research projects in

the field; the most successful Bulgarian participant in FP6.

Ontotext @ JRC

11/68

5-6 Oct 2005

Mission

• A critical mass of research in a number of AI areas made efficient KM almost possible.

• the technology on the market is mostly of two sorts:

– Expensive black boxes

– Academic prototypes

Our mission is:

• To develop and popularize open, skillfully engineered tools...

• For Information Extraction and Knowledge Management,

• Which considerably reduce the cost for implementation and use of KM applications.

Ontotext @ JRC

12/68

5-6 Oct 2005

Major Research Areas

We focus on building cutting-edge expertise and technology in the following areas:

• ontology design, management, and alignment;

• knowledge representation, reasoning;

• information extraction (IE), applications in IR;

• semantic web services;

• upper-level ontologies and lexical semantics;

• NLP: POS, gazetteers, co-reference resolution, named entity recognition (NER)

• machine learning (HMM, NN, etc.)

Ontotext @ JRC

13/68

5-6 Oct 2005

Academic & Technology Partners

• NLP Group, Sheffield University, UK;

• Digital Enterprise Research Institute (DERI), Institut für Informatik, Innsbruck, Austria, andNational University of Ireland, Galway;

• Aduna (Aidministrator) b.v., The Nederland's;

• Linguistic Modelling Lab.CLPOI, Bulgarian Academy of Sciences;

• British Telecommunications Plc, (BT), UK.

• Froschungszentrum Informatik (FZI) and Institut AIFBKarlsruhe, Germany.

Ontotext @ JRC

14/68

5-6 Oct 2005

Customers

• SemanticEdge GmBH, Berlin, Germany;

• QinetiQ Ltd, UK;

• Fairway Consultants, UK;

Ontotext @ JRC

15/68

5-6 Oct 2005

Research Projects

We were/are part of a number of FP5 research projects:

• On-To-Knowledge - the project which invented OIL.Ontology Middleware Module and a DAML+OIL reasoner.

• VISION - Towards Next Generation Knowledge Management.

• OntoWeb - Ontology-based information exchange for knowledge management ….

• SWWS - Semantic Web enabled Web Services.

Ontotext @ JRC

16/68

5-6 Oct 2005

Research Projects (II)

FP6 integrated projects that started Jan 2004, durations ~3 years:

• SEKT: Semantic Knowledge Technologies. Targeting a synergy of Ontology and Metadata Technology, Knowledge Discovery and Human Language Technology.

• DIP: Data, Information, and Process Integration with Semantic Web Services.

• PrestoSpace: Preservation towards storage and access. Standardized Practices for Audiovisual Contents in Europe.

• Infrawebs: Intelligent Framework for Generating Open (Adaptable) Development Platforms for Web-Service Enabled Applications Using Semantic Web Technologies, Distributed Decision Support Units and Multi-Agent-Systems

Ontotext @ JRC

17/68

5-6 Oct 2005

Introduction to Ontologies

Despite the formal definitions, ontologies are:

• Conceptual models or schemata

– Represented in a formalism which allows

– Unambiguous “semantic” interpretation

– Inference

• Can be considered a combination of:

– DB schema

– XML Schema

– OO-diagram (e.g. UML)

– Subject hierarchy/taxonomy (think of Yahoo)

– Business logic rules

Ontotext @ JRC

18/68

5-6 Oct 2005

Introduction to Ontologies (II)

• Imagine a DB storing“John is a son of Mary”.

• It will be able to "answer" just: – Which are the sons of Mary? Which son is John?

• An ontology with a definition of the family relationships. It could infer: – John is a child of Mary (more general)

– Mary is a woman; – Mary is the mother of John (inverse); – Mary is a relative of John (generalized inverse).

• The above facts, would remain "invisible" to a typical DB, which model of the world is limited to data-structures of strings and numbers.

Ontotext @ JRC

19/68

5-6 Oct 2005

Products

• The Ontology Middleware Module (OMM) is an enterprise back-end for formal KR and KM applications based on Semantic Web standards

• An extension of the Sesame RDF(S) repository that adds a Knowledge Control System.

• OMM integration options: Built-In, RMI, SOAP, HTTP.

K now ledg e

C ontro l System

M eta-Inform ation

A ccessC ontrol

T rackingC hanges

C urrent U ser Info.

C hange Investigation

Ontotext @ JRC

20/68

5-6 Oct 2005

Products

• BOR – a DAML+OIL reasoner.

• Proprietary GATE components:

– Hash Gazetteer. A high-performance lookup tool.

– Hidden Markov Model Learner. A stohastic module for filtering annotations, disambiguation, (etc.,) based on confidence measures.

• The News Collector is a web service, collecting and indexing articles from the top-10 global news wires:

– About 1000 articles/day, annotated and indexed using KIM;

– Used to validate the heuristics and resources of KIM;

Ontotext @ JRC

21/68

5-6 Oct 2005

Products (II)

• The KIM Platform (the next slides), http://www.ontotext.kim.

• SWWS Studio (http://swws.ontotext.com)

– Semantic Web Service description development environment

– Developed in the course of the SWWS project

– Based on WSMO (http://www.wsmo.org)

• WSMO4J (http://wsmo4j.sourceforge.net)

– A WSMO API and a reference implementation

– for building Semantic Web Services applications

– Used in WSMO Studio, (http://www.wsmostudio.org/)

– The basis for ORDI, used in OMWG (http://www.omwg.org)

– Used in projects DIP, SEKT, Infrawebs

Ontotext @ JRC

22/68

5-6 Oct 2005

OWLIM

• OWLIM is a high-performance OWL repository

• Storage and Inference Layer (SAIL) for Sesame RDF database

• OWLIM performs OWL DLP reasoning

• It is uses the IRRE (Inductive Rule Reasoning Engine) for forward-chaining and “total materialization”

• In-memory reasoning and query evaluation

• OWLIM provides a reliable persistence, based on RDF N-Triples

• OWLIM can manage millions of statements on desktop hardware

• Extremely fast upload and query evaluation even for huge ontologies and knowledge bases

Ontotext @ JRC

23/68

5-6 Oct 2005

Scalability: Upload and Reasoning

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

Size of repository (millions of explicit statements)

Upl

oad

spee

d (s

tate

men

ts/s

ec)

2Xeon1GB

2Opt3GB

2Opt5GB

PM512MB

1/log

Ontotext @ JRC

24/68

5-6 Oct 2005

0

50

100

150

200

250

300

350

400

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10

Size of repository (millions of explicit statements)

Eva

luat

ion

tim

e Q

2 (m

sec)

2Xeon1GB

2Opt3GB

2Opt5GB

PM512MB

Scalability: Query Answering

• Q2: Pattern of 12 statement-joins and LIKE literal constraint

Ontotext @ JRC

25/68

5-6 Oct 2005

OWLIM under LUMB Benchmark

• The Lehigh Univ. evaluation is one of the most comprehensive benchmark experiments published recently (ISWC 2004, WSJ 2005)

• Synthetically generated OWL knowledge bases

• The biggest set generated is LUMB(50,0) – 6M explicit statements

• 14 queries, checking different inferences

• OWLIM on LUMB:

– On a desktop machine OWLIM loads LUMB(50,0) in 10 min

– The only other systems known to load it, does this for 12 hours

– All the queries are answered correctly

• Based on this we can claim that:

– OWLIM is the fastest OWL repository in the world!

Ontotext @ JRC

26/68

5-6 Oct 2005

JOCI

• “Jobs & Contacts Intelligence”, Innovantage, Fairway Consultants

• Gathering recruitment-related information from web-sites of UK organizations

• Offering services on top of this data to recruitment agencies, job portals, and other.

• JOCI uses KIM for information extraction (IE, text-mining)• JOCI makes use of a domain ontology to:

– support the IE process, – to structure the knowledge base with the obtained results,

and – facilitate semantic queries.

• Sirma is shareholder in Fairway Consultants

Ontotext @ JRC

27/68

5-6 Oct 2005

UK Web Space

UK Web Space

JOCI Dataflow

Focused Crawler

Crawler Classifier

Information Extraction

Single-Document IE

Object Consolidation

KIM Server

Semantic Repository

Document Store

Web UI

Ontotext @ JRC

28/68

5-6 Oct 2005

JOCI: Vacancy Consolidation/Matching

Location

CityCountry

U.K. Scotland Glasgow

Vacancy 1 Vacancy 2

Consolidated Vacancy

“IT Applications Support Analyst”

“Support Analyst”

subRegionOf subRegionOf

hasJobTitle

locatedIn

locatedIn locatedInsub-string

type type

subClassOf

Ontotext @ JRC

29/68

5-6 Oct 2005

JOCI Statistics

• The figures below are indicative and reflect an old state of the JOCI system:

– The actual figures are to be announced after the launch of JOCI

• Web-sites inspected: 0.5M• Web-sites with vacancy announcements: 30K• Extracted vacancies: 100K

Ontotext @ JRC

30/68

5-6 Oct 2005

The KIM Platform

• A platform offering

services and infrastructure for:

– (semi-) automatic semantic annotation and

– ontology population

– semantic indexing and retrieval of content

– query and navigation over the formal knowledge

• Based on Information Extraction technology

Ontotext @ JRC

31/68

5-6 Oct 2005

KIM What’s Inside?

The KIM Platform includes:

• Ontologies (PROTON + KIMSO + KIMLO) and KIM World KB

• KIM Server – with a set of APIs for remote access and integration

• Front-ends: Web-UI and plug-in for Internet Explorer.

Ontotext @ JRC

32/68

5-6 Oct 2005

The AIM of KIM

• Aim: to arm Semantic Web applications

- by providing a metadata generation technology

- in a standard, consistent, and scalable framework

Ontotext @ JRC

33/68

5-6 Oct 2005

What KIM does? Semantic Annotation

Ontotext @ JRC

34/68

5-6 Oct 2005

Simple Usage: Highlight, Hyperlink, and…

Ontotext @ JRC

35/68

5-6 Oct 2005

Simple Usage: … Explore and Navigate

Ontotext @ JRC

36/68

5-6 Oct 2005

Simple Usage: … Enjoy a Hyperbolic Tree View

Ontotext @ JRC

37/68

5-6 Oct 2005

KIM is Based On…

KIM is based on the following open-source platforms:

• GATE – the most popular NLP and IE platform in the world, developed at the University of Sheffield. Ontotext is its biggest co-developer.www.gate.ac.uk and www.ontotext.com/gate

• OWLIM – OWL repository, compliant with Sesame RDF database from Aduna B.V. www.ontotext.com/owlim

• Lucene – an open-source IR engine by Apache. jakarta.apache.org/lucene/

Ontotext @ JRC

38/68

5-6 Oct 2005

How KIM Searches Better

KIM can match a Query like: Documents about a telecom company in Europe, John Smith, and a date

in the first half of 2002.With a document containing:

“At its meeting on the 10th of May, the board of Vodafone appointed John G. Smith as CTO"

The classical IR could not match:– Vodafone with a "telecom in Europe“, because:

• Vodafone is a mobile operator, which is a sort of a telecom;• Vodafone is in the UK, which is a part of Europe.

– 5th of May with a "date in first half of 2002“;– “John G. Smith” with “John Smith”.

Ontotext @ JRC

39/68

5-6 Oct 2005

Entity Pattern Search

Ontotext @ JRC

40/68

5-6 Oct 2005

Pattern Search: Entity Results

Ontotext @ JRC

41/68

5-6 Oct 2005

Entity Pattern Search: KIM Explorer

Ontotext @ JRC

42/68

5-6 Oct 2005

Semantic Metadata in KIM…

• Provides a specific metadata schema,

– focusing on named entities (particulars),

– as well as number and time-expressions, addresses, etc.,

– everything “specific”, apart from the general concepts.

• Defines specific tasks for generation and usage of the metadata which are well-understood and measurable.

• Why not metadata about general things (universals)?

– It is too complex…

– but we leave the door open.

• The particulars seem to provide a good 80/20 compromise.

Ontotext @ JRC

43/68

5-6 Oct 2005

World Knowledge in KIM

Rationale:

• The ontology is encoded in OWL Lite and RDF.

• provide common knowledge about world entities;

• KIM bets on scale and avoids heavy semantics;minimum modeling of common-sense, almost no axioms;

• The ontology is encoded in OWL Lite and RDF.

• In addition, a number of rules (generative axioms) are defined, e.g.:

<X,locatedIn,Y> and <Y,subRegionOf,Z> => <X,locatedIn,Z>

• Axioms of this sort are supported by OWLIM and they provide a consistent mechanism for “custom” extensions to the OWL or RDF(S) semantics with respect to a particular ontology

Ontotext @ JRC

44/68

5-6 Oct 2005

PROTON

• Name. PROTON is an acronym for Proto Ontology– ex-names: BULO (basic upper-level ontology), GO (generic ontology);– not a Russian space rocket – “proto” – used in the sense of “primary”, “beginning”, “giving rise to”,

vs. “first in time” or “oldest”;– connotations: positive, fundamental, elemental, “in favour of”, even

romantic (like a science-fiction novel from the 60-ies) • Intended usage. A Basic Upper-Level Ontology like PROTON - used for:

– ontology population– knowledge modelling and integration strategy of a KM environment;– generation of domain, application, and other ontologies.

Ontotext @ JRC

45/68

5-6 Oct 2005

PROTON Design

• Design principles:

1. domain-independence;

2. light-weight logical definitions;

3. Compliance with popular metadata standards;

4. good coverage of concrete and/or named entities (i.e. people, organizations, numbers);

5. no specific support for general concepts (such as “apple”, “love”, “walk”), however the design allows for such extensions

Ontotext @ JRC

46/68

5-6 Oct 2005

Some Figures…

• PROTON defines about

250 classes and 100 properties

• Providing coverage of most of the upper-level concepts necessary for semantic annotation, indexing, and retrieval

• A modular architecture, allowing for great flexibility of usage and extension:

– SYSTEM module - contains a few meta-level primitives (6 classes and 7 properties); introduces the notion of 'entity', which can have aliases;

– TOP module - the highest, most general, conceptual level, consisting of about 20 classes;

– UPPER module - over 200 general classes of entities, which often appear in multiple domains.

Ontotext @ JRC

47/68

5-6 Oct 2005

PROTON Ontology Language

• The current version of the ontology is encoded in OWL Lite.

• A few custom entilement rules (axioms) are also defined for usage in tools that support them, for instance:

Premise:<xxx, protont:roleHolder, yyy><xxx, protont:roleIn, zzz><yyy, rdf:type, protont:Agent>

Consequent:<yyy, protont:involvedIn, zzz>

• Axioms of this sort are interpreted by OWLIM

• PROTON is portable to any OWL(Lite)-compliant tool.

• PROTON can be used without such axioms either.

Ontotext @ JRC

48/68

5-6 Oct 2005

Other Standards: Relations

• ADL Feature Type Thesaurus and GNS– the backbone of the Location branch;– on its turn aligned with the geographic feature designators, of the GNS

database of NIMA;– PROTON is more coarse-grained, taking about 80 out of 300 types.

• Dublin Core– the basic element set available as properties of protont:InformationResource

and protont:Document classes;– the resource type vocabulary is mapped to sub-classes of

InformationResource.• OpenCyc and WordNet– consulted and referred to in glosses.• ACE (Automatic Content Extraction) annotation types – covered.• FOAF – assure easy mapping (e.g. the Account class was added).• DOLCE, EuroWordnet Top, and others – consulted to various extent.

Ontotext @ JRC

49/68

5-6 Oct 2005

Other Standards: Compliance

• Other models are not directly imported (for consistency reasons)

• The mapping of the appropriate primitives is easy, on the basis of

– a compliant design, and

– formal notes in the PROTON glosses, which indicate the appropriate mappings.

• For instance, in PROTON, a protont:inLanguage property is defined

– as an equivalent of the dc:language element in Dublin Core

– with a domain protont:InformationResource

– and a range protont:Language

Ontotext @ JRC

50/68

5-6 Oct 2005

KIM World KB

A quasi-exhaustive coverage of the most popular entities in the world …

• What a person is expected to have heard about that is beyond the horizons of his country, profession, and hobbies.

• Entities of general importance … like the ones that appear in the news …

KIM “knows”:

• Locations: mountains, cities, roads, etc.

• Organizations, all important sorts of: business, international, political, government, sport, academic…

• Specific people, etc.

Ontotext @ JRC

51/68

5-6 Oct 2005

KIM World KB: Entity Description

• The NE-s are represented with their Semantic Descriptions via:• Aliases (Florida & FL);• Relations with other entities (Person hasPosition Position);• Attributes (latitude & longitude of geographic entities);• their proper Class

Ontotext @ JRC

52/68

5-6 Oct 2005

The Scale of KIM World KB

RDF Statements Small KB Full KB

- explicit 444,086 2,248,576

- after inference 1,014,409 5,200,017

Instances

- Entity: 40,804 205,287

- Location: 12,528 35,590

- Country: 261 261

- Province: 4,262 4,262

- City: 4,400 4,417

- Organization: 8,339 146,969

- Company: 7,848 146,262

- Person: 6,022 6,354

- Alias: 64,589 429,035

Ontotext @ JRC

53/68

5-6 Oct 2005

KIM IE Pipeline

Ontotext @ JRC

54/68

5-6 Oct 2005

JAPE Grammars

• Jape grammars are based on the last MUSE version

• Class/instance information included

• Better class granularity in grammars

• Relation recognition grammars - LocatedIn and HasPositionWithinOrganization

Ontotext @ JRC

55/68

5-6 Oct 2005

Disambiguation & Filtering

• Simple disambiguation (longest match), e.g. San Francisco Journal• Based on the main alias, e.g. “Beijing”• By priority of the class, instance or relative class priority

– E.g. Brand “Microsoft” vs. Company “Microsoft Corp.” – We assign a priority (1-1000) to each class and instance– For pairs of classes we define relative priority– If the difference between the priorities is greater than a certain

threshold the possible reference to the entity with the lower priority is ignored

• Still to be improved

Ontotext @ JRC

56/68

5-6 Oct 2005

KIM Scaling on Data

• The Semantic Repository is based on OWLIM

• In our practical tests we observe perfect performance on top of:

– 1.2M of entity descriptions:

– about 15M explicit statements;

– above 30M statements after forward chaining.

• Document and Annotation storage and indexing with Lucene:

– One million docs, processed on a $1000-worth machine;

– retrieval in milliseconds.

Ontotext @ JRC

57/68

5-6 Oct 2005

Entity Ranking: a sketch for Jan-May 2004

No Instance Label Rank

1 Country_T.5 United States 0.032

4 Country_T.IZ Republic of Iraq 0.011

6 Person_T.51 George W. Bush 0.010

9 Country_T.IS State of Israel 0.006

11 DayOfWeek_T.4 Tuesday 0.005

12 NewsAgency_T.6 The Associated Press 0.005

14 InternationalOrganization_T.13 United Nations 0.005

27 Country_T.CH People's Republic of China 0.004

32 City_T.3068 New York 0.004

36 InternationalOrganization_T.18 European Union 0.004

40 Person_T.115 Ariel Sharon 0.003

43 Country_T.JA Japan 0.003

44 Country_T.UK United Kingdom 0.003

45 CountryCapital_T.93 Baghdad 0.003

Ontotext @ JRC

58/68

5-6 Oct 2005

SWAN/KIM Cluster Architecture

• At present, KIM is used for massive semantic annotation in the context of the SWAN and SEKT projects

Here are some of its features:

• support for a virtually unlimited number of annotators

• centralized ontology storage and querying;

• centralized meta-data (annotations) and document storage, indexing, and querying;

• support for multiple crawlers (or other data sources);

• dynamic reconfiguration of the cluster (e.g. staring new crawlers or annotators on demand).

Ontotext @ JRC

59/68

5-6 Oct 2005

SWAN/KIM Cluster Console

Ontotext @ JRC

60/68

5-6 Oct 2005

SWAN Project: Semantic Web Annotator

Large Scale Annotation of human language for the Semantic Web using Human Language Technology (HLT).

Hosted by DERI (NUIG, Galway) and involves also:

• GATE team (from the Sheffield University's NLP Group) and

• Ontotext Lab.

• For more details take a look at http://deri.ie/projects/swan/

The current status:

• KIM Cluster of 7 servers in DERI

• Above 0.5TB shared storage

• 6 AMD64 Opterons, 6 Xeons, 36GB RAM

Ontotext @ JRC

61/68

5-6 Oct 2005

CoreDB: Name and Goals

• CoreDB is a component of KIM

• Stands for: Co-Occurrence and Ranking of Entities DB

• In a nutshell, it is designed to allow fast queries of the sort:– Q1: the number of appearances of “UK” in documents during

Jan 2005– Q2: all people co-occurring with John Smith and some bank

institution in documents from the second half of 2003– Q3: Q2 + where the documents contain “fraud” and the name

of the institution contains “capital”

Ontotext @ JRC

62/68

5-6 Oct 2005

CoreDB: Functionality

• It allows asking in a structured manner for:

– The number of references to entities in a (sub-)set of documents

– The entities, which co-occur together with other entities

• Entities can be constrained by:

– Class (and its sub-classes)

– Keyword/token in one of its names/aliases/labels

• Documents can be constrained according to DC-like features:

– Date (range; could be any date in the doc)

– Type (exact match; could be any string)

– Authors

– Title and Sub-title

– Keyword/token in the content, authors or the title fields

Ontotext @ JRC

63/68

5-6 Oct 2005

The Scale of Ambition

• The major point is to allow such queries in *efficient* manner over data with the following cardinality:

– 10^6 entities/terms

– 10^7 documents

– 10^2 entities occurring in an average document

• This means managing and querying efficiently 10^9 entity occurrences

• We had tested the current implementations with 10^7 occurrences and it answers the basic queries in milliseconds.

Ontotext @ JRC

64/68

5-6 Oct 2005

CoreDB Applications

• Detection of “associative” links between entities, based on co-occurrence in documents– It is an alternative of the detection of strong links based on

local context parsing• Ranking, measuring popularity, of an entity over a set of

documents– The ranking is as good/relevant/representative as the set of

documents is• Computing timelines (changes over time) for entity ranking or

co-occurrence– “How did our popularity in the IT press changed during June”

(i.e. “What is the effect of this 1.5MEuro media campaign ?!?”)– “How does the strength of association between organization X and

RDF changes over Q1 ?”

Ontotext @ JRC

65/68

5-6 Oct 2005

Implementation

• It is a new component in the architecture of KIM – Having an API (part of the KIM API), allows different implementations

• There are now a couple of RDBMS-based implementations: – Derby (free, open-source, 100% Java, was Cloudscape from IBM)– ORACLE (v. 10g)

• The Derby implementation – does not allow for efficient searches involving keywords

• The ORACLE implementation is used also for FTS-style indexing of the document contents– Makes possible efficient combination of semantic and keyword search

(which is already available through the SemanticQuery API)• In both RDBMS implementations:

– Part of the ontology and the KB are replicated– Same with part of the document and index related information

Ontotext @ JRC

66/68

5-6 Oct 2005

Ontotext Facts

• Founded year 2000

• 14 employees (permanent, without the shared personnel and associates)

• Daily statistics for http://www.ontotext.com, over: 150 visits; 2000 hits

• Number of scientific publications: above 30

• Number of projects running: 9

• More than 20 partners we directly cooperate with on projects

• Average age: about 28

• Number of servers per developer: 0.7

Ontotext @ JRC

67/68

5-6 Oct 2005

Ontotext Lab

Robust Technology and Professional Services for

Knowledge and Language Engineering

http://www.ontotext.com

Ontotext @ JRC

Documents

Transcript of Ontotext @ JRC