Dm2 e ontotext-nov2012

OWLIM

Mariana Damova, PhD

DM2EVienna, November 2012

Ontotext

– Top-5 provider of core Semantic Technology

– Established in year 2000; offices in Bulgaria, UK, USA

– Active both in research and commercial projects (FP7 funding for 10 years)

• 360° semantic technology – unique portfolio:

– Semantic Databases: high-performance RDF DBMS, scalable reasoning

– Semantic Search: text-mining (IE), metadata generation, Information Retrieval (IR)

– Web Mining: focused crawling, screen scraping, data fusion

– Linked Data Management and Data Integration

Good recognition in the SemTech community

– Ontotext pages are ranked #1 for “semantic annotation” and “semantic repository” at

GYM, #3 for “linked data management” at Google

Several joint ventures and subsidiaries

– Innovantage: leading online recruitment intelligence provider in UK

Ontotext Clients (selected)

British Broadcasting Corporation (BBC)– Run its World Cup 2010 sites on top of OWLIM

– Since Mar’12 BBC Sports

– 2012 Olympics sections are driven by OWLIM and a Concept Extraction service developed by Ontotext

Press Association (UK)– Analysis of Sports news

– Concept extraction

– Linked data generation– Linked data generation

Top-3 USA media (not allowed to name)

The National Archives (UK) contracted Ontotext to implement semantic KB and semantic search for the Government Web Archive

British Museum (UK) Ontotext leads the development of Phase 3 of ResearchSpace project on collaborative research in cultural heritage; British Museum’s public SPARQL end-point is powered by OWLIM

de Bibliothek (Holland) aggregation of data from 150 library databases

Semantic Technologies

• Semantic technologies (RDF, LOD) allow for an unprecedented ease of

integration of heterogeneous data sources

– Already adopted in pharmaceuticals and publishing industries

– Cultural heritage is the next

BBC – when MySQL was replaced with OWLIM in their “Dynamic Semantic

Publishing” architecture, the BBC team observed considerable reduction of Publishing” architecture, the BBC team observed considerable reduction of

complexity of database design, query specification, application

development, and query evaluation time. BBC World Cup 2010 dynamic

semantic publishing. Jem Rayfield, Senior Technical Architect BBC News

and Knowledge.

http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_cup_2010_dyna

mic_sem.html

• OWLIM is a family of scalable semantic repositories

• OWLIM-Lite: in-memory, fastest, scales to ~100 million statements

• OWLIM-SE: file-based, sameAs & query optimizations, scales to 20 billion

statements

• OWLIM-Enterprise: replication cluster deployment for resilience and high

performance parallel query-answering

Semantic Repository for RDFS and OWL

• OWLIM provides

– Management, integration and analysis of heterogeneous data

– Combined with light-weight, high-performance reasoning

– The inference is based on logical rule-entailment

– Full RDFS, OWL Horst, restricted OWL-Lite, OWL2-QL and OWL2 RL

– Custom semantics can be defined via rules and axiomatic triples

OWLIM in the Cultural Heritage Domain

Selected commercial projects

ResearchSpace project funded by the Andrew W. Mellon Foundation Support for collaborative web-based research, information sharing and web publishing for the cultural heritage scholarly community. An Ontotext-led international consortium.

The Polish Digital National Museum aggregates artifacts from over 70 contributing cultural institutions in the Digital Libraries Federation PIONIER Network using OWLIM repository of Ontotext

LODAC (Linked Open Data in Academia), Japan's National Institute of Informatics aggregates various information across multiple Japanese resources as LOD. The system uses 8 OWLIM nodes and aggregates 19 collections with 700 000 entities and 15M triples. uses 8 OWLIM nodes and aggregates 19 collections with 700 000 entities and 15M triples.

SemTech for Cultural Heritage project funded by ITCCSemantic publishing of Bulgarian cultural heritage to Europeana Establishing a Bulgarian technical aggregator for Europeana

Selected research projects

MOLTO FP7 project, a use case in cultural heritage for a semantic knowledgerepresentationinfrastructure for querying RDF and presenting query results, includes close to 9K museum objects from two collections of The Gothenburg City

Charisma (Cultural Heritage Advanced Research Infrastructures) an EU-funded integrating activity project, a consortium of 21 partners, metadata from 6 major European cultural institutions has selected OWLIM repository of Ontotext

OWLIM PERFORMANCE

• OWLIM is a scalable, robust and efficient triple store

– Serving the two most important web-sites for the London Olympic Games• Official Olympics website

• BBC Olympics website

– Performance highlights• OWLIM loads the 100M and the 200M datasets almost twice as fast as the next best product

(17 min. for 100M)(17 min. for 100M)

• Best query performance among those repositories that can handle update and multi-client

query tasks (5,285 Query-mixes-per-hour, where a query mix contains 25 queries; e.g. about

100 queries/sec)

• OWLIM v5 is 43% faster than v.4.3 on the BSBM Explore and Update scenario

• OWLIM v5 requires between 25% and 70% less storage space

• OWL 2 RL-type languages have proven to be the only feasible approach for

reasoning with billion statements

Reasoning complexity

owl:sameAs Optimization

a way to handle the equivalent statements by a single master node,

which has as an impact efficient and compact handling of inferred

statements resulting in 4-6 times more statements available to query

than the explicitly introduced ones

OWLIM Replication Cluster

• Distribution through data replication is used to ensure:

– Better handling of concurrent user requests

– Failover support

• How does it work?

– Every user request is pushed in a transaction queue

– Each data write request is are multiplexed to all repository instances– Each data write request is are multiplexed to all repository instances

– Each read request is dispatched to one of the

instance only

– To ensure load-balancing, each

read requests is send to the

instance with smallest execution

queue at this point in time

Geo-spatial index

• Geo-spatial information concerns the geometry of points, shapes and distances relative to the surface of the Earth (or any spherical object).

• When using OWLIM-SE all angles are in decimal degrees with the latitude ranging from -90 to +90 degrees and the longitude ranging from -180 to +180 degrees.

• airports have a reference point given by latitude, longitude and altitude; • political boundaries can be specified by polygons where each vertex is a 2-Dimensional latitude/longitude pair.

• OWLIM-SE includes a plug-in that allows for efficient

calculation of a modification of PageRank over RDF graphs

• Computation of rank values is fast, e.g.

– 400M LOD statements takes 310 sec (27 iteraions)

• Results are available through a system predicate

RDF Rank

• Results are available through a system predicate

• Example: get the 100 most important nodes in the RDF graph

SELECT ?n {?n rank:hasRDFRank ?r}

ORDER BY DESC(?r) LIMIT 100

Define: nested repositories

”Nested repositories” represent a new data

management concept for RDF data:

• a mechanism for sharing data stored across

multiple repositories, where

• one of them contains a large body of

knowledge which gets embedded in other

repositoriesrepositories

• each containing more specific data, which are

being interlinked with the common body of

knowledge

http://www.ontotext.com/owlimhttp://www.ontotext.com/owlim

[email protected]

Dm2 e ontotext-nov2012

Technology

Transcript of Dm2 e ontotext-nov2012