LOD2 Webinar Series: Virtuoso 7

47
LOD2 Webinar . 29.11.2011 . Page 1 http://lod2.eu Creating Knowledge out of Interlinked Data

description

This webinar in the course of the LOD2 webinar series will present Virtuoso 7. Virtuoso Column Store, Adaptive Techniques for RDF Graph Databases. In this webinar we shall discuss the application of column store techniques to both graph (RDF) and relational data for mixed work-loads ranging from lookup to analytics. Virtuoso is an innovative enterprise grade multi-model data server for agile enterprises & individuals. It delivers an unrivaled platform agnostic solution for data management, access, and integration. The unique hybrid server architecture of Virtuoso enables it to offer traditionally distinct server functionality within a single product If you are interested in Linked (Open) Data principles and mechanisms, LOD tools & services and concrete use cases that can be realised using LOD then join us in the free LOD2 webinar series! http://lod2.eu/BlogPost/webinar-series

Transcript of LOD2 Webinar Series: Virtuoso 7

Page 1: LOD2 Webinar Series: Virtuoso 7

LOD2 Webinar . 29.11.2011 . Page 1 http://lod2.eu

Creating Knowledge out of Interlinked Data

Page 2: LOD2 Webinar Series: Virtuoso 7

http://lod2.eu

LOD2 is a large-scale integrating project co-funded by the European Commission within the FP7 Information and Communication Technologies Work Programme. This 4-year project comprises leading Linked Open Data technology researchers, companies, and service providers. Coming from across 12 countries the partners

are coordinated by the Agile Knowledge Engineering and Semantic Web Research Group at the University of Leipzig, Germany.

LOD2 will integrate and syndicate Linked Data with existing large-scale

applications. The project shows the benefits in the scenarios of Media and Publishing, Corporate Data intranets and eGovernment.

Page 3: LOD2 Webinar Series: Virtuoso 7

http://lod2.eu

Once  per  month  the  LOD2  webinar  series  offer  a  free  webinar  about  tools  and  services  along  the  Linked  

Open  Data  Life  Cycle.    

Stay  with  us  and  learn  more  about  acquisiAon,  ediAng,  composing,  connected  applicaAons  –  and  finally  

publishing  Linked  Open  Data.  

Page 4: LOD2 Webinar Series: Virtuoso 7

© 2012 OpenLink Software, All rights reserved.

Virtuoso 7.0 Enabling Massively Scalable Big Data Analytics

for RDF & SQL Data Management

By Orri Erling, Virtuoso Program Manager & Hugh Williams, Professional Services Manager

Making Technology Work For You

Page 5: LOD2 Webinar Series: Virtuoso 7

© 2012 OpenLink Software, All rights reserved.

Company Overview

Page 6: LOD2 Webinar Series: Virtuoso 7

OpenLink Company Overview n  OpenLink Software is a privately-held company founded in 1992 by its President &

CEO, Kingsley Idehen. The company is an industry acclaimed technology innovator in the following areas:

§  ODBC, JDBC, ADO.NET, and OLE-DB compliant Data Access Drivers for Oracle, SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL

§  High-Performance & Scalable Multi-Model (Relational & Graph) Database Technology

§  Data Integration Middleware (Data Virtualization Technology across a wide variety of Protocols & Formats)

§  Web Application Server Technology

§  Linked Data Deployment & Management

§  Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis, Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)

§  Identity Management.

© 2012 OpenLink Software, All rights reserved.

Page 7: LOD2 Webinar Series: Virtuoso 7

Products & Services Software Products

•  OpenLink Universal Data Access Drivers (UDA) - High-performance data access drivers for ODBC, JDBC, ADO.NET, and OLE DB that provide transparent access to enterprise databases.

•  OpenLink Virtuoso - available in single server and cluster editions that are deployed in cloud and/or enterprise modes.

•  OpenLink Data Spaces Platform and Applications

•  OpenLink Ajax Toolkit •  OpenLink Data Explorer

•  An Open Source Data Access SDK for ODBC All OpenLink products are delivered by download from the Internet (http, ftp, etc.). Temporary licenses are issued upon download and may be extended as needed, on a case-by-case basis. Permanent licenses are issued once payment is received.

© 2012 OpenLink Software, All rights reserved.

Page 8: LOD2 Webinar Series: Virtuoso 7

Products & Services Professional and Support Services

•  OpenLink Product Support provides front-line email and phone support, web-based online support, and a variety of premium services such as phone, emergency, and onsite support.

•  Our Support staff is comprised of individuals with extensive knowledge of data access, data migration, database administration, programming APIs, and other relevant skills.

•  Services are sold in either Standard "Bronze" or Premium "Platinum" Support packages, with varying hours of availability, response times, etc.

•  We also offer Custom Development, Training, and other Consultancy services. These services can be offered on- or off-site. Expenses for travel, accommodations, food, etc., associated with on-site services are charged separately.

© 2012 OpenLink Software, All rights reserved.

Page 9: LOD2 Webinar Series: Virtuoso 7

Customers OpenLink's installed base is in excess of 10,000 customers worldwide. Examples include:

© 2012 OpenLink Software, All rights reserved.

n  Data.Gov (U.S. Govt. Open Linked Data initiative)

n  Verizon n  Raytheon n  Bank of America n  CGI Federal n  Elsevier n  French National Library n  Globo n  Scottish Government

n  St Jude's Medical n  Barclays Bank n  Wells Fargo n  and many more

Page 10: LOD2 Webinar Series: Virtuoso 7

Office Locations

USA OpenLink Software, Inc 10 Burlington Mall Road Suite 265 Burlington, MA 01803 Tel.: +1 781 273 0900 Fax: +1 781 229 8030

© 2012 OpenLink Software, All rights reserved.

UK OpenLink Software Ltd. Airport House Purley Way Croydon, Surrey CR0 0XZ Tel.: +44 (0)20 8681 7701 Fax: +44 (0)20 8681 7702

Page 11: LOD2 Webinar Series: Virtuoso 7

© 2012 OpenLink Software, All rights reserved.

Virtuoso Universal Server Overview

Page 12: LOD2 Webinar Series: Virtuoso 7

Situation Analysis

© 2012 OpenLink Software, All rights reserved.

Data is growing exponentially

along the following dimensions:

n Volume

n Velocity

n Variety

All of this happens while the total

hours in day remains 24 hrs.

Page 13: LOD2 Webinar Series: Virtuoso 7

Product Value Proposition

© 2012 OpenLink Software, All rights reserved.

Enterprise and Individual Agility

via Data Access, Integration, and

Management, without

compromising performance,

scalability, security, and platform

independence.

Virtuoso locks you into an experience (openness, performance, and scale) not

the platform itself. -- Kingsley Idehen, Founder & CEO, OpenLink

Software

Page 14: LOD2 Webinar Series: Virtuoso 7

Product Architecture

© 2012 OpenLink Software, All rights reserved.

A high-performance, scalable,

secure, and operating-system-

independent server designed

to handle contemporary

challenges associated with

standards compliant data

access, data integration, and

data management.

Page 15: LOD2 Webinar Series: Virtuoso 7

Data Virtualization Middleware

© 2012 OpenLink Software, All rights reserved.

An in-built middleware layer

(“Sponger”) for creating

Transient & Persistent

Views over Heterogeneous

Data Sources.

Page 16: LOD2 Webinar Series: Virtuoso 7

Sophisticated Content Crawler

© 2012 OpenLink Software, All rights reserved.

DBMS hosted Content

Crawler that’s leverages

loosely coupled binding to

the Sponger Middleware

component for

transformation of

unstructured and semi-

structured data into Linked

Data.

Page 17: LOD2 Webinar Series: Virtuoso 7

Core Platform behind LOD Cloud

© 2010 OpenLink Software, All rights reserved.

Core Platform (Graph DBMS and Linked Data Deployment) behind DBpedia, many

bubbles in the LOD Cloud, and the LOD Cloud cache itself.

Page 18: LOD2 Webinar Series: Virtuoso 7

Virtuoso Linked Data projects •  DBpedia - public SPARQL endpoint over the DBpedia data

(and international Chapters)

•  LOD Cloud Cache - public server hosting LOD cloud datasets

•  URIBurner - Linked Data generation & transformation service

•  Linked Geo Data - OpenStreetMap Spatial data as Linked Data

•  Sindice - SPARQL endpoint behind its Semantic Web Index

•  Data.gov - US Government Linked Data

•  Health.data.gov - Clinical Quality Linked Data on health.data.gov

•  Seevl - Linked Data music discovery service

•  Bio2RDF - Life science data mapped to Linked Data

•  Neurocommons - Life science data mapped to Linked Data

•  Musicbrainz - MusicBrainz database published as Linked Data

•  Open PHACTS - DBpedia-like Linked Data Space for Pharma

•  Others - Many others …

© 2012 OpenLink Software, All rights reserved.

Page 19: LOD2 Webinar Series: Virtuoso 7

Powerful Standards Support

© 2012 OpenLink Software, All rights reserved.

ODBC compliance enables use of client applications (e.g. Microsoft Access) as front-

ends for Virtuoso, 3rd party RDBMS engines, and the World Wide Web hosted Linked

Open Data Cloud.

Page 20: LOD2 Webinar Series: Virtuoso 7

Powerful Standards Support Cont’d

© 2012 OpenLink Software, All rights reserved.

ODBC & HTML5 compliance enables development of rich client apps. that

leverage the WebDB-ODBC bridge for accessing data across: Virtuoso, 3rd party

RDBMS engines, and the World Wide Web hosted Linked Open Data Cloud.

Page 21: LOD2 Webinar Series: Virtuoso 7

Insight Discovery & Exploration

© 2012 OpenLink Software, All rights reserved.

Native Faceted Browsing that enables multi-dimensional drill-downs via any browser

Page 22: LOD2 Webinar Series: Virtuoso 7

Insight Discovery & Exploration

© 2012 OpenLink Software, All rights reserved.

Microsoft Silverlight or HTML5 based PivotViewer Front-End for SPARQL and SPARQL-FED

Queries

Page 23: LOD2 Webinar Series: Virtuoso 7

Powerful SPARQL Query Service

© 2012 OpenLink Software, All rights reserved.

Basic SPARQL Endpoint for Creating Query Definitions & Sharing Query Results.

Example: health.data.gov data directly from a Web Browser.

Page 24: LOD2 Webinar Series: Virtuoso 7

Powerful SPARQL Query Builder

© 2012 OpenLink Software, All rights reserved.

Use Query By Example (QBE) Patterns to Construct & Share Query

Results.

Page 25: LOD2 Webinar Series: Virtuoso 7

How Do I Get Going?

n  Download, install, and experience the power of coherent integration of disparate data sources, data access protocols, and data representation formats.

n  In an nutshell, commence exploitation of powerful business intelligence, socially enhanced collaboration, data virtualization, and entity analytics without writing a line of code!

n  Turn "Big Data" into exploitable "Smart Data" without compromise!

n  Will be integrated into the next release of the LOD2 Stack

© 2012 OpenLink Software, All rights reserved.

Page 26: LOD2 Webinar Series: Virtuoso 7

© 2012 OpenLink Software, All rights reserved.

Virtuoso 7.0

Page 27: LOD2 Webinar Series: Virtuoso 7

27 © 2012 OpenLink Software, All rights reserved.

Flexible Big Data Challenge

n  Data Agility is challenged by Volume, Velocity, and Variety

n  “Schema Last” is great - if the price is right n  RDF, graphs promise powerful querying with the

flexibility and scale of NoSQL key-value stores n  Inference may be good for integration, if can

express the right things, beyond OWL n  RDF data management technology must learn

from the lessons of SQL RDBMS, everything applies

Page 28: LOD2 Webinar Series: Virtuoso 7

28 © 2012 OpenLink Software, All rights reserved.

Virtuoso 7.0 Mission Statement

Destruction of the following items as impediments to

Big (Open) Linked Data exploitation:

n Performance

n Scalability

n Platform Independence

n Security & Privacy

n Price

Page 29: LOD2 Webinar Series: Virtuoso 7

29 © 2012 OpenLink Software, All rights reserved.

Virtuoso 7.0 & Big Data Myths

Myths put to rest:

n Scalable Open Ended SPARQL Endpoints

n Scalable Open Ended Read-Write SPARQL

Endpoints

n Fine-grained Access Controls underlying Read-

Only or Read-Write endpoints.

Page 30: LOD2 Webinar Series: Virtuoso 7

30 © 2012 OpenLink Software, All rights reserved.

Virtuoso Column Store Features

n  Supports SQL and SPARQL query languages

n  Compact column-wise storage

n  Vectored execution of commands

n  Shared nothing scale out for clusters

n  Powerful procedure language with parallel,

distributed control structures

n  Full-text and geospatial indexes

Page 31: LOD2 Webinar Series: Virtuoso 7

31 © 2012 OpenLink Software, All rights reserved.

Storage Engine n  Freely mix column-, and row-wise indices n  All SQL and RDF data types natively supported , single

execution engine for SQL/SPARQL

n  Column compression 3x more space efficient than row-wise compression for RDF

n  Column stores are not only for big scans, random access surpasses rows as as soon as there is some locality

n  9 B/quad with DBpedia, 7 B/quad with BSBM or RDF-H, 14 B/quad with web crawls (PSOG, POSG, SP, OP, GS, excluding literals)

Page 32: LOD2 Webinar Series: Virtuoso 7

32 © 2012 OpenLink Software, All rights reserved.

Execution Engine n  Vectoring is not only for column stores n  Vectoring makes a random access into a linear merge

join if there is any locality: Always a win, mileage depends on run time factors

n  Vectoring eliminates interpretation overhead and makes CPU friendly code possible

n  Even with run time data typing, vectoring allows use of type-specific operators on homogenous data, e.g. arithmetic

n  Dynamically adjust vector size: Larger vector may not fit in cache but will get better locality for random access

Page 33: LOD2 Webinar Series: Virtuoso 7

33 © 2012 OpenLink Software, All rights reserved.

Graph operations n  Run time computation plus caching instead of

materialization n  SPARQL/SQL extension for arbitrary transitive subqueries: n  Flexible options for returning shortest paths, all paths, all /

distinct reachable, attributes of steps on paths etc. n  Efficient execution, searching the graph from both ends if

looking for a path with ends given n  Query operators for RDF hierarchy traversal n  Special query operator for OWL sameAs and IFP based

identity n  Taking OWL sameAs / IFP identity into account for

DISTINCT /GROUP BY

Page 34: LOD2 Webinar Series: Virtuoso 7

34 © 2012 OpenLink Software, All rights reserved.

Query Optimization Challenges n  Typical SQL stats do not help n  Need to measure data cardinalities starting from

constants in the query n  Need to sample fanout predicate by predicate, as

needed n  Predicate and class hierarchies are easy to

handle in sampling n  sameAs or IFP inference voids all guesses n  Is hash join worthwhile? High setup cost means

that one must be sure of cardinalities first

Page 35: LOD2 Webinar Series: Virtuoso 7

35 © 2012 OpenLink Software, All rights reserved.

Deep Sampling n  Everything is a join -> sampling must also do joins n  As the candidate plan grows, the cost model

executes all the ops on a sample of the data n  Actual cardinality and locality are known, also when

search conditions are correlated n  Having high confidence in the cost model, hash join

plans become safe and attractive n  Even though there is an indexed access path for all,

a scan can be better because it produces results in order. Need to be sure of selectivity before taking the risk

Page 36: LOD2 Webinar Series: Virtuoso 7

36 © 2012 OpenLink Software, All rights reserved.

Elastic Cluster

n  Data is partitioned by key, different indices may have different partition keys

n  Partitions may split and migrate between servers

n  Partitions may be kept in duplicate for fault tolerance/load balancing

n  Actual access stats drive partition split and placement

Page 37: LOD2 Webinar Series: Virtuoso 7

37 © 2012 OpenLink Software, All rights reserved.

Optimizing for Cluster n  Vectored execution is natural in a cluster since single-tuple

messages are not an option n  Keep max ops in flight at all times, always send long messages n  Fully distributed query coordination: ¡  Any node can service a client request. Correlated subqueries, stored

procedures may execute anywhere, arbitrary parallelism and recursion between partitions

¡  On single shared memory box, cluster is approximately even with single process multithreading, low overhead

¡  1.8x more throughput in BSBM BI when going from 1 to 2 machines ¡  Distributed stored procedures, send the proc to the data, as in map-

reduce, except that there are no limits on cross partition calling/recursion ¡  Choice of transactional and auto-commit update semantics, can have

atomic ops without global transaction

Page 38: LOD2 Webinar Series: Virtuoso 7

38 © 2012 OpenLink Software, All rights reserved.

Cluster Architecture Diagrams

Page 39: LOD2 Webinar Series: Virtuoso 7

39 © 2012 OpenLink Software, All rights reserved.

n  55 billion triples in LOD cache, only 384 GB of RAM, 2TB disk

n  2 x 384 GB of RAM, 4TB SSD

n  Most of Linked Open Data and Web Crawls

n  http://lod.openlinksw.com

n  http://lod.openlinksw.com/sparql

LOD Cache

Page 40: LOD2 Webinar Series: Virtuoso 7

40 © 2012 OpenLink Software, All rights reserved.

Independent Benchmark Report from CWI:

Berlin SPARQL Benchmark

#Triples Source File Size

Compressed Source File Size

Source Data Files Per Loader Node

Final Database File Size

Load Time

50 Billion 2.8 TB 240 GB 30 GB 1.8 TB 10h 54s

150 Billion 8.5 TB 728 GB 91 GB 5.6 TB n/a

Page 41: LOD2 Webinar Series: Virtuoso 7

41 © 2012 OpenLink Software, All rights reserved.

Store Comparisons Summary:

Exploration oriented queries (QMpH)

Berlin SPARQL Benchmark

100 Million Triples

200 Million Triples

1 Billion Triples

Virtuoso 6 37,678.319 32,969.006

8,984.789

Virtuoso 7 47,178.820

27,933.682

Page 42: LOD2 Webinar Series: Virtuoso 7

42 © 2012 OpenLink Software, All rights reserved.

Store Comparisons Summary:

Business Intelligence oriented queries (QMpH)

Berlin SPARQL Benchmark

10 Million Triples 100 Million Triples

1 Billion Triples

Virtuoso 6 431.465 35.342 2.383

Virtuoso 7 996.795 75.236

Page 43: LOD2 Webinar Series: Virtuoso 7

43 © 2012 OpenLink Software, All rights reserved.

Store Comparisons Summary:

Exploration oriented queries (Cluster Edition) (QMpH)

Berlin SPARQL Benchmark

10 Billion Triples 50 Billion Triples 150 Billion Triples

Virtuoso 7 2,360.210 4,253.157 2,090.574

Page 44: LOD2 Webinar Series: Virtuoso 7

44 © 2012 OpenLink Software, All rights reserved.

Store Comparisons Summary:

Business Intelligence oriented queries (Cluster Edition) (QMpH)

Berlin SPARQL Benchmark

10 Billion Triples 50 Billion Triples 150 Billion Triples

Virtuoso 7 13.078 0.964 0.285

Page 45: LOD2 Webinar Series: Virtuoso 7

45 © 2012 OpenLink Software, All rights reserved.

Future Work

n  Complete deep sampling: enhanced query optimization plans

n  Run TPC-H and TPC-DS in SQL and their 1:1 translation in SPARQL, demonstrating SPARQL performance as near to SQL as possible

Page 46: LOD2 Webinar Series: Virtuoso 7

Additional Information n  OpenLink Software

¡  OpenLink Software - www.openlinksw.com ¡  OpenLink Virtuoso - virtuoso.openlinksw.com ¡  Universal Data Access - uda.openlinksw.com

n  Social Media Data spaces ¡  http://virtuoso.openlinksw.com/blog/ (weblog) ¡  https://plus.google.com/112399767740508618350/

posts (Google+) ¡  https://twitter.com/OpenLink (Twitter) ¡  http://www.linkedin.com/company/openlink-software

(LinkedIn) ¡  Hashtag: #LinkedData (Anywhere)

© 2012 OpenLink Software, All rights reserved.

Page 47: LOD2 Webinar Series: Virtuoso 7

EU-FP7 LOD2 WP6 – 25.-26.03.2013. Page 47 http://lod2.eu

Creating Knowledge out of Interlinked Data

LOD2 Stack Usability Survey 2013

http://www.surveygizmo.com/s3/1188229/LOD2-Stack-Usability-Survey-2013