Towards distributed RDF querying · Introduction: Virtual Repository RDF Data RDF Data RDF Data Rep...
Transcript of Towards distributed RDF querying · Introduction: Virtual Repository RDF Data RDF Data RDF Data Rep...
Towards distributed RDF querying
Citation for published version (APA):Vdovjak, R., Houben, G. J. P. M., & Stuckenschmidt, H. (2004). Towards distributed RDF querying. 1-1. Abstractfrom Dutch-Belgian Database Day 2004 (DBDBD 2004), December 3, 2004, Antwerp, Belgium , Antwerp,Belgium.
Document status and date:Published: 01/01/2004
Document Version:Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)
Please check the document version of this publication:
• A submitted manuscript is the version of the article upon submission and before peer-review. There can beimportant differences between the submitted version and the official published version of record. Peopleinterested in the research are advised to contact the author for the final version of the publication, or visit theDOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and pagenumbers.Link to publication
General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.
If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, pleasefollow below link for the End User Agreement:www.tue.nl/taverne
Take down policyIf you believe that this document breaches copyright please contact us at:[email protected] details and we will investigate your claim.
Download date: 26. Feb. 2021
1
Towards Distributed RDF Querying
Richard Vdovjak (TU Eindhoven)Geert-Jan Houben (TU Eindhoven)Heiner Stuckenschmidt (VU Amsterdam)
Layout
n Introduction / motivationn Hera
n Integration model
n Query Processingn Optimization
Introduction: Layout
Motivation
n Demand for combining distributed data on the Webq Comparative Shoppingq Virtual Museums
q Digital Librariesq Web Portals
n User: q formulate his queryq split the queryq assemble results
Introduction: motivation
Web Site
Web Site
Web Site
Query 1
Query 2
Query 3UserUser
Sem. Web / RDF(S) to the Rescue (?)
n RDFSq The Pivot Language of the Semantic Webq Solves the problem of syntactic heterogeneity of
different sourcesq Provides basic modelling primitives and reasoning
techniques for conceptual knowledgeq Built on an extremely simple data model that
avoids complications of other object oriented formalisms
q And (most importantly): It is a standard !
Introduction: motivation
The RDF data model
n statements are <Subject, Predicate, Object> triples:q <“http://wwwis.win.tue.nl/~houben/”, dc:creator, “Geert-Jan”>
n statements describe properties of resources
n a resource is everything that has an identifier: URIn Directed Labelled Multi-graph
���������
����� ��� �����������������
dc:creator
Introduction: RDF (S)
Sem. Web / RDF(S) to the Rescue (?)
n Benefitsq Common Syntaxq Formal (limited) Semanticsq Flexible
easy to express anything about anything
n Our User ?q formulate his queryq split the queryq assemble results
...(still unhappy)
Introduction: RDF(S)
RDFData
RDFData
RDFData
Repository
Repository
Repository
Query 1
Query 2
Query 3UserUser
2
Possible Solution: Data Warehousing
n Gather All Data Locally
n Watch for Updatesq Insert New Dataq Delete Old Data
n Evaluate (locally) User Query
Introduction: Data Warehousing
RDFData
RDFData
RDFData
Repository RepositoryRepository
Query UserUser
RDFData
Repository
Data Warehousing: Problems
n Performance Bottleneck
n Freshness
n Copyright / Data Ownership issues
Introduction: Data Warehousing
Insert into Local Repository:Processing Speed
0
50
100
150
200
250
0 10 20 30 40 50 60 70 80
Size ( Stored triplets * 1000)
Tri
ple
ts/s
Another Solution: Virtual Repository
n Split the User Query into Sub-Queries
n Translate Sub-Queries to Source Schemata
n Distribute Sub-Queriesn Assemble Results
Introduction: Virtual Repository
RDFData
RDFData
RDFData
Repository Repository Repository
Query
Sub-QuerySub-QuerySub-Query
UserUser
Path Index
Mediator
Challenges/ Requirements for Integration of RDF(S) Sourcesn Unified Interface to data sources
q Usually only a part of a source is of interestq “Freshness” of data must be guaranteedq Sources are many, autonomous, and volatileq Semantic Heterogeneity
n Schema heterogeneity n Designation heterogeneity
n Distributed Query processing q Complex/Join queries (SeRQL language)q Flexibility w.r.t. user needs/query
n (the order of importance changes)q Correctness, Completenessq Performance
Introduction: Requirements
Distinguishing aspects from Distributed Databases
Hera Design Methodology
Hera Suite
Design Methodology
RequirementsAnalysis
ConceptualDesign
IntegrationDesign
ApplicationDesign
AdaptationDesign
PresentationDesign
(Search)Agent
inforequest
(meta) data
ConceptualModel Application
Model
info request (slice)presentation
End User
RQL / RDF XSLT / XML
UserModel
PresentationModel
PresentationEngine
ApplicationEngine
IntegrationEngine
CuypersEngine
AdaptationEngine
info request HTML/WML
IntegrationModel
Semantic Layer Application Layer Presentation Layer
Hera
n Data retrieval is just a begining n Navigational Structuren Presentation Renderingn Adaptation / Adaptivity
Hera Front-end
Source Clusters:
User Query
Presentation
Conceptual Model Access Point
html/smil
Hera Back-end
IM InstanceIntegration Model (IM)
The Semantic Layer
(Search) Agent request
RDF/data
Hera: Semantic Layer
3
Conceptual Model (CM)
n Interface between data retrieval and presentation generation (via SeRQL)
n CM exists on its own (even without instances)q Made by a system designerq Top down approach
n Specifies the application’s semantics What is the information system about
n Expressed in RDF(S) n Populated on demand with data from several
heterogeneous information sourcesn Challenge: Map the Sources to the defined CM
Hera: Conceptual Model
Layout
n Introduction / motivationn Hera
n Integration model
n Query Processingn Optimization
Integration Model
Integration Model (IM)n A generic framework for describing, integrating and
relating concepts from sources to their CM counterparts
n View Definition/Translation Language + n Object reconciliation language +n Language for programming the mediator +
n Expressed in RDFS n Instantiated by the integration designer into IMIs:
program the mediator to overcome the semantic heterogeneity between the sources and the CM
Integration Model: Definition
Integration Model in RDF(S)
PathExpression
ComparatorTransformer
ProcInstruction
FromEdge
ToEdge
FromNode
ToNode
PrimaryNode
EdgeNode
From
To
Rdf:Property Rdf:Resource
Decoration Articulation
appliesToArt
Literal
edgeProducedBy
producedBy
source
target
backtrack
follow
idByValue
obtainedFromEdgeobtainedFromNode
starts
forSource
ends
ends
starts
starts
fromTarget
toTarget
rankValue
nodeProducedBy
subPropertyOf
subClassOf
property
ends
endsList
endsList
List
FromList
obtainedFromList
Source
address
name
Literal
Literal
idByProperty
SubSchema
forSource
hasClassList
ClassList identicalFor
appliesToSource
Literal
dimension
Ranking Restriction
Literal
restrictionValuerestrictionPath
From
Integration Model: RDFS syntax
Schema heterogeneity
n Sources are autonomous and can therefore differ a lot from each other
n Mappings are formed through the notion of Path Expressions (PE) which form articulations
n An articulation is a pair of two PEs, one in an external source, one in the CMq consists of a link between the start-classes and a
link between the ending-edges
Integration Model: Expressive power
Schema heterogeneityn Design the CM like source 1 :
Integration Model: Expressive power
4
Schema heterogeneity
n Sometimes a value should be mapped to a list of values
n A transformer is needed for the necessary actionn Denoted in the IM as “obtained by list”
Integration Model: Expressive power
n Different sources may have different ways to uniquely identify instances
n Need to define the identifying properties “primary key” of objects in every source
n Consolidate them into the CM so that a join can be performed across multiple sources
Designation heterogeneity
Integration Model: Expressive power
Designation heterogeneity
n IM offers three kinds of data-identification:1. idByUri � every object has an unique URI
2. idByValue � a value is unique for an object
3. idByProperty � a “super resource” defines the uniqueness of an object
n Examples:
Integration Model: Expressive power
Designation heterogeneity : idByUri
n An object is uniquely defined by its URIn Can be used/imposed within closed communities,
e.g. corporate ISn Does not work World Wide
Integration Model: Expressive power
Designation heterogeneity : idByValuen Value based (similarly to the relational model)n A object is uniquely defined by one(or more) of its
own properties
Integration Model: Expressive power
Designation heterogeneity: idByPropertyn idBy-information provided by a super-resourcen Recursive path until either idByUri or idByValue is
encountered
Integration Model: Expressive power
5
Join Examplen If the primary key is the same data is joined
n The two idBy-paths are different but joining is still possible if the end-values for the primary key are the same
Integration Model: Expressive power
Different sources - different qualities
n Sources can “get points” for certain qualities:q Data reliabilityq Data quality: e.g. picture quality for a Photo databaseq Reachabilityq Speed
n Sources can be consulted in different order based on the current user preferences
n Decorations q making background knowledge explicitq “exported” into the CM (extending the Schema)
Integration Model: Expressive power
Extra IM features
n Process Instructions : (Need to compare (primary) keys
q Transformers/Comparatos§ Conversion functions (date of birth -> Personal number)§ Look-up table value translations§ Unit conversions (km->mile)§ Format conversion (tif -> jpg)
n Direct translationq In case of homogeneous Schemas/sourcesq Lists of classes with identical outgoing edges
Integration Model: Extra features
Layout
n Introduction / motivationn Hera
n Integration model
n Query Processingn Optimization
Query Processing
CM Instance (CMI)
n The Hera presentation engine needs more data than that resulted from a “bare” user query
n The user query is extended to retrieve literal values (the real content)
n An RDF graph is constructed out of the “flat” SeRQL output
Query Processing: CMI definition
CMI generation
n User Query: q Retrieve all writers
n Extended query:q Retrieves also the name, age, hasPortrait P4,P5
Literal
Literal Literal Literal
Literal
Literal
Literal
Comic
Picture
WriterAnimator
Person
P1 P2 P3
P4
P5
hasPicture
hasWriter
hasDrawer
name
age
Given the CM:
Literal
drawStyle hasPortrait
Query Processing: CMI generation example
6
Layout
n Introduction / motivationn Integration model
n Query Processing
n Optimization
Optimization
Initial Performance experimentsn Queries against small-size applications (e.g. the comics
database or virtual museum) answered within ms
n Medium Test Set: RDF version of Wordnetq Naturally split in four parts: Nouns (10MB), Glossary (15MB),
Similar-To Definitions (2MB), Hyponyms (8MB)Test Setting: q Three local stores: Vrije Universiteit Amsterdam, CWI
(Amsterdam) and Eindhoven Technical University, Mediator at Vrije Universiteit
Resultsq When installed locally Performance of distributed system is between
50 and 200% of the original Sesame (200-1000 msec.)q The performance drops with the size of the result set due to
extensive joining and communication overhead
Optimization: Initial Performance
Where the Time Goes?
Large Result Set
Local Processing31%
Communication47%
Result Joining21%
Mediator Overhead
1%
Small Result Set
Local Processing75%
Communication9%
Result Joining9%
Mediator Overhead
7%
Depends on many factorsq Data Size / Source Processing Speedq Query (complexity, result size ...) q Connection Speed
Improving the Performance
n Large applications (hundreds of MB) require sophisticated optimization techniques:
n Currentlyq Schema/path Indexing q Join orderingq Algebraic optimizations
n Work in progressq Reducing the transferred data
n requires an architecture change (similar to P2P)
Optimization: Approaches
Schema/Path Index Hierarchy
n Central place to store the translatable paths from the articulations
n The main idea: pushing long paths to sources is more efficent than joining many small paths at the mediatorq fewer joinsq smaller data traffic
n The index is constructed out of a pool of articulations
n Articulations can be added/deleted or modified on the fly: flexible source management
n Index is able to infer “new” paths of its own:
Optimization: Path Indexing
Schema/Path Index Hierarchy
n Inference rules:
q Inclusion of “ super paths”Given: A � B and B � CInsert A � C
q Transitive closure of subClassOf / subPropertyOfGiven: A � B, C � D and B –subClassOf–› Cinsert A � D
Optimization: Path Indexing
7
Schema/Path Index Hierarchy
n Key = path (sequence of properties)n Value = list of sources to which this path can be
translated
Optimization: Path Indexing
Performance Results: Full Index vs 1-path Index
Joining Time
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
100 1000 10000 100000
Result Size per Table
Tim
e [m
s]
Full Index
1-Index
Evaluation Time
100000
300000
500000
700000
900000
1100000
1300000
1500000
100 1000 10000 100000
Result Size per Table
Tim
e [m
s] Full Index
1-Index
From a RDF Path to Relational Tables
A B C Dx y z
Source1 Source2 Source3
C B
a1 b1
a2 b2
B C
b1 c1
b3 c2
C D
c1 d1
c2 d2
Source1x
Source2y
Source3z
� �
The Problem of Join Orderingn Determine the optimal
order of join execution in a (chain) query
n Different strategies have different execution costs due to different selectivity of the join operations
n Problem is NP-hard: we consider heuristic solutions
Optimization: Join Ordering
A Cost Model for RDF Queryingn Data Access Costs
q Initializing the transmissionq Transmitting the data
n Join Costs q Nested Loop Join
q Hash Join (potentially faster but less flexible wrt. determining obhect identity
n Costs of a Query Plan:q Transmission costs for all
relationsq Join costs for the chosen
footprint
Optimization: Join ordering
Join Ordering Heuristicsn Complexity of Task
demands heuristic approaches, performance of heuristics depends on class of queries.
n For chain queries the following performs best [Steinbrunn et.al. 1997]
n 1) Iterative Improvementq Start with random solutionsq Improve solution using a
greedy heuristic q Result: local optima
n 2) Simulated Annealingq Further improve solutions
allowing for increasing costs with a certain probability
q Helps to get out of local optima and converge towards better solutions
Optimization: Join ordering
8
Optimizing Communication: Collaborating Network of Mediators
n One coordinator + Set of cooperating nodes
n The coordinator generates the (global) query plan for other nodes
n instructions:q obtain data
receive a tableuse your local data
q join obtained data in a given order
q ship data to another node
RDFData
RDFData
RDFData
Repository Repository Repository
RDFData
RDFData
RDFData
Repository Repository Repository
RDFData
RDFData
RepositoryRepository
Path IndexMediator
Path IndexMediator
Path IndexMediator
RDFData
RDFData
RepositoryRepository
Path IndexMediator
RDFData
RDFData
RepositoryRepository
Path IndexMediator
QueryUserUser
Conclusionsn Virtual Repositories are a viable solution for building
distributed WISChallenges:
q Semantic heterogeneityn Source schemas can differ n URI is not enough for joiningn Flexible Evaluation
q Performance / Scalabilityn Path Indexingn Join Orderingn Collaborating Mediators
IM}
} Optimizer + Architecture
Questions