Caroline Cooper. 1500 The first pinhole camera was invented by Alhazen (Ibn Al-Haytham).
Rihab Ayed, Mohand-Saïd Hacid - cnrs.fr · 5 8 co-worker 10 Haytham 6 Huy 5 9 co-worker 10 Haytham...
Transcript of Rihab Ayed, Mohand-Saïd Hacid - cnrs.fr · 5 8 co-worker 10 Haytham 6 Huy 5 9 co-worker 10 Haytham...
Rihab Ayed, Mohand-Saïd Hacid
Indexation hybride de graphes pour des requêtes agrégatives
1
MF
C1
C2
C3
2
3
Problem
Query
Data set
Answer Labeled graphs
q
g1 g2 g3
{g2}
Exact matching
Approach: Filtering + Verification
q
g1 g2 g3
ID-List: {g1,g2}
Pattern A:
Not relevant Candidate (Pattern A) Candidate (Pattern A)
Not relevant
Solution
Filtering:
Verification:
Aggregated Search
q
q g1 g3
+ =
g1
g2
g3
Queries
Variables
Constants (known resources)
x1
x2
prof
x3 trainee
co-worker
is
supervises
friend
as
graphID edgeID eLabel sVID sVLabel dVID dVLabel
3 1 group 1 Said 2 DB
3 2 as 1 Said 3 prof
3 3 in 3 prof 4 Univ-Lyon1
5 4 supervises 5 Said 6 Huy
5 5 supervises 5 Said 7 Heni
5 6 is 6 Huy 8 trainee
5 7 is 7 Heni 9 trainee
5 8 co-worker 10 Haytham 6 Huy
5 9 co-worker 10 Haytham 7 Heni
Phase 1: Edge-Edge Encoding Schemes (Sakr and Al-Naymat [1])
Said DB
Univ-Lyon1
prof
group
in
as
d3
[1] Sherif Sakr and Ghazi Al-Naymat. Efficient relational techniques for processing graph queries, 2010.
Huy trainee
Haytham
is
co-worker Said
Heni
trainee
is co-worker
d5
Graph DB
Relational DB
encoded
graphIDQ graphID edgeIDQ edgeID eLabel sVLabel dVLabel
1 5 1 4 supervises Said Huy
1 5 2 5 is Huy trainee
1 5 3 6 co-worker Haytham Huy
commonedges table
Phase 2: Common edges search
Huy trainee
Haytham
is
co-worker Said
LIRIS
Univ-Lyon1
in
x1
x2
prof
x3 trainee
co-worker
is
supervises
friend
as
d5
Discover all edges that belong to both the query graph q and a graph database D.
Whether all edges in q are present in D.
q
query q
qconst. qano.
verification
valid?
end false
Aggregate search true
x1
x2
prof
x3 trainee
co-worker
is
supervises
friend
as
Univ-Lyon1
in
prof Univ-Lyon1
in x1
x2
prof
x3 trainee
co-worker
is
supervises
friend
as
Phase 3: Query Decomposition
Huy M2 ECD
student d1
Said Julien
Abdelkader
friend
Univ-Nantes
in
Vietnam
from
Said
prof
DB group
Univ-Lyon1
in
Huy trainee
Haytham
is
co-worker Said
Heni
trainee
is co-worker
Said Haytham
Huy
friend
Heni know
d2 d3
d4
d5
Odilon
doctorat Said
is
Fabrice supervises
d6
prof
as
friend
co-worker
Univ-Nantes in
q
friend
as
Common edges search
Mohamed
co-worker
x1
x2
prof
x3 trainee
co-worker
is
supervises
friend
as
Final answer set
# x1 x2 x3
1 Said Haytham Heni
2 Said Haytham Huy
Said
Haytham
prof
Heni trainee
co-worker
is
supervises
friend
as
answer 1
Said
Haytham
prof
Huy trainee
co-worker
is
supervises
friend
as
answer 2
Distributed processing
Source ObjectID
S1 O1
S2 O1
S3 O2
ObjectID RA decl
O1 40 15
O2 40 25
S1
S3
S2
O1
O2
OID
OID
OID
40
RA
RA
15 Decl
Decl 25 ?x
15
?z
Decl
Ra Query
Storage
SPO, OPS,…
RDF3X, ….
x1
l1
z1 n1
k1
a
b
c
d
e
y1
z2 x2
y2 a
b
c
x1 a y1
x1 b z1
x1 c l1
x2 a y2
x2 b z2
x2 c l1
y1 d k1
y1 e n1
SPO
k1 d Y1
l1 c x1
l1 c x2
n1 e y1
y1 a x1
y2 a x2
z1 b x1
z2 b x2
OPS
Storage
?x ?y
?z
a
b
Problem : Scan of SPO ?
x1 a y1
x1 b z1
x1 c l1
x2 a y2
x2 b z2
x2 c l1
y1 d k1
y1 e n1
x1 a y1
x1 b z1
x1 c l1
x2 a y2
x2 b z2
x2 c l1
y1 d k1
y1 e n1
a b c
d e
S P O SPO-Lattice
a b c
a
a b
b
d e f
d e f
⊥
⊺
h j g c
Data partitioning
Graph
Application-oriented partitioning
Worker 2
P5 P6
P7 P8
Worker 1
P1 P2
P3 P4
Worker 4
P13 P14
P15 P16
Worker 3
P9 P10
P11 P12
Problem 3:
How to partition data?
Our proposal
• Schema-oriented partitioning
Query evaluation ?
x1
l1
z1 n1
k1
a
b
c
d
e
y1
Worker 1
z2 x2
y2
a
b
c
Worker 2
x3
l2
z3 n2
k2
a
b
c
d
e
y2
x4
a
b
c
n3
k3 d
e
y3
l3
z4
?x
?l
?z ?n
?k
a
b
c
d
e
?y
?x
?l
?z
?n
?k
a
b
c
d
e
?y
Query
?y
SQ1 SQ2
Decomposition
?x ?y ?z ?l
x1 y1 z1 l1
x2 y2 z2 l1
Worker 1
Worker 2
?x ?y ?z ?l
x3 y2 z3 l2
x4 y3 z4 l3
SQ1 SQ2
?k ?n
k1 n1
To be transferred to W2
?k ?n
k2 n2
k3 n3
SuperStep 1
SQ2
SuperStep 2
?x ?y ?z ?l
x2 y2 z2 l1
Transfer
?k ?n
k2 n2
Query evaluation - BSP
SQ1 SQ2 SQ3 SQ4 SQ5
S1
S2
S3
S4
S5
Network
Network
Network
Network
Syn
c
Acceptable plan
?x
Cst
?y
?z
a
b
c
?y ?t
?k
d
e
f
SQf2
cst12
SQf1
?y ?t
?k
d
e
f
SQf2
cst12
?x
Cst
?y
?z
a
b
c
SQf1
Problem : How to compare query plans ?
Costs estimation
Reassigning partitions
P1 P2 P3 P4
P5 P6 P7 P8
P9 P10 P11 P12
P13 P14 P15 P16
Analyze the costs of network transfers linked to a set of queries Assigning Partitions
P1 P2 P3 P4
P5 P6 P7 P8
P9 P10 P11 P12
P13 P14 P15 P16
Changing the partition assignment without changing the physical structure of the partition
Example
P1 P2 P3 P4
P1 0 500 300 25
P2 20 0 50 25
P3 10 150 0 40
P4 300 100 1024 0
Transfer history
M1 M2 M3
P1 1 0 0
P2 0 0 1
P3 0 1 0
P4 1 0 0
Cost= 800+95+200+1124 =2219
M1 M2 M3
P1 1 0 0
P2 0 0 1
P3 0 1 0
P4 0 1 0
Cost= 825+95+160+400 = 1480
Assigning Partitions
Gain 739 !
C4 C2
Query
Formulation
Query Decomposition
(n-predicates)
Documents
(RDF graphs )
Semantic
clustering Structural
clustering
F1
F2 C1 C3
Semantic
Ressource
Frequent subgraphs
(n-predicates)
Indexing
R
Predicates Similarity measures
Sources querying
Cooccurrence
C5
C4 C2
Query
Formulation
Query Decomposition
(n-predicates)
Documents
(RDF graphs )
Semantic
clustering Structural
clustering
F1
F2 C1 C3
Semantic
Ressource
Semantic/Structural
clustering
Frequent subgraphs
(n-predicates)
Indexing
Optimizing the access to sources and fragments
R
Predicates Similarity measures
Sources querying
Cooccurrence
C5
Query
… Graph Database
Index of sources
and documents
Phds
Roald
Tunisia
Nobel Prize
Supervised by
in CV visited
?
Name
Semantic clustering - meta-information about metadata of RDF graphs Structural clustering - FSM analysis Query decomposition reducing the
#requests to sources of information #aggregations to perform between fragments
RDF data partitioning strategies [Akhter et al., 2018]
Random (e.g., Horizontal)
Structure-based [Hammoud et al., 2015] : partitioning by subject or predicate or using the structure of the graph
Semantic-based (e.g., hierarchical : Resources with the same hierarchy prefix are often queried together)
Workload-aware [Abdelaziz et al., 2017] : query workload (use of frequent query patterns)
Partitioning Strategies for RDF graphs
Semantic Clustering
Predicate Relatedness Predicates (edges) are projected as a set of concepts (nodes) in a Knowledge Representation resource (e.g., ontology, thesaurus)
Semantic Clustering - Semantic Relatedness
Measures of semantic similarity/distance in RDF graphs have been considered mainly in
Ontology alignment [David et al., 2008] [Harispe et al., 2013] Web recommendation systems [Leal et al., 2013] Approximate RDF search [Zheng et al., 2016]
No available approach for compting semantic relatedness [4] between Predicates in an RDF graph and
Properties in a semantic ressource
The existing approaches use the properties to calculate the semantic similarity between concepts and instances in a knowledge representation resource (e.g., ontology)
[4] https://www.xml.com/pub/a/2001/01/24/rdf.html
We use the description of predicates to compare their semantic relatedness
Two tools devoted to Semantic Text Similarity (STS) [Resnik, 1999] – they use two knowledge representation resources
Tool KR resource Similarity measure Approach
ADW [Pilehvar et al., 2013]
WordNet Weighted Overlap Alignment of pair of words from the two texts + random walk s
UMBC [Han et al., 2013]
Stanford WebBase project Corpus + WordNet
LSA model using Corpus word co-occurrence + WordNet relationships
Alignment of pair of words from the two texts
Semantic Clustering : Semantic Relatedness
Example of semantic relatedness of the predicate “artist” with other predicates (using ADW tool)
Semantic Clustering : Semantic Relatedness
Semantic Clustering : Assumptions
Predicates are annotated by their descriptions (i.e., rdfs:comment) or their labels (i.e., rdfs:label )
Predicate labels with punctuation errors are fixed
Predicates with no descriptive metadata are included using their local name in URI (e.g., ArtistId => Artist Id)
Highly similar predicates can be clustered together (e.g., work_in, has job) if inexact graph matching will be performed
Clusters are constructed based only on frequent predicates (frequency > support threshold)
Infrequent predicates are assigned to clusters of related frequent ones (relatedness with top-N frequent predicates per cluster)
Structural Clustering
Clustering (structurally) documents – represented as graphs – in an aggregated information retrieval system
Finding the most efficient implementation to use for a FSM analysis in a Centralized Graph Transaction Database
Results about the most efficient & available FSM solution [5] Initial list : 32 FSM algorithms
Final list : 13 solutions out of 6 filtered algorithms
Datasets : 12 datasets from the literature
Computing the co-frequency of predicates and their corresponding subgraphs in the database (subgraph max size =5)
[5] https://perso.liris.cnrs.fr/rihab.ayed/recherche.html
Experimental Settings
DBSPB benchmark [6] Number of predicates larger than other benchmarks (e.g., LUBM and SP2Bench) [Kim et al., 2015]
DBPedia : a real cross-domain dataset
The dataset is a sample of DBPedia with a scale factor x% (i.e., 10%, 50% or 100%) and two query sets involving 50 and 40 query templates
We use the SPARQL Fuseki Server, Jena TDB Triple Store
Clustering library : scikit-learn v0.20.2, Algorithm : Spectral Clustering (using affinity matrix)
Two types of machines A master machine : performs clustering, sends partitions to each slave machine, decomposes the query, gets the intermediate results and aggregate them
Slave machines : store partitions of the dataset, queries them and return partial results to master machine
[6] https://github.com/dice-group/IGUANA/wiki/How-to-execute-DBPSB
Experimental Settings : Metrics
Study the effect of semantic clustering on aggregated results
Number of queried sources / all sources Number of requests sent to sources Number of aggregation in one source / all sources Completeness of results Runtime : the time spent by tasks devoted to query processing and aggregation of results Size of intermediate results and number of joins Number of local vs. external joins Partition quality : The representativeness of clustered subgraphs compared to the structure of graphs in the dataset