Rihab Ayed, Mohand-Saïd Hacid - cnrs.fr · 5 8 co-worker 10 Haytham 6 Huy 5 9 co-worker 10 Haytham...

Rihab Ayed, Mohand-Saïd Hacid

Indexation hybride de graphes pour des requêtes agrégatives

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=2ahUKEwiW7fSioIfiAhU18-AKHUlCB1sQjRx6BAgBEAU&url=https://fr.wikipedia.org/wiki/Fichier:Centre_national_de_la_recherche_scientifique.svg&psig=AOvVaw1QLRlKBqqBcqc9VufP-kfU&ust=1557243928602256

1

MF

C1

C2

C3

2

3

Problem

Query

Data set

Answer Labeled graphs

q

g1 g2 g3

{g2}

Exact matching

Approach: Filtering + Verification

q

g1 g2 g3

ID-List: {g1,g2}

Pattern A:

Not relevant Candidate (Pattern A) Candidate (Pattern A)

Not relevant

Solution

Filtering:

Verification:

Aggregated Search

q

q g1 g3

+ =

g1

g2

g3

Queries

Variables

Constants (known resources)

x1

x2

prof

x3 trainee

co-worker

is

supervises

friend

as

graphID edgeID eLabel sVID sVLabel dVID dVLabel

3 1 group 1 Said 2 DB

3 2 as 1 Said 3 prof

3 3 in 3 prof 4 Univ-Lyon1

5 4 supervises 5 Said 6 Huy

5 5 supervises 5 Said 7 Heni

5 6 is 6 Huy 8 trainee

5 7 is 7 Heni 9 trainee

5 8 co-worker 10 Haytham 6 Huy

5 9 co-worker 10 Haytham 7 Heni

Phase 1: Edge-Edge Encoding Schemes (Sakr and Al-Naymat [1])

Said DB

Univ-Lyon1

prof

group

in

as

d3

[1] Sherif Sakr and Ghazi Al-Naymat. Efficient relational techniques for processing graph queries, 2010.

Huy trainee

Haytham

is

co-worker Said

Heni

trainee

is co-worker

d5

Graph DB

Relational DB

encoded

graphIDQ graphID edgeIDQ edgeID eLabel sVLabel dVLabel

1 5 1 4 supervises Said Huy

1 5 2 5 is Huy trainee

1 5 3 6 co-worker Haytham Huy

commonedges table

Phase 2: Common edges search

Huy trainee

Haytham

is

co-worker Said

LIRIS

Univ-Lyon1

in

x1

x2

prof

x3 trainee

co-worker

is

supervises

friend

as

d5

Discover all edges that belong to both the query graph q and a graph database D.

Whether all edges in q are present in D.

q

query q

qconst. qano.

verification

valid?

end false

Aggregate search true

x1

x2

prof

x3 trainee

co-worker

is

supervises

friend

as

Univ-Lyon1

in

prof Univ-Lyon1

in x1

x2

prof

x3 trainee

co-worker

is

supervises

friend

as

Phase 3: Query Decomposition

Huy M2 ECD

student d1

Said Julien

Abdelkader

friend

Univ-Nantes

in

Vietnam

from

Said

prof

DB group

Univ-Lyon1

in

Huy trainee

Haytham

is

co-worker Said

Heni

trainee

is co-worker

Said Haytham

Huy

friend

Heni know

d2 d3

d4

d5

Odilon

doctorat Said

is

Fabrice supervises

d6

prof

as

friend

co-worker

Univ-Nantes in

q

friend

as

Common edges search

Mohamed

co-worker

x1

x2

prof

x3 trainee

co-worker

is

supervises

friend

as

Final answer set

# x1 x2 x3

1 Said Haytham Heni

2 Said Haytham Huy

Said

Haytham

prof

Heni trainee

co-worker

is

supervises

friend

as

answer 1

Said

Haytham

prof

Huy trainee

co-worker

is

supervises

friend

as

answer 2

Distributed processing

Source ObjectID

S1 O1

S2 O1

S3 O2

ObjectID RA decl

O1 40 15

O2 40 25

S1

S3

S2

O1

O2

OID

OID

OID

40

RA

RA

15 Decl

Decl 25 ?x

15

?z

Decl

Ra Query

Storage

SPO, OPS,…

RDF3X, ….

x1

l1

z1 n1

k1

a

b

c

d

e

y1

z2 x2

y2 a

b

c

x1 a y1

x1 b z1

x1 c l1

x2 a y2

x2 b z2

x2 c l1

y1 d k1

y1 e n1

SPO

k1 d Y1

l1 c x1

l1 c x2

n1 e y1

y1 a x1

y2 a x2

z1 b x1

z2 b x2

OPS

Storage

?x ?y

?z

a

b

Problem : Scan of SPO ?

x1 a y1

x1 b z1

x1 c l1

x2 a y2

x2 b z2

x2 c l1

y1 d k1

y1 e n1

x1 a y1

x1 b z1

x1 c l1

x2 a y2

x2 b z2

x2 c l1

y1 d k1

y1 e n1

a b c

d e

S P O SPO-Lattice

a b c

a

a b

b

d e f

d e f

⊥

⊺

h j g c

Data partitioning

Graph

Application-oriented partitioning

Worker 2

P5 P6

P7 P8

Worker 1

P1 P2

P3 P4

Worker 4

P13 P14

P15 P16

Worker 3

P9 P10

P11 P12

Problem 3:

How to partition data?

Our proposal

• Schema-oriented partitioning

Query evaluation ?

x1

l1

z1 n1

k1

a

b

c

d

e

y1

Worker 1

z2 x2

y2

a

b

c

Worker 2

x3

l2

z3 n2

k2

a

b

c

d

e

y2

x4

a

b

c

n3

k3 d

e

y3

l3

z4

?x

?l

?z ?n

?k

a

b

c

d

e

?y

?x

?l

?z

?n

?k

a

b

c

d

e

?y

Query

?y

SQ1 SQ2

Decomposition

?x ?y ?z ?l

x1 y1 z1 l1

x2 y2 z2 l1

Worker 1

Worker 2

?x ?y ?z ?l

x3 y2 z3 l2

x4 y3 z4 l3

SQ1 SQ2

?k ?n

k1 n1

To be transferred to W2

?k ?n

k2 n2

k3 n3

SuperStep 1

SQ2

SuperStep 2

?x ?y ?z ?l

x2 y2 z2 l1

Transfer

?k ?n

k2 n2

Query evaluation - BSP

SQ1 SQ2 SQ3 SQ4 SQ5

S1

S2

S3

S4

S5

Network

Network

Network

Network

Syn

c

Acceptable plan

?x

Cst

?y

?z

a

b

c

?y ?t

?k

d

e

f

SQf2

cst12

SQf1

?y ?t

?k

d

e

f

SQf2

cst12

?x

Cst

?y

?z

a

b

c

SQf1

Problem : How to compare query plans ?

Costs estimation

Reassigning partitions

P1 P2 P3 P4

P5 P6 P7 P8

P9 P10 P11 P12

P13 P14 P15 P16

Analyze the costs of network transfers linked to a set of queries Assigning Partitions

P1 P2 P3 P4

P5 P6 P7 P8

P9 P10 P11 P12

P13 P14 P15 P16

Changing the partition assignment without changing the physical structure of the partition

Example

P1 P2 P3 P4

P1 0 500 300 25

P2 20 0 50 25

P3 10 150 0 40

P4 300 100 1024 0

Transfer history

M1 M2 M3

P1 1 0 0

P2 0 0 1

P3 0 1 0

P4 1 0 0

Cost= 800+95+200+1124 =2219

M1 M2 M3

P1 1 0 0

P2 0 0 1

P3 0 1 0

P4 0 1 0

Cost= 825+95+160+400 = 1480

Assigning Partitions

Gain 739 !

C4 C2

Query

Formulation

Query Decomposition

(n-predicates)

Documents

(RDF graphs )

Semantic

clustering Structural

clustering

F1

F2 C1 C3

Semantic

Ressource

Frequent subgraphs

(n-predicates)

Indexing

R

Predicates Similarity measures

Sources querying

Cooccurrence

C5

C4 C2

Query

Formulation

Query Decomposition

(n-predicates)

Documents

(RDF graphs )

Semantic

clustering Structural

clustering

F1

F2 C1 C3

Semantic

Ressource

Semantic/Structural

clustering

Frequent subgraphs

(n-predicates)

Indexing

Optimizing the access to sources and fragments

R

Predicates Similarity measures

Sources querying

Cooccurrence

C5

Query

… Graph Database

Index of sources

and documents

Phds

Roald

Tunisia

Nobel Prize

Supervised by

in CV visited

?

Name

Semantic clustering - meta-information about metadata of RDF graphs Structural clustering - FSM analysis Query decomposition reducing the

#requests to sources of information #aggregations to perform between fragments

RDF data partitioning strategies [Akhter et al., 2018]

Random (e.g., Horizontal)

Structure-based [Hammoud et al., 2015] : partitioning by subject or predicate or using the structure of the graph

Semantic-based (e.g., hierarchical : Resources with the same hierarchy prefix are often queried together)

Workload-aware [Abdelaziz et al., 2017] : query workload (use of frequent query patterns)

Partitioning Strategies for RDF graphs

Semantic Clustering

Predicate Relatedness Predicates (edges) are projected as a set of concepts (nodes) in a Knowledge Representation resource (e.g., ontology, thesaurus)

Semantic Clustering - Semantic Relatedness

Measures of semantic similarity/distance in RDF graphs have been considered mainly in

Ontology alignment [David et al., 2008] [Harispe et al., 2013] Web recommendation systems [Leal et al., 2013] Approximate RDF search [Zheng et al., 2016]

No available approach for compting semantic relatedness [4] between Predicates in an RDF graph and

Properties in a semantic ressource

The existing approaches use the properties to calculate the semantic similarity between concepts and instances in a knowledge representation resource (e.g., ontology)

[4] https://www.xml.com/pub/a/2001/01/24/rdf.html

We use the description of predicates to compare their semantic relatedness

Two tools devoted to Semantic Text Similarity (STS) [Resnik, 1999] – they use two knowledge representation resources

Tool KR resource Similarity measure Approach

ADW [Pilehvar et al., 2013]

WordNet Weighted Overlap Alignment of pair of words from the two texts + random walk s

UMBC [Han et al., 2013]

Stanford WebBase project Corpus + WordNet

LSA model using Corpus word co-occurrence + WordNet relationships

Alignment of pair of words from the two texts

Semantic Clustering : Semantic Relatedness

Example of semantic relatedness of the predicate “artist” with other predicates (using ADW tool)

Semantic Clustering : Semantic Relatedness

Semantic Clustering : Assumptions

Predicates are annotated by their descriptions (i.e., rdfs:comment) or their labels (i.e., rdfs:label )

Predicate labels with punctuation errors are fixed

Predicates with no descriptive metadata are included using their local name in URI (e.g., ArtistId => Artist Id)

Highly similar predicates can be clustered together (e.g., work_in, has job) if inexact graph matching will be performed

Clusters are constructed based only on frequent predicates (frequency > support threshold)

Infrequent predicates are assigned to clusters of related frequent ones (relatedness with top-N frequent predicates per cluster)

Structural Clustering

Clustering (structurally) documents – represented as graphs – in an aggregated information retrieval system

Finding the most efficient implementation to use for a FSM analysis in a Centralized Graph Transaction Database

Results about the most efficient & available FSM solution [5] Initial list : 32 FSM algorithms

Final list : 13 solutions out of 6 filtered algorithms

Datasets : 12 datasets from the literature

Computing the co-frequency of predicates and their corresponding subgraphs in the database (subgraph max size =5)

[5] https://perso.liris.cnrs.fr/rihab.ayed/recherche.html

Experimental Settings

DBSPB benchmark [6] Number of predicates larger than other benchmarks (e.g., LUBM and SP2Bench) [Kim et al., 2015]

DBPedia : a real cross-domain dataset

The dataset is a sample of DBPedia with a scale factor x% (i.e., 10%, 50% or 100%) and two query sets involving 50 and 40 query templates

We use the SPARQL Fuseki Server, Jena TDB Triple Store

Clustering library : scikit-learn v0.20.2, Algorithm : Spectral Clustering (using affinity matrix)

Two types of machines A master machine : performs clustering, sends partitions to each slave machine, decomposes the query, gets the intermediate results and aggregate them

Slave machines : store partitions of the dataset, queries them and return partial results to master machine

[6] https://github.com/dice-group/IGUANA/wiki/How-to-execute-DBPSB

Experimental Settings : Metrics

Study the effect of semantic clustering on aggregated results

Number of queried sources / all sources Number of requests sent to sources Number of aggregation in one source / all sources Completeness of results Runtime : the time spent by tasks devoted to query processing and aggregation of results Size of intermediate results and number of joins Number of local vs. external joins Partition quality : The representativeness of clustered subgraphs compared to the structure of graphs in the dataset

Rihab Ayed, Mohand-Saïd Hacid - cnrs.fr · 5 8 co-worker 10 Haytham 6 Huy 5 9 co-worker 10 Haytham...

Documents

Transcript of Rihab Ayed, Mohand-Saïd Hacid - cnrs.fr · 5 8 co-worker 10 Haytham 6 Huy 5 9 co-worker 10 Haytham...