Scalable Multi-Query Optimization for SPARQL - cs.utah.edulifeifei/papers/mqoslides.pdf · Scalable...

Post on 12-Oct-2020

2 views 0 download

Transcript of Scalable Multi-Query Optimization for SPARQL - cs.utah.edulifeifei/papers/mqoslides.pdf · Scalable...

Scalable Multi-Query Optimization for SPARQL

Wangchao Le1 Anastasios Kementsietsidis2 Songyun Duan2 Feifei Li1

1University of Utah 2IBM Research

April 2, 2012

Outline

1 Introduction

2 Preliminary

3 Our approach

4 Experiments

5 Conclusions

2 / 113

Outline

1 Introduction

2 Preliminary

3 Our approach

4 Experiments

5 Conclusions

3 / 113

Introduction

We are inundated with a large collection of RDF (ResourceDescription Framework) data.

DBpedia, Uniprot, Freebase etcA large graph and encode rich semantics

Available engines to manage RDF data?

RDBMS: Migrate RDF, e.g., Sesame, JenaSDB etc.Generic RDF stores: e.g., RDF3X, JenaTDB etc.

[VP07] D.J. Abadi, et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.

[HEX08] C. Weiss, et al. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.

[RDF3X] T. Neumann, G. Weikum. RDF-3X: a RISC-style engine for RDF. In VLDB, 2008.

[SJP09] T. Neumann, G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In SIGMOD, 2009.

[BM10] M. Atre, et al. Matrix ”Bit” loaded: A Scalable Lightweight Join Query Processor for RDF Data. In WWW, 2010.

[SSQ11] J. Huang, et al. Scalable SPARQL Querying of Large RDF Graphs. In VLDB, 2011.

4 / 113

Introduction

We are inundated with a large collection of RDF (ResourceDescription Framework) data.

DBpedia, Uniprot, Freebase etc

<rdf:RDFxmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#xmlns:dcterms=”http://purl.org/dc/terms/”>

<rdf:Description rdf:about=”urn:x-states:New York”><dcterms:alternative>NY</dcterms:alternative>

</rdf:Description></rdf:RDF>

Internally ...

A large graph and encode rich semantics

Available engines to manage RDF data?

RDBMS: Migrate RDF, e.g., Sesame, JenaSDB etc.Generic RDF stores: e.g., RDF3X, JenaTDB etc.

[VP07] D.J. Abadi, et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.

[HEX08] C. Weiss, et al. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.

[RDF3X] T. Neumann, G. Weikum. RDF-3X: a RISC-style engine for RDF. In VLDB, 2008.

[SJP09] T. Neumann, G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In SIGMOD, 2009.

[BM10] M. Atre, et al. Matrix ”Bit” loaded: A Scalable Lightweight Join Query Processor for RDF Data. In WWW, 2010.

[SSQ11] J. Huang, et al. Scalable SPARQL Querying of Large RDF Graphs. In VLDB, 2011.

5 / 113

Introduction

We are inundated with a large collection of RDF (ResourceDescription Framework) data.

DBpedia, Uniprot, Freebase etc

<rdf:RDFxmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#xmlns:dcterms=”http://purl.org/dc/terms/”>

<rdf:Description rdf:about=”urn:x-states:New York”><dcterms:alternative>NY</dcterms:alternative>

</rdf:Description></rdf:RDF>

<http://. . ./New York> <http://purl.org/dc/terms/alternative> ”NY” .

Triple format:

Internally ...

subject predicate object

A large graph and encode rich semantics

Available engines to manage RDF data?

RDBMS: Migrate RDF, e.g., Sesame, JenaSDB etc.Generic RDF stores: e.g., RDF3X, JenaTDB etc.

[VP07] D.J. Abadi, et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.

[HEX08] C. Weiss, et al. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.

[RDF3X] T. Neumann, G. Weikum. RDF-3X: a RISC-style engine for RDF. In VLDB, 2008.

[SJP09] T. Neumann, G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In SIGMOD, 2009.

[BM10] M. Atre, et al. Matrix ”Bit” loaded: A Scalable Lightweight Join Query Processor for RDF Data. In WWW, 2010.

[SSQ11] J. Huang, et al. Scalable SPARQL Querying of Large RDF Graphs. In VLDB, 2011.

6 / 113

Introduction

We are inundated with a large collection of RDF (ResourceDescription Framework) data.

DBpedia, Uniprot, Freebase etc

<rdf:RDFxmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#xmlns:dcterms=”http://purl.org/dc/terms/”>

<rdf:Description rdf:about=”urn:x-states:New York”><dcterms:alternative>NY</dcterms:alternative>

</rdf:Description></rdf:RDF>

<http://. . ./New York> <http://purl.org/dc/terms/alternative> ”NY” .

Triple format:

Internally ...

subject predicate object

A large graph and encode rich semantics

Available engines to manage RDF data?

RDBMS: Migrate RDF, e.g., Sesame, JenaSDB etc.Generic RDF stores: e.g., RDF3X, JenaTDB etc.

[VP07] D.J. Abadi, et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.

[HEX08] C. Weiss, et al. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.

[RDF3X] T. Neumann, G. Weikum. RDF-3X: a RISC-style engine for RDF. In VLDB, 2008.

[SJP09] T. Neumann, G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In SIGMOD, 2009.

[BM10] M. Atre, et al. Matrix ”Bit” loaded: A Scalable Lightweight Join Query Processor for RDF Data. In WWW, 2010.

[SSQ11] J. Huang, et al. Scalable SPARQL Querying of Large RDF Graphs. In VLDB, 2011.

7 / 113

Introduction

We are inundated with a large collection of RDF (ResourceDescription Framework) data.

DBpedia, Uniprot, Freebase etc

<rdf:RDFxmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#xmlns:dcterms=”http://purl.org/dc/terms/”>

<rdf:Description rdf:about=”urn:x-states:New York”><dcterms:alternative>NY</dcterms:alternative>

</rdf:Description></rdf:RDF>

<http://. . ./New York> <http://purl.org/dc/terms/alternative> ”NY” .

Triple format:

Internally ...

Query language: SPARQLsubject predicate object

A large graph and encode rich semantics

Available engines to manage RDF data?

RDBMS: Migrate RDF, e.g., Sesame, JenaSDB etc.Generic RDF stores: e.g., RDF3X, JenaTDB etc.

[VP07] D.J. Abadi, et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.

[HEX08] C. Weiss, et al. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.

[RDF3X] T. Neumann, G. Weikum. RDF-3X: a RISC-style engine for RDF. In VLDB, 2008.

[SJP09] T. Neumann, G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In SIGMOD, 2009.

[BM10] M. Atre, et al. Matrix ”Bit” loaded: A Scalable Lightweight Join Query Processor for RDF Data. In WWW, 2010.

[SSQ11] J. Huang, et al. Scalable SPARQL Querying of Large RDF Graphs. In VLDB, 2011.

8 / 113

Introduction

We are inundated with a large collection of RDF (ResourceDescription Framework) data.

DBpedia, Uniprot, Freebase etcA large graph and encode rich semantics

Available engines to manage RDF data?

RDBMS: Migrate RDF, e.g., Sesame, JenaSDB etc.Generic RDF stores: e.g., RDF3X, JenaTDB etc.

[VP07] D.J. Abadi, et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.

[HEX08] C. Weiss, et al. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.

[RDF3X] T. Neumann, G. Weikum. RDF-3X: a RISC-style engine for RDF. In VLDB, 2008.

[SJP09] T. Neumann, G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In SIGMOD, 2009.

[BM10] M. Atre, et al. Matrix ”Bit” loaded: A Scalable Lightweight Join Query Processor for RDF Data. In WWW, 2010.

[SSQ11] J. Huang, et al. Scalable SPARQL Querying of Large RDF Graphs. In VLDB, 2011.

9 / 113

Introduction

We are inundated with a large collection of RDF (ResourceDescription Framework) data.

DBpedia, Uniprot, Freebase etcA large graph and encode rich semantics

Available engines to manage RDF data?

RDBMS: Migrate RDF, e.g., Sesame, JenaSDB etc.Generic RDF stores: e.g., RDF3X, JenaTDB etc.

[VP07] D.J. Abadi, et al. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.

[HEX08] C. Weiss, et al. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.

[RDF3X] T. Neumann, G. Weikum. RDF-3X: a RISC-style engine for RDF. In VLDB, 2008.

[SJP09] T. Neumann, G. Weikum. Scalable Join Processing on Very Large RDF Graphs. In SIGMOD, 2009.

[BM10] M. Atre, et al. Matrix ”Bit” loaded: A Scalable Lightweight Join Query Processor for RDF Data. In WWW, 2010.

[SSQ11] J. Huang, et al. Scalable SPARQL Querying of Large RDF Graphs. In VLDB, 2011.

10 / 113

Introduction

Q3

RDF store

Q1Q2

Qn−1QnSPARQL queries

11 / 113

Introduction

Q3

RDF store

Q1Q2

Qn−1Qn

• Observation: queries share common parts• Multi-query optimization

SPARQL queries

12 / 113

Introduction

A tempting choice: turn to MQO in relational databases[MQO88][MQO90][MQO00]

SPARQL↔relational algebra [EPS08][FSR07].Exist quite a few relational solutions for RDF store.

[MQO90] T. Sellis, et al. On the Multiple-Query Optimization Problem. In TKDE, 1990.

[MQO88] T. Sellis, et al. Multiple-query optimization. In TODS, 1988.

[MQO00] P. Roy, et al. Efficient and extensible algorithms for multi query optimization. In SIGMOD, 2000.

[EPS08] R. Angles, et al. The Expressive Power of SPARQL. In ISWC, 2008.

[FSR07] A. Polleres, et al. From SPARQL to rules (and back). In WWW, 2007.

For SPARQL and RDF, new issues arise in practice.

Convert SPARQL to SQL: not all engines use RDBMSConversion to SQL → a large number of joinsStore dependent solution

13 / 113

Introduction

A tempting choice: turn to MQO in relational databases[MQO88][MQO90][MQO00]

SPARQL↔relational algebra [EPS08][FSR07].Exist quite a few relational solutions for RDF store.

[MQO90] T. Sellis, et al. On the Multiple-Query Optimization Problem. In TKDE, 1990.

[MQO88] T. Sellis, et al. Multiple-query optimization. In TODS, 1988.

[MQO00] P. Roy, et al. Efficient and extensible algorithms for multi query optimization. In SIGMOD, 2000.

[EPS08] R. Angles, et al. The Expressive Power of SPARQL. In ISWC, 2008.

[FSR07] A. Polleres, et al. From SPARQL to rules (and back). In WWW, 2007.

For SPARQL and RDF, new issues arise in practice.

Convert SPARQL to SQL: not all engines use RDBMSConversion to SQL → a large number of joinsStore dependent solution

14 / 113

Introduction

A tempting choice: turn to MQO in relational databases[MQO88][MQO90][MQO00]

SPARQL↔relational algebra [EPS08][FSR07].Exist quite a few relational solutions for RDF store.

[MQO90] T. Sellis, et al. On the Multiple-Query Optimization Problem. In TKDE, 1990.

[MQO88] T. Sellis, et al. Multiple-query optimization. In TODS, 1988.

[MQO00] P. Roy, et al. Efficient and extensible algorithms for multi query optimization. In SIGMOD, 2000.

[EPS08] R. Angles, et al. The Expressive Power of SPARQL. In ISWC, 2008.

[FSR07] A. Polleres, et al. From SPARQL to rules (and back). In WWW, 2007.

For SPARQL and RDF, new issues arise in practice.

Convert SPARQL to SQL: not all engines use RDBMSConversion to SQL → a large number of joins

Store dependent solution

15 / 113

Introduction

A tempting choice: turn to MQO in relational databases[MQO88][MQO90][MQO00]

SPARQL↔relational algebra [EPS08][FSR07].Exist quite a few relational solutions for RDF store.

[MQO90] T. Sellis, et al. On the Multiple-Query Optimization Problem. In TKDE, 1990.

[MQO88] T. Sellis, et al. Multiple-query optimization. In TODS, 1988.

[MQO00] P. Roy, et al. Efficient and extensible algorithms for multi query optimization. In SIGMOD, 2000.

[EPS08] R. Angles, et al. The Expressive Power of SPARQL. In ISWC, 2008.

[FSR07] A. Polleres, et al. From SPARQL to rules (and back). In WWW, 2007.

For SPARQL and RDF, new issues arise in practice.

Convert SPARQL to SQL: not all engines use RDBMSConversion to SQL → a large number of joinsStore dependent solution

16 / 113

Outline

1 Introduction

2 Preliminary

3 Our approach

4 Experiments

5 Conclusions

17 / 113

Preliminary

We foucs on two types of queries

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

18 / 113

Preliminary

We foucs on two types of queries

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

19 / 113

Preliminary

We foucs on two types of queries

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

subj pred objp1 name ”Alice”p1 zip 10001p1 mbox alice@homep1 mbox alice@workp1 www http://home/alicep2 name ”Bob”p2 zip 10001p3 name ”Ella”p3 zip 10001p3 www http://work/ellap4 name ”Tim”p4 zip ”11234”

(a) triple table D

SELECT ?nameWHERE { ?x name ?name, ?x zip 10001,

}

(b) Example query QOPT

name”Alice””Bob””Ella”

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

20 / 113

Preliminary

We foucs on two types of queries

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

subj pred objp1 name ”Alice”p1 zip 10001p1 mbox alice@homep1 mbox alice@workp1 www http://home/alicep2 name ”Bob”p2 zip 10001p3 name ”Ella”p3 zip 10001p3 www http://work/ellap4 name ”Tim”p4 zip ”11234”

(a) triple table D

SELECT ?name , ?mail , ?hpageWHERE { ?x name ?name, ?x zip 10001,

OPTIONAL {?x mbox ?mail }OPTIONAL {?x www ?hpage }}

(b) Example query QOPT

name”Alice””Bob””Ella”

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

21 / 113

Preliminary

We foucs on two types of queries

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

subj pred objp1 name ”Alice”p1 zip 10001p1 mbox alice@homep1 mbox alice@workp1 www http://home/alicep2 name ”Bob”p2 zip 10001p3 name ”Ella”p3 zip 10001p3 www http://work/ellap4 name ”Tim”p4 zip ”11234”

(a) triple table D

SELECT ?name , ?mail , ?hpageWHERE { ?x name ?name, ?x zip 10001,

OPTIONAL {?x mbox ?mail }OPTIONAL {?x www ?hpage }}

(b) Example query QOPT

name mail hpage”Alice” alice@home http://home/alice”Alice” alice@work http://home/alice”Bob””Ella” http://work/ella

(c) Output QOPT(D)

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

22 / 113

Preliminary

We foucs on two types of queries

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

23 / 113

Preliminary

We foucs on two types of queries

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

Problem statement.

Input: a set Q of Type 1 queries and a data graph G

Output: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

24 / 113

Preliminary

We foucs on two types of queries

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queries

Requirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

25 / 113

Preliminary

We foucs on two types of queries

Type 1: Q := SELECT RD WHERE GP

Type 2: QOPT := SELECT RD WHERE GP (OPTIONAL GPOPT)+

Problem statement.

Input: a set Q of Type 1 queries and a data graph GOutput: a set of rewritten queries, QOPT of Type 1 and Type 2 queriesRequirements:

soundness and completeness: QOPT(G)≡Q (G).

cost:Tr (Q)+Te (Qopt)

Te (Q)≤ 1

26 / 113

Our approach

1 Introduction

2 Preliminary

3 Our approach

4 Experiments

5 Conclusions

27 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

28 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

: constant : variable

Q1: SELECT * WHERE{?x1 P1 ?z1 . ?y1 P2 ?z1 . ?y1 P3 ?w1 . ?w1 P4 v1 . }

29 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

30 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

SELECT *WHERE { ?x P1 ?z , ?y P2 ?z ,

}

?z?x

?y

P1

P2

(I) Structure only QOPT

31 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

SELECT *WHERE { ?x P1 ?z , ?y P2 ?z ,

OPTIONAL {?y P3 ?w , ?w P4 v1 }

}

?z?x

?y

P1

P2

P3

v1 ?wP4

(I) Structure only QOPT

32 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

SELECT *WHERE { ?x P1 ?z , ?y P2 ?z ,

OPTIONAL {?y P3 ?w , ?w P4 v1 }OPTIONAL {?t P3 ?x , ?t P5 v1, ?w P4 v1 }

}

?z?x

?y

P1

P2

?t

P3

P5 P3

v1 ?wP4

(I) Structure only QOPT

33 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

SELECT *WHERE { ?x P1 ?z , ?y P2 ?z ,

OPTIONAL {?y P3 ?w , ?w P4 v1 }OPTIONAL {?t P3 ?x , ?t P5 v1, ?w P4 v1 }

}

?z?x

?y

P1

P2

?t

P3

P5 P3

v1 ?wP4

Evaluated once→potential saving

(I) Structure only QOPT

OPTIONALs are evaluated on top of the common substructures

(intermediate results cached by engine).

34 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

pattern p α(p)

?x P1 ?z 30%?y P2 ?z 20%?y P3 ?w 18%?w P4 v1 1%?t P5 v1 2%

*Max common subquery is not selective

(II) Using cost in optimization

35 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

pattern p α(p)

?x P1 ?z 30%?y P2 ?z 20%?y P3 ?w 18%?w P4 v1 1%?t P5 v1 2%

*Max common subquery is not selective

(II) Using cost in optimization

36 / 113

Motivating example

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2

P4?w2

?t2

SELECT *WHERE { ?w P4 v1,

OPTIONAL {?x1 P1 ?z1, ?y1 P2 ?z1, ?y1 P3 ?w }OPTIONAL {?x2 P1 ?z2, ?y2 P2 ?z2, ?t2 P3 ?x2, ?t2 P5 v1 }

}

(II) Using cost in optimization

37 / 113

Our approach

Q={q1, q2, . . . , qn}

38 / 113

Our approach

Q={q1, q2, . . . , qn} They often do not share one common subquery

39 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together

40 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together• finding structure similarity is expensive• group by predicates• distance: Jaccard similarity of predicate sets

41 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together• finding structure similarity is expensive• group by predicates• distance: Jaccard similarity of predicate sets

Rewriting Rewriting Rewriting

42 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together• finding structure similarity is expensive• group by predicates• distance: Jaccard similarity of predicate sets

Rewriting Rewriting Rewriting • Recursively rewrite a subset of type 1 queries(hierarchically)→ a set of type 2 queries

43 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together• finding structure similarity is expensive• group by predicates• distance: Jaccard similarity of predicate sets

Rewriting Rewriting Rewriting • Recursively rewrite a subset of type 1 queries

• finding common edge subgraphs• optimizations to avoid bad efficiency• cost: guard against bad rewritings• approx. by the min selectivity in common subquery

(hierarchically)→ a set of type 2 queries

44 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together• finding structure similarity is expensive• group by predicates• distance: Jaccard similarity of predicate sets

Rewriting Rewriting Rewriting • Recursively rewrite a subset of type 1 queries

• finding common edge subgraphs• optimizations to avoid bad efficiency• cost: guard against bad rewritings• approx. by the min selectivity in common subquery

?z?x

?y

P1

P2

?t

P3

P5 P3

v1 ?wP4

pattern p α(p)

?x P1 ?z 30%?y P2 ?z 20%?y P3 ?w 18%?w P4 v1 1%?t P5 v1 2%

(hierarchically)→ a set of type 2 queries

45 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together• finding structure similarity is expensive• group by predicates• distance: Jaccard similarity of predicate sets

Rewriting Rewriting Rewriting • Recursively rewrite a subset of type 1 queries

• finding common edge subgraphs• optimizations to avoid bad efficiency• cost: guard against bad rewritings• approx. by the min selectivity in common subquery

Execution

(hierarchically)→ a set of type 2 queries

46 / 113

Our approach

Q={q1, q2, . . . , qn}

Paritition input queries

Group 1 Group 2 Group k

• Similar queries can be optimized together• finding structure similarity is expensive• group by predicates• distance: Jaccard similarity of predicate sets

Rewriting Rewriting Rewriting • Recursively rewrite a subset of type 1 queries

• finding common edge subgraphs• optimizations to avoid bad efficiency• cost: guard against bad rewritings• approx. by the min selectivity in common subquery

Execution

Result distribution

r(q1) r(q2) r(qn)

(hierarchically)→ a set of type 2 queries

47 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queriesRD of a Type 1 query: e.g., X and Z

↑↓columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesOPTIONAL queries

48 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queries

X Y Z

ab cd ef g

RD of a Type 1 query: e.g., X and Z↑↓

columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesOPTIONAL queries

49 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queries

X Y Z

ab cd ef g

RD of a Type 1 query: e.g., X and Z↑↓

columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesOPTIONAL queries

50 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queries

X Y Z

ab cd ef g

RD of a Type 1 query: e.g., X and Z↑↓

columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesOPTIONAL queries

51 / 113

Experiments

Implementation highlights

C++64-bit Linux, 2GHz Xeon(R) CPU, 4GB memory

Dataset

Extend LUBM benchmark generator:randomness in structure, variances of sel.

RDF stores: Jena TDB 0.85 etc

Queries

Rewriting w/ structure: MQO-S; rewriting w/ structure and cost: MQO

52 / 113

Experiments

Implementation highlights

C++64-bit Linux, 2GHz Xeon(R) CPU, 4GB memory

Dataset

Extend LUBM benchmark generator:randomness in structure, variances of sel.

1 5 10 15 20 25 30 35 40 45 50

10−1

100

101

Se

lectivity (

%)

Predicate ID

Selective Non−selective

RDF stores: Jena TDB 0.85 etc

Queries

Rewriting w/ structure: MQO-S; rewriting w/ structure and cost: MQO

53 / 113

Experiments

Implementation highlights

C++64-bit Linux, 2GHz Xeon(R) CPU, 4GB memory

Dataset

Extend LUBM benchmark generator:randomness in structure, variances of sel.

RDF stores: Jena TDB 0.85 etc

Queries

Rewriting w/ structure: MQO-S; rewriting w/ structure and cost: MQO

54 / 113

Experiments

Implementation highlights

C++64-bit Linux, 2GHz Xeon(R) CPU, 4GB memory

Dataset

Extend LUBM benchmark generator:randomness in structure, variances of sel.

RDF stores: Jena TDB 0.85 etc

QueriesEnsure randomness in structure, e.g., star, chain and circle

Rewriting w/ structure: MQO-S; rewriting w/ structure and cost: MQO

55 / 113

Experiments

Implementation highlights

C++64-bit Linux, 2GHz Xeon(R) CPU, 4GB memory

Dataset

Extend LUBM benchmark generator:randomness in structure, variances of sel.

RDF stores: Jena TDB 0.85 etc

QueriesParameter Symbol Default RangeDataset size D 4M 3M to 9MNumber of queries |Q| 100 60 to 160Size of query (num of triple patterns) |Q| 6 5 to 9Number of seed queries κ 6 5 to 10Size of seed queries |qcmn| ∼ |Q|/2 1 to 5Max selectivity of patterns in Q αmax (Q) random 0.1% to 4%Min selectivity of patterns in Q αmin(Q) 1% 0.1% to 4%

Rewriting w/ structure: MQO-S; rewriting w/ structure and cost: MQO

56 / 113

Experiments

Implementation highlights

C++64-bit Linux, 2GHz Xeon(R) CPU, 4GB memory

Dataset

Extend LUBM benchmark generator:randomness in structure, variances of sel.

RDF stores: Jena TDB 0.85 etc

QueriesParameter Symbol Default RangeDataset size D 4M 3M to 9MNumber of queries |Q| 100 60 to 160Size of query (num of triple patterns) |Q| 6 5 to 9Number of seed queries κ 6 5 to 10Size of seed queries |qcmn| ∼ |Q|/2 1 to 5Max selectivity of patterns in Q αmax (Q) random 0.1% to 4%Min selectivity of patterns in Q αmin(Q) 1% 0.1% to 4%

Rewriting w/ structure: MQO-S; rewriting w/ structure and cost: MQO

57 / 113

Experiments

Time on rewritingMQO-S-C: structure based rewriting

MQO-C: rewriting integrating with cost

58 / 113

Experiments

Time on rewritingMQO-S-C: structure based rewriting

MQO-C: rewriting integrating with cost

60 80 100 120 140 1600

0.25

0.5

0.75

1

1.25

1.5

Tim

e (s

econ

ds)

MQO−S−C MQO−C

|Q|

*Costly/bad rewritings are rejected→more rounds of comparisons.

59 / 113

Experiments

Time on distributing resultsMQO-S-P: parsing results from MQO-S

MQO-P: parsing results with MQO

60 / 113

Experiments

Time on distributing resultsMQO-S-P: parsing results from MQO-S

MQO-P: parsing results with MQO

60 80 100 120 140 1600

0.25

0.5

0.75

1

1.25

1.5

Tim

e (s

econ

ds)

MQO−S−P MQO−P

|Q|

*Non-selective common subqueriesincrease the set of results.

61 / 113

Experiments

Time on distributing resultsMQO-S-P: parsing results from MQO-S

MQO-P: parsing results with MQO

60 80 100 120 140 1600

0.25

0.5

0.75

1

1.25

1.5

Tim

e (s

econ

ds)

MQO−S−P MQO−P

|Q|

*Non-selective common subqueriesincrease the set of results.

*Both rewriting and parsingare efficiently doable

62 / 113

Experiments

Varying num of queries in a batchNo-MQO: no optimization

MQO-S: optimization based on structural rewriting

MQO: integrating cost

63 / 113

Experiments

Varying num of queries in a batchNo-MQO: no optimization

MQO-S: optimization based on structural rewriting

MQO: integrating cost

60 80 100 120 140 1600

50

100

150

Num

ber

of q

uerie

s

No−MQO MQO−S MQO

|Q|

*Both reduce the num ofqueries to be exectued

64 / 113

Experiments

Varying num of queries in a batchNo-MQO: no optimization

MQO-S: optimization based on structural rewriting

MQO: integrating cost

60 80 100 120 140 1600

100

200

300

400

Tim

e (s

econ

ds)

No−MQO MQO−S MQO

|Q|65 / 113

Experiments

Varying min. selectivity in seed queriesNo-MQO: no optimization

MQO-S: optimization based on structural rewriting

MQO: integrating cost

66 / 113

Experiments

Varying min. selectivity in seed queriesNo-MQO: no optimization

MQO-S: optimization based on structural rewriting

MQO: integrating cost

0.1 0.5 1 2 40

25

50

75

100

Num

ber

of q

uerie

s

No−MQO MQO−S MQO

αmin(qcmn) (%)

*MQO: reject more bad rewritings;MQO-S: not sensitive

67 / 113

Experiments

Varying min. selectivity in seed queriesNo-MQO: no optimization

MQO-S: optimization based on structural rewriting

MQO: integrating cost

0.1 0.5 1 2 40

100

200

300

Tim

e (s

econ

ds)

No−MQO MQO−S MQO

αmin(qcmn) (%)68 / 113

Experiments

Varying seed sizeMQO-S: optimization based on structural rewriting

MQO: integrating cost

percentage=Te (common subquery)Te (Qopt )

× 100%

69 / 113

Experiments

Varying seed sizeMQO-S: optimization based on structural rewriting

MQO: integrating cost

percentage=Te (common subquery)Te (Qopt )

× 100%

1 2 3 4 5

60

80

100

Per

cent

age

(%)

MQO−S MQO

|qcmn|

No−OPTIONAL− − −

*MQO-S: up to 25% time on optional

70 / 113

Conclusions

In dealing RDF data on the Web, store independency is important

Combining SPARQL language and graph algorithms can achieveMQO, i.e., by rewriting queries

Cost must be taken in consideration during rewriting

71 / 113

The End

Thank You

Q and A

72 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

Represent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

73 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

Represent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

74 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

Represent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

75 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

p1

p2 p3

p1

p3p2 p4

p1

p2 p3

p7

p8p9

p7

p7

p8

p9 p9

p8

p7p10

p2

p1

p1p3

p4

p4

p2p3

p3

p2p2 p1

p1p3

p4 p1

p2

p4

p2

p1p3

p7

p9p8

p9

p8p6

p9

p7p7

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

p4

Represent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

76 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

p1

p2 p3

p1

p3p2 p4

p1

p2 p3

p7

p8p9

p7

p7

p8

p9 p9

p8

p7p10

p2

p1

p1p3

p4

p4

p2p3

p3

p2p2 p1

p1p3

p4 p1

p2

p4

p2

p1p3

p7

p9p8

p9

p8p6

p9

p7p7

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

p4

Distance: Jaccard similarity on predicates

Represent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

77 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

p1

p2 p3

p1

p3p2 p4

p1

p2 p3

p7

p8p9

p2

p1

p1p3

p4

p4

p2p3

p3

p2p2 p1

p1p3

p4 p1

p2

p4

p2

p1p4

p5p6

p9

p7p7

{p1,p2,p3} {p1,p2,p3} {p1,p2,p3} {p7,p8,p9} {p7,p8,p9} {p7,p8,p9,p10}

{p1,p2,p3,p4} {p2,p3,p4} {p1,p2,p3,p4} {p1,p2,p4} {p7,p8,p9} {p5,p6,p7,p9}

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

p4

p7

p7

p8

p9 p9

p8

p7p10

p7

p9p8

p9

Distance: Jaccard similarity on predicatesRepresent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

78 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

p1

p2 p3

p1

p3p2 p4

p1

p2 p3

p7

p8p9

p2

p1

p1p3

p4

p4

p2p3

p3

p2p2 p1

p1p3

p4 p1

p2

p4

p2

p1p4

p5p6

p9

p7p7

{p1,p2,p3} {p1,p2,p3} {p1,p2,p3} {p7,p8,p9} {p7,p8,p9} {p7,p8,p9,p10}

{p1,p2,p3,p4} {p2,p3,p4} {p1,p2,p3,p4} {p1,p2,p4} {p7,p8,p9} {p5,p6,p7,p9}

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

p4

p7

p7

p8

p9 p9

p8

p7p10

p7

p9p8

p9

Distance: Jaccard similarity on predicatesRepresent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

79 / 113

Our approach

Partition input queries

Object: similar queries can be optimized together in rewriting

p1

p2 p3

p1

p3p2 p4

p1

p2 p3

p7

p8p9

p2

p1

p1p3

p4

p4

p2p3

p3

p2p2 p1

p1p3

p4 p1

p2

p4

p2

p1p4

p5p6

p9

p7p7

{p1,p2,p3} {p1,p2,p3} {p1,p2,p3} {p7,p8,p9} {p7,p8,p9} {p7,p8,p9,p10}

{p1,p2,p3,p4} {p2,p3,p4} {p1,p2,p3,p4} {p1,p2,p4} {p7,p8,p9} {p5,p6,p7,p9}

q1 q2 q3 q4 q5 q6

q12q11q10q9q8q7

p4

p7

p7

p8

p9 p9

p8

p7p10

p7

p9p8

p9

Distance: Jaccard similarity on predicatesRepresent each query as a set of predicates.

Measure the similarity of a pair of queries by set similarity

Grouping: k-means

80 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewriting → finding maximal common triple patterns

In the language of graph . . .

[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

81 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

qab qcd

qabcd

qa qb qc qd . . . . . . . . . . . . qr qs

Rewriting → finding maximal common triple patterns

In the language of graph . . .

[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

82 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcd

Rewriting → finding maximal common triple patterns

In the language of graph . . .

[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

83 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

Pair up queries with max Jaccard similarity

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcd

Rewriting → finding maximal common triple patterns

In the language of graph . . .

[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

84 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

Pair up queries with max Jaccard similarity

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

Rewriting → finding maximal common triple patterns

In the language of graph . . .

[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

85 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

Pair up queries with max Jaccard similarity

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

Rewriting → finding maximal common triple patterns

In the language of graph . . .

[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

86 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

Pair up queries with max Jaccard similarity

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

Rewriting → finding maximal common triple patterns

In the language of graph . . .[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

87 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

Pair up queries with max Jaccard similarity

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

Rewriting → finding maximal common triple patterns

In the language of graph . . .[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs

88 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

Pair up queries with max Jaccard similarity

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

Rewriting → finding maximal common triple patterns

In the language of graph . . .[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs→ maximal common connected induced sugraphs in linegraphs

89 / 113

Our approach

Hierarchical rewriting and clustering (inside a group)

Rewrite pairs of queries bottom up

Pair up queries with max Jaccard similarity

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcdrewrite

Rewriting → finding maximal common triple patterns

In the language of graph . . .[CLQ01] I. Koch. Enumerating all connected maximal common subgraphs in two graphs. In Theoretical Computer Science, 2001.

maximal common connected edge subgraphs→ maximal common connected induced sugraphs in linegraphs→ maximal cliques in the product graph

90 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

91 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

92 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

93 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

• product graph: simultaneous walk

94 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

• product graph: simultaneous walk

95 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

• product graph: simultaneous walk

graph query pattern 1 graph query pattern 2

The triangle (clique) highlights the common subgraph composed by

96 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

• product graph: simultaneous walk

• blowup in size, esp. > 2 queries

affect clique detection

graph query pattern 1 graph query pattern 2

The triangle (clique) highlights the common subgraph composed by

97 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

• product graph: simultaneous walk

• blowup in size, esp. > 2 queries

affect clique detection

• optimize the product graph

graph query pattern 1 graph query pattern 2

The triangle (clique) highlights the common subgraph composed by

98 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

• product graph: simultaneous walk

• blowup in size, esp. > 2 queries

affect clique detection

• optimize the product graph

• prune non-common predicates• check the constants

99 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

• product graph: simultaneous walk

• blowup in size, esp. > 2 queries

affect clique detection

• optimize the product graph

• prune non-common predicates• check the constants• prune vertices with non-common

neighborhoods

100 / 113

Our approach

?z4?x4

?y4

P1

P2

v1

?z3?x3

?y3

P1

P2P3

P5

v1

?z2?x2

?y2

P1

P2P3

P5

?w1v1

?z1?x1

?y1

P1

P2

P4

P3

(a) Query Q1 (b) Query Q2 (c) Query Q3 (d) Query Q4

P4 P4?w2

?t2

?w3

?u4

P4?w4

P3

P6

v1

• Linegraph: invert vertices and edges

• sub−sub:`0, sub−obj:`1, obj−sub:`2, obj−obj:`3

P2

P1

P3

P4

`3

`3

`0

`0

`1

`2

P5

P1 P3

`3

`3`3

`0

P2

`3

`1`2

`1

`2

(a) L(Q1) (b) L(Q2) (c) L(Q3) (d) L(Q4)

`0

P2P1

P3 P5

P4

`3

`3

`2 `1`0

`0

`3`3

P4

`2 `1

P1P2

P3 P6

P4

`3

`3

`0 `0`3

`3

`0`0

• product graph: simultaneous walk

• blowup in size, esp. > 2 queries

affect clique detection

• optimize the product graph

• prune non-common predicates• check the constants• prune vertices with non-common

neighborhoods

`3

`3

P2P1

L(GPp):

S: P3 P4

101 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

102 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

103 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

104 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritings

Measure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

105 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximation

Cost: discard bad rewritings, keep good ones in hierarchical rewriting

106 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

107 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

108 / 113

Our approach

Find maximal cliques in the product graph [CLQ02][CLQ03][CLQ02] Patric R.J. Ostergard. A fast algorithm for the maximum clique problem. In Discrete Applied Mathematics, 2002.

[CLQ03] E. Tomita et al. An efficient branch-and-bound algorithm for finding a maximum clique. In Discrete Mathematics andTheoretical Computer Science, LNCS, 2003.

Integrate cost into rewriting

Structure: maximize size of the common subquery in a rewritingEvaluation on cost: guard against bad rewritingsMeasure: min selectivity in the common subquery for approximationCost: discard bad rewritings, keep good ones in hierarchical rewriting

qa qb qc qd . . . . . . . . . . . . qr qs

qab qcd

qabcd

Stop when the selectivitydrops

109 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queriesRD of a Type 1 query

↑↓columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesnested OPTIONALs

110 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queries

name mail hpage

”Alice” alice@home http://home/alice”Alice” alice@work http://home/alice”Bob””Ella” http://work/ella

RD of a Type 1 query

↑↓columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesnested OPTIONALs

111 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queries

name mail hpage

”Alice” alice@home http://home/alice”Alice” alice@work http://home/alice”Bob””Ella” http://work/ella

RD of a Type 1 query

↑↓columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesnested OPTIONALs

112 / 113

Our approach

Related issues

Distributing results, i.e., Type 2 query→Type 1 queries

name mail hpage

”Alice” alice@home http://home/alice”Alice” alice@work http://home/alice”Bob””Ella” http://work/ella

RD of a Type 1 query

↑↓columns from results of the Type 2 rewriting

Soundness and completeness

Extensibility of the solution: more general queries

handle variable predicatesnested OPTIONALs

113 / 113