Divide & Conquer-based Inclusion Dependency Discovery · Divide & Conquer-based Inclusion...

26
Divide & Conquer-based Inclusion Dependency Discovery Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quianè-Ruiz, Felix Naumann

Transcript of Divide & Conquer-based Inclusion Dependency Discovery · Divide & Conquer-based Inclusion...

1

Divide & Conquer-based Inclusion Dependency Discovery

Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quianè-Ruiz, Felix Naumann

ID Name Location Student Course

42 Database Systems A-1.2 Miller DBS 1

88 Database Systems B-2.2 Miller PT 1

73 Database Systems A-1.2 Smith DPDC

69 Algorithms in Java C-E.1 Miller PT 1

13 Algorithms in Java C-E.1 Smith DPDC

ID Name Location Student Course

42 Database Systems A-1.2 Miller DBS 1

88 Database Systems B-2.2 Miller PT 1

73 Database Systems A-1.2 Smith DPDC

69 Algorithms in Java C-E.1 Miller PT 1

13 Algorithms in Java C-E.1 Smith DPDC

Motivation Unary Inclusion Dependencies (INDs)

Title Author Price Pages Published

Database Systems Ullman 214 1203 2007

Algorithms in Java Sedgewick 130 768 2002

3D Computer Graphics Watt 20 570 1999

Lending

Book

2

Title Author Price Pages Published

Database Systems Ullman 214 1203 2007

Algorithms in Java Sedgewick 130 768 2002

3D Computer Graphics Watt 20 570 1999

Name ⊆ Title

ID Name Location Student Course

42 Database Systems A-1.2 Miller DBS 1

88 Database Systems B-2.2 Miller PT 1

73 Database Systems A-1.2 Smith DPDC

69 Algorithms in Java C-E.1 Miller PT 1

13 Algorithms in Java C-E.1 Smith DPDC

ID Name Location Student Course

42 Database Systems A-1.2 Miller DBS 1

88 Database Systems B-2.2 Miller PT 1

73 Database Systems A-1.2 Smith DPDC

69 Algorithms in Java C-E.1 Miller PT 1

13 Algorithms in Java C-E.1 Smith DPDC

Motivation N-ary Inclusion Dependencies (INDs)

Name Lecture Credit Semester Verified

Miller DBS 1 20 2 false

Miller PT 1 15 2 false

Smith DPDC 10 6 true

Lending

Student

3

Name Lecture Credit Semester Verified

Miller DBS 1 20 2 false

Miller PT 1 15 2 false

Smith DPDC 10 6 true

Student, Course ⊆ Name, Lecture

Motivation Necessity for IND Discovery

Independend Creation Technical Limitations

4

Motivation IND Detection Search Space

5

A B C D

AB AC AD BC BD CD

ABC ABD ACD BCD

ABCD

Test every combination with all other combina-

tions of same size!

No n-ary INDs here and above!

A ⊆ B A ⊆ C A ⊆ D

B ⊆ A B ⊆ C B ⊆ D

C ⊆ A C ⊆ B C ⊆ D

D ⊆ A D ⊆ B D ⊆ C

AB ⊆ CD AB ⊆ DC

AC ⊆ BD AC ⊆ DB

AD ⊆ BC AD ⊆ CB

BC ⊆ AD BC ⊆ DA

BD ⊆ AC BD ⊆ CA

CD ⊆ AB CD ⊆ BA

Motivation IND Detection Search Space

6

Unary IND detection:

O(n2)

for n attributes

N-ary IND detection:

O(2n · n!)

for n attributes

total 0 2 6 24 80 330 1302 5936 26784 133650 669350 3609672 19674096 113525594 664400310

15 0

14 0 0

13 0 0 0

12 0 0 0 0

11 0 0 0 0 0

10 0 0 0 0 0 0

9 0 0 0 0 0 0 0

8 0 0 0 0 0 0 0 0

7 0 0 0 0 0 0 0 17297280 259459200

6 0 0 0 0 0 0 665280 8648640 60540480 302702400

5 0 0 0 0 0 30240 332640 1995840 8648640 30270240 90810720

4 0 0 0 0 1680 15120 75600 277200 831600 2162160 5045040 10810800

3 0 0 0 120 840 3360 10080 25200 55440 110880 205920 360360 600600

2 0 0 12 60 180 420 840 1512 2520 3960 5940 8580 12012 16380

1 0 2 6 12 20 30 42 56 72 90 110 132 156 182 210

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of attributes: n

Nu

mb

er o

f le

vels

: k

IND candidates

Motivation Related Work

7

Bell: S. Bell and P. Brockhausen. Discovery of data dependencies in relational databases. Technical report, Universität Dortmund, 1995

Koeller: A. Koeller and E. A. Rundensteiner. Discovery of high-dimensional inclusion dependencies. In ICDE, 2002

Zigzag: F. D. Marchi and J.-M. Petit. Zigzag: A new algorithm for mining large inclusion dependencies in databases. In ICDM, 2003

SPIDER: J. Bauckmann, U. Leser, and F. Naumann. Efficiently computing inclusion dependencies for schema discovery. In ICDE Workshops, 2006

MIND: F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. JIIS, 32:53–73, 2009

Clim: F. D. Marchi. CLIM: Closed inclusion dependency mining in databases. In ICDM Workshops, 2011

(best for unary INDs)

(best for n-ary INDs)

8

Experimental Evaluation

N-ary IND Discovery

Unary IND Discovery

Unary IND Discovery The Approach

Unary IND Discovery The Approach

No preparation

Unary IND Discovery The Approach

Hashing

Unary IND Discovery The Approach

Sorting

Unary IND Discovery The Approach

Bucketing

Unary IND Discovery The Approach

Bucketing

brick

N-ary IND Discovery The BINDER Algorithm

15

results

BINDER

(meta) data

data buckets

meta data

candidates

partition

INDs

Validator

Preprocessor

Bucketizer

Candidate Generator

dataset

attributes

Unary IND Discovery The BINDER Algorithm

16

z x

A x

B z

C

D

E z x

F x

G

y o

y o

y

o

v u

v u

v

u

s

t

t

t

s

qr

r

qr

q

A B C D E F G

X X

X X X X

X X X X

X X X X X

F A

s

zxzr qrvusz

A

yyxtycrtr o

B

tvttzutuzz

C

oyyyoosooy

D

vtyyvtvtyt

E

srxzzqzxxr

F G

xyycuyxqoo

Divide

Rel. 1 Rel. 2

Conquer X

attributes

dataflow

values

ignored

validation?

Dynamic Memory Handling:

Spill largest buckets to disk if memory is exhausted.

Lazy Partition Refinement:

Split a partition if it does not fit into main memory.

Unary IND Discovery The BINDER Algorithm

17

A

B

C

D

z y x v u y x v

z v x o

z

A C

y

x

o

v

u

A B

A B D

D

A B C

A

attr2value value2attr

B A C A

Never tested!

A B C D

look up B,C,D A,C,D A,B,D A,B,C

A z A,C C A,C,D A A,B,C

A y A,B - A A A,B,C

B x A,B,D - A A A,B

B v A,B,C - A A A,B

D o D - A A -

18

Experimental Evaluation

N-ary IND Discovery

Unary IND Discovery

N-ary IND Discovery The BINDER Algorithm

19

A B C D

AB AC AD BC BD CD

ABC ABD ACD BCD

ABCD

A ⊆ B A ⊆ C A ⊆ D

B ⊆ A B ⊆ C B ⊆ D

C ⊆ A C ⊆ B C ⊆ D

D ⊆ A D ⊆ B D ⊆ C

AB ⊆ CD AB ⊆ DC

AC ⊆ BD AC ⊆ DB

AD ⊆ BC AD ⊆ CB

BC ⊆ AD BC ⊆ DA

BD ⊆ AC BD ⊆ CA

CD ⊆ AB CD ⊆ BA

N-ary IND Discovery The BINDER Algorithm

20

results

BINDER

(meta) data

data buckets

meta data

candidates

partition

INDs

Validator

Preprocessor

Bucketizer

Candidate Generator

dataset

attributes

N-ary IND Discovery The BINDER Algorithm

21

AB CD

New Candidate?

1. Both valid:

A C

B D

2. Non-trivial:

A≠B≠C≠D

22

Experimental Evaluation

N-ary IND Discovery

Unary IND Discovery

Experimental Evaluation The Metanome Profiling Tool

23

Algorithm execution Result & Resource

management

Algorithm configuration Result & Resource

presentation

Configuration

Resource Links

SPIDER jar

txt tsv

xml csv

DB2 DB2

MySQL Results

DUCC jar

BINDER jar

DFD jar

MIND jar

www.metanome.de

Experimental Evaluation Unary lND Discovery

24

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ru

nti

me in

% o

f lo

ng

est

ru

n

128 GB RAM Longest run:

0.6 5.2 28 66 1.2 1.1 1.8 4.6 18 21 5.9 7.5 14.2

sec sec sec sec min min min min min min h h h

SPIDER File BINDER File

Experimental Evaluation N-ary lND Discovery

25

> 1 Billion INDs

0,1

1

10

100

1000

10000

100000

0,1

1

10

100

1000

10000

100000

imp

rovem

en

t fa

cto

r

ru

nti

me [

sec]

Termination (7 days)

0,1110100100010000100000

0,11

10100

100010000

100000

Im

provem

en

t Faco

r

ru

nti

me [

sec]

BINDER MIND Factor (w/o >)

Up to 26x faster unary IND Discovery

Up to 2500x faster n-ary IND Discovery

No file-handle or memory limits

Divide & Conquer-based Inclusion Dependency Discovery

Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quianè-Ruiz, Felix Naumann