A Course on Probabilistic Databases

30
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014 Probabilistic Databases - Dan Suciu 1

description

A Course on Probabilistic Databases. Dan Suciu University of Washington. Outline. Part 1. Motivating Applications The Probabilistic Data ModelChapter 2 Extensional Query PlansChapter 4.2 The Complexity of Query EvaluationChapter 3 Extensional EvaluationChapter 4.1 - PowerPoint PPT Presentation

Transcript of A Course on Probabilistic Databases

Page 1: A Course on Probabilistic Databases

Probabilistic Databases - Dan Suciu 1

A COURSE ON PROBABILISTIC DATABASESDan Suciu

University of Washington

June, 2014

Page 2: A Course on Probabilistic Databases

Probabilistic Databases - Dan Suciu 2

Outline1. Motivating Applications

2. The Probabilistic Data Model Chapter 2

3. Extensional Query Plans Chapter 4.2

4. The Complexity of Query Evaluation Chapter 3

5. Extensional Evaluation Chapter 4.1

6. Intensional Evaluation Chapter 5

7. Conclusions

June, 2014P

art

1

Par

t

2P

art

3

Par

t

4

Page 3: A Course on Probabilistic Databases

3

Probabilistic Data Model: Outline• Review: Relational Data Model

• Incomplete Databases

• Probabilistic Databases (PDB)

• Query semantics

• Block Independent Disjoint

June, 2014 Probabilistic Databases - Dan Suciu

Page 4: A Course on Probabilistic Databases

4

Review: Relational Data Model

June, 2014 Probabilistic Databases - Dan Suciu

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

OwnerData: stored in relations (= tables)

Page 5: A Course on Probabilistic Databases

5

Review: Relational Data Model

June, 2014 Probabilistic Databases - Dan Suciu

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Queries: SQL,

Data: stored in relations (= tables)

Find all owners of objects in the Office

-- SQL: e.g. postgres

SELECT DISTINCT Owner.nameFROM Owner, LocationWHERE Owner.object = Location.object and Location.loc = ‘Office’

Page 6: A Course on Probabilistic Databases

6

Review: Relational Data Model

June, 2014 Probabilistic Databases - Dan Suciu

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

OwnerData: stored in relations (= tables)

Queries: SQL,

Find all owners of objects in the Office

Unions of Conjunctive Queries

-- SQL: e.g. postgres

SELECT DISTINCT Owner.nameFROM Owner, LocationWHERE Owner.object = Location.object and Location.loc = ‘Office’

Note that x,t, are existentially quantified:

Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’

Q(z) = x t (Owner(z,x), Location(x,t,’Office’))∃ ∃

Page 7: A Course on Probabilistic Databases

7

Review: Relational Data Model

June, 2014 Probabilistic Databases - Dan Suciu

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Query answer: Q=Name

Joe

Jim

Data: stored in relations (= tables)

Queries: SQL,

Find all owners of objects in the Office

Unions of Conjunctive Queries

-- SQL: e.g. postgres

SELECT DISTINCT Owner.nameFROM Owner, LocationWHERE Owner.object = Location.object and Location.loc = ‘Office’

Note that x,t, are existentially quantified:

Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’

Q(z) = x t (Owner(z,x), Location(x,t,’Office’))∃ ∃

Page 8: A Course on Probabilistic Databases

8

Review: Relational Data Model

June, 2014 Probabilistic Databases - Dan Suciu

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Note that x,t, are existentially quantified:

Query answer: Q=Name

Joe

Jim

Q(z) = Owner(z,x), Location(x,t,y),y=‘Office’

Data: stored in relations (= tables)

Queries: SQL,

Find all owners of objects in the Office

Unions of Conjunctive Queries

Q(z) = x t (Owner(z,x), Location(x,t,’Office’))∃ ∃

-- SQL: e.g. postgres

SELECT DISTINCT Owner.nameFROM Owner, LocationWHERE Owner.object = Location.object and Location.loc = ‘Office’

Page 9: A Course on Probabilistic Databases

Review: Complexity of Query Evaluation

Query Q, database D

• Data complexity: fix Q, complexity = f(D)

• Query complexity: fix D, complexity = f(Q)

• Combined complexity: complexity = f(D,Q)

Moshe Vardi

This course

Data complexity is unique to database research

Page 10: A Course on Probabilistic Databases

10

Incomplete Database

June, 2014 Probabilistic Databases - Dan Suciu

Definition An Incomplete Database is a finite setof database instances W = (W1, W2, …, Wn)

Each Wi is calleda possible world

Page 11: A Course on Probabilistic Databases

11

Incomplete Database

June, 2014 Probabilistic Databases - Dan Suciu

Definition An Incomplete Database is a finite setof database instances W = (W1, W2, …, Wn)

Each Wi is calleda possible world

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

W1 W2 W3 W4

Page 12: A Course on Probabilistic Databases

12

Incomplete Database

June, 2014 Probabilistic Databases - Dan Suciu

Definition An Incomplete Database is a finite setof database instances W = (W1, W2, …, Wn)

Each Wi is calleda possible world

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

W1 W2 W3 W4

Page 13: A Course on Probabilistic Databases

13

Incomplete Database

June, 2014 Probabilistic Databases - Dan Suciu

Definition An Incomplete Database is a finite setof database instances W = (W1, W2, …, Wn)

Each Wi is calleda possible world

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Location

Name Object

Jim Laptop77

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

W1 W2 W3 W4

Page 14: A Course on Probabilistic Databases

14

Incomplete Database: Query Semantics

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given query Q, incomplete database W:• An answer t is certain, if W∀ i, t ∈Q(Wi)• An answer t is possible if W∃ i, t ∈Q(Wi)

Page 15: A Course on Probabilistic Databases

15

Incomplete Database: Query Semantics

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given query Q, incomplete database W:• An answer t is certain, if W∀ i, t ∈Q(Wi)• An answer t is possible if W∃ i, t ∈Q(Wi)

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Location

Name Object

Joe Laptop77

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Q(z) = Owner(z,x), Location(x,t,’Office’)

W1 W2 W3 W4

Page 16: A Course on Probabilistic Databases

16

W1 W2 W3 W4

Q= Q= Q= Q=

Incomplete Database: Query Semantics

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given query Q, incomplete database W:• An answer t is certain, if W∀ i, t ∈Q(Wi)• An answer t is possible if W∃ i, t ∈Q(Wi)

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Location

Name Object

Joe Laptop77

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Joe

Jim

Joe Joe Joe

Jim

Q(z) = Owner(z,x), Location(x,t,’Office’)

Page 17: A Course on Probabilistic Databases

17

W1 W2 W3 W4

Q= Q= Q= Q=

Incomplete Database: Query Semantics

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given query Q, incomplete database W:• An answer t is certain, if W∀ i, t ∈Q(Wi)• An answer t is possible if W∃ i, t ∈Q(Wi)

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Location

Name Object

Joe Laptop77

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Joe

Jim

Joe Joe Joe

Jim

Certain answers to Q: JoePossible answers to Q: Joe, Jim

Q(z) = Owner(z,x), Location(x,t,’Office’)

Page 18: A Course on Probabilistic Databases

18

Probabilistic Database

June, 2014 Probabilistic Databases - Dan Suciu

Definition A Probabilistic Database is (W, P), where W is an incompletedatabase, and P: W [0,1] a probability distribution: Σi=1,n P(Wi) = 1

Page 19: A Course on Probabilistic Databases

19

Probabilistic Database

June, 2014 Probabilistic Databases - Dan Suciu

Definition A Probabilistic Database is (W, P), where W is an incompletedatabase, and P: W [0,1] a probability distribution: Σi=1,n P(Wi) = 1

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Location

Name Object

Jim Laptop77

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

W1 W2 W3 W4

0.3 0.4 0.2 0.1

Page 20: A Course on Probabilistic Databases

20

Probabilistic Database: Query Semantics

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given query Q, probabilistic database (W,P):• The marginal probability of an answer t is:

P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }

Page 21: A Course on Probabilistic Databases

21

Probabilistic Database: Query Semantics

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given query Q, probabilistic database (W,P):• The marginal probability of an answer t is:

P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Location

Name Object

Jim Laptop77

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

W1 W2 W3 W4

0.3 0.4 0.2 0.1

Q(z) = Owner(z,x), Location(x,t,’Office’)

Page 22: A Course on Probabilistic Databases

22

W1 W2 W3 W4

0.3 0.4 0.2 0.1

Q= Q= Q= Q=

Probabilistic Database: Query Semantics

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given query Q, probabilistic database (W,P):• The marginal probability of an answer t is:

P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Location

Name Object

Joe Laptop77

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Joe

Jim

Joe Joe Joe

Jim

Q(z) = Owner(z,x), Location(x,t,’Office’)

Page 23: A Course on Probabilistic Databases

23

W1 W2 W3 W4

0.3 0.4 0.2 0.1

Q= Q= Q= Q=

Probabilistic Database: Query Semantics

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given query Q, probabilistic database (W,P):• The marginal probability of an answer t is:

P(t) = Σ { P(Wi) | Wi ∈ W, t ∈Q(Wi) }

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Joe Laptop77

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Location

Name Object

Joe Laptop77

Owner

Object Time Loc

Laptop77 5:07 Hall

Laptop77 9:05 Office

Book302 8:18 Office

Location

Name Object

Joe Book302

Jim Laptop77

Fred GgleGlass

Owner

Joe

Jim

Joe Joe Joe

Jim

P(Joe) = 1.0P(Jim) = 0.4

Q(z) = Owner(z,x), Location(x,t,’Office’)

Page 24: A Course on Probabilistic Databases

24

Discussion• Intuition: a probabilistic database says that the database

can be in one of possible states, each with a probability

• Possible query answers: a set of answers annotated with probabilities:

(t1, p1), (t2, p2), (t3, p3), …

Usually: p1 ≥ p2 ≥ p3 ≥ …

• Problem: the number of possible world in a probabilistic database is astronomically large. To represent it, we impose some restrictions

June, 2014 Probabilistic Databases - Dan Suciu

Page 25: A Course on Probabilistic Databases

25

Independent, Disjoint Tuples

June, 2014 Probabilistic Databases - Dan Suciu

Definition Given a probabilistic database (W, P).Two tuples t1, t2 are called:• Independent, if: P(t1 t2) = P(t1) P(t2)• Disjoint (or exclusive), if: P(t1t2) = 0

Page 26: A Course on Probabilistic Databases

26

Independent, Disjoint Tuples

June, 2014 Probabilistic Databases - Dan Suciu

Definition A probabilistic database is calledBlock-Independent-Disjoint (BID), if its tuples are groupedinto blocks such that:• Tuples from the same block are disjoint• Tuples from different blocks are independent

Definition Given a probabilistic database (W, P).Two tuples t1, t2 are called:• Independent, if: P(t1 t2) = P(t1) P(t2)• Disjoint (or exclusive), if: P(t1t2) = 0

Page 27: A Course on Probabilistic Databases

27

Example: BID Table

W={Object Time Loc

Laptop77 9:07 Rm444

Book302 9:18 Office

Object Time Loc

Laptop77 9:07 Rm444

Book302 9:18 Rm444

Object Time Loc

Laptop77 9:07 Rm444

Book302 9:18 Lift

Object Time Loc

Laptop77 9:07 Hall

Book302 9:18 Office

Object Time Loc

Laptop77 9:07 Hall

Book302 9:18 Rm444

Object Time Loc

Laptop77 9:07 Hall

Book302 9:18 Lift

Object Time Loc

Laptop77 9:07 Rm444Object Time Loc

Laptop77 9:07 HallObject Time Loc

Book302 9:18 OfficeObject Time Loc

Book302 9:18 Rm444Object Time Loc

Book302 9:18 LiftObject Time Loc

}p1p3p1p4

p1(1- p3-p4-p5)

Possibleworlds

BID Table

June, 2014 Probabilistic Databases - Dan Suciu

Disjoint

Inde- pen- dent

Disjoint

Object Time Loc P

Laptop77 9:07 Rm444 p1

Laptop77 9:07 Hall p2

Book302 9:18 Office p3

Book302 9:18 Rm444 p4

Book302 9:18 Lift p5

Page 28: A Course on Probabilistic Databases

Probabilistic Databases - Dan Suciu 28

The Query Evaluation Problem

Given: a BID database D, a query Q, and output tuple t

Compute: P(t)

Note: D has, say, 1000000 tuples, while the number of possible worlds is 21000000

Challenge: compute P(t) efficiently, in the size of D

June, 2014

Data complexity: the complexity of P depends dramatically on Q

Page 29: A Course on Probabilistic Databases

29

x y P

a1 b1 q1

a1 b2 q2

a2 b3 q3

a2 b4 q4

a2 b5 q5

S

x P

a1 p1

a2 p2

a3 p3

R

P(Q) = 1-(1-q1)*(1-q2)p1*[ ]

1-(1-q3)*(1-q4)*(1-q5)p2*[ ]

1- {1- } *

{1- }

SELECT DISTINCT ‘true’FROM R, SWHERE R.x = S.x

An Example

June, 2014 Probabilistic Databases - Dan Suciu

Boolean query

Q() = R(x), S(x,y)

One can compute P(Q) in PTIME in the size of the database D

Page 30: A Course on Probabilistic Databases

30

Summary of the Probabilistic Data Model

• Possible Worlds Semantics: very powerful but difficult to represent

• Block-Independent-Disjoint databases have efficient representation: D is stored in a traditional database

• Independent Databases: even simpler

Main challenge: evaluate Q efficiently in the size of D

June, 2014 Probabilistic Databases - Dan Suciu

This tutorial