1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy...

57
1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of 1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy...

1

Searching and Integrating Information on the Web

Seminar 4: Ranking Queries and Data Privacy

Professor Chen Li

UC Irvine

Seminar 3 2

Outline and readings• Ranking Queries

Fagin, R., Combining Fuzzy Information from Multiple Systems, PODS 1996

Fagin et al., Optimal Aggregation Algorithms for Middleware, PODS 2001.

• Data privacy:– Database-as-service

Executing SQL over Encrypted Data in the Database-Service-Provider Model. Hakan Hacigumus, Bala Iyer, Chen Li, and Sharad Mehrotra. SIGMOD 2002.

– XML Data publishing Secure XML Publishing without Information Leakage in the

Presence of Data Inference. Xiaochun Yang and Chen Li. To appear in VLDB'04

Seminar 3 3

Outline

• Ranking Queries

• Data privacy:– XML Data publishing– Database-as-service

Seminar 3 4

1. Finding multi-attribute tuples with top-k highest scores

2. Scoring function: aggregating scores on attributes, e.g., w1*A1 + … + wn * An, where wi is the weight for attribute Ai.

3. Monotone aggregation functions: if tuple A has a higher grade than tuple B on each attribute, then A’s overall grade is higher than B’s.

Top-k queries

Car ID Mileage Year Price

1 10000 1997 200002 20000 2000 110003 17000 1998 120004 15000 1990 80005 5000 1990 120006 15000 1990 50007 12000 1985 5000

Seminar 3 5

Applications

• Multimedia databases• Web search queries:

– Restaurants– Houses– Cars– …

Seminar 3 6

Modes of Data Access (Fagin)

Underlying Middleware (e.g., Search engines, Garlic, QBIC) supports 2 modes:

1. Sorted access: - Attribute Ai (column) forms a list Li sorted based on the score

of Ai.- The list is output one by one.

2. Random access: - Ask the system for the grade of any given objectGoal: minimize the total cost to get the top-k results

ace...

price

mileage year

bef...

ade...

Sorted lists

Seminar 3 7

FA: Fagin’s algorithm [PODS96]

1. Do sorted access in parallel to each of the m sorted lists Li. Wait until there is a set H of at least k objects such that each of these objects has been seen in each of the m lists.

2. For each object R that has been seen, do random access as needed to each of the lists Li to find the i-th field xi or R.

3. Compute the aggregate results.

Seminar 3 8

Example:

1. Suppose k = 1. Given the three partial lists retrieved so far, ‘e’ appears in all of them. We can say that the top 1 tuple must be in {a,b,c,e,d,f}.

2. Reason: since the function is monotonic, tuple ‘e’ “blocks” all tuples below, since they can only have a smaller overall grade than ‘e’.

3. The algorithm does random access for these 5 tuples to get their grades, and pick the top 1.

4. Notice that we cannot say ‘e’ must be the top 1, since other tuples (e.g., ‘a’) may still have a higher overall score

5. Minor point: one possible improvement – ‘f’ can never be better than ‘e’.

ace...

price

mileage year

bef...

ade...

Cut-off line

Seminar 3 9

General case

1. Once k tuples have appeared in all the partial lists, halt.2. Reason: these k tuples block all the tuples below, which

cannot be better than these k tuples3. Do random access for the retrieved tuples to get their

overall grades, and find the top-k.

k

price

mileage year

kk

Cut-off line

Seminar 3 10

FA’s Properties

1. Can correctly find top-k results for monotone aggregation functions

2. Cost of a database with N objects: O(N^[(m-1)/m]*K^[1/m]) with arbitrarily high probability.

Seminar 3 11

FA’s Drawbacks• The number of sorted accesses is still

large.• Since all seen tuples should be

buffered, the required buffer size is unbounded.

• Does not exploit the bound given by the aggregation function to determine when to stop sorted access.

Seminar 3 12

TA: Threshold Algorithm [PODS2001]

1. Do sorted access in parallel to each of the m sorted lists. As an object R is seen under sorted access in some list, do random access to the other lists to find the grade xi of object R in other lists. Then compute the aggregate grade for this object R. If this is one of the highest, insert it, else discard it.

2. For each list Li, let xi be the grade of the last object seen under sorted access. Define the threshold value T to be t( x1, …, xm). As soon as at least k objects have been seen whose grade is at least equal to T, then halt.

3. Return the K objects that have been seen with the highest grades.

Seminar 3 13

Example:

1. A buffer keeps the top-k tuples that have been found so far2. For any tuple in a sorted list, do a random access to get its overall grade.

Compare it with the tuples in the buffer queue, and decide to insert it or discard it.

3. Threshold window (including the previous m records) represents the “best” top-k results we can see, assuming we can combine best values from different tuples.

4. Notice that this window may not be “horizontal” if we use different speeds to access different lists

5. This window helps us decide when to stop: once we find k tuple whose grade is at least equal to the window tuple, we halt.

ace...

price

mileage year

bef...

ade...

buffer for top-kThreshold

window

Seminar 3 14

TA’s Properties1. TA is optimal for all monotone functions

and over every database.2. Compared to FA, TA requires a small,

constant-size buffer. 3. TA allows early stopping

– Can show TA never stops later than FA. (Why?)

4. There are times when the user is satisfied with approximate top k list. TA is modified to give such approximation.

5. TA can be modified to the case where random access is impossible

Seminar 3 15

Instance Optimality

1. Algorithm b is instance optimal over an algorithm set A and a database instance set D, if b is in A, and for any algorithm a in A and every instance d in D, we have: cost (b,D) = O(cost(a,D)).

2. Similar to “competitive ratio”3. Essentially: b is the best algorithm in A.4. Stronger than “optimality in a worst-case

case”5. TA is instance optimal in all “correct

algorithms” (nondeterministic algorithms).Ab

a

Seminar 3 16

Variations of TA

• NRA: When no random access is possible– Example: Web search engines, which typically do not

allow you to enter a URL and get its ranking

• TAZ: When no sorted access is possible for some predicates– Example: Find good restaurants near location x (sorted

and random access for restaurant ratings, random access only for distances from a mapping site)

• CA: When the relative costs of random and sorted accesses matter.

• TA: Only when approximate answers are needed – Example: Web search, with lots of good quality answers

Seminar 3 17

Outline

• Ranking Queries

• Data privacy:– XML Data publishing – Database-as-service

Seminar 3 18

Motivation

• Privacy in publishing XML data• Applications:

– Web publishing– Data sharing and exchange, e.g., in

P2P systems

Seminar 3 19

Example: Hospital XML data

physician

Walker

physician

phname

Smith

(1)

(1)

treat (1)

(2)

phname(2)

treat (3)

treat (2)

patient (1)

pname(1)

disease(1)

ward (1)

leukemia(1)

W305(1)

Alice(1)

Alice(2)

Betty(2)Cathy (2)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)W403

patient

pname

Tom

cancer

(4)

(4)

ward (4)

disease(4)

... ...

hospital

leukemia(1)

Goal: hide Alice’s disease Common Knowledge: patients in the same ward have the same disease

Seminar 3 20

Problem

Given: • An XML document to be published• Sensitive data in the document• Common knowledge using which public

users can do data inferenceFind:• A partial document to be released so that

users cannot infer the sensitive data

Seminar 3 21

Research challenges

• How to model data inference using common knowledge?

• How to compute all possible inferred data?

• How to compute a partial document to be published without leaking sensitive information?

Seminar 3 22

Roadmap

• Information Leakage– Defining sensitive data– Describing common knowledge– Computing inferred documents

• Prevent information leakage

Seminar 3 23

Defining sensitive data

hospital

pname*

CathyA2

disease

*

patient

Alice

SA1

• Using an XQuery, called “regulating query”• A special node marked “*” to indicate the

sensitive data

Seminar 3 24

Example 1

disease

*

patient

Alice

SA1

leukemia(2)

hospital

patient (1)

pname(1)

disease(1)

ward (1)

leukemia(1)

W305(1)Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

• Map the query to the XML tree• For each mapping, the target of the * node is sensitive.

Seminar 3 25

Example 2

hospital

pname*

CathyA2

leukemia(2)

hospital

patient (1)

pname(1)

disease(1)

ward (1)

leukemia(1)

W305(1)Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

Seminar 3 26

Common Knowledge

• Represented as XML constraints

• Could be obtained in various ways, e.g., – possible schema– analysis from the published data

Seminar 3 27

Common Constraints

• Child constraints: //p //p/c//patient //patient/pname

• Descendant constraints: //p //p//d//patient //patient//disease

• Functional dependencies: //p/a//p/b//patient/ward //patient/disease

Patient Patient

pname

Patient Patient

disease

Patient

warddisease

Patient

warddisease

w1 w2d1 d2

If w1 = w2, then d1 = d2

(value equal)

Seminar 3 28

Modify partial document using constraints

C1: //patient //patient/pname C2: //patient //patient//diseaseC3: //patient/ward //patient/disease

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1)

(1) (1) (1) (2)

(2)

(1) (1) (2)

Partial document P

Seminar 3 29

Apply C1 on document P

C1: //patient //patient/pname

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1)

(1) (1) (1) (2)

(2)

(1) (1) (2)

C1(P)

pname

Seminar 3 30

Apply C2 on document P

C2: //patient //patient//disease

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1)

(1) (1) (1) (2)

(2)

(1) (1) (2)

C2(P)

disease

•Floating branch: exact location unknown

Seminar 3 31

Apply C3 on document P

C3: //patient/ward//patient/disease

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1)

(1) (1) (1) (2)

(2)

(1) (1) (2)

C3(P)

disease

leukemia

Seminar 3 32

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1)

(1) (1) (1) (2)

(2)

(1) (1) (2)

C2: //patient //patient//diseaseC3: //patient/ward //patient/disease

diseasedisease

leukemia

Apply a sequence of constraints: <C2,C3>

Seminar 3 33

hospital

leukemia

patient

disease ward

W305 W305

patient

wardpname

(1)

(1) (1) (1) (2)

(2)

(1) (1) (2)

C2: //patient //patient//diseaseC3: //patient/ward //patient/disease

disease

leukemia

Another user applies a different sequence of constraints: <C3,C2>

After applying C3, we cannot use C2 to expand the treeNo more floating branch!

Seminar 3 34

They look different!

• P1 is “m-contained” in P2:– There is a mapping from P1 to P2.

– A floating branch can be mapped to a path.

– The m-containing document P2 has more information

• P2 is also “m-contained” in P1.

• Thus they are “m-equivalent”!

P2: result of <C3,C2>

hospital

leukemia

patient

disease ward

W305 W305

patient

ward

(1)

pname(1) (1) (1) (2)

(2)

(1) (1) (2)

disease

leukemia

P1: result of <C2,C3>

diseasedisease

leukemia

hospital

leukemia

patient

disease ward

W305 W305

patient

ward

(1)

pname(1) (1) (1) (2)

(2)

(1) (1) (2)

Seminar 3 35

What documents can users infer?

• Different users can use different sequences of constraints to do inference

• Thus they can infer different documents• Questions:

– Can an inference process terminate?– What inferred document should we consider to

prevent leakage of sensitive data?

Seminar 3 36

Theorem• Given a partial document P of an XML

document D and a set of constraints C={C1,…, Ck}, there is a document M that can be inferred from P using a sequence of constraints, such that:– for any sequence of constraints, its resulting

document is m-contained in M. • Can be computed using a greedy

approach. • Such a document is unique under m-

equivalence.

Seminar 3 37

Information leakage• For a partial document P, if there exists a

regulating query A, such that the maximal inferred document M can produce a non-empty answer to the query A, then we say “P causes information leakage.” Partial Document P

Inference

Regulating query A

Seminar 3 38

Roadmap

• Information Leakage• Prevent information leakage

Seminar 3 39

Formal Problem• Given an XML document D, a regulating

query A, common knowledge represented as constraints C1,…,Ck;– How to find a partial document P without

information leakage?– Called a valid partial document

• The empty document is a trivial one• We want the published document to

have as much data as possible

Seminar 3 40

An algorithm

• We develop an algorithm for solving this problem

• We use the running example to illustrate the algorithm

Seminar 3 41

Example

patient (1)

pname(1)

disease(1)

ward (1)

leukemia(1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

disease

*

patient

Alice

S

Regulating query A

Functional dependency: //patient/ward //patient/disease

Seminar 3 42

Remove sensitive data A(D)

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

Remaining document: D - A(D)

disease

*

patient

Alice

S

Seminar 3 43

Compute the maximal inferred document M of D-A(D)

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

Maximal inferred document: M

disease

*

patient

Alice

S

Seminar 3 44

Testing Information Leakage

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

There is a mapping from A to P. So information leaked.

disease

*

patient

Alice

S

Regulating query A

Seminar 3 45

Computing a valid partial document

A

S

InferenceA

Sbreak mapping

InferenceA

Sbreak mapping

chase back chase back

How to break the mappings?How to chase back the inference steps?

A(D)D - A(D)

Seminar 3 46

AND/OR Graphs• A structure representing how a goal

can be reached by solving subproblems.

• We use such graphs to formulate the process of finding a valid partial document

Seminar 3 47

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

disease

*

patient

Alice

S

Regulating query A

START

leukemiaAlice (1)(1)

OR

•Consider mapping images of the leaf nodes in A•An “OR” connector shows that solving any of the subproblems can solve the parent problem.

Seminar 3 48

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

disease

*

patient

Alice

S

Regulating query A

START

leukemia

W305

AND

OR OR

Alice

leukemia(2) (3)

(1)(1)

leukemia(2) W305(1)

W305(3)

OR

•Multiple ways to infer the sensitive data.•An “AND” connector shows that solving ALL the subproblems can solve the parent problem.

Seminar 3 49

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

disease

*

patient

Alice

S

Regulating query A

START

leukemia

W305

AND

OR OR

Alice

leukemia(2) (3)

(1)(1)

leukemia(2) W305(1)

W305(3)

OR OR

AND

. . .

OR

•Continue expanding the AND/OR graph

Seminar 3 50

AND/OR Graphs (cont)• A special START node representing the

goal of computing a valid partial document.

• The graph has nodes corresponding to nodes in the maximal inferred document M.

• Such a node represents the subproblem of hiding its corresponding node n in M– This node n should be removed from M– It cannot be inferred using the constraints

and other nodes in M.

Seminar 3 51

Solution graphs• A connected subgraph (of M) including the

START node• For each node in the subgraph, its

successor connectors are also in the subgraph.

• If it contains an OR connector, it must also contain one of the connector's successors.

• If it contains an AND connector, it must also contain all the successors of the connector.

Seminar 3 52

Example solution graphsSTART

Alice (1)

OR

START

leukemia

AND

OR OR

(1)

W305(1)

OR

Seminar 3 53

Computing a valid partial document using a solution graph

• For a solution graph G, for each node in G, we remove the corresponding node in M to get a valid partial document

START

Alice (1)

OR

START

leukemia

AND

OR OR

(1)

W305(1)

OR

patient (1)

pname(1)

disease(1)

ward (1)

W305(1)

Alice(1)

patient(2)

disease (2)

pname(2)

ward(2)

W305 (2)

leukemia(2)

Betty (1)

patient(3)

pname(3)

leukemia(3)

ward (3)

W305(3)

disease(3)

Cathy(1)

hospital

leukemia(1)

Seminar 3 54

Constructing an AND/OR Graph

• Give an algorithm for computing an AND/OR graph

• Consider inference steps of different constraints

• Many algorithms proposed on finding a solution graph. They are applicable

• No need to construct the entire AND/OR graph. Search for a solution graph “on the fly.”

Seminar 3 55

Related work

Data Execution Query

Data Query Execution

Query Execution Data

B. C/S access control

C. Database as a service

D. Data publishing (our work)

Data Execution QueryA. Single-user DBMS

Different scenarios of database security based on trust domains

Seminar 3 56

Summary of 2nd paper

• Formulated problem of publishing XML document without information leakage due to data inference

• Showed the effect of constraints on inference

• Algorithm for finding a valid partial document of a given document

Seminar 3 57

Outline

• Ranking Queries

• Data privacy:– XML Data publishing– Database-as-service (DAS) model