Query Flocks: A Generalization of Association-Rule Mining

Query Flocks: A Generalization of Association-Rule Mining

D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov,

A. Rosenthal

Motivations

• Market basket analysis has been successful, partially due to the a-priori optimization

• Extend this trick to a more general context– efficiently mine large databases for patterns– use parametrized queries with a filter condition– spend most of the time evaluating the

“interesting” cases

Query Flocks

• Two parts:– generate parametrized queries (parameters are

denoted by names starting with $)– filter the results of the queries

• Result is the set of tuples which are “acceptable” assignments of values for the parameters

Market Basket Example

• Find all pairs of items that appear in at least 20 market baskets

• Result is all pairs of items ($1,$2) such that at least 20 baskets have both items

Datalog query:

answer(B) :-

baskets(B,$1) AND

baskets(B,$2)

Filter:

COUNT(answer.B) >= 20

Why Not SQL?

The same query in SQL:SELECT i1.Item, i2.Item

FROM baskets i1,

baskets i2

WHERE i1.Item < i2.Item AND i1.BID = i2.BID

GROUP BY

i1.Item, i2.Item

HAVING

20 <= COUNT(i1.BID)

• A-Priori trick is not implemented by conventional optimizers

• Claim: necessary code optimizations could be implemented in SQL systems

Generalizing the A-Priori Technique

• First evaluate a less expensive query and eliminate certain answers

• Use a subset of the subgoals of the query

• This subset must form a safe query

Safe Queries

• A variable in the head appears in a nonnegated, nonarithmetic subgoal

• A variable in a negated subgoal appears in a nonnegated subgoal

• A variable in an arithmetic subgoal appears in a nonnegated, nonarithmetic subgoal

ExampleRelations:diagnosed(patient, disease)

exhibits(patient, symptom)

treatments(patient, medicine)

causes(disease, symptom)

Query:answer(P) :-

exhibits(P,$s) AND

treatments(P,$m) AND

diagnosed(P,D) AND

NOT causes(D,$s)

Find symptoms $s and medicines $m such that many (at least 20) patients exhibit the symptom and are taking the medicine, but their disease does not explain the symptom

Some Safe Subqueries

• answer(P) :- exhibits(P,$s). 20+ patients exhibit the symptom

• answer(P) :- treatments(P,$m). 20+ patients were given the medicine

• answer(P) :- diagnosed(P,D) AND exhibits(P,$s) AND NOT causes(D,$s). 20+ patients have an unexplained symptom

• answer(P) :- exhibits(P,$s) AND treatments(P,$m). 20+ patients are taking the medicine and exhibit the symptom

A Formal Query Plan Using A Sequence of Filter Steps

okS($s) := FILTER($s,

answer(P) :- exhibits(P,$s),

COUNT(answer.P) >= 20);

okM($m) := FILTER($m,

answer(P) :- treatments(P,$m),


ok($s,$m) := FILTER({$s,$m},

answer(P) :- okS($s) AND okM($m) AND

diagnoses(P,D) AND exhibits(P,$s) AND

treatments(p,$m) AND NOT causes(D,$s),


But Which Subqueries Are Best?

• Depends on sizes of relations, and numbers of patients, diseases, etc.

• Use heuristics for restricting the search for a query plan

A Dynamic Technique

• Use the sizes of the intermediate relations, after computation, to decide whether to filter– if the relation size gives an average number of

tuples per value assignment that is much lower than previous steps, filter

– if the set of parameters has not been seen before, compare number of tuples per value assignment with support threshold

Example

1. Compare number patients with number symptoms

2. Compare number patients with number medicines

3. Compare size of relation with symptoms * medicines

4. Compare number patients in relation from 3 with number patients from leaf

5. Must be done to get query result

NOT causes(D,$s)

diagnosed(P,D)

exhibits(P,$s) treatments(P,$m)1 2

3

4

5

Summary

• This is a way of describing operations on large-scale databases– flocks consist of parametrized queries and

filters for the results of the queries– exploit the a-priori algorithm with subqueries– use techniques for limiting the search for query

plans

Query Flocks: A Generalization of Association-Rule Mining

Documents

Transcript of Query Flocks: A Generalization of Association-Rule Mining