Query Flocks: A Generalization of Association-Rule Mining
description
Transcript of Query Flocks: A Generalization of Association-Rule Mining
Query Flocks: A Generalization of Association-Rule Mining
D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov,
A. Rosenthal
Motivations
• Market basket analysis has been successful, partially due to the a-priori optimization
• Extend this trick to a more general context– efficiently mine large databases for patterns– use parametrized queries with a filter condition– spend most of the time evaluating the
“interesting” cases
Query Flocks
• Two parts:– generate parametrized queries (parameters are
denoted by names starting with $)– filter the results of the queries
• Result is the set of tuples which are “acceptable” assignments of values for the parameters
Market Basket Example
• Find all pairs of items that appear in at least 20 market baskets
• Result is all pairs of items ($1,$2) such that at least 20 baskets have both items
Datalog query:
answer(B) :-
baskets(B,$1) AND
baskets(B,$2)
Filter:
COUNT(answer.B) >= 20
Why Not SQL?
The same query in SQL:SELECT i1.Item, i2.Item
FROM baskets i1,
baskets i2
WHERE i1.Item < i2.Item AND i1.BID = i2.BID
GROUP BY
i1.Item, i2.Item
HAVING
20 <= COUNT(i1.BID)
• A-Priori trick is not implemented by conventional optimizers
• Claim: necessary code optimizations could be implemented in SQL systems
Generalizing the A-Priori Technique
• First evaluate a less expensive query and eliminate certain answers
• Use a subset of the subgoals of the query
• This subset must form a safe query
Safe Queries
• A variable in the head appears in a nonnegated, nonarithmetic subgoal
• A variable in a negated subgoal appears in a nonnegated subgoal
• A variable in an arithmetic subgoal appears in a nonnegated, nonarithmetic subgoal
ExampleRelations:diagnosed(patient, disease)
exhibits(patient, symptom)
treatments(patient, medicine)
causes(disease, symptom)
Query:answer(P) :-
exhibits(P,$s) AND
treatments(P,$m) AND
diagnosed(P,D) AND
NOT causes(D,$s)
Find symptoms $s and medicines $m such that many (at least 20) patients exhibit the symptom and are taking the medicine, but their disease does not explain the symptom
Some Safe Subqueries
• answer(P) :- exhibits(P,$s). 20+ patients exhibit the symptom
• answer(P) :- treatments(P,$m). 20+ patients were given the medicine
• answer(P) :- diagnosed(P,D) AND exhibits(P,$s) AND NOT causes(D,$s). 20+ patients have an unexplained symptom
• answer(P) :- exhibits(P,$s) AND treatments(P,$m). 20+ patients are taking the medicine and exhibit the symptom
A Formal Query Plan Using A Sequence of Filter Steps
okS($s) := FILTER($s,
answer(P) :- exhibits(P,$s),
COUNT(answer.P) >= 20);
okM($m) := FILTER($m,
answer(P) :- treatments(P,$m),
COUNT(answer.P) >= 20);
ok($s,$m) := FILTER({$s,$m},
answer(P) :- okS($s) AND okM($m) AND
diagnoses(P,D) AND exhibits(P,$s) AND
treatments(p,$m) AND NOT causes(D,$s),
COUNT(answer.P) >= 20);
But Which Subqueries Are Best?
• Depends on sizes of relations, and numbers of patients, diseases, etc.
• Use heuristics for restricting the search for a query plan
A Dynamic Technique
• Use the sizes of the intermediate relations, after computation, to decide whether to filter– if the relation size gives an average number of
tuples per value assignment that is much lower than previous steps, filter
– if the set of parameters has not been seen before, compare number of tuples per value assignment with support threshold
Example
1. Compare number patients with number symptoms
2. Compare number patients with number medicines
3. Compare size of relation with symptoms * medicines
4. Compare number patients in relation from 3 with number patients from leaf
5. Must be done to get query result
NOT causes(D,$s)
diagnosed(P,D)
exhibits(P,$s) treatments(P,$m)1 2
3
4
5
Summary
• This is a way of describing operations on large-scale databases– flocks consist of parametrized queries and
filters for the results of the queries– exploit the a-priori algorithm with subqueries– use techniques for limiting the search for query
plans