Mining Approximate Functional Dependencies (AFDs) as Condensed Representations of Association Rules...
-
date post
20-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of Mining Approximate Functional Dependencies (AFDs) as Condensed Representations of Association Rules...
Mining Approximate Functional Dependencies (AFDs) as
Condensed Representations of Association Rules
Master’s Thesis Defenseby Aravind Krishna Kalavagattu
Committee Members:Dr. Subbarao Kambhampati (chair)Dr. Yi ChenDr. Huan Liu
Database Systems
• Well-defined schema and method for querying (SQL)
• Query optimization
• Lately, some systems started supporting IR-Style answering of user queries
Data mining
• Discovering useful patterns from data
• Rule learning is a well researched method for discovering interesting relations between variables in large databases
• Association Rules
Rule Mining with Several applicationsOver databases
Introduction to AFDs Approximate Functional Dependencies are rules denoting
approximate determinations at attribute level. AFDs are of the form (X ~~> Y), where X and Y are sets
of attributes X is the “determining set” and Y is called “dependent set” Rules with singleton dependent sets are of high interest
A classic example of an AFD (Nationality ~~> Language)
More examples Make ~~> Model (Job Title, Experience) ~~> Salary
Indicates that we can approximately guess the language of a person if we know which country she is from.
Introduction (contd..) Functional Dependency (FD)
Given a relation R, a set of attributes X in R is said to functionally determine another attribute Y, also in R, (written X → Y) if and only if each X value is associated with precisely one Y value.
AFDs can be loosely defined as FDs that approximately hold (there are some exception rows that fail to satisfy the Function over the current relation) Example: Make~~>Model (with error = 0.3)
70% of the tuples satisfy the dependency
Applications of AFDs
Predicting Missing Values of attributes
In relational tables(QPIAD)
Using values of attributes in determining set of AFD
Query Optimization(CORDS, BHUNT)
Maintaining correct selectivity estimates
Query Rewriting(AIMQ, QPIAD, QUIC)
Example: Model~~>BodyStyleRewrite query on Model=“RAV4” to Retrieve tuples with bodystyle=“SUV”
Database design (Database normalization)(Efficient Storage)Similar to the way FDs are used
FD Mining and Implications FD Mining aims at finding a minimal cover
Minimum set of FDs from which the entire set of FDs can be generated
Example: If A→B is an FD, then, ({A,C}→B) is considered redundant
Can we substitute this by generating only minimal dependencies in case of AFDs?
NO, because AFDs (Z~~>B) may be interesting for the application and we may prefer them to A~~>B.
Non-minimal dependencies perform better in QPIAD, QUIC etc
Example: AFD (JobTitle, Experience)~~>Salary Vs (JobTitle~~>Salary)
Performance Concerns
AFD Mining is costly The pruning strategies of FDs are not applicable in
case of AFDs. For datasets with large number of attributes, the
search space gets worse! Method for determining whether a dependency
holds or not is costly Way to traverse the search space is tricky
Bottom-up Vs Top-down ?
Quality Concerns Before algorithms for discovering AFDs can be developed,
AFDs need better Interestingness measures
AFDs used as feature selectors in classification are expected to give good Accuracy.
AFDs used in query rewriting are expected to give a high throughput per query.
(VIN~~>Make) Vs (Model~~>Make) (VIN~~>Make) looks good using the error metric But, intuitively (as well as practically) (Model~~>Make) is
a better AFD.
Challenges in AFD Mining
1. Defining right interestingness measures
2. Performing an efficient traversal in the search space of possible rules
3. Employing effective pruning strategies
Agenda/Outline Introduction Related Work Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs Present the AFDMiner algorithm Experimental Results
Performance Quality
Agenda/Outline Introduction Related Work Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs Present the AFDMiner algorithm Experimental Results
Performance Quality
Related WorkFD Mining Algorithms
•Aim at finding minimal cover•DepMiner, FUN, TANE, FD_Mine
Existing Approximation measures for AFDs•Tau, InD metrics
Grouping association rulesClustering association rules (v1~>u, v2~>u as (v1^v2~>u))
Do not work well for AFDs
•Metrics do not seem to matter in practice
•No accompanied algorithm to mine AFDs
No one combines them as AFDs
Existing AFD Miners
CORDS•SoftFDs (C1=>C2)•Uses |C1,C2|/|C1||C2| as the approximation measure
AIMQ/QPIAD/QUIC•TANE• Post-processing over TANE
•Restricted to singleton determining set•Works from a sample•Measure used is not appropriate
•Highly Inefficient•Quality of some AFDs is bad
Agenda/Outline Introduction Related Work Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs Present the AFDMiner algorithm Experimental Results
Performance Quality
Condensing Association Rules
Viewing database relations as transactions Itemsets ≈attribute-value
pairs Association rules
Between Itemsets Beer~>Diapers
Here, they are between attribute value pairs
AFDs are rules between Attributes Corresponding to a lot of
association rules sharing the same attributes
Example
Example:
Association Rule: (Toyota, Camry)~>Sedan
Rolling up association rules as AFDs
Honda~~>Accord Toyota~~>Camry Tata~~>Maruti800… …
Make~~>Model
Confidence Consider an association rule of the form (α→β)
Confidence denotes the conditional probability of β (head) given α (body).
Similarly for an AFD (X~~>A), Confidence should denote the chance of finding the
values of A, given values of X Define AFD Confidence in terms of confidence of
association rules
Specifically, picking the best association rule for every distinct value-combination of the body of the association rule.
Confidence
For the example carDB, Confidence = Support (Make:Honda~~>Model:Accord) +
Support (Make:Toyota~~>Model:Camry) = 3/8+2/8 = 5/8
Interestingly this is equal to (1-g3) g3 has a natural interpretation as the fraction of tuples with
exceptions affecting the dependency.
Specificity For an association rule (α→β),
Support is the probability with which the conditioning event (i.e., α) occurs
Rule with High-Confidence, yet Low-Support is a bad rule!
Presence of a lot of association rules with low supports makes the AFD bad.
In classification, this affects prediction accuracy.
For query rewriting tasks, per-query throughput is less.
Types of AFDs
1. Model ~~> Make Few Branches - Uniform Distribution Good, and might hold good universally
2. VIN ~~> Make Many Branches - Uniform Distribution Bad - Confidence of each association rule is high,
but bad supports
3. Model, Location ~~> Price Many Branches - Skewed Distribution Few association rules with high support and
many with low support
Accord~~>Honda Camry~~>Toyota Maruti800~~>Tata… …
Model~~>Make
Specificity
The Specificity measure captures our intuition of different types of AFDs.
It is based on information entropy Higher the Specificity (above a threshold), worse the AFD is ! Shares similar motivations with the way SplitInfo is defined
in decision trees while computing Information Gain Ratio Follows Monotonicity
Normalized with the worst case Specificity i.e., X is a key
Agenda/Outline Introduction Related Work Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs Present the AFDMiner algorithm Experimental Results
Performance Quality
AFD Mining Problem Good AFDs are the ones within the desired
thresholds of the Confidence and Specificity measures.
Formally, the AFD mining problem can be stated as follows:
AFD Mining The problem of AFD Mining is learn all AFDs
that hold over a given relational table
Two costs:1. Major cost is the Combinatoric cost of
traversing the search space2. Cost of visiting data to validate each rule
(To compute the interestingness measures)
Search process for AFDs is exponential in terms of the number of attributes
Pruning Strategies
1. Pruning by Specificity Specificity(Y) ≥ Specificity(X), where Y is a superset of X If Specificity(X) > maxSpecificity, we can prune all AFDs
with X and its supersets as the determining set2. Pruning (applicable to FDs)
If (X→A) is an FD, all AFDs of the form (Y→A) can be pruned
3. Pruning keys Needed for FDs But, this is subsumed by case 1 in AFDMiner
Because if Specificity(X) = 1, it means X is a key
AFDMiner algorithm Search starts from
singleton sets of attributes and works its way to larger attribute sets through the set containment lattice level by level.
When the algorithm is processing a set X, it tests AFDs of the form (X \{A})~~>A), where AєX.
Information from previous levels is captured by maintaining RHS+ Candidate Sets for each set.
Traversal in the Search Space During the bottom-up breadth-first search, the
stopping criteria at a node are:1. The AFD confidence becomes 1, and thus it is an FD. 2. The Specificity value of the X is greater than the max
value given.
FD based Pruning
Specificity based Pruning
Example:
A→C is an FD
Then, C is removed from RHS+(ABC)
Computing Confidence and Specificity
Methods are based on representing attribute sets by equivalence class partitions of the set of tuples
And, ∏X is the collection of equivalence classes of tuples for attribute set X
Example: ∏make = {{1, 2, 3, 4, 5}, {6, 7, 8}} ∏model = {{1, 2, 3}, {4, 5}, {6}, {7, 8}} ∏{make U model} = {{1, 2, 3}, {4, 5}, {6}, {7, 8}}
A functional dependency holds if ∏X = ∏XUA
For the AFD (X~~>A), Confidence = 1 – g3(X~~>A)In this example, Confidence(Model ~~>Make) = 1
Confidence(Make~~>Model) = 5/8
Algorithms Algorithm AFDMiner:
•Computes Confidence
•Applies FD-based pruning
Computes Specificity and applies pruning
•Computes level Ll+1
•Ll+1 contains only those attribute sets of size l+1 which have their subsets of size l in Ll
Agenda/Outline Introduction Related Work Provide new perspective for AFDs
Roll-ups/condensed representations to association rules
Define measures for AFDs Present the AFDMiner algorithm Experimental Results
Performance Quality
Empirical Evaluation Experimental Setup
Data sets CensusDB (199523 tuples, 30 attrb) MushroomDB (8124 tuples, 23 attrb)
Parameters for AFDMiner minConf maxSpecificity No. of tuples No. of attributes MaxLength of determining set
Aim of the experiments is to show that the Dual-Measure approach (AFDMiner—using both confidence and specificity outperforms the Single-Measure approach (No_Specificity – that uses Confidence alone)
No_Specificity: A modified version of AFDMiner, which uses using only Confidence but not Specificity for AFDs. Thus, it generates all AFDs (X~~>A) with (Confidence(X~~>A) >minConf)
Evaluating Quality BestAFD:
The highest confident AFD among all the AFDs with attribute A as their dependent attribute
Classification Task: Classifier is run with determining set of
BestAFD as features Used 10-fold cross-validation and computed
the average classification accuracy Weka tool-kit
Evaluated over the censusDB
82
83
84
85
86
87
88
89
90
91
92
93
No_InfoSupport AFDMiner
Cla
ssif
icati
on
Accu
racy
Evaluation Quality
Average Classification accuracy for all attributesminConf = 0.8 ; maxSpecificity = 0.4
Shows that Specificity is effective in generating better quality AFDs.
No_Specificity
CensusDB
Choosing minConf !
Choosing maxSpecificity
Classification Accuracy (by varying maxSpecificity) threshold low => good rules are pruned threshold high => bad rules are not being pruned
Classification accuracy approximately forms a double elbow shaped curve.
0
1000
2000
3000
4000
5000
6000
0 0.2 0.4 0.6 0.8
InfoSupport
Tim
e T
aken
(m
s)
MaxSpecificityMaxSpecificityCensusDB CensusDB
Choosing maxSpecificity
Time to compute AFDs: Increases with increasing maxSpecificity Rate of change varies
A good threshold value for Specificity (i.e., maxSpecificity) is the value at the first elbow in the graph on quality
0
1000
2000
3000
4000
5000
6000
0 0.2 0.4 0.6 0.8
InfoSupport
Tim
e T
aken
(m
s)
Best Value
MaxSpecificityMaxSpecificity
Query Throughput
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
2 3 4 6 7 8 9 10 11 12 14 15 17 18 19 20 21 22 23 24
A ttrib u te s
No
of T
uple
s R
etri
eved
A F DMiner
No_InfoS upport
No. of tuples returned for an top-10 queries on each distinct determining set (denotes query throughput)
No_Specificity
Discussion on TANE
Primarily designed to generate FDs Modified version for generating
Approximate Dependencies
Uses the error metric g3 for AFDs Bottom-up search in the lattice
Generates only minimal dependencies Pruning applicable to FDs
Comparison (AFDMiner Vs TANE)
TANENOMINP is a modified version of TANE that does not stop with just minimal dependencies.
minConf is 0.8 (thus, we set the g3 to be 0.2)
AFDMiner outperforms both the approaches -- thus strengthening the argument that AFDs with high confidence and with reasonable Specificity are the best
Evaluating Performance
Time varies linearly with the number of tuples. AFDMiner takes less time compared to that of
NoSpecificity. Time varies exponentially on the number of
attributes. AFDMiner completes much faster than NoSpecificity
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2000 4000 6000 8000 10000 12000
Number of Tuples
Tim
e T
aken
(m
s)
No_Specif icity
AFDMiner
CensusDB
0
5000
10000
15000
20000
25000
30000
35000
40000
0 5 10 15 20 25 30 35
No. of attributes
Tim
e t
aken
(m
s)
No_SpecificityAFDMiner
CensusDB
Evaluating Performance
0
1000
2000
3000
4000
5000
6000
0 2000 4000 6000 8000 10000
No. of Tuples
Tim
e T
aken
(ms)
No_Specificity
AFDMiner
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25
No of attributes
Tim
e ta
ken
(ms)
No_Specificity
AFDMiner (ms)
0
20000
40000
6000080000
100000
120000
140000
160000
0 1 2 3 4 5 6 7
Length of determining set in each AFD
Nu
mb
er o
f ca
nd
idat
es
visi
ted
No_Specificity
AFDMiner
CensusDB
These experiments show that AFDMiner is fast
MushroomDB
0
5000
10000
15000
20000
25000
30000
0 1 2 3 4 5 6
Length of determining set in each AFD
Tim
e t
ak
en
(m
s)
No_Specif icity
AFD Miner
MushroomDB
CensusDB
Conclusion Introduced a novel perspective for AFDs
Condensed roll-ups of association rules.
Two metrics for AFDs Confidence Specificity
Algorithm AFDMiner all AFDs (confidence > minConf; Specificity < maxSpecificity) Bottom-up search in a breadth-first manner in the set
containment lattice of attributes Pruning based on Specificity
Experiments – AFDMiner generates high-quality AFDs faster. AFDs with high Confidence and reasonable Specificity
A version of this thesis is currently under review at ICDE’ 09
Future Direction Conditional Functional Dependencies (CFDs)
Dependencies of the form ({ZipCode→City} if country =”England”). i.e., Holding true only for certain values of one
or more of other attributes. CAFDs are the probabilistic counter part of CFDs CFDs and CAFDs are applied in data cleaning
and value prediction recently, but mining these
conditional rules is unexplored. Intuitively, CFDs are intermediate rules between association rules (value level) and FD (attribute level). So, we believe that our approach can help in generating them !
Questions ?