Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate:...
-
Upload
mary-waters -
Category
Documents
-
view
218 -
download
1
Transcript of Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate:...
Detecting Group Differences: Mining Contrast Sets
Author: Stephen D. BayAdvisor: Dr. HsuGraduate: Yan-Cheng Lin
Outline
Motivation Objective Research Review Search for Contrast Sets Filtering for Summarizing Contrast
Set Evaluation Conclusion
Motivation
Learning group differences a central problem in many domains
Contrasting groups especially important in social science research
Objective
Automatically detect differences between contrasting groups from observational multivariate data
Research Review
time series research multiple observations
traditional statistical methods rule learner and decision tree
miss group differences association rule mining
multiple group and different search criteria
Problem Definition
itemset concept extends to contrast setDefinition 1:
Let A1,A2,...,Ak be a set of k variables called attributes.
Each Ai can take on values from the set {Vi1,Vi2,...Vim}.
Contrast set a conjunction of attribute –value pairs defined on groups G1,G2,...,Gn with no Ai occurring more than once.
Define support of contrast set Definition 2:
The support of a contrast set with respect to a group G is the percentage of examples in G where the contrast set is true.
minimum support difference δ user defined threshold
Search for Contrast Sets
find contrast sets meet our criteria though search
explore all possible contrast sets return only sets meet our criteria
STUCCO (Search and Testing for Understandable Consistent Contrasts): breadth-first search incorporates several efficiently mining techniques
Framework use set-enumeration trees use breadth-first search counting phase organize nodes into candidate
groups
Finding Significant Contrast Sets testing the null hypothesis across all groups support counts from contingency tables
Controlling Search Error
data mining test many hypotheses family of tests control Type I error Bonferroni inequality:given any set of events
e1,e2,...,en, the probability of their union is less than or equal to the sum of the individual probabilities
Pruning
prune when contrast sets fail to meet effect size or statistical significance criteria
prune when lead to uninteresting contrast sets
Effect Size Pruning prune nodes when bound maximum support differ
ence groups below δ Statistical Significance Pruning
pruned when too few data or maximum value X2 too small
Interest Based Pruning contrast sets are not interesting when have
identical support or relation between groups is fixed
Specializations with Identical Support marital-status=husband marital-status=husband ^ Sex = male
Fixed Relations
Fixed Relations prune node as contrast set specializations do
not add new information
Relation to Itemset Mining
minimum support difference criterion implies constraints support levels in individual groups
eliminate large portions of the search space based on:
subset infrequency pruning effect size pruning
superset frequency pruning interest based pruning
ab abc
Filtering for Summarizing Contrast Set
past approaches limit the rules shown by constraint the
variables or items compare discovered rules, show only
unexpected results new methods
expectation based statistical approach identify and select linear trend contrast
sets
Statistical Surprise
show most general contrast sets first, more complicated conjunctions if surprising based on previously shown sets
IPF(Iterative Proportional Fitting) find maximum likelihood estimates
Detecting Linear Trends identical to finding change over time detect significant contrast set by using the chi-
square test use regression techniques to find the portion of
the x2
Evaluation three research points:
low support difference few high support attribute-value pairs, lower bounds can’
t take advantage pruning rules
δ -> 0 statistical significance pruning is more important filtering rules
Conclusion
STUCCO algorithm combined statistical hypothesis testing with search for mining contrast sets
STUCOO has pruning rules efficient mining at low
support differences guaranteed control over false positives linear trend detection compact summarization of result