Mining Quantitative Correlated Patterns Using an Information- Theoretic Approach Yiping Ke, James...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
1
Transcript of Mining Quantitative Correlated Patterns Using an Information- Theoretic Approach Yiping Ke, James...
Mining Quantitative Correlated Patterns Using an Information-
Theoretic Approach
Yiping Ke, James Cheng, Wilfred Ng
Presented By:
Chibuike Muoh
Presentation Outline:
• Contributions of the paperIntroduction
• What are QCPs?DefinitionsBackground Information Theory (entropy, MI, NMI)
• Mining QCPsAll-confidenceDiscretization problem (interval combining)Attribute-level pruningInterval-level pruningQCoMine algorithm
Contributions of the paper
• Presents a new algorithm for mining patterns on databases based on theory borrowed from information theory: entropy & mutual information
• Achieves discretization of attribute domain using supervised interval combining to preserve dependency between attributes
Introduction
• Similar to association rule mining in principle but evaluating for association rules can be too expensive on VLDBs
• Trivial result-set {pregnant} {edema} & {pregnant, female} {edema}
• Unproductive rules as a result of co-occurrence effects {pregnant, dataminer} {edema}.
– So occupation and edema condition are related?
• Unlike association mining, mining for QCP consider the dependency of the attribute sets of the database to generate highly correlated patterns– Similar to generating “maximal informative k-itemsets”, but here
we consider dependency in the attribute sets
Introduction…contd.
• The idea behind mining QCP– Evaluate the attribute set and look for ‘strong’
dependencies between attributes– Next find correlated interval sets in the
dependent attributes and generate patterns from them
• Thus, QCPs are not restricted by frequently co-occurring attributes
Definitions: Quantitative Database
• A pattern X, is a set of attributes or random variable = {x1, x2, x3, …, xm} whose outcomes can be numerical or quantitative and have possibilities p(vx) = {p1, p2, p3, …, pm}– These attributes can be either categorical in which case
domain of xi, dom(x) is in the interval {lx, ux} where lx = ux
– And it is quantitative if where xi[lx, ux] is the interval of xi,
lx <= ux
– A pattern X is called a k-pattern if |attr(X)| = k• Consider a quantitative database, D, as a set of
transactions, T. and transaction in D are a vector of items <v1, v2, v3, …, vm> where vi E dom(xi) for 1 <= I <= m.
Definition…contd.
• So we say a transaction supports a pattern X if every attribute in X is represented in T– The frequency of a pattern X in D, freq(X), is
the number of transactions in D that supports X
– The support of X, supp(X) = freq(X)/|D| which is the probability a transaction T in D supports X
Example
• The database table above consists of (6) attributes of which (3) are quantitative {age, salary, service years} and two are categorical {gender, married}
• The last column records the support of each transaction• E.g. For pattern X = age[4, 5]gender[1,1], supp(X) =
0.25+0.19 = 0.44
Background: Information Theory
• Mining QCPs makes use of fundamental concepts in information theory
• Entropy: measures the information content/uncertainty of a random variable, x
Background: Information Theory…contd.
• Mutual information (MI): measures the average reduction in uncertainty about a random variable X, given the knowledge of Y (or vice versa)
– MI is a symmetric measure, so the greater the value of I(x; y), the more information x and y tell about each other.
The example above shows that age causes a reduction of 0.47 in the uncertainty of married
Similarly as an exercise, we can compute I(gender;education) =
Example
• Consider the pattern X = (age;married) from Table 1, we can compute I(age,married) =
}5,4,3,2,1{ }2,1{
47.0)()(
),(log),(
age marriedv vmarriedage
marriedagemarriedage
pp
vvpvvp
0.40
Normalized Mutual Information
• But by how much does X actually tell us about Y?
• Entropy of different attributes vary greatly, so MI only returns us an absolute value, which would not be so helpful in our case
• We can try normalizing the MI among our set of attributes to get a global relative measure
NMI…contd.
• Normalizing the MI measure among the attribute sets returns us the minimum percentage of reduction in the uncertainty of one attribute given the knowledge of another
where
Example 2• From the previous
example we can compute
• Also we can determine
Note that although I(age;married) > I(gender;education) its NMI is less this can be attributed to the high entropy value of H(age) = 2.19 > H(education) = 1.34
This implies that a much larger absolute value of uncertainty can be reduce by knowing age than a relative amount.
Definition: Quantitative Pattern
• A more formal definition of quantitative pattern X follows below:
• Thus given a minimum threshold (μ) and minimum all-confidence threshold (ς ), a quantitative pattern has strong co-dependency between attributes and high confidence level in the dataset
allconf(X)
• All confidence is a correlation measure for determining the minimum confidence of association rules that can be derived from a given pattern.
• For a quantitative pattern, allconf(X) is defined as:
• This is different from association rule mining where conf(XY) only indicates an implication of sets on left to sets on right
allconf(X)…contd.
• All confidence has the downward closure property thus a pattern has all-confidence no less than ς, so do all its sub-patterns
)(]),[sup()sup( xdomxulxX xx
Example
• allconf(X) = gender[1,1]education[1,1]
53.0
}09.009.011.019.0,08.009.009.009.011.019.025.0{
09.009.011.019.0
])}1,1[sup(]),1,1[{sup(
])1,1[]1,1[sup(
MAX
educationgenderMAX
educationgender
Similarly allconf(gender[1,1]married[1,1]) = 0.9
allconf(X)
• A caveat about allconf is that since it is applied at fine granularity to intervals of attributes it can’t solely be used as a measure for co-related patterns.– Quantitative attributes can span huge intervals creating a co-
occurrence problem
• The above, points explain the need to first perform pruning at attribute level
Example
For the given employee database in the previous example, we set μ= 0.2 and ς = 0.5. The pattern Y = gender[1,1]married[1,1] is not a QCP because
Ί(gender,married)= 0 < μ although allconf(Y) = 0.9
this is because, gender & married are independent of each other, but then p(gender[1,1]) and p(married[1,1]) are very high
QCP Mining
• Problem description: – Given a quantitative database, D, a minimum
information threshold μ, and a minimum all-confidence threshold, ς, the mining problem is to find all QCPs from D
QCP Mining: Process Outline
Quantitative Database
Attributepruning
Interval pruning
IntervalCombining/Discretization
QCoMine Algorithm
- Attribute pruning finds dependent attribute sets
- Interval pruning generates correlated patterns
Interval Combining
• When dealing with quantitative data, continuous attributes we need to discretize the intervals of the attribute.
• Challenges– Preventing the intervals from being to trivial
• Eg: age[0,2] vs age[0,0], age[1,1], age[2,2]
– Considering the dependency of the attributes when combining their intervals
• Example: the pattern (age,gender) can produce a different interval than (age,married)
Interval combining…contd.• Interval combing for quantitative patterns can be
considered an optimization problem, for an objective function Φ :
• Goal for this stage is:– Given two attributes x and y, where x is quantitative
and y can be either quantitative or categorical we want to obtain the optimal combined intervals of x with respect to y.
• Note that since this optimization is performed locally (btw. pairs of attribute) we use MI instead of NMI
Interval combining: Algorithm.
The idea is to pick up at each time the maximum Φ [ix[j],ix[j+1]](x,y) among all pairs of consecutive intervals ix[j] and ix[j+1], and combine corresponding ix[j] and ix[j+1] into xj’
•Let Φ[ix1,ix2](x,y) denote the value of Φ(x,y) when ix1 and ix2 are combined with respect to y
•At each time, two consecutive intervals, ix1 & ix2 are considered for combination.
To prevent the intervals from being to trivial a termination condition is set as minimum value for the interval specified
Attribute level pruning
• At this stage pruning at the attribute level is performed such that the attributes in a pattern have NMI of at least μ
The above definition considers attribute patterns as vertices in a graph, and cliques in the graph represent QCPs
Attribute Level pruning…contd.• From the previous definition, QCP’s are cliques in the
NMI-graph having NMI >= μ– Without pruning at the attribute level i.e. u=0 the search space
for cliques in the graph becomes more complex– And enumerating for cliques in a graph can be an exhaustive
process
• Authors of the paper introduce a prefix tree structure for prefixing correlated attributes attribute prefix tree, Tattr
• Clique enumeration in the NMI-graph is done using a the prefix tree– The only extra action required when enumerating cliques using
the prefix tree is to check if (u,v) is an edge in the G
Prefix tree construction• To create the prefix tree
1. First a root node is created at level 0 of Tattr 2. Then at level 1 we create a node for each
attribute I as a child of the root3. For each node u at level k (k >= 1) and for each
right sibling v of u, if (u,v) is an edge in G, we create a child node for u with the same attribute label as that of v
4. Repeat step 3 until for u’s children at level k+1
Step 3 of the prefix tree construction creates the prefix tree in a depth-first manner
Interval-level pruning
• Even though the cliques found using the NMI-graph have high NMI they differ on the intervals of their continuous attributes– Since intervals are combined in a supervised way, the same
attribute may have difference set of combined intervals with respect to different attributes
– Thus patterns with low all-confidence may still be generated from correlated attributes
• The Interval-level pruning process uses all-confidence to ensure that only high confidence patterns are generated from a pattern X and all its super-patterns– Follows from its downward closure property
Interval-level pruning…contd.
• Note that an easy way to perform pruning at the interval level for a k+1 pattern, is to compute the intersection of the prefixing (k-1) intervals of the two k-patterns– Example
Given age[30,40]married[1,1] and age[25,35]salary[2000,3000] intersect the intervals of age to obtain the new pattern age[30,35]married[1,1]salary[2000,3000]
• However producing a new (k+1) pattern using intersection violates the downward closure property of all-confidence– Shrinking the intervals in the (k+1)-pattern may cause a great
decrease in the support value of a single item so its all-confidence may be higher than its composite k-patterns
Interval-level pruning…contd.
• We can avoid intersection in the interval pruning by enumerating all sub-intervals of a combined interval Sx and Sy of the attribute set {x,y} at level-2 of Tattr and prune at that level before generating a pattern
• We need to consider all pairs of sub-intervals of x and y as each of them represents a pattern– Thus for each interval set {i’x, i’y}, where
– We create a QCP X if x[i’x]y[i’x] if allconf(X) >= ς
• This process of evaluating all possible sub-interval combinations at 2-patterns ensures down closure on all k-patterns generated from it
yyxxyyxx SiandSiiiii ,','
QCoMine AlgorithmFirst combine the base intervals of each quantitative attribute with respect to another attribute
Step 2-4 constructs the NMI graph G and uses it to guide the construction of the attribute prefix tree Tattr to perform attribute pruning
Steps 5-13 construct level-2 of Tattr and also perform interval pruning (steps 10-13) which produces all 2-pattern QCPs
Twinterval is an interval-prefix tree,
that keep the interval sets of all patterns generated by a node u in Tattr it is used as a memoization variable for speedup and space saving
Steps 14-15 invoke RecurMine on the child nodes of u in G to generate all k-QCPs for k > 2
QCoMine Algorithm…contd.The steps in the RecurMine algorithm continue to build the prefix tree Tattr from k>2
Interval pruning is aided by using the interval-prefix tree to speed up joins of two k-patterns.
At step 6 of the algorithm when two k-patterns are combined, it is ensured that all their prefixing (k-1)-intervals are the same in both patterns to prevent performing interval combining
Performance of QCoMine• Performance test of the QCoMine algorithm were
performed to test the efficiency of its three major components
1. Supervised interval combining2. Attribute-level pruning by NMI3. Interval-level pruning by all confidence
• Three-variants of the algorithm were createda. QCoMine, which performs all operations as described originally in
the paperb. QCoMine-0, a control variant of the original algorithm which
performs the interval combining process but sets μ=0c. QCoMine-1, is another control variant that does not perform
interval combining process but utilizes μ as described originally in the paper
• The tests were performed with all-confidence from ς = 60% to 100%
Performance of QCoMine…contd.
When interval combining is not applied, results on the dataset can only be obtain when ς = 100%. In all other cases the algorithm will run out of memory.
This is because QCoMine-1 is inefficient since it allows the interval of an item to become too trivial so patters would easily gain all-confidence > ς simply by co-occurrence.
Performance of QCoMine…contd.
The running time for both QCoMine and QCoMine-0 increases only slightly for smaller ς this is because the majority of the time is spent on computing the 2-patterns.No matter the value of ς we need to test every 2-pattern to determine if it’s a QCP, before we can employ downward property of all-confidence to prune.
References
1. Mining quantitative correlated patterns using an information-theoretic approach, Y Ke, J Cheng, W Ng - Proceedings of the 12th ACM SIGKDD international conference 2006
2. Discovering significant rules, GI Webb - Proceedings of the 12th ACM SIGKDD international conference 2006
3. Maximally informative k-itemsets and their efficient discovery, AJ Knobbe, EKY Ho - Proceedings of the 12th ACM SIGKDD international conference 2006