Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order...

Mining Non-Redundant High Order Correlations in Binary Data

Xiang Zhang, Feng Pan, Wei Wang, and Andrew NobelVLDB2008Mining Non-Redundant High Order Correlations in Binary DataOutlineMotivationProperties related to NIFSsPruning candidates by mutual informationThe algorithmBounds based on pair-wise correlationsBounds based on Hamming distancesDiscussion

MotivationExample:Suppose X, Y and Z are binary features, where X and Y are disease SNPs, Z=X(XOR)Y is the complex disease trait.{X,Y,Z} have strong correlation.But there are no correlation in{X,Z},{Y,Z} and {X,Y}.SummaryWe can see that the high order correlation pattern cannot be identified by only examing the pair-wise correlationsTwo aspects of the desired correlation patterns:The correlation involves more than two featuresThe correlation is non-redundant, i.e., removing any feature will greatly reduce the correlation

(Cont.) be the relative entropy reduction of Y based on X.Consider three features,

i.e., the relative entropy reduction of given or alone is small. i.e., or jointly reduce the uncertainty of more than they do separately.This strong correlation exists only when these three features are considered together.

(Cont.)In this paper, author study the problem of finding non-redundant high order correlations in binary data.NIFSs(Non-redundant Interacting Feature Subsets):The features in an NIFS together has high multi-informationAll subsets of an NIFS have low multi-information.The computational challenge of finding NIFSs:To enumerate feature combinations to find the feature subsets that have high correlation.For each such subset, it must be checked all its subsets to make sure there is no redundancy. Properties related to NIFSs(Downward closure property of WFSs): If feature subset is a WFS, then all its subsets are WFSsAdvantage: This greatly reduces the complexity of the problem. Let be a NIFS. Any is not a NIFS

Pruning candidates by mutual information is not a WFS, i.e.,All supersets of can be safely pruned.Ex.Let ,

Algorithm

Upper and lower bounds based on pair-wise correlations

is the average entropy in bits per symbol of a randomly drawn k-element subset of

Algorithm(Cont.)Suppose that the current candidate feature subset is

, check whether all subsets of V of size (b-a-1) are WFSs.

, the subtree of V can be pruned. In case 2, C(V) must be calculated and checked all subsets of V.

(Cont.) , there is no need to calculate C(V) and directly proceed to its subtree. Using adding proposition to get upper and lower bounds on the multi-information for each direct child node of V

, it must be calculate C(V) .

Definition of NIFSA subset of features is NIFS if the following two criteria are satisfied: is an SFSEvery proper subset is a WFSEx. is a NIFS is a SFS are WFSs, where

(Cont.)

Let and are SFSs

are WFSs

To require that any subset of an NIFS is weakly correlated.

Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order...

Documents

Transcript of Xiang Zhang, Feng Pan, Wei Wang, and Andrew Nobel VLDB2008 Mining Non-Redundant High Order...