Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation...
Transcript of Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation...
![Page 1: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/1.jpg)
Cluster Subspace Identification Via
Conditional Entropy Calculations
James DiggansGeorge Mason University
Jeffrey L. SolkaGeorge Mason University
![Page 2: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/2.jpg)
Outline
Subspace identification - why?Conditional entropy and clusters in R2.Ordering dimensions for easy subspace visualization and identification.Maximal cliques lead to automatic subspace identification.
![Page 3: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/3.jpg)
Subspace identificationInitial, high-level exploration of complex data can inform downstream analyses.Explore samples (observations) or genes (dimensions) depending on intent.Cluster structure in patients may only be revealed on a subset of genes (and vice-versa) (Getz el at).Uninformed feature selection can discard informative features.
![Page 4: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/4.jpg)
Conditional entropy and clusters in R2
Use of conditional entropy gives us:Distribution-freeRobust to outliers/extreme valuesMinimal nuisance parametersRobust to noise as long as the noise exists in all subspaces.
Adapted from a method proposed by Guo et al at the Geography department at Penn State.
Guo et al, Workshop on Clustering High-Dimensional Data and its Applications, 2003
![Page 5: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/5.jpg)
Geography to … Microarrays?Guo et al have data with many (~10,000) observations in a few (~50) dimensions (measurements):
Dim.
Obs.
We have the opposite problem; we have many more ‘dimensions’ – genes – than we do observations –‘samples’ or ‘patients’ – on those dimensions. We flip Guo’s method on its ear – pretend that observations are dimensions and vice-versa.
Dim.
Obs.
“Obs”
“Dim”
![Page 6: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/6.jpg)
The methodns
ns
nr
Nested MeansMatrix
ng
ns
ns Minimal SpanningTree
MST Order
CE DistanceMatrix
Clique Discovery CliquesGene ExpressionData
![Page 7: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/7.jpg)
CE – what are we looking for?
![Page 8: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/8.jpg)
Nested means discretizationResistant to extreme outliers not seen in an equal-interval approach.We calculate nested mean vectors by:
Calculate the mean value of a dimension.Divide the data into two halves on this mean.Recursively divide each half into half again, calculating a vector of ‘nested mean’ boundaries.Stop once we have the ‘required’ number of intervals (denoted r).
We want enough intervals so that, on average, each cell contains~35 points (Cheng et al, 1999). Guo uses (r is the number of intervals):
35/ 2 ≈rnkr 2=
and Example: For n = 10,000, r = 16 because 16*16is 256 and 256*35 = 8960 < 10,000.
![Page 9: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/9.jpg)
The methodns
ns
nr
Nested MeansMatrix
ng
ns
ns Minimal SpanningTree
MST Order
CE DistanceMatrix
Clique Discovery CliquesGene ExpressionData
![Page 10: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/10.jpg)
Calculating CEFor every pair of dimensions (X and Y), discretizethe 2D sub-space (using the nested means intervals); each cell is then represented in a table by the number of observations that fall in that cell.Calculate entropy for every row and column; weight each by the row or column sum divided by the total number of observations.Add up weighted row and column entropy values to get CE(Y|X) and CE(X|Y). The maximum of these two values is the final cluster tendency measure.
![Page 11: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/11.jpg)
Calculating CE∑ ∈
−=χ
χx
xdxdCH log)](log)([)(
X1 X2 X3 X4 X5 X6 Sum Wt CE
X1 0 1 3 0 0 0 4 .03 .314
X2 1 9 1 0 1 2 14 .09 .629X3 7 14 3 7 6 0 37 .25 .835X4 7 6 13 19 12 5 62 .41 .939X5 0 4 14 5 1 1 25 .17 .668X6 1 2 3 2 0 0 8 .05 .737
Sum 16 36 37 33 20 8Wt .11 .24 .25 .22 .13 .05CE .597 .847 .806 .615 .540 .502
CE(Y|X).700
CEmax.812
CE(X|Y).812
example taken from Guo et al150 total values, r = 6 intervals
![Page 12: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/12.jpg)
The methodns
ns
nr
Nested MeansMatrix
ng
ns
ns Minimal SpanningTree
MST Order
CE DistanceMatrix
Clique Discovery CliquesGene ExpressionData
![Page 13: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/13.jpg)
Graph-theoretic analysis
CE calculation results in a distance matrix -visualizing the fully-connected graph is of little use.We can use graph theory to answer two questions:
Topologically, is there a linear order that, when sorted and imaged, can reveal cluster structure?What fully-connected sub-graphs (cliques) exist in my data?
![Page 14: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/14.jpg)
Sample ordering – the MSTA minimum spanning tree (MST) is a spanning tree, but has weights or lengths associated with the edges, and the total weight of the tree (the sum of the weights of its edges) is at a minimum.We can use the topological ordering of the MST to create a relative ordering of our samples. Sorting the samples in this way in a data image can reveal structure.We used Kruskal’s algorithm in the RBGL R library (mstree.kruskal()) – a greedy approach to generate an MST.
![Page 15: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/15.jpg)
Use of the MST to Induce Orderings on the Dimensions
• similar to UPGMA tree-building
• the linear ordering can be viewed as a 1D compression of the resulting hierarchical tree
![Page 16: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/16.jpg)
MST orderings on the image of the CE values
After ordering the samples according to their MST order, use of R’s image() method can generate the image at right.This ordering can show us formerly-hidden cluster structure without any presupposition.
![Page 17: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/17.jpg)
Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph
If we can see cluster structure, can we retrieve it in an automatic fashion?On the fully-connected graph, break all edges longer than a threshold distance (somewhat subjective; varies between data sets).
![Page 18: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/18.jpg)
Ascertaining Clusters of Dimensions Based on the Maximal Cliques of the Complete CE Graph
On the resulting graph, find all cliques (fully-connected node sets).Dr. Marchette – graph library’s clique()Future work: a more efficient method is required.
![Page 19: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/19.jpg)
Implementation details
Nested means discretization and calculation of conditional entropy written in RMST ordering and dot files (our graph format of choice) written in PerlGraphs visualized using AT&T’s GraphvizAll input and output files are tab-delimited ASCII text
![Page 20: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/20.jpg)
Anecdotal Results
![Page 21: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/21.jpg)
Artificial Data Set1000 observations in R100 distributed N(0,1) in each of the variatesObservations 1-250 translated by + 3 in dimensions {5,6,7,8}Observations 251-500 translated by –3 in dimensions {24,25,26,27,28,29,30} Observations 501-750 translated by +5 in dimensions {55,56,57,58,59,60,61,62,63,64,65,66,67} Observations 751-1000 translated by –5 in dimensions {10,11,12,13,14}
![Page 22: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/22.jpg)
Artificial dataset results - MST
![Page 23: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/23.jpg)
Image of Sorted CE Values for the Artificial Dataset
![Page 24: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/24.jpg)
Golub datasetAn experiment to determine the ability of microarray data to separate acute myeloid leukemia (AML) from acute lymphoblasticleukemia (ALL).Custom microarray, 7,129 genes72 samples
47 ALL samples (both B- and T-cell)25 AML samples
T.R. Golub et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, vol. 286, 531 (1999)
![Page 25: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/25.jpg)
Golub Dataset - MST
• ALL samples
• AML samples
![Page 26: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/26.jpg)
Image of Sorted CE Values for the Golub Dataset
![Page 27: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/27.jpg)
ALL data set
Acute lymphoblastic leukemia B and T-cell data set contributed to Bioconductor by the Dana Farber Cancer Institute.Affymetrix U95Av2 chip, 12,625 genes128 samples
95 B-cell samples33 T-cell samples
![Page 28: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/28.jpg)
ALL - MST
• B-cell samples
• T-cell samples
![Page 29: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/29.jpg)
Image of Sorted CE Values for the ALL Dataset
![Page 30: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/30.jpg)
Summary/Conclusions
An informative technique for initial high-level data explorationFuture direction:
Concretely determine sensitivity to noiseDevelop a visualization tool for the MST orderingA more efficient clique-discovery method
![Page 31: Cluster Subspace Identification Via Conditional …...Graph-theoretic analysis zCE calculation results in a distance matrix - visualizing the fully-connected graph is of little use.](https://reader036.fdocuments.in/reader036/viewer/2022070804/5f036c527e708231d4092227/html5/thumbnails/31.jpg)
ReferencesCheng, C., A. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, USA. (1999)Getz, G., Levine, E., Domany E. Coupled two-way clustering analysis of gene microarray data. PNAS. 97:22, 12079. (2000).Guo, D. et al. Breaking Down Dimensionality: Effective and Efficient Feature Selection for High-Dimensional Clustering. [Name of Conference]. [date]Guo, D., D. Peuquet and M. Gahegan (2002). Opening the Black Box: Interactive Hierarchical Clustering for Multivariate Spatial Patterns. The 10th ACM International Symposium on Advances in Geographic Information Systems, McLean, VA, USA.