The MST-kNN with Paracliques (Presentation)
-
Upload
ahmed-shams-arefin-phd -
Category
Software
-
view
287 -
download
1
Transcript of The MST-kNN with Paracliques (Presentation)
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
The MST-kNN with Paracliques
Ahmed Shamsul Arefin Carlos Riveros Regina BerrettaPablo Moscato*
The Priority Research Centre for Bioinformatics Biomarker Discovery andInformation-based Medicine
University of Newcastle
{Ahmed.Arefin, Carlos.Riveros, Regina.Berretta, Pablo.Moscato}@newcastle.edu.au.
February 7, 2015
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Overview
1 IntroductionBackgroundThe Problem
2 The MST-kNN with ParacliquesProposed SolutionImplementationResults
3 Conclusion and Future Research DirectionsConclusionFuture Research Directions
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
Introduction
Data clustering is perhaps the most common and widely usedapproach in data analytics. Over the years, a large number ofmethods have been developed for clustering. Among those, thegraph-based approaches are well-known for their advantages inpartitioning both real-world and artificial data[Jain et al., 1999].
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
Graph-based clustering
Graph based methods generally take a distance matrixcomputed from the input and build a proximity graphG(V , E), where each vertex represents a data element, eachedge represents the presence of a proximity relationshipand the weight of the edge represents, in some way, thedegree of proximity of the pair of vertices [Anders, 2003].
This is followed by the computation of some subgraphs[Berkhin, 2006], e.g., the Minimum Spaning Tree (MST),the k-Nearest Neighbour Graph (k-NNG), the RelativeNeighbourhood Graph (RNG) and so forth.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
The MST-kNN
Among the various known graph clustering methods, theMST-kNN [Inostroza-Ponta et al., 2006] (see also[Gonzalez-Barrios and Quiroz, 2003]) is of interest for ourwork, as it does not require any ad hoc user-definedparameter.
Further, in terms of homogeneity and separation index[Sharan et al., 2003], it has been shown that it performsbetter than the classical clustering algorithms such asK-Means and SOMs [Inostroza-Ponta et al., 2007].
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
The MST-kNN
The MST-kNN’s scalability and performance have beendemonstrated in its external-memory [Arefin et al., 2011]as well as in data-parallel variants [Arefin et al., 2012a] and[Arefin et al., 2012c].
Furthermore, it has been employed in the analysis oflarge-scale real-world data of various kinds, such as:
- stock market time series data [Inostroza-Ponta et al., 2006],- yeast gene expression data [Inostroza-Ponta et al., 2011],- prostate cancer data [Capp et al., 2009],- breast cancer data [Arefin et al., 2011]- Alzheimer’s disease data [Arefin et al., 2012b] and so on.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
The MST-kNN
.The MST-kNN [Inostroza-Ponta et al., 2006] 1
1The method in [Gonzalez-Barrios and Quiroz, 2003] does not haverecursion and automatic k.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
The MST-kNN (Demonstration)
A complete graph formed by 16 Indo-European Languages, extractedfrom the 84 Indo European Languages distance matrix provided in
[Dyen et al., 1992]
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
The MST-kNN (Demonstration)
The Minimum Spanning Tree (MST)
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
The MST-kNN (Demonstration) - Contd.
.
.Application of the MST-kNN on the 16 Indo-European Languages
Note that k = min{bln(n)c ; min k / GkNN is connected} (1)
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
The MST-kNN (1st Iteration)
.Application of the MST-kNN on the 84 Indo European Languages
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
BackgroundThe Problem
The Problem
The MST-kNN’s Limitation:
The MST-kNN’s outcome does not provide insight of the corevertices’ interactions within the MST-kNN partitions.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
The MST-kNN with Paracliques
Proposed Solution:
We propose a modified version termed as the MST-kNN withParacliques. It adopts the working procedure of theMST-kNN, but using an iterative approach and integratesparaclique structures into the MST-kNN’s outcome.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
Definitions
A clique is a set of vertices in which every vertex has anedge to every other vertex in the set.
A maximal clique is a clique that cannot be extended byadding another vertex. The maximum clique of a graph isa maximal clique that has the largest number of verticesand is arguably the most ‘natural’ cluster in a proximitygraph [Ngomo, 2006].
However the problem of finding the maximum clique is awell-known NP-hard problem.
In contrast, the identification of paracliques[Chesler and Langston, 2006]2 provides a viable alternative.
2See also ‘quasi-cliques’ in [Abello et al., 2002]Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
The MST-kNN with Paracliques (Contd.)
We identify paracliques via the identification of themaximal cliques of size 3 or higher present in the kNNgraphs reconstructed from the MST-kNN components.
In other words, we collect the neighborhood networks asparacliques that are present within the MST-kNNcomponents, but lack only a few edges to become cliques ofa larger size.
This results in insightful networks among the core verticesin each MST-kNN partition than the ones portrayed by theMST alone.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
The MST-kNN with Paracliques (Algorithm)
The Proposed Method (see lines 8 and 9)
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
The MST-kNN with Paracliques (Algorithm) - Contd.
The kNN Paracliques Method
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
The MST-kNN with Paracliques (Implementation)
The proposed method has been implemented in R using theigraph package [Csardi and Nepusz, 2006]. For example,
For computing the MST and kNN we useminimum.spanning.tree (Prim’s) and graph.adjacency
functions, respectively.
For finding the maximal cliques we use decompose.graph,maximal.cliques and induced.subgraph functions andfor retrieving the k maximal cliques we use an order
function.
For merging the graphs, we compute the symmetric graphdifferences.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
The MST-kNN with Paracliques (Results)
The MST-kNN + Paracliques on the 84 Indo European Languages
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
The MST-kNN with Paracliques (Results) - Contd.
168 Shakespearean era plays in [Marsden et al., 2013] (see also[Arefin et al., 2014] - 256 Shakespearean era plays and poems)
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
Proposed SolutionImplementationResults
The MST-kNN with Paracliques (Results) - StatisticalSignificance
Table: Significance of clusterings by the MST-kNN and its paracliquevariant.
Data Method ScoringClass
Wilcoxontestp-Value
Kruskal-Wallis testp-Value
84Indo-Europeanlanguages dataset
MST-kNN 9 languagegroups
1.04E-07 2.09E-07
MST-kNN withParacliques
1.02E-07 2.04E-07
168Shakespeareanera plays data set
MST-kNN 39 authors ofthe plays
1.13E-10 2.26E-10
MST-kNN withParacliques
8.10E-12 1.62E-11
*The Kruskal-Wallis test 2, on the original vs. the individual 1000 randompermutations resulted in p-values close to 0.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Conclusion and Future Research Directions
We presented an interesting variant of the MST-kNN method,termed as the MST-kNN with Paracliques, which providesmore insights of the inter-relations among the partitionedelements.We envision that the modified method will be a useful dataclustering approach for the analysis of data sets in several areas,including– bioinformatics, artificial intelligence, image andvideo analysis, creative arts, and finance.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Conclusion and Future Research Directions - Contd.
Issue 1
At the moment, on smaller data sets, our method’s time performanceis similar to the MST-kNN, however at a large scale, e.g., with a dataset having more than 10,000 elements, it performs at least 10 timesslower, which is mainly due to its maximal clique finding component.
Plan
We aim to re-implement this part using a data-parallel approach,which we expect to give a better speedup gain.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Conclusion and Future Research Directions - Contd.
Issue 2
So far we have only compared our outcomes against the MST-kNN.This is because, we initially aimed at enhancing the MST-kNNperformance only, where the original method has already been shownto perform better against the traditional clustering methods, such asCLICK and SOMs.
Plan
We aim to compare our outcomes against the other data partitioningmethods, such as DBSCAN for graphs, affinity propagation, spectralclustering, etc. This would also help us to identify the data types, forwhich the proposed method is more appropriate.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Conclusion and Future Research Directions - Contd.
Issue 3
Currently it is only available (beta) for the members of CIBM researchgroup at the CIBM website http://cibm.newcastle.edu.au. It ispart of a local R tool called CIBM-RUtils.
Plan
We aim to publish it as a data clustering package for R at the CRANhttp://cran.r-project.org/ (once the Issues 1 and 2 have beenresolved).
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Thanks + QA
Thanks:
1 Dr. Renato Vimieiro,Lecturer, Centro deInformatica, UFP, Brazil(former CIBM member).
2 All CIBM members,collaborators andusers/testers of theCIBM-RUtils.
Thank you +QA?
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Abello, J., Resende, M. G., and Sudarsky, S. (2002).
Massive quasi-clique detection.
In LATIN 2002: Theoretical Informatics, pages 598–612. Springer.
Anders, K.-H. (2003).
A hierarchical graph-clustering approach to find groups of objects.
In Proceedings 5th Workshop on Progress in Automated Map Generalization, pages 1–8.
Arefin, A., Riveros, C., Berretta, R., and Moscato, P. (2012a).
kNN-Boruvka-GPU: A fast and scalable mst construction from kNN graphs on GPU.
In Murgante, B., Gervasi, O., Misra, S., Nedjah, N., Rocha, A., Taniar, D., and Apduhan, B.,
editors, Computational Science and Its Applications ICCSA 2012, volume 7333 of Lecture Notes in
Computer Science, pages 71–86. Springer Berlin Heidelberg.
Arefin, A. S., Inostroza-Ponta, M., Mathieson, L., Berretta, R., and Moscato, P. (2011).
Clustering nodes in large-scale biological networks using external memory algorithms.
In Xiang, Y., Cuzzocrea, A., Hobbs, M., and Zhou, W., editors, Algorithms and Architectures for
Parallel Processing, volume 7017 of Lecture Notes in Computer Science, pages 375–386. Springer Berlin
Heidelberg.
Arefin, A. S., Mathieson, L., Johnstone, D., Berretta, R., and Moscato, P. (2012b).
Unveiling clusters of RNA transcript pairs associated with markers of Alzheimers disease
progression.
PloS one, 7(9):e45535.
Arefin, A. S., Riveros, C., Berretta, R., and Moscato, P. (2012c).
kNN-MST-Agglomerative: A fast and scalable graph-based data clustering approach on GPU.
In Computer Science & Education (ICCSE), 2012 7th International Conference on, pages 585–590. IEEE.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Arefin, A. S., Vimieiro, R., Riveros, C., Craig, H., and Moscato, P. (2014).
An Information Theoretic clustering approach for unveiling authorship affinities in
Shakespearean era plays and poems.
PLoS ONE, 9(10):e111445.
Berkhin, P. (2006).
A survey of clustering data mining techniques.
In Grouping multidimensional data, pages 25–71. Springer.
Capp, A., Inostroza-Ponta, M., Bill, D., Moscato, P., Lai, C., Christie, D., Lamb, D., Turner,
S., Joseph, D., and Matthews, J. (2009).
Is there more than one proctitis syndrome? a revisitation using data from the TROG 96.01
trial.
Radiotherapy and oncology, 90(3):400–407.
Chesler, E. and Langston, M. (2006).
Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic
data.
In Eskin, E., Ideker, T., Raphael, B., and Workman, C., editors, Systems Biology and Regulatory
Genomics, volume 4023 of Lecture Notes in Computer Science, pages 150–165. Springer Berlin
Heidelberg.
Csardi, G. and Nepusz, T. (2006).
The igraph software package for complex network research.
InterJournal, Complex Systems, 1695(5).
Dyen, I., Kruskal, J. B., and Black, P. (1992).
An Indoeuropean classification: a lexicostatistical experiment.
Transactions of the American Philosophical Society, pages iii–132.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Gonzalez-Barrios, J. M. and Quiroz, A. J. (2003).
A clustering procedure based on the comparison between the k nearest neighbors graph and the
minimal spanning tree.
Statistics & Probability Letters, 62(1):23–34.
Inostroza-Ponta, M., Berretta, R., Mendes, A., and Moscato, P. (2006).
An automatic graph layout procedure to visualize correlated data.
In Artificial Intelligence in Theory and Practice, pages 179–188. Springer.
Inostroza-Ponta, M., Berretta, R., and Moscato, P. (2011).
QAPgrid: A two level QAP-based approach for large-scale data analysis and visualization.
PloS one, 6(1):e14468.
Inostroza-Ponta, M., Mendes, A., Berretta, R., and Moscato, P. (2007).
An integrated QAP-based approach to visualize patterns of gene expression similarity.
In Progress in Artificial Life, pages 156–167. Springer.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999).
Data clustering: a review.
ACM computing surveys (CSUR), 31(3):264–323.
Marsden, J., Budden, D., Craig, H., and Moscato, P. (2013).
Language individuation and marker words: Shakespeare and his Maxwell’s demon.
PloS one, 8(6):e66813.
Ngomo, A.-C. N. (2006).
Clique-based clustering.
Evaluation, 1:10.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015
IntroductionThe MST-kNN with Paracliques
Conclusion and Future Research Directions
ConclusionFuture Research Directions
Sharan, R., Maron-Katz, A., and Shamir, R. (2003).
CLICK and EXPANDER: A system for clustering and visualizing gene expression data.
Bioinformatics, 19(14):1787–1799.
Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015