The MST-kNN with Paracliques (Presentation)

IntroductionThe MST-kNN with Paracliques

Conclusion and Future Research Directions

The MST-kNN with Paracliques

Ahmed Shamsul Arefin Carlos Riveros Regina BerrettaPablo Moscato*

The Priority Research Centre for Bioinformatics Biomarker Discovery andInformation-based Medicine

University of Newcastle

{Ahmed.Arefin, Carlos.Riveros, Regina.Berretta, Pablo.Moscato}@newcastle.edu.au.

February 7, 2015

Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015



Overview

1 IntroductionBackgroundThe Problem

2 The MST-kNN with ParacliquesProposed SolutionImplementationResults

3 Conclusion and Future Research DirectionsConclusionFuture Research Directions




BackgroundThe Problem

Introduction

Data clustering is perhaps the most common and widely usedapproach in data analytics. Over the years, a large number ofmethods have been developed for clustering. Among those, thegraph-based approaches are well-known for their advantages inpartitioning both real-world and artificial data[Jain et al., 1999].





Graph-based clustering

Graph based methods generally take a distance matrixcomputed from the input and build a proximity graphG(V , E), where each vertex represents a data element, eachedge represents the presence of a proximity relationshipand the weight of the edge represents, in some way, thedegree of proximity of the pair of vertices [Anders, 2003].

This is followed by the computation of some subgraphs[Berkhin, 2006], e.g., the Minimum Spaning Tree (MST),the k-Nearest Neighbour Graph (k-NNG), the RelativeNeighbourhood Graph (RNG) and so forth.





The MST-kNN

Among the various known graph clustering methods, theMST-kNN [Inostroza-Ponta et al., 2006] (see also[Gonzalez-Barrios and Quiroz, 2003]) is of interest for ourwork, as it does not require any ad hoc user-definedparameter.

Further, in terms of homogeneity and separation index[Sharan et al., 2003], it has been shown that it performsbetter than the classical clustering algorithms such asK-Means and SOMs [Inostroza-Ponta et al., 2007].





The MST-kNN

The MST-kNN’s scalability and performance have beendemonstrated in its external-memory [Arefin et al., 2011]as well as in data-parallel variants [Arefin et al., 2012a] and[Arefin et al., 2012c].

Furthermore, it has been employed in the analysis oflarge-scale real-world data of various kinds, such as:

- stock market time series data [Inostroza-Ponta et al., 2006],- yeast gene expression data [Inostroza-Ponta et al., 2011],- prostate cancer data [Capp et al., 2009],- breast cancer data [Arefin et al., 2011]- Alzheimer’s disease data [Arefin et al., 2012b] and so on.





The MST-kNN

.The MST-kNN [Inostroza-Ponta et al., 2006] 1

1The method in [Gonzalez-Barrios and Quiroz, 2003] does not haverecursion and automatic k.





The MST-kNN (Demonstration)

A complete graph formed by 16 Indo-European Languages, extractedfrom the 84 Indo European Languages distance matrix provided in

[Dyen et al., 1992]





The MST-kNN (Demonstration)

The Minimum Spanning Tree (MST)





The MST-kNN (Demonstration) - Contd.

.

.Application of the MST-kNN on the 16 Indo-European Languages

Note that k = min{bln(n)c ; min k / GkNN is connected} (1)





The MST-kNN (1st Iteration)

.Application of the MST-kNN on the 84 Indo European Languages





The Problem

The MST-kNN’s Limitation:

The MST-kNN’s outcome does not provide insight of the corevertices’ interactions within the MST-kNN partitions.




Proposed SolutionImplementationResults

The MST-kNN with Paracliques

Proposed Solution:

We propose a modified version termed as the MST-kNN withParacliques. It adopts the working procedure of theMST-kNN, but using an iterative approach and integratesparaclique structures into the MST-kNN’s outcome.





Definitions

A clique is a set of vertices in which every vertex has anedge to every other vertex in the set.

A maximal clique is a clique that cannot be extended byadding another vertex. The maximum clique of a graph isa maximal clique that has the largest number of verticesand is arguably the most ‘natural’ cluster in a proximitygraph [Ngomo, 2006].

However the problem of finding the maximum clique is awell-known NP-hard problem.

In contrast, the identification of paracliques[Chesler and Langston, 2006]2 provides a viable alternative.

2See also ‘quasi-cliques’ in [Abello et al., 2002]Presented by Ahmed S. Arefin The MST-kNN + Paracliques — ACALCI 2015




The MST-kNN with Paracliques (Contd.)

We identify paracliques via the identification of themaximal cliques of size 3 or higher present in the kNNgraphs reconstructed from the MST-kNN components.

In other words, we collect the neighborhood networks asparacliques that are present within the MST-kNNcomponents, but lack only a few edges to become cliques ofa larger size.

This results in insightful networks among the core verticesin each MST-kNN partition than the ones portrayed by theMST alone.





The MST-kNN with Paracliques (Algorithm)

The Proposed Method (see lines 8 and 9)





The MST-kNN with Paracliques (Algorithm) - Contd.

The kNN Paracliques Method





The MST-kNN with Paracliques (Implementation)

The proposed method has been implemented in R using theigraph package [Csardi and Nepusz, 2006]. For example,

For computing the MST and kNN we useminimum.spanning.tree (Prim’s) and graph.adjacency

functions, respectively.

For finding the maximal cliques we use decompose.graph,maximal.cliques and induced.subgraph functions andfor retrieving the k maximal cliques we use an order

function.

For merging the graphs, we compute the symmetric graphdifferences.





The MST-kNN with Paracliques (Results)

The MST-kNN + Paracliques on the 84 Indo European Languages





The MST-kNN with Paracliques (Results) - Contd.

168 Shakespearean era plays in [Marsden et al., 2013] (see also[Arefin et al., 2014] - 256 Shakespearean era plays and poems)





The MST-kNN with Paracliques (Results) - StatisticalSignificance

Table: Significance of clusterings by the MST-kNN and its paracliquevariant.

Data Method ScoringClass

Wilcoxontestp-Value

Kruskal-Wallis testp-Value

84Indo-Europeanlanguages dataset

MST-kNN 9 languagegroups

1.04E-07 2.09E-07

MST-kNN withParacliques

1.02E-07 2.04E-07

168Shakespeareanera plays data set

MST-kNN 39 authors ofthe plays

1.13E-10 2.26E-10

MST-kNN withParacliques

8.10E-12 1.62E-11

*The Kruskal-Wallis test 2, on the original vs. the individual 1000 randompermutations resulted in p-values close to 0.




ConclusionFuture Research Directions


We presented an interesting variant of the MST-kNN method,termed as the MST-kNN with Paracliques, which providesmore insights of the inter-relations among the partitionedelements.We envision that the modified method will be a useful dataclustering approach for the analysis of data sets in several areas,including– bioinformatics, artificial intelligence, image andvideo analysis, creative arts, and finance.





Conclusion and Future Research Directions - Contd.

Issue 1

At the moment, on smaller data sets, our method’s time performanceis similar to the MST-kNN, however at a large scale, e.g., with a dataset having more than 10,000 elements, it performs at least 10 timesslower, which is mainly due to its maximal clique finding component.

Plan

We aim to re-implement this part using a data-parallel approach,which we expect to give a better speedup gain.






Issue 2

So far we have only compared our outcomes against the MST-kNN.This is because, we initially aimed at enhancing the MST-kNNperformance only, where the original method has already been shownto perform better against the traditional clustering methods, such asCLICK and SOMs.

Plan

We aim to compare our outcomes against the other data partitioningmethods, such as DBSCAN for graphs, affinity propagation, spectralclustering, etc. This would also help us to identify the data types, forwhich the proposed method is more appropriate.






Issue 3

Currently it is only available (beta) for the members of CIBM researchgroup at the CIBM website http://cibm.newcastle.edu.au. It ispart of a local R tool called CIBM-RUtils.

Plan

We aim to publish it as a data clustering package for R at the CRANhttp://cran.r-project.org/ (once the Issues 1 and 2 have beenresolved).


http://cibm.newcastle.edu.au

http://cran.r-project.org/




Thanks + QA

Thanks:

1 Dr. Renato Vimieiro,Lecturer, Centro deInformatica, UFP, Brazil(former CIBM member).

2 All CIBM members,collaborators andusers/testers of theCIBM-RUtils.

Thank you +QA?





Abello, J., Resende, M. G., and Sudarsky, S. (2002).

Massive quasi-clique detection.

In LATIN 2002: Theoretical Informatics, pages 598–612. Springer.

Anders, K.-H. (2003).

A hierarchical graph-clustering approach to find groups of objects.

In Proceedings 5th Workshop on Progress in Automated Map Generalization, pages 1–8.

Arefin, A., Riveros, C., Berretta, R., and Moscato, P. (2012a).

kNN-Boruvka-GPU: A fast and scalable mst construction from kNN graphs on GPU.

In Murgante, B., Gervasi, O., Misra, S., Nedjah, N., Rocha, A., Taniar, D., and Apduhan, B.,

editors, Computational Science and Its Applications ICCSA 2012, volume 7333 of Lecture Notes in

Computer Science, pages 71–86. Springer Berlin Heidelberg.

Arefin, A. S., Inostroza-Ponta, M., Mathieson, L., Berretta, R., and Moscato, P. (2011).

Clustering nodes in large-scale biological networks using external memory algorithms.

In Xiang, Y., Cuzzocrea, A., Hobbs, M., and Zhou, W., editors, Algorithms and Architectures for

Parallel Processing, volume 7017 of Lecture Notes in Computer Science, pages 375–386. Springer Berlin

Heidelberg.

Arefin, A. S., Mathieson, L., Johnstone, D., Berretta, R., and Moscato, P. (2012b).

Unveiling clusters of RNA transcript pairs associated with markers of Alzheimers disease

progression.

PloS one, 7(9):e45535.

Arefin, A. S., Riveros, C., Berretta, R., and Moscato, P. (2012c).

kNN-MST-Agglomerative: A fast and scalable graph-based data clustering approach on GPU.

In Computer Science & Education (ICCSE), 2012 7th International Conference on, pages 585–590. IEEE.





Arefin, A. S., Vimieiro, R., Riveros, C., Craig, H., and Moscato, P. (2014).

An Information Theoretic clustering approach for unveiling authorship affinities in

Shakespearean era plays and poems.

PLoS ONE, 9(10):e111445.

Berkhin, P. (2006).

A survey of clustering data mining techniques.

In Grouping multidimensional data, pages 25–71. Springer.

Capp, A., Inostroza-Ponta, M., Bill, D., Moscato, P., Lai, C., Christie, D., Lamb, D., Turner,

S., Joseph, D., and Matthews, J. (2009).

Is there more than one proctitis syndrome? a revisitation using data from the TROG 96.01

trial.

Radiotherapy and oncology, 90(3):400–407.

Chesler, E. and Langston, M. (2006).

Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic

data.

In Eskin, E., Ideker, T., Raphael, B., and Workman, C., editors, Systems Biology and Regulatory

Genomics, volume 4023 of Lecture Notes in Computer Science, pages 150–165. Springer Berlin

Heidelberg.

Csardi, G. and Nepusz, T. (2006).

The igraph software package for complex network research.

InterJournal, Complex Systems, 1695(5).

Dyen, I., Kruskal, J. B., and Black, P. (1992).

An Indoeuropean classification: a lexicostatistical experiment.

Transactions of the American Philosophical Society, pages iii–132.





Gonzalez-Barrios, J. M. and Quiroz, A. J. (2003).

A clustering procedure based on the comparison between the k nearest neighbors graph and the

minimal spanning tree.

Statistics & Probability Letters, 62(1):23–34.

Inostroza-Ponta, M., Berretta, R., Mendes, A., and Moscato, P. (2006).

An automatic graph layout procedure to visualize correlated data.

In Artificial Intelligence in Theory and Practice, pages 179–188. Springer.

Inostroza-Ponta, M., Berretta, R., and Moscato, P. (2011).

QAPgrid: A two level QAP-based approach for large-scale data analysis and visualization.

PloS one, 6(1):e14468.

Inostroza-Ponta, M., Mendes, A., Berretta, R., and Moscato, P. (2007).

An integrated QAP-based approach to visualize patterns of gene expression similarity.

In Progress in Artificial Life, pages 156–167. Springer.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999).

Data clustering: a review.

ACM computing surveys (CSUR), 31(3):264–323.

Marsden, J., Budden, D., Craig, H., and Moscato, P. (2013).

Language individuation and marker words: Shakespeare and his Maxwell’s demon.

PloS one, 8(6):e66813.

Ngomo, A.-C. N. (2006).

Clique-based clustering.

Evaluation, 1:10.





Sharan, R., Maron-Katz, A., and Shamir, R. (2003).

CLICK and EXPANDER: A system for clustering and visualizing gene expression data.

Bioinformatics, 19(14):1787–1799.


The MST-kNN with Paracliques (Presentation)

Software

Transcript of The MST-kNN with Paracliques (Presentation)