Gene selection via significant subset using silhouette index

We choose compact and separated clusters of genes by computing the Silhouette Index of Compactness[1,2,3] on every possible subset of the N genes. This approach may take an impracticalamount of time, since there are 2N such sets; therefore we propose a sub-optimal search, limiting the computation of the index on the sets provided by the Hierarchical Clustering algorithm, notonly on the final stage, but on every intermediate step. If there are N genes, there will be N such groupings, the first one with N clusters (subsets), and the last one with only 1 large cluster, making atotal of N(N+1)/2 candidate subsets. Because of the overlapping, there are only 2*N different subsets to be processed, and because of the way the clustering algorithm works, most of them will becompact. The Silhouette index will ensure to select groups that are also separated from the other ones.

Juan Ignacio Pastore , Guillermo Abras , Diego Sebastían Comas , Marcel Brun , Virginia Ballarin

Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP

[email protected]

1,2 1 1 1 1

1

2Comisión Nacional de Investigaciones Científicas y Técnicas CONICET,

Gene Selection via Significant Subsetusing Silhouette Index

Gene selection is an important task in the area of bioinformatics, where significant genes are chosen using somecriterion of significance. In the case of classification, like disease vs. normal, tissuetype, etc, the criterion used is the ability to provide good features for the classification task. In other cases it is interesting to select large groups of genes with similar behavior, regardless of the class.This task is usually carried on by clustering algorithm, where the whole family of genes, or a subset of them, is grouped into significant clusters. These techniques provide insight on possible co-regulation between genes, but usually provide large, maybe enormous sets, depending on the number of clusters required. In this work we present a new algorithm that provides sets of genes withvery similar expression. This is possible by using the complete clustering tree provided by the hierarchicalclustering algorithm, and the Silhouette index for ranking of the subsets.

[1] Rousseeuw, Peter J., "Silhouettes: A graphical aid to the interpretation and validation of clusteranalysis", Journal of Computational and Applied Mathematics , 20 (1) , pp.53-65 , 1987.

[2] Pearson John V. et.al., "Identification of the Genetic Basis for Complex Disorders by Use of Pooling-Based Genomewide Single-NucleotideyPolimorphism Association Studies”, The American Journal ofHuman Genetics, 80, pp. 126-139. 2007.

[3] Jianping Hua, David W. Craig, Marcel Brun, Jennifer Webster, Victoria Zismann, Waibhav Tembe,Keta Joshipura, Matthew J. Huentelman, Edward R. Dougherty, Dietrich A. Stephan: SNiPer-HD:improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNParrays. Bioinformatics 23(1): pp. 57-63 (2007).

Microarray Data

Performance results

The proposed tool may be a powerful tool for the biologists or computationalbiology researchers interested on generating new hypothesis on co-expressedgenes, which are not provided by more standard analysis tools.

Conclusion

Experiment : E-GEOD-15653 Submitter(s) : Patti Lab : Joslin Diabetes Center Mary Elizabeth Patti.

(Generated description): Experiment with 18 hybridizations, using 18 samples of species [Homo sapiens], using 18 arrays of array design [Affymetrix GeneChip® Human Genome HG-U133A [HG-U133A]], producing 18 raw data files and 18 transformed and/or normalized data files.

(Submitter's description 1): Hepatic lipid accumulation is an important complication of obesity linked to risk for type 2 diabetes. To identify novel transcriptional changes in human liver which couldcontribute to hepatic lipid accumulation and associated insulin resistance and type 2 diabetes (DM2), we evaluated gene expression and gene set enrichment in surgical liver biopsies from 13obese (9 with DM2) and 5 control subjects, obtained in the fasting state at the time of elective abdominal surgery for obesity or cholecystectomy. RNA was isolated for cRNA preparation andhybridized to Affymetrix U133A microarrays. Experiment Overall Design: Human liver samples were obtained from 5 lean control subjects undergoing elective cholecystectomy and 13 obesesubjects (with or without Type 2 diabetes) undergoing gastric bypass surgery. Subjects with diabetes were classified as either well-controlled or poorly-controlled.

Experimental Data

Experiments

QI: 0.69QI: 0.29 QI: 0.43

0 0.5 10

0.2

0.4

0.6

0.8

1

RAS1

RA

S2

0 0.5 10

0.2

0.4

0.6

0.8

1

RAS1

RA

S2

0 0.5 10

0.2

0.4

0.6

0.8

1

RAS1

RA

S2

Silhouette Index

Program’s interface for gene selection

( )1

1 1

k

K

k x Ck

S S xK n= Î

é ù= ê ú

ë ûå å

Algorithm

Hierarchical Clustering

References

( ) ( ) ( )( ) ( )

( ) ( )

( ) ( )

,

1, , ,

max ,

1,

1

1min ,

k

h

y C y xk

h K h ky Ch

b x a xS x

a x b x

a x d x yn

b x d x yn

Î ¹

= ¹Î

-=

é ùë û

=-

é ù= ê ú

ë û

å

åK

The Silhouette index measures notonly the compacness of the

clusters, but also the distancebetween them. The higher theindex, the more compact and

separated from each other are thecluster.

Microarray Data HierarchicalClustering

SelectedSets

SilhouetteIndex

Probe Set

ID

GeneTitle Gene SymbolBiological Process

TermMolecularFunction Term

Cellular Component

Term

204550_x_atglutathione S-

transferasemu1GSTM1 metabolicprocess

glutathionetransferase activity

transferase activitycytoplasm

204418_x_at

glutathione S-

transferasemu2

(muscle)

GSTM2 metabolicprocessglutathionetransferase activity


215333_x_atglutathione S-

transferasemu1GSTM1 metabolicprocess

glutathionetransferase activity


Hierarchical clustering is anagglomerative partitioning algorithm thatidentifies compact subsets of the data, ina iterative proceeding. The result of the

algorithm is a dendrogram, a treestructure informing all the steps of the

grouping process.

Introduction

Software Implementation

Results

In the sample data used for testing purposes the top selected sets showed consistency and many of thegenes of the groups were related by function. Below we can see one of the top sets, with a Silhouette indexof 0.94,which consists of two probes for the same gene (GSTM1 ), and one probe for gene GSTM2, which areboth members of the mu class of enzymes, which functions in the detoxification of electrophilic compounds.

Gene selection via significant subset using silhouette index

Technology

Transcript of Gene selection via significant subset using silhouette index