Gene selection via significant subset using silhouette index

1
We choose compact and separated clusters of genes by computing the Silhouette Index of Compactness[1,2,3] on every possible subset of the N genes. This approach may take an impractical amount of time, since there are 2N such sets; therefore we propose a sub-optimal search, limiting the computation of the index on the sets provided by the Hierarchical Clustering algorithm, not only on the final stage, but on every intermediate step. If there are N genes, there will be N such groupings, the first one with N clusters (subsets), and the last one with only 1 large cluster, making a total of N(N+1)/2 candidate subsets. Because of the overlapping, there are only 2*N different subsets to be processed, and because of the way the clustering algorithm works, most of them will be compact. The Silhouette index will ensure to select groups that are also separated from the other ones. Juan Ignacio Pastore , Guillermo Abras , Diego Sebastían Comas , Marcel Brun , Virginia Ballarin Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP [email protected] 1,2 1 1 1 1 1 2 Comisión Nacional de Investigaciones Científicas y Técnicas CONICET, Gene Selection via Significant Subset using Silhouette Index Gene selection is an important task in the area of bioinformatics, where significant genes are chosen using somecriterion of significance. In the case of classification, like disease vs. normal, tissue type, etc, the criterion used is the ability to provide good features for the classification task. In other cases it is interesting to select large groups of genes with similar behavior, regardless of the class. This task is usually carried on by clustering algorithm, where the whole family of genes, or a subset of them, is grouped into significant clusters. These techniques provide insight on possible co- regulation between genes, but usually provide large, maybe enormous sets, depending on the number of clusters required. In this work we present a new algorithm that provides sets of genes with very similar expression. This is possible by using the complete clustering tree provided by the hierarchicalclustering algorithm, and the Silhouette index for ranking of the subsets. [1] Rousseeuw, Peter J., "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis", Journal of Computational and Applied Mathematics , 20 (1) , pp.53-65 , 1987. [2] Pearson John V. et.al., "Identification of the Genetic Basis for Complex Disorders by Use of Pooling- Based Genomewide Single-NucleotideyPolimorphism Association Studies”, The American Journal of Human Genetics, 80, pp. 126-139. 2007. [3] Jianping Hua, David W. Craig, Marcel Brun, Jennifer Webster, Victoria Zismann, Waibhav Tembe, Keta Joshipura, Matthew J. Huentelman, Edward R. Dougherty, Dietrich A. Stephan: SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics 23(1): pp. 57-63 (2007). Microarray Data Performance results The proposed tool may be a powerful tool for the biologists or computational biology researchers interested on generating new hypothesis on co-expressed genes, which are not provided by more standard analysis tools. Conclusion Experiment : E-GEOD-15653 Submitter(s) : Patti Lab : Joslin Diabetes Center Mary Elizabeth Patti. (Generated description): Experiment with 18 hybridizations, using 18 samples of species [Homo sapiens], using 18 arrays of array design [Affymetrix GeneChip® Human Genome HG-U133A [HG- U133A]], producing 18 raw data files and 18 transformed and/or normalized data files. (Submitter's description 1): Hepatic lipid accumulation is an important complication of obesity linked to risk for type 2 diabetes. To identify novel transcriptional changes in human liver which could contribute to hepatic lipid accumulation and associated insulin resistance and type 2 diabetes (DM2), we evaluated gene expression and gene set enrichment in surgical liver biopsies from 13 obese (9 with DM2) and 5 control subjects, obtained in the fasting state at the time of elective abdominal surgery for obesity or cholecystectomy. RNA was isolated for cRNA preparation and hybridized to Affymetrix U133A microarrays. Experiment Overall Design: Human liver samples were obtained from 5 lean control subjects undergoing elective cholecystectomy and 13 obese subjects (with or without Type 2 diabetes) undergoing gastric bypass surgery. Subjects with diabetes were classified as either well-controlled or poorly-controlled. Experimental Data Experiments QI: 0.69 QI: 0.29 QI: 0.43 0 0.5 1 0 0.2 0.4 0.6 0.8 1 RAS1 RAS2 0 0.5 1 0 0.2 0.4 0.6 0.8 1 RAS1 RAS2 0 0.5 1 0 0.2 0.4 0.6 0.8 1 RAS1 RAS2 Silhouette Index Program’s interface for gene selection () 1 1 1 k K k xC k S Sx K n = Î é ù = ê ú ë û å å Algorithm Hierarchical Clustering References () () () () () () ( ) () ( ) , 1, , , max , 1 , 1 1 min , k h yC y x k h Khk yC h bx ax Sx ax bx ax dxy n bx dxy n Î ¹ = ¹ Î - = é ù ë û = - é ù = ê ú ë û å å K The Silhouette index measures not only the compacness of the clusters, but also the distance between them. The higher the index, the more compact and separated from each other are the cluster. Microarray Data Hierarchical Clustering Selected Sets Silhouette Index Probe Set ID Gene Title Gene Symbol Biological Process Term Molecular Function Term Cellular Component Term 204550_x_at glutathione S- transferase mu 1 GSTM1 metabolic process glutathione transferase activity transferase activity cytoplasm 204418_x_at glutathione S- transferase mu 2 (muscle) GSTM2 metabolic process glutathione transferase activity transferase activity cytoplasm 215333_x_at glutathione S- transferase mu 1 GSTM1 metabolic process glutathione transferase activity transferase activity cytoplasm Hierarchical clustering is an agglomerative partitioning algorithm that identifies compact subsets of the data, in a iterative proceeding. The result of the algorithm is a dendrogram, a tree structure informing all the steps of the grouping process. Introduction Software Implementation Results In the sample data used for testing purposes the top selected sets showed consistency and many of the genes of the groups were related by function. Below we can see one of the top sets, with a Silhouette index of 0.94,which consists of two probes for the same gene (GSTM1 ), and one probe for gene GSTM2, which are both members of the mu class of enzymes, which functions in the detoxification of electrophilic compounds.

Transcript of Gene selection via significant subset using silhouette index

We choose compact and separated clusters of genes by computing the Silhouette Index of Compactness[1,2,3] on every possible subset of the N genes. This approach may take an impracticalamount of time, since there are 2N such sets; therefore we propose a sub-optimal search, limiting the computation of the index on the sets provided by the Hierarchical Clustering algorithm, notonly on the final stage, but on every intermediate step. If there are N genes, there will be N such groupings, the first one with N clusters (subsets), and the last one with only 1 large cluster, making atotal of N(N+1)/2 candidate subsets. Because of the overlapping, there are only 2*N different subsets to be processed, and because of the way the clustering algorithm works, most of them will becompact. The Silhouette index will ensure to select groups that are also separated from the other ones.

Juan Ignacio Pastore , Guillermo Abras , Diego Sebastían Comas , Marcel Brun , Virginia Ballarin

Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP

[email protected]

1,2 1 1 1 1

1

2Comisión Nacional de Investigaciones Científicas y Técnicas CONICET,

Gene Selection via Significant Subsetusing Silhouette Index

Gene selection is an important task in the area of bioinformatics, where significant genes are chosen using somecriterion of significance. In the case of classification, like disease vs. normal, tissuetype, etc, the criterion used is the ability to provide good features for the classification task. In other cases it is interesting to select large groups of genes with similar behavior, regardless of the class.This task is usually carried on by clustering algorithm, where the whole family of genes, or a subset of them, is grouped into significant clusters. These techniques provide insight on possible co-regulation between genes, but usually provide large, maybe enormous sets, depending on the number of clusters required. In this work we present a new algorithm that provides sets of genes withvery similar expression. This is possible by using the complete clustering tree provided by the hierarchicalclustering algorithm, and the Silhouette index for ranking of the subsets.

[1] Rousseeuw, Peter J., "Silhouettes: A graphical aid to the interpretation and validation of clusteranalysis", Journal of Computational and Applied Mathematics , 20 (1) , pp.53-65 , 1987.

[2] Pearson John V. et.al., "Identification of the Genetic Basis for Complex Disorders by Use of Pooling-Based Genomewide Single-NucleotideyPolimorphism Association Studies”, The American Journal ofHuman Genetics, 80, pp. 126-139. 2007.

[3] Jianping Hua, David W. Craig, Marcel Brun, Jennifer Webster, Victoria Zismann, Waibhav Tembe,Keta Joshipura, Matthew J. Huentelman, Edward R. Dougherty, Dietrich A. Stephan: SNiPer-HD:improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNParrays. Bioinformatics 23(1): pp. 57-63 (2007).

Microarray Data

Performance results

The proposed tool may be a powerful tool for the biologists or computationalbiology researchers interested on generating new hypothesis on co-expressedgenes, which are not provided by more standard analysis tools.

Conclusion

Experiment : E-GEOD-15653 Submitter(s) : Patti Lab : Joslin Diabetes Center Mary Elizabeth Patti.

(Generated description): Experiment with 18 hybridizations, using 18 samples of species [Homo sapiens], using 18 arrays of array design [Affymetrix GeneChip® Human Genome HG-U133A [HG-U133A]], producing 18 raw data files and 18 transformed and/or normalized data files.

(Submitter's description 1): Hepatic lipid accumulation is an important complication of obesity linked to risk for type 2 diabetes. To identify novel transcriptional changes in human liver which couldcontribute to hepatic lipid accumulation and associated insulin resistance and type 2 diabetes (DM2), we evaluated gene expression and gene set enrichment in surgical liver biopsies from 13obese (9 with DM2) and 5 control subjects, obtained in the fasting state at the time of elective abdominal surgery for obesity or cholecystectomy. RNA was isolated for cRNA preparation andhybridized to Affymetrix U133A microarrays. Experiment Overall Design: Human liver samples were obtained from 5 lean control subjects undergoing elective cholecystectomy and 13 obesesubjects (with or without Type 2 diabetes) undergoing gastric bypass surgery. Subjects with diabetes were classified as either well-controlled or poorly-controlled.

Experimental Data

Experiments

QI: 0.69QI: 0.29 QI: 0.43

0 0.5 10

0.2

0.4

0.6

0.8

1

RAS1

RA

S2

0 0.5 10

0.2

0.4

0.6

0.8

1

RAS1

RA

S2

0 0.5 10

0.2

0.4

0.6

0.8

1

RAS1

RA

S2

Silhouette Index

Program’s interface for gene selection

( )1

1 1

k

K

k x Ck

S S xK n= Î

é ù= ê ú

ë ûå å

Algorithm

Hierarchical Clustering

References

( ) ( ) ( )( ) ( )

( ) ( )

( ) ( )

,

1, , ,

max ,

1,

1

1min ,

k

h

y C y xk

h K h ky Ch

b x a xS x

a x b x

a x d x yn

b x d x yn

Î ¹

= ¹Î

-=

é ùë û

=-

é ù= ê ú

ë û

å

åK

The Silhouette index measures notonly the compacness of the

clusters, but also the distancebetween them. The higher theindex, the more compact and

separated from each other are thecluster.

Microarray Data HierarchicalClustering

SelectedSets

SilhouetteIndex

Probe Set

ID

GeneTitle Gene SymbolBiological Process

TermMolecularFunction Term

Cellular Component

Term

204550_x_atglutathione S-

transferasemu1GSTM1 metabolicprocess

glutathionetransferase activity

transferase activitycytoplasm

204418_x_at

glutathione S-

transferasemu2

(muscle)

GSTM2 metabolicprocessglutathionetransferase activity

transferase activitycytoplasm

215333_x_atglutathione S-

transferasemu1GSTM1 metabolicprocess

glutathionetransferase activity

transferase activitycytoplasm

Hierarchical clustering is anagglomerative partitioning algorithm thatidentifies compact subsets of the data, ina iterative proceeding. The result of the

algorithm is a dendrogram, a treestructure informing all the steps of the

grouping process.

Introduction

Software Implementation

Results

In the sample data used for testing purposes the top selected sets showed consistency and many of thegenes of the groups were related by function. Below we can see one of the top sets, with a Silhouette indexof 0.94,which consists of two probes for the same gene (GSTM1 ), and one probe for gene GSTM2, which areboth members of the mu class of enzymes, which functions in the detoxification of electrophilic compounds.