Gene selection via significant subset using silhouette index
-
Upload
asociacion-argentina-de-bioinformatica-y-biologia-computacional -
Category
Technology
-
view
1.493 -
download
0
Transcript of Gene selection via significant subset using silhouette index
We choose compact and separated clusters of genes by computing the Silhouette Index of Compactness[1,2,3] on every possible subset of the N genes. This approach may take an impracticalamount of time, since there are 2N such sets; therefore we propose a sub-optimal search, limiting the computation of the index on the sets provided by the Hierarchical Clustering algorithm, notonly on the final stage, but on every intermediate step. If there are N genes, there will be N such groupings, the first one with N clusters (subsets), and the last one with only 1 large cluster, making atotal of N(N+1)/2 candidate subsets. Because of the overlapping, there are only 2*N different subsets to be processed, and because of the way the clustering algorithm works, most of them will becompact. The Silhouette index will ensure to select groups that are also separated from the other ones.
Juan Ignacio Pastore , Guillermo Abras , Diego Sebastían Comas , Marcel Brun , Virginia Ballarin
Laboratorio de Procesos y Medición de Señales, Facultad de Ingeniería, UNMdP
1,2 1 1 1 1
1
2Comisión Nacional de Investigaciones Científicas y Técnicas CONICET,
Gene Selection via Significant Subsetusing Silhouette Index
Gene selection is an important task in the area of bioinformatics, where significant genes are chosen using somecriterion of significance. In the case of classification, like disease vs. normal, tissuetype, etc, the criterion used is the ability to provide good features for the classification task. In other cases it is interesting to select large groups of genes with similar behavior, regardless of the class.This task is usually carried on by clustering algorithm, where the whole family of genes, or a subset of them, is grouped into significant clusters. These techniques provide insight on possible co-regulation between genes, but usually provide large, maybe enormous sets, depending on the number of clusters required. In this work we present a new algorithm that provides sets of genes withvery similar expression. This is possible by using the complete clustering tree provided by the hierarchicalclustering algorithm, and the Silhouette index for ranking of the subsets.
[1] Rousseeuw, Peter J., "Silhouettes: A graphical aid to the interpretation and validation of clusteranalysis", Journal of Computational and Applied Mathematics , 20 (1) , pp.53-65 , 1987.
[2] Pearson John V. et.al., "Identification of the Genetic Basis for Complex Disorders by Use of Pooling-Based Genomewide Single-NucleotideyPolimorphism Association Studies”, The American Journal ofHuman Genetics, 80, pp. 126-139. 2007.
[3] Jianping Hua, David W. Craig, Marcel Brun, Jennifer Webster, Victoria Zismann, Waibhav Tembe,Keta Joshipura, Matthew J. Huentelman, Edward R. Dougherty, Dietrich A. Stephan: SNiPer-HD:improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNParrays. Bioinformatics 23(1): pp. 57-63 (2007).
Microarray Data
Performance results
The proposed tool may be a powerful tool for the biologists or computationalbiology researchers interested on generating new hypothesis on co-expressedgenes, which are not provided by more standard analysis tools.
Conclusion
Experiment : E-GEOD-15653 Submitter(s) : Patti Lab : Joslin Diabetes Center Mary Elizabeth Patti.
(Generated description): Experiment with 18 hybridizations, using 18 samples of species [Homo sapiens], using 18 arrays of array design [Affymetrix GeneChip® Human Genome HG-U133A [HG-U133A]], producing 18 raw data files and 18 transformed and/or normalized data files.
(Submitter's description 1): Hepatic lipid accumulation is an important complication of obesity linked to risk for type 2 diabetes. To identify novel transcriptional changes in human liver which couldcontribute to hepatic lipid accumulation and associated insulin resistance and type 2 diabetes (DM2), we evaluated gene expression and gene set enrichment in surgical liver biopsies from 13obese (9 with DM2) and 5 control subjects, obtained in the fasting state at the time of elective abdominal surgery for obesity or cholecystectomy. RNA was isolated for cRNA preparation andhybridized to Affymetrix U133A microarrays. Experiment Overall Design: Human liver samples were obtained from 5 lean control subjects undergoing elective cholecystectomy and 13 obesesubjects (with or without Type 2 diabetes) undergoing gastric bypass surgery. Subjects with diabetes were classified as either well-controlled or poorly-controlled.
Experimental Data
Experiments
QI: 0.69QI: 0.29 QI: 0.43
0 0.5 10
0.2
0.4
0.6
0.8
1
RAS1
RA
S2
0 0.5 10
0.2
0.4
0.6
0.8
1
RAS1
RA
S2
0 0.5 10
0.2
0.4
0.6
0.8
1
RAS1
RA
S2
Silhouette Index
Program’s interface for gene selection
( )1
1 1
k
K
k x Ck
S S xK n= Î
é ù= ê ú
ë ûå å
Algorithm
Hierarchical Clustering
References
( ) ( ) ( )( ) ( )
( ) ( )
( ) ( )
,
1, , ,
max ,
1,
1
1min ,
k
h
y C y xk
h K h ky Ch
b x a xS x
a x b x
a x d x yn
b x d x yn
Î ¹
= ¹Î
-=
é ùë û
=-
é ù= ê ú
ë û
å
åK
The Silhouette index measures notonly the compacness of the
clusters, but also the distancebetween them. The higher theindex, the more compact and
separated from each other are thecluster.
Microarray Data HierarchicalClustering
SelectedSets
SilhouetteIndex
Probe Set
ID
GeneTitle Gene SymbolBiological Process
TermMolecularFunction Term
Cellular Component
Term
204550_x_atglutathione S-
transferasemu1GSTM1 metabolicprocess
glutathionetransferase activity
transferase activitycytoplasm
204418_x_at
glutathione S-
transferasemu2
(muscle)
GSTM2 metabolicprocessglutathionetransferase activity
transferase activitycytoplasm
215333_x_atglutathione S-
transferasemu1GSTM1 metabolicprocess
glutathionetransferase activity
transferase activitycytoplasm
Hierarchical clustering is anagglomerative partitioning algorithm thatidentifies compact subsets of the data, ina iterative proceeding. The result of the
algorithm is a dendrogram, a treestructure informing all the steps of the
grouping process.
Introduction
Software Implementation
Results
In the sample data used for testing purposes the top selected sets showed consistency and many of thegenes of the groups were related by function. Below we can see one of the top sets, with a Silhouette indexof 0.94,which consists of two probes for the same gene (GSTM1 ), and one probe for gene GSTM2, which areboth members of the mu class of enzymes, which functions in the detoxification of electrophilic compounds.