Microarray revolutionized biology and medicine research One
gene at a time before, now tens of thousands simultaneously -
PROTEOMICS Gene expression Gene disease relation Gene-gene
interaction Finding Co-Regulated Genes Understanding Gene
Regulatory Networks Many, many more
Slide 3
Basic idea of Microarray ( probe ) ( microchip ) ( sample ) (
hybridization )
Slide 4
Basic idea of Microarray Construction Place array of probes on
microchip Probe (for example) is oligonucleotide ~25 bases long
that characterizes gene or genome Each probe has many, many clones
Chip is about 2cm by 2cm Application principle Put (liquid) sample
containing genes on microarray and allow probe and gene sequences
to hybridize and wash away the rest Analyze hybridization
pattern
Slide 5
cDNA microarray schema cDNA
Slide 6
Microarray analysis Operation Principle: Samples are tagged
with flourescent material to show pattern of sample-probe
interaction (hybridization) Microarray may have 60K probe
Gene Expression Data Gene expression data on p genes for n
samples Genes mRNA samples Gene expression level of gene i in mRNA
sample j = Log (Red intensity / Green intensity) Log(Avg. PM - Avg.
MM) sample1sample2sample3sample4sample5 1 0.46 0.30 0.80 1.51
0.90... 2-0.10 0.49 0.24 0.06 0.46... 3 0.15 0.74 0.04 0.10 0.20...
4-0.45-1.03-0.79-0.56-0.32... 5-0.06 1.06 1.35 1.09-1.09...
Slide 9
Some possible applications Sample from specific organ to show
which genes are expressed Compare samples from healthy and sick
host to find gene-disease connection Probes are sets of human
pathogens for disease detection
Slide 10
Amount of data from single microarray is huge If just two
color, then amount of data on array with N probes is 2 N Cannot
analyze pixel by pixel Analyze by pattern cluster analysis
Slide 11
Major Data Mining Techniques Link Analysis Associations
Discovery Sequential Pattern Discovery Similar Time Series
Discovery Predictive Modeling Classification Clustering
Slide 12
Strengthens signal when averages are taken within clusters of
genes (Eisen) Useful (essential ?) when seeking new subclasses of
cells, tumours, etc. Leads to readily interpreted figures Cluster
Analysis: grouping similarly expressed genes, Cell samples, or
both
Slide 13
Some clustering methods and software Partitioning K-Means,
K-Medoids, PAM, CLARA Hierarchical Cluster, HAC BIRCH CURE ROCK
Density-based CAST, DBSCAN OPTICS CLIQUE Grid-based STING CLIQUE
WaveCluster Model-based SOM (self-organized map) COBWEB CLASSIT
AutoClass Two-way Clustering Block clustering
Slide 14
A review paper assessing various methods Algorithmic Approaches
to Clustering Gene Expression Data, Ron Shamir School of Computer
Science, Tel-Aviv University Tel-Aviv
http://citeseer.nj.nec.com/shamir01alg orithmic.html Conclusion:
hierarchical clustering exceptional
Slide 15
Partitioning
Slide 16
Density-based clustering
Slide 17
Hierarchical (used most often) agglomerativity divisivity
Eisen et al. Proc. Natl. Acad. Sci. USA 95 (1998) data
clustered randomized row column both time
Slide 21
distance measurements correlation coefficients association
coefficients probabilistic similarity coefficients Types of
Similarity Measurements
Slide 22
Correlation Coefficients The most popular correlation
coefficient is Pearson correlation coefficient (1892) correlation
between X={X 1, X 2, , X n } and Y={Y 1, Y 2, , Y n } where From:
Shin-Mu Tseng [email protected] s XY s XY is the similarity
between X & Y
Slide 23
Now can use similarity for Tree construction Normalize
similarity so that =1 Then have nxn similarity matrix S whose
diagonal elements are 1 Define distance matrix by (for example) D =
1 S Diagonal elements of D are 0 Now use distance matrix to built
tree (using some tree-building software recall lecture on
Phylogeny) s XX
Slide 24
A dendrogram (tree) for clustered genes 12345 Cluster 6=(1,2)
Cluster 7=(1,2,3) Cluster 8=(4,5) Cluster 9= (1,2,3,4,5) Let p =
number of genes. 1. Calculate within class correlation. 2. Perform
hierarchical clustering which will produce (2p-1) clusters of
genes. 3. Average within clusters of genes. 4 Perform testing on
averages of clusters of genes as if they were single genes. E.g.
p=5
Slide 25
A real case Nature Feb, 2000 Paper by Allzadeh. A et al
Distinct types of diffuse large B-cell lymphoma identified by gene
expression profiling
Slide 26
Validation Techniques Huberts Statistics X= [X(i, j)] and Y=
[Y(i, j)] are two n n matrix X(i, j) similarity of gene i and gene
j Huberts statistic represents the point serial correlation where M
= n (n - 1) / 2 A higher value of represents the better clustering
quality. if genes i and j are in same cluster, otherwise From:
Shin-Mu Tseng [email protected]
Slide 27
Discovering sub-groups
Slide 28
Time Course Data Gene Expression is time-dependent
Slide 29
Sample of time course of clustered genes time
Slide 30
Limitations Cluster analyses : Usually outside the normal
framework of statistical inference Less appropriate when only a few
genes are likely to change Needs lots of experiments Single gene
tests : May be too noisy in general to show much May not reveal
coordinated effects of positively correlated genes. Hard to relate
to pathways
Slide 31
Some useful links Affymetrix www.affymetrix.com Michael Eisen
Lab at LBL (hierarchical clustering software Cluster and Tree View
(Windows)) rana.lbl.gov/ Stanford MicroArray Database (Xcluster
(Linux)) genome-www4.stanford.edu/MicroArray/SMD/ Review of
Currently Available Microarray Software
www.the-scientist.com/yr2001/apr/profile1_010430.html Microarray DB
www.biologie.ens.fr/en/genetiqu/puces/bddeng.html
Slide 32
Eisen, M. B. et al., (1998). "Cluster analysis 'and display of
genome-wide expression patterns." Proc Natl Acad Sci U S A 95(25):
14863-8. Wen, X., et al., (1998). "Large-scale temporal gene ex-
pression mapping of central nervous system development." Proc Natl
Acad Sci U S A 95(1): 334-9. U. Alon, et al., (1999) Broad patterns
of gene expression revealed by clustering analysis of tumor and
normal colon tissues probed by oligonucleotide arrays. PNAS,
96:6745-6750, June 1999. Spellman, P. T. et al., (1998).
"Comprehensive identification of cell cycle-regulated genes of the
yeast Saccharomyces cerevisiae by microarray hybridization. Mol
Biol Cell 9(12): 3273-97 Some papers