Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics...
-
Upload
mariah-hunt -
Category
Documents
-
view
219 -
download
0
Transcript of Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics...
![Page 1: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/1.jpg)
Microarray data analysis
David A. McClellan, Ph.D.Introduction to Bioinformatics
[email protected] Young UniversityDept. Integrative Biology
25 January 2006
![Page 2: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/2.jpg)
Inferential statistics
Inferential statistics are used to make inferencesabout a population from a sample.
Hypothesis testing is a common form of inferentialstatistics. A null hypothesis is stated, such as:“There is no difference in signal intensity for the geneexpression measurements in normal and diseasedsamples.” The alternative hypothesis is that thereis a difference.
We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level to p < 0.05.
Page 199
![Page 3: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/3.jpg)
Inferential statistics
A t-test is a commonly used test statistic to assessthe difference in mean values between two groups.
t = =
Questions
Is the sample size (n) adequate?Are the data normally distributed?Is the variance of the data known?Is the variance the same in the two groups?Is it appropriate to set the significance level to p < 0.05?
Page 199
x1 – x2
difference between mean values
variability (noise)
![Page 4: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/4.jpg)
Inferential statistics
Paradigm Parametric test Nonparametric
Compare two unpaired groups Unpaired t-test Mann-Whitney test
Compare twopaired groups Paired t-test Wilcoxon test
Compare 3 or ANOVAmore groups
Page 198-200
![Page 5: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/5.jpg)
ANOVA
ANalysis Of VAriance
ANOVA calculates the probability that several conditions all come from the same distribution
![Page 6: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/6.jpg)
Parametric vs. Nonparametric
Parametric tests are applied to data sets that are sampled from a normal distribution (t-tests & ANOVAs)
Nonparametric tests do not make assumptions about the population distribution – they rank the outcome variable from low to high and analyze the ranks
![Page 7: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/7.jpg)
Mann-Whitney test(a two-sample rank test)
Actual measurements are not employed; the ranks of the measurements are used instead
n1 and n2 are the number of observations in samples 1 and 2, and R1 is the sum of the ranks of the observations in sample 1
1
1121 2
1R
nnnnU
![Page 8: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/8.jpg)
Mann-Whitney example
![Page 9: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/9.jpg)
Mann-Whitney table
![Page 10: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/10.jpg)
Wilcoxon paired-sample test
A nonparametric analogue to the paired-sample t-test, just as the Mann-Whitney test is a nonparametric procedure analogous to the unpaired-sample t-test
![Page 11: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/11.jpg)
Wilcoxon example
![Page 12: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/12.jpg)
Wilcoxon table
![Page 13: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/13.jpg)
Inferential statistics
Is it appropriate to set the significance level to p < 0.05?If you hypothesize that a specific gene is up-regulated,you can set the probability value to 0.05.
You might measure the expression of 10,000 genes andhope that any of them are up- or down-regulated. Butyou can expect to see 5% (500 genes) regulated at thep < 0.05 level by chance alone. To account for thethousands of repeated measurements you are making,some researchers apply a Bonferroni correction.The level for statistical significance is divided by thenumber of measurements, e.g. the criterion becomes:
p < (0.05)/10,000 or p < 5 x 10-6
Page 199
![Page 14: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/14.jpg)
Page 200
Significance analysis of microarrays (SAM)
SAM -- an Excel plug-in -- URL: www-stat.stanford.edu/~tibs/SAM-- modified t-test-- adjustable false discovery rate
![Page 15: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/15.jpg)
Page 202
![Page 16: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/16.jpg)
up-regulated
Page 202
down-regulated
expected
obse
rved
![Page 17: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/17.jpg)
Descriptive statistics
Microarray data are highly dimensional: there aremany thousands of measurements made from a smallnumber of samples.
Descriptive (exploratory) statistics help you to findmeaningful patterns in the data.
A first step is to arrange the data in a matrix.Next, use a distance metric to define the relatednessof the different data points. Two commonly useddistance metrics are:
-- Euclidean distance-- Pearson coefficient of correlation
203
![Page 18: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/18.jpg)
Euclidean Distance
![Page 19: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/19.jpg)
Pearson Correlation Coefficient
![Page 20: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/20.jpg)
Descriptive statistics: clustering
Clustering algorithms offer useful visual descriptionsof microarray data.
Genes may be clustered, or samples, or both.
We will next describe hierarchical clustering.This may be agglomerative (building up the branchesof a tree, beginning with the two most closely relatedobjects) or divisive (building the tree by finding themost dissimilar objects first).
In each case, we end up with a tree having branchesand nodes.
Page 204
![Page 21: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/21.jpg)
Agglomerative clustering
a
b
c
d
e
a,b
43210
Page 206
![Page 22: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/22.jpg)
a
b
c
d
e
a,b
d,e
43210
Agglomerative clustering
Page 206
![Page 23: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/23.jpg)
a
b
c
d
e
a,b
d,e
c,d,e
43210
Agglomerative clustering
Page 206
![Page 24: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/24.jpg)
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
43210
Agglomerative clustering
…tree is constructed
Page 206
![Page 25: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/25.jpg)
Divisive clustering
a,b,c,d,e
4 3 2 1 0
Page 206
![Page 26: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/26.jpg)
Divisive clustering
c,d,e
a,b,c,d,e
4 3 2 1 0
Page 206
![Page 27: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/27.jpg)
Divisive clustering
d,e
c,d,e
a,b,c,d,e
4 3 2 1 0
Page 206
![Page 28: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/28.jpg)
Divisive clustering
a,b
d,e
c,d,e
a,b,c,d,e
4 3 2 1 0
Page 206
![Page 29: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/29.jpg)
Divisive clusteringa
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
4 3 2 1 0
…tree is constructed
Page 206
![Page 30: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/30.jpg)
divisive
agglomerative
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
4 3 2 1 0
43210
Page 206
![Page 31: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/31.jpg)
1
12
1
12Page 207
![Page 32: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/32.jpg)
Cluster and TreeView
Page 208
![Page 33: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/33.jpg)
Cluster and TreeView
clustering PCASOMK means
Page 208
![Page 34: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/34.jpg)
Cluster and TreeView
Page 208
![Page 35: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/35.jpg)
Cluster and TreeView
Page 208
![Page 36: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/36.jpg)
Page 208
![Page 37: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/37.jpg)
Page 208
![Page 38: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/38.jpg)
Page 208
![Page 39: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/39.jpg)
Page 209
Two-way clusteringof genes (y-axis)and cell lines(x-axis)(Alizadeh et al.,2000)
![Page 40: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/40.jpg)
Self-Organizing Maps (SOM)
To download GeneCluster:
http://www.genome.wi.mit.edu/MPR/software.html
![Page 41: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/41.jpg)
Page 211
SOMs are unsupervised neural net algorithms that identify coregulated genes
![Page 42: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/42.jpg)
Two pre-processing steps essential to apply SOMs
1. Variation Filtering:
Data are passed through a variation filter to eliminate those genes showing no significant change in expression across the k samples. This step is needed to prevent nodes from being attracted to large sets of invariant genes.
2. Normalization:
The expression level of each gene is normalized across experiments. This focuses attention on the 'shape' of expression patterns rather than absolute levels of expression.
![Page 43: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/43.jpg)
An exploratory technique used to reduce thedimensionality of the data set to 2D or 3D
For a matrix of m genes x n samples, create a newcovariance matrix of size n x n
Thus transform some large number of variables intoa smaller number of uncorrelated variables calledprincipal components (PCs).
Principal components analysis (PCA)
Page 211
![Page 44: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/44.jpg)
Pri
nci
pal
co
mp
on
ent
axis
#2
(10%
)
Principal component axis #1 (87%)
PC#3: 1
%
C3
C4
C2
C1
N2
N3
N4
P1
P4
P2 P3
Lead (P)
Sodium (N)
Control (C)
Legend
Principal components analysis (PCA), an exploratory technique that reduces data dimensionality,
distinguishes lead-exposed from control cell lines
![Page 45: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/45.jpg)
Principal components analysis (PCA): objectives
• to reduce dimensionality
• to determine the linear combination of variables
• to choose the most useful variables (features)
• to visualize multidimensional data
• to identify groups of objects (e.g. genes/samples)
• to identify outliers
Page 211
![Page 46: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/46.jpg)
Page 212http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
![Page 47: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/47.jpg)
Page 212http://www.okstate.edu/artsci/botany/ordinate/PCA.htm
![Page 48: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/48.jpg)
Page 212
![Page 49: Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics david_mcclellan@byu.edu Brigham Young University Dept. Integrative Biology.](https://reader036.fdocuments.in/reader036/viewer/2022062422/56649efc5503460f94c0f70a/html5/thumbnails/49.jpg)
Chr 21
Use of PCA to demonstrate increased levels of geneexpression from Down syndrome (trisomy 21) brain