CONTRIBUTION TO STATISTICAL
TECHNIQUES FOR IDENTIFYING
DIFFERENTIALLY EXPRESSED GENES IN
MICROARRAY DATA
By
Ahmed Hossain
SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT
UNIVERSITY OF TORONTO
6TH FLOOR, HEALTH SCIENCES BUILDING, 155 COLLEGE STREET, TORONTO, ON
M5T 3M7, CANADA
MARCH 2011
c⃝ Copyright by Ahmed Hossain, 2011
Abstract
Contribution to Statistical Techniques for Identifying Differentially Expressed Genes in
Microarray Data
Ahmed Hossain
Doctor of Philosophy
Graduate Department of Dalla Lana School of Public Health
University of Toronto
2011
With the development of DNA microarray technology, scientists can now measure the
expression levels of thousands of genes (features or genomic biomarkers) simultaneously
in one single experiment. Robust and accurate gene selection methods are required to
identify differentially expressed genes across different samples for disease diagnosis or
prognosis. The problem of identifying significantly differentially expressed genes can be
stated as follows: Given gene expression measurements from an experiment of two (or
more) conditions, find a subset of all genes having significantly different expression levels
across these two (or more) conditions.
Analysis of genomic data is challenging due to high dimensionality of data and low
sample size. Currently several mathematical and statistical methods exist to identify
significantly differentially expressed genes. The methods typically focus on gene by gene
analysis within a parametric hypothesis testing framework. In this study, we propose
three flexible procedures for analyzing microarray data.
In the first method we propose a parametric method which is based on a flexible
distribution, Generalized Logistic Distribution of Type II (GLDII), and an approximate
likelihood ratio test (ALRT) is developed. Though the method considers gene-by-gene
analysis, the ALRT method with distributional assumption GLDII appears to provide a
ii
favourable fit to microarray data.
In the second method we propose a test statistic for testing whether area under re-
ceiver operating characteristic curve (AUC) for each gene is greater than 0.5 allowing
different variances for each gene. This proposed method is computationally less intensive
and can identify genes that are reasonably stable with satisfactory prediction perfor-
mance.
The third method is based on comparing two AUCs for a pair of genes that is de-
signed for selecting highly correlated genes in the microarray datasets. We propose a
nonparametric procedure for selecting genes with expression levels correlated with that
of a “seed” gene in microarray experiments. The test proposed by DeLong et al. (1988)
is the conventional nonparametric procedure for comparing correlated AUCs. It uses a
consistent variance estimator and relies on asymptotic normality of the AUC estimator.
Our proposed method includes DeLong’s variance estimation technique in comparing pair
of genes and can identify genes with biologically sound implications.
In this thesis, we focus on the primary step in the gene selection process, namely, the
ranking of genes with respect to a statistical measure of differential expression. We assess
the proposed approaches by extensive simulation studies and demonstrate the methods
on real datasets. The simulation study indicates that the parametric method performs
favorably well at any settings of variance, sample size and treatment effects. Importantly,
the method is found less sensitive to contaminated by noise. The proposed nonparametric
methods do not involve complicated formulas and do not require advanced programming
skills. Again both methods can identify a large fraction of truly differentially expressed
(DE) genes, especially if the data consists of large sample sizes or the presence of outliers.
We conclude that the proposed methods offer good choices of analytical tools to identify
DE genes for further biological and clinical analysis.
iii
To My Parents
iv
Acknowledgements
The successful completion of this research work is not a result of only my own effort,
but is an aggregate of contributions from many others ranging from my family members
to teachers of this department .
First, I like to acknowledge my debt of honor to ALLAH, the almighty, for enabling
me to accomplish this research work successfully.
I would like to express my heartiest thank to my supervisor Joseph Beyene, Ph.D.,
for his help and guidence, and leading me into such an interesting area. Without him this
thesis wouldn’t have been this thesis. My heartiest gratitude also goes to my honorable
teacher Professor Andrew R. Willan, Ph.D., for giving me opportunity to do my Ph.D.
here and permitting me to undertake this research work as a partial fulfillment of my
Ph.D. Degree in this department at University of Toronto. I am also thankful to Hospital
for Sick Children for their financial support which help me to proceed with this research.
I also gratefully acknowledge the overseas Scholarship scheme from the University of
Toronto for paying my tuition fee and studentship from the school for providing me with
living maintenance.
I also wish to thank Professor Angelo Canty, Ph.D., Laurent Briollais, Ph.D. and
David Tritchler, Ph.D. for reviewing the thesis and provide me their valuable input to
improve the thesis.
Of course, I am grateful to my parents for their love and encouragement throughout
my studies. Without them this work would never have come into existence (literally).
Finally, I wish to thank the following: Ping Zhao Hu; Shahnaz (for changing my
life from worse to bad); Zafeera (for changing my life from bad to best); and my two
sisters.
Toronto, Ontario Ahmed Hossain (March 3, 2011)
v
Contents
1 Introduction to Microarray Technology 1
1.1 Measuring Gene Expression Using Microarrays . . . . . . . . . . . . . . . 1
1.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Microarray Technologies . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Microarray Gene Expression Dataset . . . . . . . . . . . . . . . . 6
1.2 Background Adjustment and Normalization . . . . . . . . . . . . . . . . 8
1.3 Challenges of Micorarray Expression Analysis . . . . . . . . . . . . . . . 9
1.4 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Assessing Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.1 An Application to Calculate FDR . . . . . . . . . . . . . . . . . . 16
1.7 Importance of Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.8 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.8.1 Preprocessing packages . . . . . . . . . . . . . . . . . . . . . . . . 19
1.8.2 Testing packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.9 Aim of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.10 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Popular Statistical Methods for Identifying Differentially Expressed
Genes 25
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vi
2.2 Fold Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 t-test and ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Nonparametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Wilcoxon Rank Sum Test (RST) . . . . . . . . . . . . . . . . . . 28
2.4.2 ROC Methodology for Gene Expression Analysis . . . . . . . . . 29
2.5 SAM-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Empirical Bayes Approach . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Posterior Odds Statistic(LIMMA) . . . . . . . . . . . . . . . . . . . . . 35
2.8 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Approximate Likelihood Ratio Method 39
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Generalized Logistic Distribution of Type II (GLDII) . . . . . . . . . . . 41
3.3 Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 Comparison between AMLE and MLE for location and scale parameters
of GLDII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 FDR Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 Permutation based p-values and AUC Estimation . . . . . . . . . . . . . 55
3.8 Comparison with Other Methods . . . . . . . . . . . . . . . . . . . . . . 55
3.8.1 Simulation Experiment . . . . . . . . . . . . . . . . . . . . . . . 56
3.8.2 Duchenne Muscular Dystrophy (DMD) Data . . . . . . . . . . . 59
3.8.3 Golub Leukemia Data: Classification Between ALL and AML . . 63
3.9 Multiclass Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9.1 Example of Multi-class microarray data: SRBCT Dataset . . . . . 67
3.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vii
4 Nonparametric Method for Detecting Differentially Expressed Genes:
Single Gene Analysis 72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Parametric versus Nonparametric Methods . . . . . . . . . . . . . . . . . 74
4.3 General Discussion on ROC analysis . . . . . . . . . . . . . . . . . . . . 76
4.4 Motivation of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.1 Single Gene Analysis: AUC . . . . . . . . . . . . . . . . . . . . 78
4.6 FDR Estimation with dg statistic . . . . . . . . . . . . . . . . . . . . . . 82
4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.8 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5 Nonparametric Method for Detecting Highly Correlated Differentially
Expressed Genes 93
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Ding’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Correlation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Comparison of Two ROC Curves . . . . . . . . . . . . . . . . . . 96
5.3.2 Permuted P-values and FDR estimation with D(adj) statistic . . 98
5.3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.4 Application: Colon Cancer Data . . . . . . . . . . . . . . . . . . . 102
5.3.5 Effect of Seed Gene: Affymetrix spike-in study . . . . . . . . . . . 106
5.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 107
viii
6 Conclusion 110
6.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2.1 Improving the ALRT method . . . . . . . . . . . . . . . . . . . . 113
6.2.2 Possible extension for D(adj) statistic . . . . . . . . . . . . . . . 113
Bibliography 115
ix
CHAPTER 1Introduction to Microarray Technology
This chapter provide a concise overview of data-analytic tasks associated with mi-
croarray studies. We want to give a brief orientation before moving to the methods for
identifying gene expression analysis. Here we introduce the foundations of microar-
ray technology and describe the limitations, concepts and methods in microarray gene
expression analysis used in this thesis.
1.1. Measuring Gene Expression Using Microarrays
1.1.1. Background
The genome consists of long deoxyribonucleic acid (DNA) molecules which are neatly
packed up into chromosomes in the nucleus of each cell. A DNA molecule is a nucleic
acid that consists of two long chains of nucleotides ( or strands) twisted together into
a double helix and joined by hydrogen bonds. Each strand is built up by a sequence
of the bases adenine (A), thymine (T), guanine (G) and cytosine (C). The bases are
paired so that an A in one strand can only bind to T in the other, and a C can only
bind to a G. The two strands are called complementary, since each strand holds the
same sequence of information. It carries the cell’s genetic information and hereditary
characteristics via its nucleotides and their sequence and is capable of self-replication
and RNA synthesis. Some segments of the DNA sequence contain genetic information
and hence these are-loosely-called genes.
1
2
Microarray technology has revolutionized modern biological research by permit-
ting the study of thousands of genes simultaneously. The principle of molecular
genetics that states genetic information flows from DNA to messenger RNA (mRNA)
and from RNA to proteins which perform gene functions (Crick, 1970). The amount
of RNA in this process indicates the level of gene expression. That is the extent to
which a gene is used to produce proteins is known as gene expression. Note that there
are different levels of gene expression, one at the transcription level, where RNA is
made from DNA, and one at the protein level, where protein is made from mRNA.
Microarray measures the gene expression level on a genomic scale by examining the
amount mRNA in cell cultures or tissues and they provide insight into gene function
by quantitatively studying gene expression.
1.1.2. Microarray Technologies
Microarray technology has been applied to many situations, including disease di-
agnosis, drug discovery, and toxicology.[Schulze and Downward (2001), Brown and
Botstein (1999), Debouck and Goodfellow (1999), Macoska (2002), Lobenhofer et al.
(2001)]. It is therefore important to measure gene expression from the sample under
study. Several techniques are available for measuring gene expression, including se-
rial analysis of gene expression (SAGE), cDNA library sequencing, differential display,
cDNA subtraction, multiplex quantitative RT-PCR, and gene expression microarrays.
Microarrays quantify gene expression by measuring the hybridization, or matching, of
DNA immobilized on a small glass, plastic or nylon matrix to mRNA representation
from the sample under study. Note that there are different levels of gene expression,
one at the transcription level, where RNA is made from DNA, and one at the protein
level, where protein is made from mRNA. There are methods for detecting mRNA
expression of a single gene or a few genes. The novelty of a microarray is that it
quantifies transcript levels on a global scale by quantifying transcript abundance of
thousands of genes simultaneously. This novelty has allowed biologists to take a global
3
perspective on life processes and to study the role of all genes or all proteins at once
(Nguyen et al. (2002)). A detailed explanation of how a microarray experiment is
done can be found in Sambrook and Russell (2001) and Dietz et al. (2003). Although
there are different types of microarrays, all follow these common basic procedures:
• Chip manufacture: A microarray is a small chip (made of chemically-coated
glass, nylon membrane or silicon) onto which tens of thousands of DNAmolecules
(probes) are attached in fixed grids. Each grid cell relates to a DNA sequence.
• mRNA preparation, labeling and hybridization: Typically, two mRNA
samples (a test sample and a control sample) are reverse transcribed into cDNAs
(targets), labeled using either fluorescent dyes or radioactive isotopics, and then
hybridized with the cloned sequences on the surface of the chip.
• Chip scanning: Chips are scanned to read the signal intensity that is emitted
from the labeled and hybridized targets.
Microarray technologies include several kinds of so-called cDNA arrays and oligonu-
cleotide arrays. Although both exploit hybridization, they differ in how DNA se-
quences are laid on the array and in the length of these sequences. Schena (2000)
reviews in detail the technical aspects of different microarray technologies. Overviews
of different microarray technologies can also be found in Nguyen et al. (2002). How-
ever, most of the results in this thesis are applicable to oligo systems developed by
Affymerix. Here we give a brief overview of the Affymetrix array.
Oligonucleotide Array: The Affymetrix Array Microarray experiments using
Affymetrix technology are widely used. In Affymetrix arrays expression of each
gene is measured by comparing hybridization of the sample mRNA to a set
of probes, composed of 11-20 pairs of oligonucleotides, each of length 25 base
pairs. The first type of probe in each pair is known as perfect match (PM) and is
taken from the gene sequence. Each PM probe is paired with a mismatch (MM)
4
probe that is created by changing the middle (13th) base of the PM sequence
to reduce the rate of specific binding of mRNA for that gene. The goal of MMs
is controlling for experimental variation and nonspecific binding of mRNA from
other parts of the genome. These two probes (PM, MM) are referred to as a
probe pair. Under relatively ideal situations, when the gene is expressed in the
cell sample, high intensity is expected for the PM probe and low intensity for
the MM probe. In this procedure, an RNA sample is prepared, labeled with a
fluorescent dye, and hybridized to an array. Unlike two-channel arrays, a single
sample is hybridized on a given array. Arrays are then scanned and images are
produced and analyzed to obtain a fluorescence intensity value for each probe,
measuring hybridization for the corresponding oligonucleotide. The software
utilities provided with the Affymetrix suite summarize the probe set intensities
to form one expression measure for each probe set. Oligonucleotide arrays are
discussed by Lockhart et al. (1996); details on Affy arrays can also be found in
Affymetrix (1999).
As microarray technology evolves, study of as many as 20,000 genes is becoming
routine [Reyal et al. (2005), Harbig et al. (2005)]. With the capability to screen large
portions of the human genome, microarrays are typically used for screening large
numbers of genes. In order to obtain meaningful information for the organism being
studied, multiple levels of analysis are performed on the primary data. Figure 1.1
displays the flowchart for a typical microarray experiment. From the analytical point
of view we can separate the microarray experiment with two stages:
Probe-level analysis The first stage of the analysis is probe-level analysis (often
called low-level analysis) which summaries the raw data to obtain a single ex-
pression value for each gene or probe from the experimental data. The probe-
level analysis should provide reliable measurements of gene or probe expression
levels leading us to a second stage of analysis which is called high level analysis.
The low level analysis, often associated with the pre-processing stage within the
5
Figure 1.1: Flowchart for a Typical Microarray Experiment. This figure is modifiedfrom http://www.humgen.nl/microarray_analysis.html.
microarray, has increasingly become an area of active research, traditionally in-
volving techniques from classical statistics. Statisticians explore opportunities
for the application of various methods to several important probe-level microar-
ray analysis problems, such as monitoring gene expression, transcript discovery,
genotyping and resequencing.
High level analysis The high level analysis provides answers to the biological ques-
tions that motivate the microarray experiment. Different types of high level
analysis include clustering, classification and projection methods. We concen-
trate our thesis work on high level analysis.
6
1.1.3. Microarray Gene Expression Dataset
Microarrays produce massive amounts of data. These data, like genome sequence
data, can help us to gain insights into underlying biological processes where they can
be queried, compared and analyzed. A DataMatrix object stores experimental data
in a matrix, with rows typically corresponding to gene names or probe identifiers,
and columns typically corresponding to sample identifiers. A DataMatrix object also
stores metadata, including the gene names or probe identifiers (as the row names)
and sample identifiers (as the column names).
Conceptually, a gene expression dataset can be regarded as consisting of three
parts - the gene expression data matrix, gene annotation and sample annotation( see
Figure 1.2). Gene annotation may include the gene names and sequence information,
location in the genome, a description of the functional roles for known genes and
links to the respective entries in gene sequence databases. Sample annotation may
provide information about the part of the organism from which the sample was taken
or which cell type was used, or whether the sample was treated, and if so what
was the treatment (e.g. a chemical compound and concentration). Samples may
also be related: for instance, they may form a time course. However, each row in
the Affymetrix data matrix corresponds to a perfect match (PM) probe, and each
column corresponds to an Affymetrix CEL file. Each CEL file is generated from a
separate and same type of chip for each sample and the image processing software
stores the location and two summary statistics (a mean and standard deviation) for
each probe. It is also important to have information about the experiment itself.
This could include identification number in a database, experimental protocols used,
preprocessing information, and so forth.
7
Figure 1.2: Conceptual view of gene expression data matrix. A gene expressiondatabase can be regarded as consisting of three parts: the gene expression datamatrix, gene annotation and sample annotation. This figure is from http://www.
ebi.ac.uk/2can/databases/microarray2.html
8
1.2. Background Adjustment and Normalization
Microarray experiments are complicated tasks involving a number of stages, most of
which have the potential to introduce error. These errors can mask the biological
signal of interest. A component of this error may be systematic, i.e., bias is present,
and it is this component that needs to be removed in order to gain as much insight as
possible into the underlying biology in a microarray experiment. This can be achieved
by using a number of well-established statistical methods. It is difficult to account
for all characteristics of gene expression data in a single model. Most models handle
the experimental data in following steps, although they may not be performed in this
order:
1. Possible background correction, which removes background noise from signal
intensities;
2. possible normalization, which is intended to even out unwanted non-biological
variation across arrays; and
3. summarization, which gives the expression index.
Most classical normalization procedures are global approaches, based on normaliza-
tion of the overall mean or median array intensity to a common standard, such as those
implemented in the Affymetrix GeneChip software (Affymetrix Inc., Santa Clara,
California). Detailed descriptions of Affymetrix normalization methods can be found
in the Version 5.0 Affymetrix Microarray Suite, MAS 5.0 User Guide. Normaliza-
tion methods implemented are similar to scaling and enable comparison analysis of
an expression and baseline array. Ideally, expression indices should be both precise
(low variance) and accurate (low bias) after normalization. Until now a number of
methods of background correction and normalization have been proposed. These
methods include the Lowess normalization by Chen et al. (1997), global linear nor-
malization by Yang et al. (2001), and variance stabilization method of Huber et al.
9
(2002). Many researchers use ANOVA models for microarray data that can account
for experiment-wide systematic effects that could bias inferences made on the data
from the individual genes. Kerr et al. (2000) proposed an ANOVA model based
on the logarithms of the original fluorescence measurements that can incorporate ex-
perimental factors and gene-specific interaction effects. Robust multiarray average
(RMA), published by Irizarry et al. (2003), is a commonly used method for prepro-
cessing and normalizing Affymetrix data. However, one part of the RMA method is
quantile normalization that is applicable to all data types. A number of algorithms,
including the aforementioned, have been implemented in the Bioconductor R project
(http://www.bioconductor.org/).
Following normalization of the data, a statistical analysis is performed to identify
differentially expressed genes, find similarities or patterns in gene expression profiling.
The overall results must then be applied to the biological model of interest for its
meaningful and appropriate interpretation.
1.3. Challenges of Micorarray Expression Analysis
Gene expression data from microarrays present many challenging problems for ana-
lysts. The data exhibit complicated error structures which are not widely recognized.
High Dimensionality Owing to the large number of genes m and small number of
samples n(m≫ n), microarray data analysis poses big challenges for statistical
analysis. An obvious problem owing to the “large m small n” is overfitting.
Again with small sample sizes parametric statistical tests of the differences be-
tween mean levels of gene expression for each of the genes will be more sensitive
to assumed distributional forms of the expression data, and resulting p-values
may not be accurate.
Distribution The measured expression levels are often non-normally distributed.
Also, it is often assumed that all the genes have same shaped distribution which
10
may or may not hold in practice.
Unequal Sample Sizes Very often the sample sizes for the treatment group and the
control group are unequal. If the sizes for these two groups are substantially
different, it is more likely that the model will fit the larger sample better. For ex-
ample, if treatment group has 10 subjects and control group has 50 subjects then
it is more likely that the model will fit the control group better. In this scenario
there is chance that two populations have different variances and one group pop-
ulation will be more skewed than the other. In this case comparing with t-test
will not give robust answers about the data. Using the t-test and assuming nor-
mality, if the population difference between the two groups is 1 and the standard
deviations are 1.5 and 1 respectively, then at 5% significance level, the statistical
power or chance of correctly rejecting the null hypothesis is only about 52.4%
(from http://www.dssresearch.com/toolkit/spcalc/power.asp). On the
other hand the power becomes 97.5% considering 50 subjects per group.
Heterogeneity of variances Gene expression values contain variability caused by
noise and natural variability. A properly chosen transformation can stabilize
the variance and improve the statistical properties of analysis.
Correlation Structure In reality, many genes are dependent to some degree, be-
cause they are biologically related, whereas others are totally independent. The
correlations in microarray data are not only strong but they are also long-
ranged, involving thousands of genes. The consequences of ignoring this fact
may produce unstable results of estimation or testing [Qiu et al. (2005), Qiu and
Yakovlev (2006), Klebanov and Yakovlev (2007 )]. It has gradually come into
the realization that inter-gene correlations are not just a nuisance, complicating
statistical inference from microarray gene expression data, but a rich source of
useful information.
11
1.4. Filtering
The concept of filtering can be visualized as taking a large matrix of data (possibly an
entire database) and making a smaller matrix. The microarray dataset is quite large
and a lot of the information corresponds to genes that do not show any interesting
changes during the experiment. If certain genes are not associated with the clinical
outcome, it is important to exclude them from the predictive model. After the removal
of irrelevant data, we are left with the good quality data, of which most is probably
uninteresting for us. If the goal of the study is to find a couple of dozens of genes for
further studies of the biologically interesting phenomenon, it is a good idea to remove
the uninteresting part of the data before proceeding with the analysis. Uninteresting
data comprises of the genes that do not show any expression changes during the
experiment. A widely used filtering procedure for Affymetrix data is described in the
following.
Affymetrix oligonucleotide chips (Affymetrix, 2002) represent each gene with a
probe set consisting of a number of probe pairs. A probe pair is composed of a
perfect match (PM) probe and a mismatch (MM) probe. An algorithm in MAS
attempts to identify probe sets that are expressed on a particular chip by reporting
a present−absent call. The algorithm compares the expression PM of the PM probe
to the expression MM of the MM probe using a relative expression value given by,
PM −MM
PM +MM
For each probe set, the Wilcoxon signed-rank procedure (Mason et al. (1989)) is
used to test the null hypothesis that the median relative expression of the probe pairs
is less than or equal to some value τ . The MAS default value of τ is 0.015. The call
is then generated by comparing the p-value p from this test with two thresholds, ϵ1
and ϵ2, where ϵ1 < ϵ2. The probe set is declared “present” (i.e., expressed) if p < ϵ1,
“absent” (i.e., unexpressed) if p ≥ ϵ2, and “marginal” if ϵ1 ≤ p < ϵ2. The MAS
defaults for ϵ1 and ϵ2 are 0.04 and 0.06, respectively. These present-absent calls are
12
often used as part of a filtering criterion that removes genes that do not appear to be
expressed in the biological system under investigation. A widely used technique, the
m/n filter, removes all probe sets having fewer than m present calls among a set of n
chips. Pounds and Cheng (2005) derived the statistical properties of the commonly
used m/n filter and proposed the pooled p-value filter and the error-minimizing pooled
p-value filter. Additionally, they proposed and demonstrated an approach to estimate
the number of interesting probes removed by a filter and assess the influence of the
filter on subsequent analysis.
1.5. Analysis
Probe Level Analysis The noisy nature of microarray experiment data has moti-
vated the development of numerous algorithms for estimating gene expression
values. Most models handle the experimental data in several steps, such as
background correction, normalization and summarization. There are several
normalization methods in the published literature:
• Affymetrix Microarray Suite Software version 5.0, MAS 5.0 (Affymetrix,
2002),
• Model-based Expression Index, MBEI (Li and Wong (2001 )),
• Robust multi-array average, RMA ( Irizarry et al. (2003)).
A further issue that needs to be addressed is the difference between the two most
commonly used microarray technologies: spotted cDNA microarrays, which re-
port relative differences in gene expression between two samples, and oligonu-
cleotide microarrays, which report absolute expression levels. Normalization
techniques for one microarray technology might not apply to another, owing to
differences in assumptions and the distributions of the output measurements.
13
Various microarray technologies for measuring expression values have created
some challenging statistical problems. There are many sources of variation in a
typical experiment and these can be accounted for using statistical design and
analysis-of-variance methodology, although careful attention has to be given
to the high-dimensionality and complicated interactions. Statistical methods
invoked early in the data analysis pipeline can remove systematic errors and
improve subsequent inferences.
High Level Analysis The goal of microarray data analysis is to find relationships
and patterns in the data, and ultimately achieve new insights into the underly-
ing biology. One of the ways to obtain new insights into biology by determining
differentially expressed genes under different conditions. Another important
function of microarray experiments is the classification of biological samples
using gene expression data. The goal of classification is to identify the dif-
ferentially expressed genes that may be used to predict class membership for
new samples. The classes are predefined and the task is to understand the
basis for the classification from a set of labeled objects. Examples of classifi-
cation methods range from linear discriminant analysis Mardia et al. (1979 )
to support vector machines Brown et al. (2000) or classification and regression
trees Breiman et al. (1984). Clustering applies when the classes are unknown
a priori and need to be discovered from the data. This provides both a visual
representation of the data and a method for measuring similarity between bio-
logical conditions. The widely used methods for clustering microarray data are:
Hierarchical, K-means and Self-organizing map. The most popular clustering
methods are nicely reviewed by Quackenbush (2001).
14
1.6. Assessing Significance
There are several approaches for reporting the degree of reliability of results in the
analysis aimed at selecting differentially expressed genes. Test statistics are usually
connected to p-values that are used to control type I error probabilities. The methods
are seen as providing rankings of the genes based on p-values of the parameteric tests.
Even if p-values could be derived in gene expression analysis, they would be more
misleading than informative, because of the strong distributional assumptions they
would be based upon. Again conventional approaches based on gene-specific p-values
are generally criticized on the grounds of the multiplicity of comparisons involved
for identifying differentially expressed genes (Dudoit et al. (2002)). For example, a
p-value of 0.05 tells that we have a probability of 5% of making a type I error (false
positive) on one gene, then we expect 500 type I errors (false positive genes) if we
look at 10000 genes. That is usually not acceptable. Multiplying the p-value by the
number of genes in the experiment, we can get an estimate of the number of false
positives. We can take this false positive rate into account when planning further
experiments.
Table 1.1: Possible outcomes of multiple hypothesis testingAccepted Rejected
True H0 T11 T12 m0
True H1 T21 T22 m1
m-R R m
The multiple testing approaches have relied on the FamilyWise Error Rate (FWER)
and the False Discovery rate (FDR). FWER is the probability of at least one false
positive among the genes selected as differentially expressed, that is,
FWER = Pr(T12 > 0)
15
where T12 is defined in Table 1.1. Bonferroni is perhaps the most widely used method
to control the FWER (see Hochberg and Tahmhane (1987)). The Bonferroni correc-
tion can be used to reduce the nominal level of significance. For example, if we have
only 5% probability of making one or more type I errors among all 10000 genes then
with Bonferroni correction the new cutoff becomes 0.05/10000. That is a very strict
cut-off, and for many purposes we can tolerate more type I errors in expectation.
False discovery rate (FDR) is another statistical method, which is used in multiple
hypothesis testing to correct for multiple comparisons. In a list of rejected hypoth-
esis, FDR controls the expected proportion of incorrectly rejected null hypothesis.
Thus it is the expected proportion of false positives among the genes declared to be
differentially expressed, that is,
FDR = E(T12
R| R > 0)
where T12 and R are defined in Table 1.1. Benjamini and Hochbergh (1995) have
proposed a method for controlling the FDR at a specified level. After ranking the
genes according to significance (p-value) and starting at the top of the list, we accept
all genes where
p ≤ T12
Rq
where T12 is the number of genes accepted so far, R is the total number of genes
tested and q is the desired FDR. For T12 > 1, this correction is less strict than a
Bonferroni correction. Because of this directly useful interpretation, FDR is a more
convenient scale to work on instead of the p-value scale. For example, if we declare
a collection of 100 genes with a maximum FDR 0.10 to be DE, then we expect a
maximum of 10 genes to be false positives. No such interpretation is available from
the p-value (Pawitan et al. (2005)). Tusher et al. (2001) and Storey and Tibshirani
(2003) selected potential differentially expressed genes, and then estimate the FDR of
the selected genes by re-sampling , given that at least one gene was selected. Storey
(2002) also introduced a q-value, which is a Bayesian concept and corresponds to
16
the posterior probability that a gene is not differentially expressed given the observed
data.
The FDR can also be estimated using permutations. Permutation is used to
estimate unadjusted p-values while avoiding parametric assumptions about the joint
distribution of the test statistics. An important assumption behind a permutation test
is that the observations are exchangeable under the null hypothesis. An important
consequence of this assumption is that tests of difference in locations require equal
variance. To estimate a p-value for gth gene using permutation analysis, one first
calculates the observed test statistic Tg. Then one permutes the samples of the gene
expression data and recalculates the test statistic, T ∗g . Permuting entire samples of
this expression data creates a situation in which the response is independent of the
gene expression levels, while attempting to preserve the correlation structure and
distributional characteristics of the gene expression levels. The p-value is estimated
by counting the number of T ∗g = {T ∗
gb : b = 1, · · · , B} that are greater than or
equal to Tg and dividing by the total number of permutations, B. More details of
the permutation algorithm for calculating unadjusted p-values are given in Dudoit
et al. (2002), Storey and Tibshirani (2003) and also in Speed (2003). A more recent
paper by Lin et al. (2008) investigate the performance of the SAM method and the
Benjamini and Hochberg-FDR procedure with regard to controlling the FDR.
When controlling the FDR, it is important to be aware of the sensitivity or false
negative rate (FNR), as we may lose too many of the truly DE genes by setting
the FDR too low. Thus, the increasing use of FDR needs to accompanied by the
sensitivity or FNR assessment.
1.6.1. An Application to Calculate FDR
We obtained the gene expression data from the leukemia microarray study of Golub
et al. (1999). Pre-processing was done as described in Dudoit et al. (2002). The data
consist of 27 ALL and 11 AML subjects of 3051 genes. Figure 1.3 is a histogram of
17
the 3051 two-sample t-statistics from the genes. The t-statistics values range from
-10.58 to 7.548. Suppose that we decide to reject all genes whose t-statistic is greater
than 3 in absolute value; there are 614 such genes.
T statistic
−10 −5 0 5
020
4060
8010
012
0
Figure 1.3: Histogram of t-statistics from the Golub et al. (1999) leukemia dataset
To calculate the FDR among these 614 genes we do a random permutation of
the sample labels (27 ALL and 11 AML). We recompute the t-statistics and count
how many exceed ±3. Doing this for 100 permutations, we find that the average
number is 17.58. Thus a simple estimate of the FDR is 17.58/614=2.86%. This
simple estimate tends to be biased because the permutations make all the genes null,
but in the data only a proportion, π0, are null. Hence to improve the estimate of
18
FDR, we multiply it by an estimate of π0. To obtain π0, we look at small values of
the t-statistic (in absolute value), where null statistics are much more abundant than
alternatives. Looking, for example, at t-statistics below 0.15 in absolute value, we
find that 2865 of the observed t-statistics fall into that range, while on average 2993 of
the t-statistics from the permutations fall in that range. (The 0.15 cutoff is arbitrary
in this example, but it can be automatically chosen taking bias and variance into
account as in Storey and Tibshirani (2003).) Hence our estimate of π0 is π0 ≈ 0.95.
Finally, our revised estimate of the FDR is 0.95*2.86=2.717%.
1.7. Importance of Replicates
Carefully designing and controlling experiments is as important as the execution of
the experiment itself. One approach that ensures greater experimental success in
gene expression studies using microarrays is the incorporation of replicates. There
are two primary types of replicate experiments: biological and technical. biological
replicates refer broadly to analysis of RNA of the same type from different subjects
(for example, muscle tissue treated with the same drug in different mice or six different
human samples across six arrays); technical replicates refer to multiple-array analysis
performed using the same RNA (for example, multiple samples from the same tissue
or analyzing one sample six times across multiple arrays). It is important to consider
one or both of these types of replicates depending on the experimental design.
Any type of replicates provide a measure of the experimental variation, such as
RNA isolation, labeling efficiency, or chip quality. The importance of biological repli-
cation in microarray gene expression studies has been addressed by Lee et al. (2000).
They conducted a controlled experiment involving replication of cDNA hybridizations
on a single microarray to investigate inherent variability in gene expression data and
the extent to which replication in an experiment can affect consistent and reliable
findings. The importance of biological replicate microarray experiments was also re-
ported by Pritchard et al. (2001) based on mouse gene expression data collected from
19
different tissues, such as the kidney, liver, and testis. They demonstrated that even
for genetically identical mice of the same age housed under the same conditions, there
were genes that expressed significant variation at the mouse level. In particular, their
data suggest that both specific genes and functional classes of genes will be consis-
tently variable, even in multiple tissue types. Genetically diverse populations such as
humans are likely to show even greater variability in gene expression. The advantages
of using replicates are summarized as follows:
• Replicates can be used to measure variation in the experiment so that statistical
tests can be applied to evaluate differences.
• Averaging across replicates increases the precision of gene expression measure-
ments and allows smaller changes to be detected.
• Replicates can be compared to locate outlier results that may occur due to
aberrations within the array, the sample, or the experimental procedure.
The optimal number of replicates in a general microarray study will depend on many
factors, including array equipment type, laboratory technique, and the condition
and preparation of samples. A study by Pan (2002) discussed how to calculate the
number of replicates (arrays or spots) in the context of applying a normal mixture
model approach to detect changes in gene expression. Their estimation depends on
several factors, including a given magnitude of expression change, a desired statistical
power to detect it, a specified type I error rate, and the statistical method being used
to detect it.
1.8. Software
In this thesis we concentrate on analyzing gene expression data to identify differ-
entially expressed genes. Microarrays generate large amounts of numeric data that
should be analyzed effectively. R statistical software (http://cran.r-project.org)
20
and its expansion packages from Bioconductor project (http://www.bioconductor.
org) provide flexible means to manage and analyze these data. There are two parts
where we do our analysis: (1) Preprocessing and (2) Testing. The data analysis part
starts after preprocessing of Affymetrix data.
1.8.1. Preprocessing packages
Functions for reading Affymetrix data are available in the package affy which is
written by a group of authors maintained by Irizarry, R.A. Functions for Affymetrix
normalization are distributed over several packages. The MAS5 method developed by
Affymetrix is available in the package affy, command mas5(). A newer method Plier,
also developed by Affymetrix is available in package plier, command plier(). The
RMA method is implemented in package affy (command rma()), but its adaption
for taking into account the differences in probes GC% (GCRMA), is available in a
separate package gcrma (command gcrma()).
In this thesis, we will apply RMA preprocessing to the data. The reason why
RMA was chosen is based on observations that it gives highly precise (low variance)
estimates of expression (which is desirable), although it might not give as accurate
(low bias) results as MAS5 (Millenaar et al. (2006)). In other words, RMA seems
systematically to underestimate gene expression.
1.8.2. Testing packages
The following few packages are aimed to provide access for identifying differentially
expressed genes.
LIMMA One of the widely used tools for the statistical analysis is limma, which
implements linear models. One of the assumptions of the limma’s method is
that the data is normally distributed, but real world data is not always normally
distributed. Limma itself also provides input and normalization functions which
support features especially useful for the linear modeling approach. The latest
21
version of this package is 3.6.9 which is written by a group of authors and the
package is maintained by Gordon Smyth. A detailed discussion about limma is
given in the book by Smyth (2004).
SIGGENES The use of siggenes package is to identify the differentially expressed
genes and estimate the False Discovery Rate (FDR) using both the Significance
Analysis of Microarrays (SAM) and the Empirical Bayes Analyses of Microar-
rays (EBAM). Schwender (2009) is the author of this package and the current
version number for this package is 1.24.0.
DEDS This library contains functions that calculate various statistics of differential
expression for microarray data, including t statistics, fold change, F statistics,
SAM, moderated t and F statistics and B statistics. It also implements a
new methodology called DEDS (Differential Expression via Distance Summary),
which selects differentially expressed genes by integrating and summarizing a
set of statistics using a weighted distance approach. The authors of this package
for version number 1.24.0 are Xiao and Yang (2007).
ROC The ROC library is a collection of functions related to receiver operating char-
acteristic (ROC) curves and are targeted to use in DNA microarray analysis.
Carey and Redestig (2010) introduced this package with the version number
1.26.0.
OCplus This R-package OCplus containing functions to compute the proportion
of non-differentially expressed (non-DE) genes based on the mixture model,
the resulting FDR and other operating characteristics of microarray data. The
package includes tools both for planned experiments (for sample size assessment)
and for already collected data (identification of differentially expressed genes).
The authors of this package for the version number 1.24.0 are Pawitan and
Ploner (2010).
22
1.9. Aim of the Thesis
The goal of many controlled microarray experiments is to identify genes that are
regulated by modifying conditions of interest. For example one may wish to compare
a drug with an alternative drug treatments. The goal of these experiments is generally
that of identifying as many of the genes as possible that are differentially expressed
across the conditions compared. Gene expression can often be thought of as the
response variable in statistical models.
Microarrays are often used to screen genes for further analysis by more reliable
assays, and the data analysis is best approached by ranking genes and/or by selecting
a subset of genes for further validation. Determining whether a gene is differentially
expressed under different conditions is an important statistical problem in genome
experiments. The standard practice is to test the hypothesis of no differential ex-
pression for each gene when comparison is made between two (or more) different
experimental conditions. Generally speaking, the statistical methods for testing the
hypothesis can be classified into two categories: the parametric method and the non-
parametric method. The most commonly used parametric method is the two sample
t-test and its variations. Although it seems to work well in some situations, the
parametric method has a big drawback: its strong dependence on model assumption.
Several authors pointed out that expression data from microarrays are often not dis-
tributed according to a normal distribution, even after some preprocessing (Hunter
et al. (2001), Thomas et al. (2001), Pan et al. (2003), Craig et al. (2003), Zhao et al.
(2003), Zhang et al. (2006)). According to Hunter et al. (2001) and Thomas et al.
(2001) the microarray data is often noisy and hence the parametric assumption is cer-
tainly inappropriate for a subset of genes despite any given transformation. Therefore
it is challenging to construct a suitable statistical model applicable to all microarray
datasets. The main goal of this thesis is to explore methods for ranking genes in
order of likelihood of being differentially expressed. Top gene lists, that give truly
23
DE genes, are the output. In these contexts we propose three new methods, namely
Parametric Method: We develop an Approximate Likelihood Ratio Test (ALRT)
method assuming the expression levels follow a generalized logistic distribution
of Type II (GLDII). The ALRT method with distributional assumption GLDII
appears to provide a suitable fit of data especially with large samples.
Nonparametric Method 1 We develop an improved test statistic for testing whether
area under receiver operating characteristic curve for each gene is greater than
0.5 allowing different variance for each gene. The method performs reasonably
well and it is computationally efficient enough for practical applications.
Nonparametric Method 2 We develop a nonparametric procedure for selecting
genes with expression levels correlated with that of a “seed” gene or a marker
gene in microarray experiments. The marker gene is known in advance. The
proposed test statistic compares two Area Under Receiver Characteristic Curves
(AUC) for gene pairs taking correlation into account.
1.10. Organization of the Thesis
The thesis is divided into a series of chapters, each devoted to a particular facet of
analysis and arranged in roughly the same order as the issues one might encounter
during a real experiment. It begins with a very brief overview of hybridization, which
nicely summarizes the microarray technology and highlights the current limitations
of the most commonly used methods. Based on the aims of the thesis described in
Section 1.9, the work includes mainly identifying differentially expressed genes. The
presentation of the work in this thesis is organized in the following four chapters
Chapter 2 outlines few of the popular microarray data analysis tools that helps to
detect differential gene expression. This chapter starts with describing how to
assess the significance of any fold changes in expression. The primary concepts
24
and methods used for identifying differentially expressed genes are also intro-
duced. An empirical comparison with other methods is also discussed in this
chapter.
Chapter 3 provides a parametric method that is proposed with a approximate like-
lihood ratio test when the underlying distribution of gene expression data is
skewed. Here we assumed the underlying distribution of gene expression data
follows the generalized logistic distribution of Type II rather than normal dis-
tribution. We also compare results on simulated data and from real microarray
datasets. We conclude the chapter with a discussion summarizing the advan-
tages and disadvantages of using our method.
Chapter 4 provides a nonparametric method for testing whether area under receiver
operating characteristic curve (AUC) for each gene is equal to 0.5 allowing
different variance for each gene. This chapter provides the implementation
of the method with real datasets and simulation procedures and shows the
improvement of identification of differentially expressed genes compared with
rank sum test and t-test.
Chapter 5 studied the problem of searching genes correlated to a known candidate
gene or a “seed” gene. We propose a test statistic that compares two Area
Under Receiver Characteristic Curves (AUC) for gene pairs taking correlation
into account and identifying DE genes sequentially. We compare our method
with other well known methods.
Chapter 6 gives the conclusion of this work and proposes possible directions for
future research.
CHAPTER 2Popular Statistical Methods for
Identifying Differentially ExpressedGenes
Having performed normalization we should now be able to compare the expression
level of any gene in the treatment to the expression level of the same gene in the
control. There are many statistics developed for this purpose. The number of methods
for identifying differentially expressed genes from microarray experiments are now
large and ever increasing. There is no consensus, no strict guidelines or real rules of
thumb when to apply some tests and when to apply others. In this chapter we discuss
some of the well known methods that are used in identifying differentially expressed
genes.
2.1. Introduction
One of the well-known areas from high-dimensional microarray analysis is that of
ranking genes according to the differential expression. For this purpose, many statis-
tical models have been proposed for the analysis of microarray gene expression data,
including generalized linear models (Kerr et al. (2000), Lee et al. (2000), Dudoit
et al. (2002)), mixed effects models (Wolfinger et al. (2001)), modified mixture model
methods (Pan et al. (2003)), significance analysis of microarrays (SAM) (Tusher et al.
(2001)), modified t-statistic method (Zhao et al. (2003)), Likelihood Methods (Ideker
et al. (2000),Wright and Simon (2003), Wu (2005)), Bayesian methods (Opgen-Rhein
25
26
and Strimmer (2007), Newton et al. (2001), Baldi and Long (2001), Lonnstedt and
Speed (2002), Newton et al. (2004), Smyth (2004), Cui et al. (2005), Fox and Dimmic
(2006), Scharpf et al. (2009)), nonparametric methods (Pepe et al. (2003)), Hotelling
T 2 method (Lu et al. (2005)), global test method (Goeman et al. (2004)), and Global
approach (Zhou et al. (2007)), among others. Some of these methods require dis-
tributional assumptions for the underlying gene expressions. It is the aim of this
chapter to consider few of these methods for analyzing microarray data to identify
differentially expressed genes.
2.2. Fold Change
The simplest method to detect differential gene expression is by ranking based on the
fold change (FC). The approach to calculate fold change is to divide the expression
level of a gene in the sample by the expression level of the same gene in the control.
Then we get the fold change, which is 1 for an unchanged expression, less than 1 for
a down-regulated gene, and larger than 1 for an up-regulated gene. The genes are
then ranked by this ratio.
The problem with fold change emerges when one takes a look at a scale. Up-
regulated genes occupy the scale from 1 to infinity (or at least 1000 for a 1000-fold
up-regulated gene) whereas all down-regulated genes only occupy the scale from 0
to 1. This scale is highly asymmetric. Another disadvantage of using fold change is
not taking variability of the gene expression values into account. The basic flaw of
this approach mentioned by Dudoit et al. (2002) is that genes with high fold change
might also exhibit high variability, and hence their differential expression between
experimental conditions may not be significant. Similarly, genes with less than two
fold changes may have highly reproducible expression intensities and hence significant
differential expression can be found across experimental conditions.
In order to assess differential expression in a way that controls both false positives
(genes declared to be differentially expressed when they are not) and false negatives
27
(genes declared to be not differentially expressed when they are), the standard ap-
proach emerging is one based on statistical significance and hypothesis testing, with
careful attention paid to multiple comparisons issues. The following are the few ap-
proaches that are discussed to assess differential expression under hypothesis testing
problem.
2.3. t-test and ANOVA
We can use a t-test to determine whether the expression of a particular gene is sig-
nificantly different between control and treatment. The t-test uses the mean and
variance of the treatment and control distributions and calculates the probability
that the observed difference in mean occurs when the null hypothesis is true. The
formula for t-statistic is the difference in the means over the standard error of the
difference. For 2 groups, this is the equivalent of a 1 way analysis of variance Sahai
and Agell (2000).
When using the t-test it is often assumed that there is equal variance between
sample and control. That allows the treatment and control to be pooled for variance
estimation. If the variance cannot be assumed equal we can use Welch’s t-test which
assumes unequal variances of the two populations. If {xig}n1i=1 and {yjg}n2
j=1 defined
as the expression observed for the n1 control cases and n2 treatment cases in the gth
gene, then for each gene g, the test statistic is
tg =xg − yg√
s2x/n1 + s2y/n2
where xg and yg denote the sample average intensities in groups 1 and 2, and s2x and
s2y denote the sample variances for each group, respectively.
After calculating the t-test p-values for the replicated genes, the ones with the
lowest p-value can be saved into a new genelist and used in further analysis, for
example cluster analysis. These are the genes that most significantly differ between
two conditions.
28
Early statistics for analyzing microarray data include gene specific t-tests to find
differentially expressed genes. This t-statistics soon turned out to be unsuitable for
microarray data. Because of the large number of genes included in the microarray
experiments, there are always some genes with a very small variances across replicates,
so that their (absolute) t-values are large regardless of whether or not the differences
in their averages are large. Some of these turn out to be false positives for the t-
statistic. Again the average statistic is highly influenced by extreme observations,
outliers, and hence unsuitable for data with as few replicates as microarray data.
Several alternative statistics have been proposed to overcome these problems, and
many of them are influenced by the theory of shrinking the variance.
If we have more than two conditions, the t-test may not be the method of choice,
because the number of comparisons grows by performing all possible comparisons
between conditions. The analysis of variance (ANOVA) method will, using the F
distribution, calculate the probability of finding the observed differences in means
between more than two conditions when the null hypothesis is true (when there is no
difference in means).
2.4. Nonparametric Tests
Instead of the parametric test, which usually assumes that the expression values are
normally distributed, non-parametric tests like Wilcoxon rank sum test (two groups)
or Kruskal-Wallis test (two or more groups) can be applied, especially if the expression
values are not normally distributed. Here we describe the Wilcoxon rank sum test
and ROC methodology that were used commonly for analyzing microarray data.
2.4.1. Wilcoxon Rank Sum Test (RST)
Troyanskaya et al. (2002) applied the Wilcoxon rank sum test (RST) to gene expres-
sion analysis. The RST is a nonparametric alternative to the two-sample t-test which
is based solely on the rank of the expression values in which the observations from
29
the two samples fall. The test is based upon ranking the n1 + n2 observations of the
combined sample, where n1 and n2 are the sample size from the two conditions. The
Wilcoxon rank-sum test statistic is the sum of the ranks for observations from one of
the conditions. That is, after ranking all expression values from the two conditions,
the best separation we can have is that all values from one condition rank higher
than all values from the other condition. This corresponds to two non overlapping
distributions in parametric tests. But since the Wilcoxon test does not measure vari-
ance, the significance of this result is limited by the number of replicates in the two
conditions. It is for this reason that the Wilcoxon test for low numbers of replication
has low power and that the distribution of p-values is rather granular.
2.4.2. ROC Methodology for Gene Expression Analysis
An assessment of the expression of a gene can be made through the use of a receiver
operating characteristic (ROC) curve. If Y and X represents the expression values
from treatment population and control population, respectively, then for any real-
valued threshold, c, the population of subjects can be classified into the two groups
according to their expression values being greater or less than c. If a gene is significant
then the treatment group will include proportionally higher expression values than
control group (or vise versa). The agreement between the classification obtained
and the expression values can be characterized by two quantities: sensitivity (True
Positive Rate) and specificity (True Negative Rate) defined as follows:
sens(c) = TPR(c) = P (y > c | Y )
spec(c) = TNR(c) = 1-False Positive Rate (FPR) = P (x ≤ c | X)
The ROC approach allows considering the agreement between expression values
and the presence of different thresholds simultaneously. The ROC curve is the plot
of sensitivity versus 1-specificity where each point on the graph corresponds to a
specific threshold c. Note that for every gene in the groups of control and treatment
30
subjects there is a ROC curve. If a gene could perfectly discriminate, it would have
an expression above which the entire treatment graph would fall and below which all
control expressions would fall or vise versa. The curve would then pass through the
point (0,1) on the unit grid. The closer an ROC curve comes to this ideal point, the
better its discriminating ability. A gene with no discriminating ability will produce
a curve that follows the diagonal of the grid. Like the difference between the means
(µ1 − µ2) the probability P (Y > X) is also a measure of the distance between the
two experimental conditions. Therefore, the straight line with slope equal to one is
the plot of the comparison of two curves with the same mean. Pepe et al. (2003)
argue that two measures related to the ROC curve are suitable for ranking genes in
regards to DE between two conditions: the Area under the ROC curve (AUC) and
the Partial AUC (pAUC). The AUC can be interpreted as the probability that a
randomly selected subject from treatment group has greater expression values than
a randomly selected subject from control group, i.e.,
AUC = P (Yg > Xg) =
∫ 1
0
ROC(t)dt,where t=FPR(c).
If {xig}n1i=1 and {yjg}n2
j=1 defined as the expression observed for the n1 control cases
and n2 treatment cases in the gth gene. In that notation the unbiased estimator of
AUC for the gth gene is given by:
Ag =
∑n1
i=1
∑n2
j=1 ψ(xig, yjg)
n1n2
= ψ..
where,
ψ(x, y) =
{1 x < y
0 x > y
The definition of ψ(x, y) doesn’t allow for ties because of the continuous nature of
microarray data. It might have possible ties when quantile normalization is used and
in this case ψ(x, y) = 0.5 can be additionally defined.
The partial AUC is simply the area under the ROC curve between t0 and t1, i.e.,
pAUC(t0, t1) =
∫ t1
t0
ROC(t)dt,
31
where the interval (t0, t1) denotes the FPRs of interest.
For continuous data, the nonparametric ROC curve may be preferred since it
passes through all observed points and provides unbiased estimates of sensitivity,
specificity, and AUC in large samples (Zweig et al. (1993)). More importantly, the
nonparametric approach does not require the data to be fitted by any particular
model. If the distributions of scores for true-positive and true-negative test sub-
jects are far from Gaussian distribution, the parametric AUC and its corresponding
standard error (SE) derived from a directly fitted binormal model may be distorted
(Godard et al. (1990)). Convergence may also be an issue with data characterized by
extreme values. For these reasons, as well as its relative simplicity and ease of use,
the nonparametric approach continues to be popular among many researchers.
2.5. SAM-Statistic
Significance analysis of microarrays (SAM) is a statistical technique, established in
2001 by Tusher, Tibshirani and Chu, for determining whether changes in gene ex-
pression are statistically significant. To avoid the genes with low expression levels
dominating the results of the analysis, a small, strictly positive constant s0, the so
called fudge factor, is added to the denominator of the usual t-statistic.
dg =xg − yg√
n/(n1n2)s2g + s0
where s0 is some constant estimated from all the individual gene variances. Values
of s0 have a shrinkage effect producing a large decrease in the magnitude of the dg
statistic when the sample variances are small, and a smaller decrease when the sample
variances are large.
This analysis uses non-parametric statistics, since the the null distribution of the
dg-statistic is unknown. In this method, repeated permutations of the data are used to
determine if the expression of any gene is significant related to the response. The use of
permutation-based analysis accounts for correlations in genes and avoids parametric
32
assumptions about the distribution of individual genes. This is an advantage over
other techniques (for example ANOVA and Bonferroni), which assume equal variance
and/or independence of genes.
Each gene in a microarray experiment can have its own unique variance. This may
be a consequence of biological or technical factors but it is clear from our experience
that variances are variable across genes to a greater extent than expected due to
statistical errors of estimation. To derive stable gene specific variance estimates, we
can borrow information across genes by shrinking the variance estimates toward a
prior value or toward their bias-corrected geometric mean. When the true variances
are highly variable it is desirable to shrink less. When the true variances are similar
we should shrink more. In this way the new variance estimates adapt to the degree
of heterogeneity of variances.
In the following we describe the SAM procedure:
1. Compute the expression score dg for each gene g, g = 1, · · · ,m, and order the
these values to obtain the observed order statistics d(g) ≤ · · · ≤ d(m).
2. Draw B random permutations of the group labels. For each permutation b,
compute the permutated expression scores dbg, g = 1, · · · ,m, and order them.
Estimate the expected order statistics by d(g) =∑
b db(g)/B, g = 1, · · · ,m.
3. Plot the observed order statistics d(g) against the expected order statistics d(g)
to obtain the SAM plot.
4. For a fixed threshold ∆ > 0, find the first data point (d(g1), d(g1)) to the right
of the origin for which d(g) − d(g) ≥ ∆, and set d(g1) = cutup(∆). Call any gene
g with dg ≥ cutup(∆) positive significant. Similarly, find the first data point
(d(g2), d(g2)) to the left of the origin for which d(g) − d(g) ≤ −∆, set d(g2) =
cutlow(∆), call any gene g with dg ≤ cutlow(∆) negative significant. See figure
2.1 for the SAM plot that shows positive and negative significant genes under
∆ = 2 that can be taken for the further biological analysis.
33
Figure 2.1: SAM Analysis for the Two-Class Unpaired Case Assuming Unequal Vari-ances and ∆ = 2.
5. Estimate the FDR by
ˆFDR(∆) = π0(1/B)
∑b#{dbg /∈ (cutlow(∆), cutup(∆))
}max {#of significant genes at defined level, 1}
where π0 is an estimate of the prior probability π0 that a gene is not differentially
expressed. This estimate depends on the choice of another ∆′ and taking the
proportion of the number of genes not significant at ∆′ when all null hypotheses
are true by the number of observed genes that are not significant at ∆′. If this
proportion is found greater than 1 then π0 will be consider as 1. A way of
estimating π0 is described by Storey and Tibshirani (2003).
34
6. Repeat steps 4 and 5 for several values of the threshold ∆. Choose the value of
∆ that provides the best balance between the number of identified genes and
the estimated FDR.
Computation of the Fudge Factor: In the SAM analysis, the fudge factor s0 is
specified by the quantile of the standard deviations sg, g = 1, · · · ,m, of the
genes that fulfill an specific optimization criterion. Efron et al. (2001) suggest
to specify the optimal choice of the fudge factor by selecting the value of s0
that leads to the most differentially expressed genes i.e., 90th percentile of the
sg values. In the SAM analysis, the fudge factor is computed by the following
algorithm provided by Chu et al. (2002).
1. Compute the 100 percentiles qk, k = 1, · · · , 100, of the sg values.
2. For α ∈ ℜ = {0, 0.05, 0.1, · · · , 1}
• compute dαg == xg−yg√n/(n1n2)s2g+sα
, where sα denotes the α quantile of the
sg values, and s0 = q0 = ming{sg}, g = 1, · · · ,m.
• calculate ναk = 1.4826 ∗MAD{dαg |si ∈ [qk−1, qk)}, k = 1, · · · , 100,
• compute the coefficient of variation CV(α) of the ναk values.
3. Set α = argminα∈ℜ{CV (α)}, and s0 = sα.
R Package siggenes is a package from Bioconductor for implementing the signif-
icance analysis of microarray technique of Tusher et al. (2001). The package
contains a function sam to calculate the statistic dg for each gene g = 1, · · · ,m.
Additionally, the number of differentially expressed genes and the estimated
FDR is determined for an initial (set of) value(s) for the threshold ∆. The
fudge factor estimate is included in this package. The estimation of the prior
probability, π0, that a gene is not differentially expressed can be obtained by the
function pi0.est. It is estimated by the natural cubic splines based method of
Storey and Tibshirani (2003).
35
2.6. Empirical Bayes Approach
Efron et al. (2001), and Efron and Tibshirani (2002) model the distribution of the
expression scores dg, g = 1, · · · ,m, as a mixture of two components, one component
for the differentially expressed genes, and the other for the not differentially expressed
genes. Easily speaking, the Empirical Bayes method based on the assumption that
there are two classes of genes: “Not Different” and “Different” with their prior proba-
bilities being equal to π0 and π1 = 1−π0, respectively. Introducing the class indicator
variable J , we can write: π0 = Pr{J = 0} and π1 = Pr{J = 1}. Denote the condi-
tional probability density of T (Ti is the test statistic for ith gene) given J = 0 by
f0(t) and the corresponding density of T given J = 1 by f1(t). Then the density of
some statistics, like the two-sample t-statistic, for gene g is
f(t) = π0f0(t) + π1f1(t)
Then to evaluate which genes are differentially expressed one uses the posterior prob-
ability of π0 for each gene:
P (Not differentially expressed | T = t) = τ0 = P (J = 0 | T = t)
=P (T = t | J = 0)P (J = 0)
P (T = t)
=π0f0(t)
f(t)
Small values of the posterior probability, τ0, indicate possibly differentially expressed
genes. The posterior probability τ0 has been termed the local FDR by Efron and
Tibshirani (2002). Therefore,
P (t) = P (J = 1 | T = t) = 1− π0f0(t)
f(t)
The simplest Bayes rule is to select a gene with T = t if P (t) ≥ C for this gene, where
C < 1 is a preselect threshold level.
36
2.7. Posterior Odds Statistic(LIMMA)
Lonnstedt and Speed (2002) proposed a parametric empirical Bayes method to the
problem of identifying differentially expressed genes. They assumed a normal mixture
model for gene expression data and defined a B-statistic which is a estimate of the
log posterior odds-ratio that each gene is differentially expressed. The B-statistic is
equivalent for the purpose of ranking genes to the penalized t-statistic
tg =xg − yg√(s0 + s2g)/n
where the penalty s0 is estimated from the mean and standard deviation of the sample
variances s2g. Using the same notation and assumption the SAM statistic can be
defined of the form
dg =xg − yg
(s0 + sg)/√n
when assessing differentially expression of g-th gene. The SAM statistic differs slightly
from the tg statistic in that the penalty is applied to the sample standard deviation sg
rather than to the sample variance s2g. Smyth (2004) reformulated the posterior odds
statistic in terms of empical Bayes (E-B) t-statistic in which posterior residual stan-
dard deviations are used in place of ordinary standard deviations. Smyth (2004) has
implemented the log odds-ratio method as the function ebayes() in the freely avail-
able R package Limma Smyth (2004). This empirical Bayes statistic is described as
equivalent to shrinkage of the estimated sample variances towards a pooled estimate,
resulting a far more stable inferences when the number of arrays is small. Smyth
(2004) suggested another direction in which the t-statistic can be generalized with
replacing the sample mean difference and sample standard deviation with location
and scale estimators which are robust against outliers.
37
2.8. Other Methods
Cui et al. (2005) FS Statistic: Cui et al. (2005) proposed a shrinkage estimator
for gene-specific variance components based on the James-Stein estimator and
used it to construct a test statistic called FS. They showed that FS test is
robust and perform well under a wide range of assumptions about variance
heterogeneity. They compared FS statistic with some other statistic such as
B, SAM and regularized t and found that FS has comparable or better power
identifying differentially expressed genes.
Wu (2005) Penalized Linear Regression Model: Wu (2005) used linear regres-
sion model for detecting differential gene expression with exploring the penal-
ized linear regression. He proposed the penalized t/F statistics for two-class
microarray data based on a penalty model. He showed the model is intuitive
and performs favorably in applications.
Pan et al. (2003) Empirical Bayes Approach: Pan et al. (2003) incorporated
biological knowledge into a mixture model to analyze microarray data. They
proposed the mixture model that allows the genes in different groups to have
different distributions while the grouping of the genes reflects biological informa-
tion. The group can be obtained from analyzing gene expression data, such as a
set of differentially expressed genes or a cluster of genes with similar expression
patterns. In these approaches, if there are statistically significant enrichments
of the genes in one or more Gene Ontology (GO) categories, the group of the
supplied genes are regarded as biologically more meaningful. They found that
their approach reduces the false discovery rates (FDR) compared with other
standard approach.
Hotelling’s T 2 Statistic Lu et al. (2005) proposed the use of Hotelling’s T 2 statistic
with a multiple forward search (MFS) algorithm that is designed for selecting
38
a subset of genes in high-dimensional microarray datasets. The Hotelling’s T 2
statistic is a natural multidimensional extension of the t-statistic and hence
it can take into account multidimensional structure of microarray data. They
found their approach gave fewer false positives and negatives than t-test.
Seng et al. (2008) Generalized likelihood ratio Method: Parameters of a sta-
tistical data model, which account for potential error sources, can be estimated
using the maximum-likelihood estimation (MLE) method. A generalized likeli-
hood ratio (GLR) test can then be applied to identify genes whose expression
levels are statistically different. A crucial step in the GLR test lies in the se-
lection of the underlying error structure summarizing the influence of multiple
sources of variation in microarray studies. Seng et al. (2008) compared the ef-
fects of different underlying statistical error structures on the GLR test’s power
in identifying differentially expressed genes in microarray data. They also evalu-
ated variants of the GLR test as well as the one sample t-test based on simulated
data by means of receiver operating characteristic (ROC) curves.
Scharpf et al. (2009) Hierarchical Bayesian Model: Scharpf et al. (2009) de-
fined a hierarchical Bayesian model for microarray expression data collected
from several studies and used it to identify differentially expressed genes between
two conditions. The showed a flexible modeling that allows for interactions be-
tween platforms and the estimated effect after including a shrinkage across both
genes and studies. They also provided guidelines for when the Bayesian model
is most likely to be useful.
CHAPTER 3
Approximate Likelihood Ratio Method
The original work of this chapter is taken from Hossain et al. (2009). In this chap-
ter we propose a test assuming the expression levels follow a Generalized Logistic
Distribution of Type II (GLDII). The motivation for this assumed distribution is
to allow longer tails than normal distributions since extreme values are common in
microarray experiments. The shape parameter of this distribution provides a wide
range of flexibility in modeling different shaped distributions. Given the computa-
tional complexity of carrying out Likelihood Ratio (LR) testing for many thousands
of genes, an Approximate LR Test (ALRT) is proposed instead. We also generalize
the two-class ALRT method to multi-class microarray data. The performance of the
ALRT method for the GLDII is compared to those based on Wald-type of statistics
through the use of simulation. The simulation results show that our method performs
quite well compared to the SAM analysis using standardized Wilcoxon rank statistics
and Empircal Bayes t-statistics at any settings of variance, sample size and treatment
effect. Our model is also found less sensitive to contamination. We apply our method
to a real microarray data comparing normal muscle to muscle from patients with
Duchenne muscular dystrophy (DMD), in which a set of truly DEGs are known. We
also illustrate our method in a two class classification problem of Golub et al. (1999)
leukemia dataset.
39
40
3.1. Background
As complex and robust as the available analysis methods for microarray data cur-
rently are, there is always room for error and many inherent problems in identifying
differentially expressed genes (DEGs). The statistical methods used to detect DEGs
can be classified into two categories: parametric methods and nonparametric meth-
ods. The most commonly used parametric method is the two sample t-test and its
variations which are based on Wald statistics. Although the Wald-type statistics may
work well in some situations but they have two main drawbacks. Firstly, their strong
dependence on model assumption and Secondly, for large difference in group means,
the standard error is inflated, lowering the Wald statistic and leading to inflation of
Type II error (false negative).
It is a common practice in microarray studies to assume the underlying distribu-
tion for a gene expression level is normal, in fact it is not always true. According
to Hunter et al. (2001) and Thomas et al. (2001), microarray data are often noisy
and hence parametric assumptions are certainly inappropriate for a subset of genes.
Therefore it is challenging to construct a suitable statistical model applicable to all
microarray datasets. More recently, Bhowmick et al. (2006) showed a Laplace mix-
ture model approach for identification of DEGs in microarray experiments. Lonnstedt
and Speed (2002) proposed a normal mixture model for gene expression data and de-
fined a log posterior odds statistic. Smyth (2004) reformulated the posterior odds
statistic in terms of empical bayes (E-B) t-statistic in which posterior residual stan-
dard deviations are used in place of ordinary standard deviations. The Bioconductor
package LIMMA provides functions for calculating Empirical-Babyes (E-B) t-statistic.
Ghosh (2004) also assumed a mixture model for assessing differential expression.
Recently Jeffery et al. (2006) compared the efficiency of 10 gene selection methods
and recommend for choice of gene selection considering different characteristics of
expression values. In all of these methods it is assumed that all the genes have the
41
same distribution. But this assumption may or may not hold in practice. Rather than
assuming a common distribution for all genes, one can partition genes into several
groups based on their expression levels and then assume that different distributions
hold for each group.
In this chapter we propose assuming that the underlying distribution of gene
expressions follow GLDII which is indexed by a shape parameter, α. In this chapter,
the GLDII is chosen because of its great flexibility in modeling any shaped distribution
on the interval (−∞,∞); a symmetric distribution is a special form of the GLDII
when α = 1. Brief details about the GLDII has been given in the Section 3.2. In order
to estimate the statistical significance, we focus on Likelihood Ratio (LR) test. Ideker
et al. (2000) proposed a generalized LR test which was performed for each gene to
identify DEGs. More recently, Bokka and Mathur (2006) proposed a nonparametric
LR test to identify differentially expressed genes from microarray data. Carrying out
LR test for thousands of genes is computationally intensive, and for this we propose
an Approximate Likelihood Ratio Test (ALRT) which will circumvent these heavy
computations.
3.2. Generalized Logistic Distribution of Type II
(GLDII)
Balakrishnan and Leung (1988) defined the generalized logistic distributions of Type
II (GLDII) by compounding a reduced log-Weibull distribution with a gamma distri-
bution. Recently Balakrishnan and Hossain (2007) studied this distribution under
progressive type II censoring. The GLDII with location parameter µ, scale parameter
σ and shape parameter α has density given by
f(x | µ, σ, α) = α
σ
exp(−(αx−µ
σ))(
1 + exp(−x−µ
σ
))α+1 , −∞ < x <∞,
and the corresponding cumulative distribution function function is given by
42
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
x
f(x)
Normal(0,1)GLDII(alpha=3)GLDII(alpha=2)GLDII(alpha=1)GLDII(alpha=0.75)GLDII(alpha=0.5)
Figure 3.1: Type II generalized logistic density for different values of α and sigma=0.5.
F (x | α, µ, σ) = 1−
(exp
(−x−µ
σ
)1 + exp
(−x−µ
σ
))α
, −∞ < x <∞.
Let Z = X−µσ
. Then, Z has the standard Type II generalized logistic distribution
with pdf
f(z | α) = αe−αz
(1 + e−z)α+1, −∞ < z <∞
and the corresponding cumulative distribution function function is
F (z | α) = 1−(
e−z
1 + e−z
)α
, −∞ < z <∞.
43
The standard Type II generalized logistic density function has been plotted in Figure
3.1 for different values of α. It is apparent from the figure that the distribution is
negatively skewed α < 1 and is positively skewed for α > 1. That is, skewness is
a increasing function of α for the GLDII. Again, it is apparent that if the value of
α tends to infinity, then the GLDII has heavier tails than the normal distribution,
meaning that there is less probability of extreme values than that under a normal
distribution. Therefore, GLDII can be used as an alternative to normal distribution
or skewed distributions.
3.3. Motivation and objectives
Figure 3.2 shows the density plots (Gaussian kernel smoothing by density() function
in R) of 3 genes which are randomly selected from Golub et al. (1999) leukemia
dataset. The dataset involves 7129 genes in 38 ALL and 34 AML subjects. It is seen
from the Figure 3.2 that the the genes follow asymmetric distributions and different
shapes. The following three statistics have been calculated to see the performance of
GLDII against the normal distribution:
Kurtosis If xi denotes for the non-missing observations of x, n for their number, µ
for their mean, and mr =∑
i(xi − µ)r/n for the sample moments of order r.
Kurtosis is defined as
Kurtosis =m4
m22
− 3.
Distributions with zero excess kurtosis belong to a symmetric distribution fam-
ily. A distribution with positive excess kurtosis has a more acute peak around
the mean (that is, a lower probability than a normally distributed variable of
values near the mean) and fatter tails (that is, a higher probability than a
normally distributed variable of extreme values). A distribution with negative
excess kurtosis has a lower, wider peak around the mean (that is, a higher
probability than a normally distributed variable of values near the mean) and
44
−1500 −1000 −500 0
0.00
000.
0010
U97502_rna1_at
Expression Values
Pro
babi
lity
ALLAML
−1000 −500 0 500
0.00
000.
0010
D14874_at
Expression Values
Pro
babi
lity
−2500 −2000 −1500 −1000 −500
0.00
000.
0006
0.00
12
U82970_at
Expression Values
Pro
babi
lity
Figure 3.2: Density plots of 3 genes from leukemia Dataset: Black-solid line ALL andRed-dashed line AML
45
thinner tails (if viewed as the height of the probability density-that is, a lower
probability than a normally distributed variable of extreme values).
Skewness In probability theory and statistics, skewness is a measure of the asym-
metry of the probability distribution of a real-valued random variable. It is
defined as,
Skewness = m3/m(3/2)2 .
Qualitatively, a negative skewness indicates that the tail on the left side of
probability density function is longer than the right side and the bulk of the
values (including the median) lies to the right of the mean. A positive skew
indicates that the tail on the right side is longer than the left side and the bulk
of the values lies to the left of the mean. A zero value indicates that the values
are relatively evenly distributed on both sides of the mean, typically but not
necessarily implying a symmetric distribution.
Akaike Information Criterion Akaike’s information criterion (AIC) is a measure
of the goodness of fit of an estimated statistical model. Given a data set, several
competing models may be ranked according to their AIC, with the one having
the lowest AIC being the best. In the general case, the AIC is
AIC = 2k − 2 lnL
where k is the number of parameters in the statistical model, and L is the max-
imized value of the likelihood function for the estimated model. Increasing the
number of free parameters to be estimated improves the goodness of fit, regard-
less of the number of free parameters in the model. Hence AIC not only rewards
goodness of fit, but also includes a penalty that is an increasing function of the
number of estimated parameters. This penalty discourages overfitting. The
preferred model is the one with the lowest AIC value. The AIC methodology
attempts to find the model that best explains the data with a minimum of free
parameters.
46
Table 3.1 presents the skewness, kurtosis and AIC values for the 3 genes which were
randomly selected from Golub et al. (1999) leukemia dataset. It is seen from the
results that the GLDII is a preferable model than a normal distribution in terms
of AIC values because GLDII produces smaller AIC values for most of the genes.
Especially with higher skewness values GLDII performs best. Therefore, GLDII can
be used for robustness studies instead of classical procedures in microarray studies
since extreme values are often observed in real microarray data.
Table 3.1: Model selection from GLDII and Normal Distribution based on AIC values
Gene Name Conditions Skewness kurtosis AIC (GLDII) AIC (Normal)U97502 rna1 at ALL -0.3710 -0.3981 539.07 541.37U97502 rna1 at AML 0.4327 -0.8372 495.63 498.92
D14874 at ALL -0.1153 -0.6149 537.81 537.37D14874 at AML 0.8593 0.9328 480.46 485.16U82970 at ALL -1.0236 1.2732 567.08 573.52U82970 at AML 0.3549 -1.2963 493.02 494.63
In this chapter, we obtain the receiver operating characteristic (ROC) curve by
computing sensitivity and specificity for a range of p-value cutpoints. The area under
the curve (AUC) is used to compare the performance of different methods. A very
good method has a high true positive rate for a given false positive rate, so that the
ROC curve occupies the upper left hand side of the graph with AUC approaching the
ideal of 1.0.
The present chapter makes four main contributions: (1) we propose underlying
distribution of a given gene expression profile follows a GLDII because it provides
a long-tailed alternative to the normal distribution, (2) we propose an approximate
likelihood ratio test (ALRT) method which is more robust than Wald tests especially
for large sample size, (3) the ALRT method is extended to multiclass microarray data,
(4) we perform a comparative study to elucidate key features of different methods in
47
microarray study.
3.4. Method
A random variable Y on (−∞,∞) which follows a GLDII has a density function of
the form
fY (y) =α
b
exp(−(y−µ
b)α)(
1 + exp(−y−µ
b
))α+1 ; −∞ < y <∞ (3.4.1)
where α, µ ∈ (−∞,∞) and b > 0 are shape, location and scale parameters, respec-
tively. When the shape parameter α = 1, the density in (3.4.1) corresponds to the
usual logistic density function. Moments of the GLDII are conveniently obtained via
the moment generating function (MGF). For the standard GLDII, the MGF is
M(t) =Γ(1 + t)Γ(α− t)
Γ(α),
which gives the mean and variance as ψ(α) − ψ(1) and ψ′(α) + ψ′(1), respectively,
where ψ(α) = ddα
ln Γ(α). Suppose {xig}n1i=1 and {yjg}n2
j=1 defined as the expression
observed for the n1 control cases and n2 treatment cases in the gth gene from the
GLDII. Here we aim to test that the means of the two groups are equal. To test
the hypothesis here we assume that the variances and shape of the two distributions
are equal, which implies α1 = α2 = α (say) and b1 = b2 = b (say) where α1 and α2,
are the shape parameters for the two groups, respectively and b1 and b2 are the scale
parameters for the two groups, respectively. It is not a crucial assumption because
we are assuming within a gene the variances and shapes of different groups are same
but between genes are different, and this assumption would give us computational
flexibility. Therefore the null hypothesis becomes H0 : µ1 = µ2, where µ1 and µ2
are the location parameters of treatment group and control group, respectively. For
simplicity of notation, we will use xi(i = 1, · · · , n1) instead of xig for treatment group
and yj(j = 1, · · · , n2) instead of yjg for control group. Following Hossain and Willan
(2007), first we estimate the parameters of location and scale parameter in the case
when the shape parameter α is known. Denoting zi =xi−µ1
band zj =
yj−µ2
b, the log
48
likelihood function for the GLDII is
logL = −n1 ln b+
n1∑i=1
ln f(zi)− n2 ln b+
n2∑j=1
ln f(zj), (3.4.2)
where f(zi) and f(zj) are the density function of the treatment group and control
group, respectively. Let
Φ1(zi) =∂
∂ziln f(zi) =
1− ezi
1 + ezi
and
Φ2(zj) =∂
∂zjln f(zj) =
1− ezj
1 + ezj
Then, we obtain the likelihood equations for µ1, µ2 and b, from (3.4.2), as follows:
∂ logL
∂µ1
=− 1
b
n1∑i=1
Φ1(zi) = 0, (3.4.3)
∂ logL
∂µ2
=− 1
b
n2∑j=1
Φ2(zj) = 0, (3.4.4)
∂ logL
∂b=− n1
b− 1
b
n1∑i=1
ziΦ1(zi)−n2
b− 1
b
n2∑j=1
zjΦ2(zj) = 0. (3.4.5)
The likelihood equations in (3.4.3) to (3.4.5) are non-linear and do not admit explicit
solutions because of the presence of the terms Φ1(zi) and Φ2(zj). Consequently some
numerical methods have to be employed to obtain the MLEs of the parameters. But
the potential problem in using numerical methods is that the method requires starting
values near the global maximum. Therefore it is quite difficult to use the numerical
methods in microarray data analysis where we are doing thousands of gene by gene
testing. Here we propose an approximate likeihood estimation procedure. A trade-off
between approximate likelihood estimates and full likelihood estimates are given in
49
the paper by Hossain and Willan (2007). Following Hossain and Willan (2007), we
approximate the functions Φ1(zi) and Φ2(zj) by expanding them in a Taylor series
around F−1(pi) = νi and F−1(pj) = νj, respectively, and keeping only the first two
terms we get two approximations Φ1(zi) and Φ2(zj).
Φ1(zi) ≈ Φ1(νi) + Φ′1(νi)(zi − νi)
= Φ1(νi)− νiΦ′1(νi) + ziΦ
′1(νi)
= A1i −B1izi, (3.4.6)
Φ2(zj) ≈ Φ2(νj) + Φ′1(νj)(zj − νj)
= Φ2(νj)− νjΦ′1(νj) + zjΦ
′1(νj)
= A2j −B2jzj, (3.4.7)
where,
A1i = Φ1(νi)− νiΦ′1(νi), B1i = −Φ′
1(νi),
A2j = Φ1(νj)− νjΦ′1(νj), B2j = −Φ′
1(νj).
νi = ln(− ln qi)
νj = ln(− ln qj)
and,
pi =i
(n1 + 1), qi = 1− pi; i = 1, 2, · · · , n1,
pj =j
(n2 + 1), qj = 1− pj; j = 1, 2, · · · , n2.
50
Substituting the approximation (3.4.6) and (3.4.7) into (3.4.3) to(3.4.4), we get
two approximate normal equations:
∂ logL
∂µ1
≈− 1
b
n1∑i=1
(A1i −B1izi) = 0 (3.4.8)
∂ logL
∂µ2
≈− 1
b
n2∑j=1
(A2j −B2jzj) = 0 (3.4.9)
Now solving the equation (3.4.8) the AMLE of µ1 can be obtained as µ1 = V1 −W1b,
where
V1 =
∑n1
i=1B1ixi∑n1
i=1B1i
, W1 =
∑n1
i=1A1i∑n1
i=1B1i
.
Similarly, we can obtain the AMLE of µ2 by solving the second normal equations
(3.4.9), µ2 = V2 − W2b. Now substituting the values of µ1 and µ2 in the third
approximate normal equation we can get a quadratic equation which has two roots.
But b must be positive and hence the AMLE of b can be obtained as
b =−λ1 +
√λ21 + 4(n1 + n2)λ22(n1 + n2)
(3.4.10)
where,
λ1 =
n1∑i=1
A1i(xi − V1) +
n2∑j=1
A2j(yj − V2),
λ2 =
n1∑i=1
B1i(xi − V1)2 +
n1∑j=1
B2j(yj − V2)2.
Now, under H0, we have µ1 = µ2 = µ0(say). We obtain the likelihood equations for
µ0 = µ(b0) and b0, from (3.4.2), as follows:
∂ logL
∂µ0
=− 1
b0
n1∑i=1
Φ1(zi)−1
b0
n2∑j=1
Φ2(zj) = 0, (3.4.11)
∂ logL
∂b0=− n1
b0− 1
b0
n1∑i=1
ziΦ1(zi)−n2
b0− 1
b0
n2∑j=1
zjΦ2(zj) = 0. (3.4.12)
51
Following the similar steps as above with likelihood equations (3.4.11) and (3.4.12)
the AMLE of µ0 obtain as
µ0 = V0 −W0b0, (3.4.13)
where
V0 =
∑n1
i=1B1ixi +∑n1
j=1B2jyj∑n1
i=1B1i +∑n2
j=1B2j
W0 =
∑n1
i=1A1i +∑n2
j=1A2j∑n1
i=1B1i +∑n2
j=1B2j
,
and the AMLE of b0 obtain as
b0 =−λ10 +
√λ210 + 4(n1 + n2)λ202(n1 + n2)
(3.4.14)
where,
λ10 =
n1∑i=1
A1i(xi − V0) +
n2∑j=1
A2j(yj − V0),
λ20 =
n1∑i=1
B1i(xi − V0)2 +
n1∑j=1
B2j(yj − V0)2.
Now, for the parameter α, the estimate is obtained from the likelihood function
by maximizing over α with the parameters µ1, µ2 and b replaced by µ1, µ2 and b,
i.e. L1(α) = maxα L(µ1, µ2, b, α) (profile likelihood method, Diciccio and Tibshirani
(1991)). The ALRT statistic can then be obtained as,
Λ = −2 log
[L(µ0, b0, α0)
L(µ1, µ2, b, α)
](3.4.15)
which is asymptotically χ21. However, the exact distribution of the likelihood ratio
corresponding to specific hypotheses is very difficult to determine. A convenient
result, though, says that as the sample size n approaches infinity, the test statistic
Λ for a nested model will be asymptotically χ2 distributed with degrees of freedom
52
equal to the difference in dimensionality of L(µ0, b0, α0) and L(µ1, µ2, b, α)(Cox and
Hinkley (1974)). A simulation study has been conducted with 5000 samples from
GLDII under the two conditions. Under the null hypothesis we consider both the
location parameters are equal at 0. Both the scale and shape parameters are assumed
as 1. Therefore, the samples per condition generated from Y ∼ GLDII(1, 0, 1). We
consider sample size 50 per condition. Figure 3.3 displays the histogram of likelihood
ratio statistic values with density scaling. It is seen from the figure that the shape is
approximately similar to the chi-square distribution with 1 degrees of freedom (red-
dotted line).
Histogram of Likelihood Ratio Test Statistic with density scaling
LR
Den
sity
0 5 10 15
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 3.3: Histogram of Likelihood Ratio values with density scaling under the nullhypothesis
53
This ALRT method can also be extended to a scenario with more than two classes.
A detailed discussion for the multiclass microarray data is given in section 3.6 to 3.8.
If the measured expression levels for a gene has k > 2 classes which are taken to be
independent and distributed as GLDII with location expression levels µ1, µ2, · · · , µk
and common scale parameter b, the ALRT statistic becomes
Λ = − log
[L(µ0, b0, α0)
L(µ1, µ2, · · · , µk, b, α)
](3.4.16)
which is asymptotically χ2k−1.
3.5. Comparison between AMLE and MLE for lo-
cation and scale parameters of GLDII
A simulation study is conducted with three choices of sample sizes for comparing
between AMLE and MLE for location and scale parameters of GLDII. Table 3.2
shows the results of average values and variances of MLEs and AMLEs for sample
sizes 10, 20 and 50 determined by simulating data from standard GLDII with α
= 1.5. All the averages were computed over 1000 simulations. On comparing the
AMLE values in table with the corresponding entries of the MLEs, we observe that
the AMLEs are almost as efficient as the MLEs even for small sample sizes. It is
apparent that for large sample size, the AMLE seems to provide the smallest bias
and variance for the estimates.
54
Table 3.2: Average values and variances of MLEs and AMLEs when the data aresimulated from GLDII(1.5,0,1).
size MLE AMLE
µ b var(µ) var(b) µ b var(µ) var(b)10 -0.02146 0.95327 0.32105 0.06635 -0.02058 0.97748 0.32906 0.0784920 0.00475 0.98148 0.15893 0.03114 0.00477 0.98988 0.16194 0.0340650 -0.00125 0.99346 0.06078 0.01069 -0.00123 0.99943 0.06374 0.01393
3.6. FDR Estimation
A nice property of the ALRT method is its connection with FDR. Because of extensive
computation involvement with ALRT method, we used 500 permutations to estimate
FDR. Though it is possible to get gene specific p-values with the asymptotic chi-
squared distribution property of ALRT method, here we propose to estimate FDR
with the test statistic (3.4.11). Similar to the SAM analysis (Tusher et al. (2001)) we
evaluated the statistical significance of the ALRT method. We can use the following
permutation algorithm to select significant genes and estimate FDR:
1. For the original data calculate the Λg statistics for gth gene (g = 1, · · · ,m),
denote their ordered values as
Λ∗(g)
2. For the r-th permutation, calculate the Λg statistic and denote their ordered
values as
Λr(g); r = 1, · · · , 500
Denote their averages across all permutations as
Λ(g) =1
500
500∑r=1
Λr(g)
55
3. For a cutoff value ∆, identify the following genes as significant
| Λ∗(g) − Λ(g) |≥ ∆
Denote
Λ0 = maxΛ∗(g)
≤Λ(g)−∆Λ∗(g), and Λ1 = minΛ∗
(g)≥Λ(g)+∆Λ
∗(g),
and estimate the expected number of false positives by chance for the Λg statistic
as follows
V (∆) =500∑r=1
∑g I{Λr
(g) ≥ Λ1}+ I{Λr(g) ≤ Λ0}
500
where I{·} is the indicator function, and the estimated FDR is
ˆFDR(∆) =V (∆)
R(∆),
where
R(∆) =∑g
I{| Λ∗(g) − Λ(g) |≥ ∆}
is the total number of significant genes.
3.7. Permutation based p-values and AUC Estima-
tion
In cases where we are unwilling to assume the null distribution or not able to identify
the distribution of the test statistic, we can obtain an assessment of the p-value via
permutation. The permuted p-values for each gene can be calculated using permuta-
tions of the class labels. For a range of cutoff values for all the methods, the number
of false positives and false-negative results are calculated from the permuted p-values.
For each method, a sufficient range of cutoffs was chosen to be able to calculate the
area under receiver characteristic curves (ROC). Seng et al. (2008) also evaluated
their methods based on simulated data by means of receiver operating characteris-
tic (ROC) curves. In addition, Hu et al. (2006) evaluated their methods with ROC
56
curves. The ROC is evaluated by means of a plot of the true positive fraction (sen-
sitivity) versus the true negative fraction (1-specificity) using a continuous varying
decision threshold. A method with good discrimination will rank the true positives
above the true negatives and therefore will have a true positive fraction greater or
equal to true negative fraction at all points. The AUCi function from the ROC package
of R is used to calculate the AUC after giving the sensitivity and specificity values in
the function.
3.8. Comparison with Other Methods
We evaluate the performance of ALRT method using simulated data as well as two
published datasets. Here we consider SAM using Wilcoxon statistic (S-W) and em-
pirical Bayes t-statisitcs (E-B) for comparison because these two methods are widely
used methods in microarray gene selection literature.
3.8.1. Simulation Experiment
The performance of our proposed method is evaluated using simulated data generated
from three distributions incorporating variability, treatment effect, shape effect and
sample size effect. We consider the scenarios having two conditions, treatment versus
control, and sample sizes of 10 and 25 per condition. We generate data from 1000
genes and set the proportion of DE genes as 0.1.
First we simulate data for control and treatment group from a GLDII. That is
expression values for a gene are generated from Y ∼ GLDIIm(α, 0, b). The GLDII cov-
ers symmetric as well as asymmetric distribution, which is practical with microarray
experiments. Although in reality genes do interact with each other, the independence
assumption is a useful simplification.
1. Here we consider two types of variability by sampling b for each gene from an
inverse Gamma distribution. Note that when b ∼ 1/Gamma(a0, a0), the mean
is E(b) = a0a0−1
and the variance is Var(b) =a20
(a0−1)(a0−2). For moderate variances
57
we use a0 = 30, and for high variances we use a0 = 5.
2. We consider four types of shape by sampling α for each gene from a Gamma
distribution. For negatively skewed expression we use α ∼ Gamma(50,100), for
symmetric expression α ∼ Gamma(50,50) and for moderately positively skewed
α ∼ Gamma(6,2) and highly positively skewed we consider α ∼ Gamma(10,2).
3. Treatment group data for DE genes are drawn from a GLDIIm(α, δ, b) distribu-
tion, where the treatment effects, δ, are sampled from a Gamma(a1, b1) distri-
bution. To allow the variance of δ(= a1b21) to increase with the mean of δ(= a1
b1),
and mean of δ to be 0.5, 1 and 2, respectively, we sample δ ∼ Gamma(4, 8),
δ ∼ Gamma(4, 4) and δ ∼ Gamma(8, 4).
To compare performance, we use receiver operating characteristic (ROC) curves,
where the test sensitivities and specificities (true positive and true negative propor-
tions) for a range of p-value cutoffs are averaged over 500 simulated datasets. The
AUC can be interpreted as the true positive rate average uniformly over the range of
false positives. Table 3.3 and Table 3.4 show AUCs for different simulation models
for the two types of variability, respectively. As seen in Tables 3.3 and 3.4, the ALRT
by assuming GLDII (ALRT(G)) provides best results at any settings of treatment ef-
fects, variability, shape effects and sample sizes. For instance, when the simulation is
made from GLDII by considering scale parameter from inverse gamma (5,5), sample
size of 10 per group, expected treatment effect of 0.5 and expected shape parameter
of 5 then ALRT(G), E-B and S-W statistic give AUC of 0.6123, 0.5814 and 0.5763,
respectively, and the standard errors of these AUCs are 0.0931, 0.1006, 0.1148, respec-
tively. ALRT(G) performs consistently and gives more reliable method for identifying
differential expression of genes. As expected, all the methods perform well when the
sample sizes and the treatment effects are assumed large. Again, when the expression
of a gene comes from symmetric distribution i.e.,E(α) = 1, ALRT(G), ALRT with
logistic distribution (ALRT(L)), and E-B provide similar results and outperform the
58
S-W statistic. When the expression of a gene are skewed, the AUCs by S-W increases
for any settings of simulation.
Secondly, we generate data from the extreme value distribution. That is expres-
sions for a gene are generated from Y ∼ EVm(0, b′). We consider scale parameters
as b′ = 1 and b′ = 2 in the simulation to allow extreme values at the left side and
the right side of the distribution, respectively. We allow the treatment effect (δ) and
the sample size effect (n) similar as previous scenario. One added comparison has
been made by considering unequal sample sizes (n1 = 10, n2 = 15). The results of
AUCs under different simulation settings are given in Table 3.5. It appears that all
the methods perform well in increasing the sample sizes. Overall, it is apparent from
the Table that the ALRT(G) gives best results at any settings of treatment effects,
scale effects and sample size effects.
Thirdly, we use normal distribution to generate the data. That is expressions for
a gene are generated from Y ∼ normalm(0, σ2). Here we also consider 1000 genes,
in which 100 genes are differentially expressed with a constant treatment effect (δ)
equal to 1.0. We sample σ2 for each gene from an inverse Gamma distribution, i.e.,
σ2 ∼ 1/Gamma(a00, a00). For moderate variances we use a00 = 21, and for high
variances we use a00 = 3. Here we also randomly add noise, η, to 20% of the samples
and 20% of the genes. To allow the noise effect in the expression values we consider
two noise effects, η ∼Gamma(4, 8) and η ∼Gamma(4, 4), respectively. The results of
AUCs under different simulation settings are summarized in Table 3.6. It can be seen
that the results from the ALRT(G) method and E-B t-statistics are close, though in
the presence of noise there may be a slight loss of power (i.e. reduced true positive)
in using E-B as compare to the ALRT(G) method. This is because, the differences
between mean levels for each of the genes used in the numerator of the E-B t-statistic
will be more sensitive and the resulting p-values may not be accurate. The ALRT
method is not affected much by the noise. It is seen from the table that the relative
performance of ALRT method improves when noise effect is considered high in the
59
data, i.e., when E(η) = 1. It should be noted that the presence of noise is very
common in real microarray data. Therefore, even when the normality assumption
holds for a given microarray data set the ALRT method with GLDII distributional
assumption performs well for identifying differentially expressed genes.
Table 3.3: AUCs when the assumed simulation model is GLDII with high variance ofexpression values(i.e.,b ∼ 1/Gamma(5, 5))
E(δ) E(α) n1 = 10, n2 = 10 n1 = 25, n2 = 25ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W
0.5 0.5 0.5528 0.5415 0.5425 0.5337 0.5796 0.5684 0.5697 0.55351 0.5809 0.5707 0.5711 0.5559 0.6351 0.6247 0.6223 0.61363 0.5999 0.5792 0.5832 0.5715 0.6948 0.6736 0.6742 0.66195 0.6123 0.5799 0.5814 0.5763 0.7224 0.6838 0.6862 0.6803
1 0.5 0.5978 0.5923 0.5889 0.5768 0.7051 0.6840 0.6813 0.69071 0.6428 0.6289 0.6386 0.6262 0.7988 0.7886 0.7934 0.77683 0.6955 0.6806 0.6826 0.6481 0.8657 0.8364 0.8483 0.84585 0.7071 0.6788 0.6817 0.6631 0.8826 0.8347 0.8511 0.8509
2 0.5 0.6937 0.6898 0.6871 0.6815 0.8821 0.8582 0.8634 0.86311 0.7783 0.7579 0.7611 0.7462 0.9289 0.9253 0.9262 0.92173 0.8419 0.8065 0.8278 0.7939 0.9574 0.9449 0.9500 0.94855 0.8443 0.8101 0.8265 0.8007 0.9639 0.9487 0.9494 0.9468
Table 3.4: AUCs when the assumed simulation model is GLDII with moderate vari-ance of expression values(i.e.,b ∼ 1/Gamma(30, 30))
E(δ) E(α) n1 = 10, n2 = 10 n1 = 25, n2 = 25ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W
0.5 0.5 0.5545 0.5453 0.5421 0.5398 0.5842 0.5651 0.5623 0.56941 0.5653 0.5601 0.5629 0.5414 0.6431 0.6314 0.6313 0.62073 0.6007 0.5805 0.5888 0.5563 0.7015 0.6736 0.6750 0.66335 0.6176 0.5891 0.5912 0.5674 0.7329 0.6867 0.6866 0.6833
1 0.5 0.6029 0.5920 0.5831 0.5793 0.7512 0.7348 0.7191 0.74211 0.6432 0.6391 0.6409 0.6187 0.8246 0.8168 0.8173 0.79373 0.6979 0.6733 0.6784 0.6441 0.9056 0.8720 0.8762 0.86485 0.7134 0.6917 0.6983 0.6639 0.9245 0.8784 0.8797 0.8726
2 0.5 0.7221 0.7153 0.7142 0.7008 0.9396 0.9219 0.9198 0.92471 0.8037 0.7982 0.8031 0.7898 0.9669 0.9633 0.9622 0.95513 0.8771 0.8595 0.8577 0.8223 0.9775 0.9562 0.9607 0.95665 0.8948 0.8624 0.8614 0.8517 0.9808 0.9699 0.9735 0.9729
60
Table 3.5: AUCs when the assumed simulation model is Extreme Value
E(δ) b′ n1 = 10, n2 = 10 n1 = 10, n2 = 15 n1 = 25, n2 = 25ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W
0.5 1 0.564 0.558 0.554 0.543 0.587 0.568 0.567 0.560 0.609 0.579 0.581 0.5762 0.572 0.559 0.557 0.551 0.601 0.577 0.578 0.569 0.618 0.585 0.586 0.581
1 1 0.641 0.623 0.624 0.614 0.659 0.628 0.630 0.621 0.797 0.771 0.773 0.7642 0.649 0.629 0.626 0.619 0.665 0.632 0.634 0.627 0.814 0.785 0.786 0.774
2 1 0.707 0.682 0.685 0.676 0.724 0.695 0.694 0.686 0.845 0.811 0.813 0.8072 0.724 0.704 0.701 0.698 0.733 0.708 0.716 0.700 0.867 0.835 0.838 0.831
Table 3.6: AUCs when the assumed simulation model is normal
E(η) E(σ2) n1 = 10, n2 = 10 n1 = 10, n2 = 15 n1 = 25, n2 = 25ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W
0.5 1.5 0.623 0.621 0.623 0.609 0.648 0.639 0.647 0.633 0.805 0.801 0.804 0.7911.05 0.643 0.641 0.641 0.629 0.673 0.668 0.672 0.663 0.810 0.802 0.807 0.801
1 1.5 0.626 0.622 0.625 0.610 0.646 0.640 0.647 0.638 0.795 0.791 0.793 0.7881.05 0.646 0.639 0.642 0.628 0.677 0.671 0.675 0.661 0.817 0.810 0.812 0.800
3.8.2. Duchenne Muscular Dystrophy (DMD) Data
Haslett et al. (2002) examined the pathogenic pathways and identify new or modi-
fying factors involved in Duchenne Muscular Dystrophy (DMD). They used the ex-
pression microarrays to compare individual gene expression profiles of skeletal mus-
cle biopsies from 12 DMD patients and 12 unaffected control patients. Affymetrix
GeneChip Ver. 5.0 software (MAS5.0) was used for raw data processing to obtain
signal intensities and normalized with a linear regression. They used geometric fold
change method to test of differential expression. The differential expression of 12
genes (13 probesets) was confirmed by quantitative RT-PCR analysis of seven DMD
biopsies and four unaffected biopsies. Here in this re-analysis we use only 23 arrays
(only 11 DMD arrays) since one file was truncated. Therefore, we have 12625 genes in
the data set with 12 samples in normal group and 11 samples in DMD patients group.
Raw data are converted to signal estimates using MAS5 by Affymetrix Inc. (2002)
which is implemented using the affy package in Bioconductor written by Gautier
61
et al. (2004). It is also possible to use dChip or RMA or GCRMA for normaliza-
tion and perform additional normalization to make the underlying distribution more
symmetric. Here we used MAS5 because our focus is to compare the performances of
methods under noisy data. An analysis of this data is also given in Hu et al. (2006).
The estimated shape parameter for the GLDII of 50% of the genes lies between
0.838 (1st Quart.) and 1.180 (3rd Quart.). Therefore the underlying distribution of
the gene expressions for 50% of the genes are close to symmetric. The lowest and
highest shape parameter for the GLDII is estimated as 0.498 and 2.310, respectively.
Figure 3.4 depicts the ROC curves for each of the four methods: E-B, SAM, ALRT(L)
and ALRT(G). For ideal comparison of these four methods we calculate the p-values
after 500 permutations and different cutoffs are chosen to be able to calculate the
area under receiver characteristic curves (ROC). For the selected cutoffs of different
values, all the four methods produce false positive genes which mostly similar or differ
by only one or two genes. We can see from the ROC curve that overall they have very
similar performance. Therefore, utilizing all the DE methods in the analysis of DMD
data, ALRT(G) is able to perform competitively with other widely used methods.
A method claiming large number of differentially expressed genes is not considered
superior unless it also produces relatively small number of false positives. All the 12
genes are found by all four methods at at a fixed cut off p-value of 0.05. Histograms of
permuted p-values from each of four methods are provided in Figure 3.5. Comparing
with the other methods, the ALRT(G) method gives fewer significant genes under a
fixed cut off of p-values. Figure 3.6 displays the number of significant genes corre-
sponding to the estimated FDR values by SAM and ALRT(G) methods. It is seen
from the Figure that the ALRT(G) method produces less number of significant genes
for a fixed estimated FDR values. It suggests from this figure that the ALRT(G)
produces fewer false positive genes compare to the SAM method. Therefore it sug-
gests from this dataset that it may be beneficial to use the ALRT(G) method in the
problem of marker or gene identification which have higher confidence in genes.
62
Figure 3.4: ROC plots for test of DE in the Duchenne Muscular Dystrophy data
Figure 3.5: Histogram of permuted p-values for test of DE in the Duchenne MuscularDystrophy data
63
0.00 0.05 0.10 0.15 0.20
050
010
0015
0020
0025
0030
00
Estimated FDR
Numb
er of
Signif
icant
Gene
s
SAMALRT(G)
Figure 3.6: Number of significant genes against estimated FDR values for theDuchenne Muscular Dystrophy data
3.8.3. Golub Leukemia Data: Classification Between ALL
and AML
Golub et al. (1999) used gene expression to discriminate between two types of leukemia
ALL and AML. Many authors used their data using different methodologies. The
training dataset consists of 27 ALL and 11 AML subjects and the test dataset con-
sists of 20 ALL and 14 AML subjects. The expression of 7129 genes were originally
measured. Here we have merged the training and testing samples so that we have
total 72 samples for our analysis. Geman et al. (2004) also combined the test and
training set to estimate the error by leave-one-out cross validation. A normalized
version of the Golub leukemia data is taken from R version package hddplot. Our
primary interest is to select important genes and use them to classify the two types
of leukemia.
Figure 3.6 compares the expected number of false positives for the four methods
corresponding to different values of estimated FDR. We can see that both t statis-
tics and ALRT method have very similar performance. Again the SAM statistic and
64
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
010
0020
0030
0040
0050
0060
00
Estimated FDR
Expe
cted
Num
ber o
f Fal
se P
ositi
ves
tSAMALRT(L)ALRT(G)
Figure 3.7: Expected number of False Positives corresponding to estimated FDRvalues.
ALRT method with generalized distribution provide very similar performance un-
til FDR values reaches 0.5 though ALRT(G) method performing better than SAM
statistic at higher values of FDR (> 0.5).
Selecting a small number of relevant genes for accurate classification of samples is
essential for the development of diagnostic tests. Different gene selection algorithms
can potentially select different relevant genes and lead to different classification accu-
racies. We assessed the performance of the four gene selection methods: E-B, SAM,
ALRT(L) and ALRT(G) with selecting different numbers of genes and use these genes
for classification with a simple Gaussian maximum likelihood discriminant rule, for
diagonal class covariance matrices. A detailed discussion of this discriminant rule
is given in Dudoit et al. (2002). All the approaches involve a spilt of the data into
training and test samples. In the training sample the classification rule is developed,
and in the test sample its performance is determined. We separate the whole data
into three folds. We train each classifier on two folds and test it on the remaining
65
5 10 15 20 25 30 35 40
0.00
0.05
0.10
0.15
0.20
0.25
Number of top ranked genes
Aver
age
misc
lass
ificat
ion
erro
rE−BSAMALRT(L)ALRT(G)
Figure 3.8: Average misclassification error rate for leukemia dataset shown as thenumber of genes
one. The top ranking genes of a specified number between 5 and 40 are used to create
the classification rule. Error for a given classification relative to a known truth are
then calculated by the classError function of R package mclust. The top ranked
genes are selected and classification error are measured within cross-validation. The
performance of the methods are evaluated by taking average misclassification errors
in the test samples. The average of the misclassification errors shown as the number
of top ranked genes in the Figure 3.8. It is seen from the figure that the ALRT(G)
method performs better than E-B and SAM method when very few of the genes are
used for classification. The classification performance improves for all the methods
when more genes are used in the classification procedure. The results also indicate
that the performance of ALRT(G), E-B and SAM methods agree closely when many
of the top 25 or 30 significant genes are of interest. Therefore the ALRT(G) can effec-
tively be used to the problem of identification of important genes for this microarray
data.
66
3.9. Multiclass Microarray Data
Multiclass microarray analysis in which the data consist of more than two classes is
rapidly gaining attention in the literature (Yeung et al. (2005)). Suppose the measured
expression levels for a gene has k > 2 classes which are taken to be independent and
distributed as GLDII with location expression levels µ1, µ2, · · · , µk and common scale
parameter b. For k classes the ALRT statistic becomes
Λ = − log
[L(µ0, b0, α0)
L(µ1, µ2, · · · , µk, b, α)
]
which is asymptotically χ2k−1, and
µ0 = K0 − L0b0,
where
K0 =
∑n1
i=1B1ixi +∑n2
j=1B2jyj + · · ·+∑nk
l=1Bklwl∑n1
i=1B1i +∑n2
j=1B2j + · · ·+∑nk
l=1Bkl
and
L0 =
∑n1
i=1A1i +∑n2
j=1A2j · · ·+∑nk
l=1Akl∑n1
i=1B1i +∑n2
j=1B2j + · · ·+∑nk
l=1Bkl
.
b0 =−λ10 +
√λ210 + 4(n1 + n2 + · · ·+ nk)λ202(n1 + n2 + · · ·+ nk)
where,
λ10 =
n1∑i=1
A1i(yi −K0) +
n2∑j=1
A2j(yj −K0) + · · ·+nk∑l=1
Akl(yl −K0),
λ20 =
n1∑i=1
B1i(yi −K0)2 +
n1∑j=1
B2j(yj −K0)2 + · · ·+
nk∑l=1
Bkl(yl −K0)2.
µu = Ku − Lub, u = 1, 2, · · · , k;
67
where
Ku =Bu1y1 +Bu2y2 + · · ·+Bunuynu
Bu1 +Bu2 + · · ·+Bunu
and
Lu =Au1 + Au2 + · · ·+ Aunu
Bu1 +Bu2 + · · ·+Bunu
.
b =−λ1 +
√λ21 + 4(n1 + n2 + · · ·+ nk)λ22(n1 + n2 + · · ·+ nk)
where,
λ1 =
n1∑i=1
A1i(xi −K1) +
n2∑j=1
A2j(yj −K2) + · · ·+nk∑l=1
Akl(wl −Kk),
λ2 =
n1∑i=1
B1i(xi −K1)2 +
n1∑j=1
B2j(yj −K2)2 + · · ·+
nk∑l=1
Bkl(wl −Kk)2.
and α is obtained from the likelihood function by maximizing over α with the parame-
ters µ1, · · · , µk and b replaced by µ1, · · · , µk and b, i.e. L1(α) = maxα L(µ1, · · · , µk, b, α).
3.9.1. Example of Multi-class microarray data: SRBCT Dataset
We have applied the gene selection methods to the small round blue cell tumor (SR-
BCT) dataset (Khan et al. (2001)). The dataset is analyzed from the R package
sda. The SRBCT microarray data measured the expression levels of 2308 genes for 88
samples from four tumor types: Burkitt Lymphoma (BL, 11 samples), Ewing sarcoma
(EWS, 29 samples), neuroblastoma (NB, 18 samples), rhabdomyosarcoma (RMS, 25
samples) and and 5 other (non-SRBCT) samples. We excluded the 5 non-SRBCT
samples from the dataset and analyzed the remaining datasets with 83 samples for
four tumor types. This dataset was also analyzed by Yang et al. (2006). Here we
compare our methods with the F statistic and multivariate version of SAM statis-
tic. An R package called DEDS provides the function comp.F for the computation of
68
F statistics and the samr function from SAMR package is used to calculate the SAM
statistic under multiclass problem type. The misclassification errors were found by
choosing K, (K = 10, 15, 20, 25 and 30) top genes according to the statistical values
of different methods. We evaluate the classification performance using a simple Gaus-
sian maximum likelihood discriminant rule, for diagonal class covariance matrices. A
5 fold cross-validation used to get the classification errors. The form of the algorithm
is as follows:
1. Randomly divide the samples into 5 partitions (subsamples).
2. Of the 5 partitions, a single subsample is retained as the validation data for
testing the model, and the remaining 4 subsamples are used as training data.
3. The cross-validation process is then repeated 5 times (the folds), with each of
the 5 subsamples used exactly once as the validation data.
4. Record the number of top ranked genes from the training data and use these
genes for classification with a simple Gaussian maximum likelihood discriminant
rule, for diagonal class covariance matrices.
5. Error for a given classification relative to a known truth are then calculated by
the classError function of R package mclust.
6. Report the average error over all 5 test sets.
The mean of the misclassification errors for different methods are summarized in
Table 3.7. It appears from the results that the ALRT method with GLDII provides
the best results among all the methods in terms of producing lowest misclassification
error. Therefore the ALRT method with GLDII can be used effectively to this dataset
for identification of the important genes in microarray data analysis.
69
Table 3.7: Average misclassification error for the SRBCT datasetK F (DEDS) SAM (SAMR) ALRT(L) ALRT(GLDII)10 0.325 0.311 0.316 0.30515 0.297 0.290 0.292 0.28520 0.271 0.268 0.267 0.26525 0.265 0.260 0.263 0.25830 0.258 0.252 0.256 0.253
3.10. Discussion
In this chapter, we propose a new method for detecting differentially expressed genes
in microarray data. The method is based on a flexible distribution, GLDII and
an approximate likelihood ratio test (ALRT) is developed. We discuss the use of
ALRT method by assuming underlying distribution is GLDII and show a comparable
performance to the SAM method, the SAM using Wilcoxon Rank statistics and E-
B t-statistics with an application to simulated and two real datasets. The ALRT
method with distributional assumption GLDII appears to provide a favorable fit to
data at any settings of simulation. In the presence of noise in gene expression data,
Wald-type statistical tests of the differences between mean levels for each of the
genes will be more sensitive to assumed distributional forms, and resulting p-values
may not be accurate. Based on our simulation analysis, our method appears to be
more favourable method in applications than the S-W method and E-B t-statistic
even when the sample sizes for the two groups are assumed small (n=10). Our
method appears to be slightly powerful in large samples. Overfitting problem will
arise in use of our method with the three parameters of GLDII when sample size
is very small. Therefore, in use of our method, we want to have a large sample
size because the desirable properties of the estimators are justified in large sample
situations. The results from the simulation studies are considerably affected by the
shape of the distribution. Therefore, it is critical to generate an underlying null
70
distribution as close as possible to real microarray data because a gene’s statistical
significance can be dramatically different under different underlying null distributions.
The strength of ALRT(G) method is it’s flexibility of considering the shape according
to the underlying distribution of expression values.
The DMD data analysis and Golub leukemia data analysis show that our method
performs well in comparison to other methods. The SAM statistics or E-B t-statistics
are often used for testing each gene’s differential expression. But all the methods
require a large number of samples to produce reasonable estimates. Furthermore, the
comparison of means can be greatly influenced by outliers with dramatically smaller
or bigger expression intensities. Gene discovery based on these Wald-type statistics
can be misleading due to different error variances under different biological conditions
and/or on different intensity ranges of microarray expression. Our ALRT(G) method
is found to be favorable at any settings of treatment effect, variability, sample size and
noise effect. The ALRT method also has the advantage that it can handle multiclass
microarray dataset.
The main motivation for using the GLDII is its closed form solution for microarray
data. The use of GLDII is particularly desirable in microarray data analysis for the
stability of its tail probabilities which play an important role in assessing statistical
significance. The ALRT method using GLDII is a part of an emerging literature that
attempts to improve statistical tests for DE. This approach may prove useful in a
number of other genome-wide estimation and inference problems.
The ALRT method performs as well or better than the Wald type test statistic for
testing the differences between two experimental conditions. The ALRT method with
GLDII has the added advantage that it provides a flexible model that can account
both symmetric and asymmetric structures of the data. Due to its flexibility, the
ALRT method with GLDII presents a viable alternative to find differential expressed
genes in microarray studies. We assumed that the data from different genes are
independent, which is unlikely to hold in microarray data since functions of many
71
genes are interrelated in varying degrees. However, the ALRT method can be adjusted
by borrowing information from all the genes. It may be possible to get more robust
results by the ALRT method by borrowing strength from genes in local intensity
regions for estimation of shape and scale parameters of GLDII. For example, one
could use Wald test for comparing two location parameters of GLDII and smooth the
variance by adding an offset in the denominator similar to what is done to the SAM
statistic. Therefore the test statistic will be
µ1 − µ2
SE(µ1 − µ2) + s0
where, µ1 and µ2 is the approximate location parameters of the GLDII for treatment
and control group, respectively and SE(.) is the standard error for difference between
the two approximate location parameter. We can also take the offset, s0, as the
90th percentile of the standard error of the mean difference between two conditions
following Efron et al. (2001) approach.
CHAPTER 4
Nonparametric Method for Detecting
Differentially Expressed Genes: Single
Gene Analysis
In this chapter, we propose a flexible rank based nonparametric procedure for an-
alyzing microarray data. In the method we propose a statistic for testing whether
area under receiver operating characteristic curve (AUC) for each gene is equal to
0.5 allowing different variance for each gene. The contribution to this “single-gene”
statistic is the studentization of the empirical AUC which takes into account the vari-
ances associated with each gene in the experiment. DeLong et al. (1988) proposed a
nonparametric procedure for calculating a consistent variance estimator of the AUC.
We use their variance estimation technique to get a test statistic, and we focus on the
primary step in the gene selection process, namely, the ranking of genes with respect
to a statistical measure of differential expression. Two real datasets are analyzed to
illustrate the methods and a simulation study is carried out to assess the relative per-
formance of different statistical gene ranking measures. The work includes how to use
the variance information to produce a list of significant targets and assess differential
gene expressions under two conditions. The proposed method does not involve com-
plicated formulas and does not require advanced programming skills. We conclude
that the proposed methods offer useful analytical tools for identifying differentially
expressed genes for further biological and clinical analysis.
72
73
4.1. Introduction
There are a variety of gene selection methods developed in the last few years. Among
them, some methods assume explicit statistical models on the gene expression data
which are called parametric methods. Other methods do not assume any specific
distribution model on the gene expression data and they are referred to as non-
parametric gene selection methods. For example, Pepe et al. (2003) proposed two
measures related to the ROC curve for ranking genes (or proteins) in regards to dif-
ferential expression between tissues: AUC and partial AUC(t0) (i.e., pAUC), where
t0 is some small false positive rate. Xiong et al. (2001) suggested a method to select
genes through the space of feature subsets using classification errors. Recently Jeffery
et al. (2006) compared the efficiency of 10 gene selection methods including both
parametric and nonparametric methods. It has been reported that the results of non-
parametric gene selection methods may be influenced by the classification methods
chosen for scoring the genes (Troyanskaya et al. (2002)). Nonetheless, model based
gene selection methods lack adaptability, because it is often impossible to construct a
universal probabilistic analysis model that is suitable for all kinds of gene expression
data, where noise and variance may vary dramatically across different gene expression
data (Troyanskaya et al. (2002)). In this sense, nonparametric gene selection methods
are more desirable than model-based ones. Here we proposed a new gene selection
method which do not assume any explicit statistical model on the gene expression
values.
An assessment of the expression of a gene can be made through the use of a
receiver operating characteristic (ROC) curve. If a gene could perfectly discriminate
between two conditions then there will be an expression level that the entire treatment
population would fall above and all control expressions would fall bellow or vise versa.
The curve would then pass through the point (0,1) on the unit grid. The closer an
ROC curve comes to this ideal point, the better its discriminating ability. A gene with
74
no discriminating ability will produce a curve that follows the diagonal of the grid.
Pepe et al. (2003) argue that two measures related to the ROC curve are suitable
for ranking genes in regards to differential expression between two conditions: Area
under the ROC curve (AUC) and Partial AUC (pAUC).
For continuous data, the nonparametric ROC curve may be preferred since it
passes through all observed points and provides unbiased estimates of sensitivity,
specificity, and AUC in large samples (Zweig et al. (1993)). More importantly, the
nonparametric approach does not require that data be fitted to any particular model.
If the distributions of scores for true-positive and true-negative test subjects are
far from Gaussian, the parametric AUC and its corresponding standard error (SE)
derived from a directly fitted binormal model may be distorted (Godard et al. (1990)).
Convergence may also be an issue with expression data since presence of extreme
values are common in such data. For these reasons, as well as its relative simplicity
and ease of use, the nonparametric approach continues to be popular among many
researchers.
The remainder of the chapter is organized as follows. We have a general discussion
comparing parametric methods with nonparametric methods on Section 4.2. A brief
discussion on ROC analysis is given in Section 4.3. We discuss the motivation and
related works in Section 4.4. We describe our proposed method in Section 4.5. In
Section 4.7, we present simulation results and we also illustrate the methods using
two real microarray datasets. We discuss the advantages and disadvantages of our
method and provide conclusion in Section 4.8.
4.2. Parametric versus Nonparametric Methods
Theoretical distributions are described by quantities called parameters, notably the
mean and standard deviation. Methods that use distributional assumptions are called
parametric methods, because we estimate the parameters of the distribution assumed
75
for the data. Frequently used parametric methods include t tests and analysis of vari-
ance for comparing groups, and least squares regression and correlation for studying
the relation between variables. All of the common parametric methods (“t meth-
ods”) assume that in some way the data follow a normal distribution and also that
the spread of the data (variance) is uniform either between groups or across the range
being studied. For example, the two sample t test assumes that the two samples of
observations come from populations that have normal distributions with the same
standard deviation. The importance of the assumptions for t methods diminishes as
sample size increases.
Alternative methods, such as the Mann-Whitney test, and rank correlation, do
not require the data to follow a particular distribution. They work by using the rank
order of observations rather than the measurements themselves. Methods which do
not require any distributional assumptions about the data, such as the rank meth-
ods, are called non-parametric methods. The term non-parametric applies to the
statistical method used to analyze data, and is not a property of the data. As tests
of significance, rank methods have almost as much power as t methods to detect a
real difference when samples are large, even for data which meet the distributional
requirements (Sawilowsky (1993)).
Non-parametric methods are most often used to analyze data which do not meet
the distributional requirements of parametric methods. In particular, skewed data are
frequently analyzed by non-parametric methods, although data transformation can
often make the data suitable for parametric analyses. Sawilowsky (1993) concluded
that “the t-test was more powerful only under a distribution that was relatively
symmetric, although the magnitude of the differences was trivial. In contrast, the
Mann-Whitney held huge power advantages for data sets which presented skewness”.
To compensate for the advantage of being free of assumptions about the distri-
bution of the data, rank methods have the disadvantage that they are mainly suited
76
to hypothesis testing and no useful estimate is obtained, such as the average differ-
ence between two groups. Estimates and confidence intervals are easy to find with
parametric methods. Non-parametric estimates and confidence intervals can be cal-
culated, however, but depend on extra assumptions which are almost as strong as
those for t methods. Rank methods have the added disadvantage of not generalizing
to more complex situations, most obviously when we wish to use regression methods
to adjust for several other factors.
The choice of an approach may also be related to sample size, as the distributional
assumptions are more important for small samples.
4.3. General Discussion on ROC analysis
Receiver operating characteristic (ROC) analysis provides a comprehensive picture of
the ability of a test to make the distinction being examined over all decision thresh-
olds. Several different methods have been developed for the analysis of ROC curves.
The area under an ROC curve (AUC) is indicative of the overall accuracy of a test
and represents the probability that a randomly selected true-positive individual will
score higher on the test than a randomly selected true-negative individual. AUC can
be estimated both parametrically and nonparametrically. The parametric methods
usually model the ROC curves by assuming a particular underlying distribution of
subject outcomes (usually assuming that a bivariate distribution of outcomes is trans-
formable to a binormal). The binormal ROC curves were shown to be quite robust
for a wide class of curves encountered in practice (Hanley (1988)), a property that is
in part due to variety of distributions that can be approximated by a monotone trans-
formation of a binormal distribution. One of the best known parametric approaches
to the analysis of the ROC curves is the maximum likelihood approach introduced by
Dorfman and Alf JrE. (1969).
Nonparametric methods utilize empirical ROC points by connecting them with
straight lines, step functions or sometimes by fitting a smooth curve. The main
77
advantage of nonparametric methods compared to parametric ones is the absence of
specific assumptions about the shape of the curve or the underlying distribution of
outcomes. Furthermore, unlike many parametric procedures, iterative algorithms are
not needed for implementation of most nonparametric methods. For continuous data,
the nonparametric ROC curve may be preferred since it passes through all observed
points and provides unbiased estimates of sensitivity, specificity, and AUC in large
samples Zweig et al. (1993). In this chapter we apply nonparametric technique to
ROC analysis for analyzing microarray gene expression data.
4.4. Motivation of this Chapter
The motivation of this chapter comes from two published papers containing nonpara-
metric approaches for identifying differentially expressed genes. Pepe et al. (2003)
proposed two measures related to the ROC curve for ranking genes (or proteins)
in regards to differential expression between tissues: AUC and partial AUC(t0) (i.e.,
pAUC), where t0 is some small false positive rate. The nonparametric AUC is equal to
the numerator of the Mann-Whitney U statistic and hence equivalent to the Wilcoxon
rank sum test (RST). pAUC is not a recommended method for small sample size (Jef-
fery et al. (2006)). Troyanskaya et al. (2002) compared three model-free approaches
and assessed their performances under varying noise levels. The three model-free ap-
proaches were: (1) nonparametric t-test, (2) RST, and (3) a heuristic method based
on high Pearson correlation to a perfectly differentiating gene (“ideal discriminator
method”). The RST is used as an alternative to the t-test to avoid the parametric
assumptions.
Figure 4.1 displays distributions (Gaussian kernel smoothing by density()function
in R) of four randomly selected genes from the Golub et al. (1999) leukemia dataset.
Our interest here is to build a separation between two cancer types: acute lym-
phoblastic leukemia (ALL) and acute myeloid leukemia (AML). It is apparent from
the distributions of the four genes that the underlying distributions of the genes lack
78
symmetry. Therefore building a method under the assumption of normality may be
invalid. Figure 4.2 displays the corresponding ROC curves for the four genes. The
AUC and pAUC (0.1) are calculated using an R package ROC. It indicates from the
figure that the gene “D14874 at” separates the two conditions more clearly than the
other three genes. The expressions for gene “D14874 at” produces AUC of 0.946 and
the pAUC(0.1) of 0.048 which clearly indicate that it is the most DE gene. Com-
paring the other three genes with respect to their AUC values indicate that, gene
“X93512 at” is the second most DE gene. But the remaining two genes are not com-
parable with respect to the values of AUC and pAUC because they have equal values
of AUC and pAUC. One of the features of the pAUC and AUC (or RST) is that they
do not account for gene-specific variability of the expression values. To improve this
scenario in microarray analysis, we suggest a statistic which can take gene-specific
variances under the two conditions. Figure 4.3 shows the estimate of AUCs and their
corresponding variances for each gene from Golub leukemia dataset. The estimation
procedure is given in the next section. We can see from the Figure that the genes
with AUC values close to 0.5 have higher variances and the genes with AUC values
close to 0 or 1 are having lower variances. Therefore, it is important to take gene
specific variances into account into the test statistic for testing the equal expression
for the two conditions.
4.5. Materials and Methods
4.5.1. Single Gene Analysis: AUC
For simplicity, it is assumed that higher values on the expression are associated with
treatment group. Let {xig}n1i=1 and {yjg}n2
j=1 denote expression values for the n1 control
79
−200 0 200 400 600
0.00
00.
002
0.00
40.
006
U97502_rna1_at
−200 0 200 400 600 800
0.00
00.
002
0.00
40.
006
D14874_at
−1000 −600 −200 0 200
0.00
00.
002
0.00
40.
006
U82970_at
−200 0 200 400 600 800
0.00
00.
004
0.00
8
X93512_at
Figure 4.1: Density plots of 4 genes from Leukemia Dataset: solid line AML anddashed line ALL.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1−specificity
sens
itivi
ty
U97502_rna1_at (A=0.785, pA=0.0273)D14874_at (A=0.946, pA=0.048)U82970_at (A=0.785, pA=0.0273)X93512_at (A=0.805, pA=0.0559)
Figure 4.2: ROC Curves of 4 randomly selected genes from Leukemia Dataset.
80
0.0 0.2 0.4 0.6 0.8 1.0
0.00
00.
005
0.01
00.
015
0.02
0
AUC
Varia
nce
of A
UC
Figure 4.3: AUC and corresponding Variance for genes from Leukemia Dataset.
and n2 treatment subjects for the gth (g = 1, · · · ,m) gene. An unbiased estimator
of AUC for the gth gene (Ag)is given by:
Ag =
∑n1
i=1
∑n2
j=1 ψ(xig, yjg)
n1n2
= ψ..
where,
ψ(x, y) =
{1 x < y
0 x > y
The definition of ψ(x, y) doesn’t allow for ties because of continuous nature of mi-
croarray data. It might possible ties when quantile normalization is used and in this
case ψ(x, x) = 0.5 can be additionally defined. Note that Ag is equal to the numerator
of the Mann-Whitney U statistic and hence equivalent to the Wilcoxon RST. The
Wilcoxon RST is a nonparametric alternative to the two-sample t-test and is based
solely on the order in which the observations from the two samples fall. Troyanskaya
et al. (2002) described the RST and applied the method to microarray data analy-
sis. The estimate (u1), mean and variance for RST by Troyanskaya et al. (2002) are
81
defined as follows:
w1 =∑
rankcontrol group sample
u1 = w1 −n1(n1 + 1)
2
meanu1 =n1n2
2
varianceu1 =n1n2(n1 + n2 + 1)
12
It is apparent that, the mean and variance for the u1 statistic are constant depending
on the sizes of both the groups and therefore don’t take gene specific variability into
account.
The AUC can be used as an alternative to a t-test when the data are not normally
distributed. The AUC index reflects the inherent discriminative ability of a diagnostic
procedure and has a nice interpretation as the probability of correct discrimination
between treatment and control groups. The estimator Ag is approximately normally
distributed under quite general assumptions ( Hoeffding et al. (1948)). Hence, know-
ing the variance of the estimator is essential for constructing a test statistic for testing
the hypothesis H0 : Ag = 0.5 against the alternative H1 : Ag = 0.5. An Ag value of
0.5 represents no predictive or discriminative ability. This is important in cases where
expressions for control group are expected to score higher than treatment group.
Several methods (Hanley et al. (1983); DeLong et al. (1988), Efron et al. (1993);
Dorfman et al. (1992)) have been proposed for computing the variances and covari-
ance of nonparametric AUC estimates derived from the same sample of cases. These
may be used to facilitate statistical tests of AUC differences between measures. The
consistent, completely nonparametric estimators of the covariance matrix for AUC
estimators was developed by DeLong et al. (1988). The conventional variance esti-
mator proposed by DeLong et al. (1988) can also be shown to be equivalent to the
two-sample jackknife estimator (Arvesen (1969)) of the the variance. Because of the
structure of the nonparametric estimator of AUC, its variance estimator is easy to
82
compute, i.e.:
1. Compute the treatment and control group components:
ψi. =1
n2
n2∑j=1
ψ(xi, yj) ψ.j =1
n1
n1∑i=1
ψ(xi, yj).
2. Calculate
s10 =1
n1 − 1
n1∑i=1
[ψi. − ψ..]2, s01 =
1
n2 − 1
n2∑j=1
[ψ.j − ψ..]2.
3. The consistent estimator of the variance for the gth gene is:
V (Ag) =s10n1
+s01n2
.
Now for testing the hypothesis H0 : Ag = 0.5 the test statistics becomes,
Zg =Ag − 0.5
SE(Ag)
which is approximately standard normally distributed. We can rank gene according
to the values of Zg.
However, when there are only a small number of arrays in each group, the estimates
of standard errors (SE) for each gene can be unstable. Some genes might by chance
have very small SEs, and therefore appear highly significant. If the discrimination
between the two conditions happen perfectly by the gene i.e., Ag is 1, then SE(Ag)
becomes 0 which makes the Zg statistic arbitrarily large. To address this problem
we smooth the variance estimates by borrowing information from the ensemble of
genes. This can assist in inference about each gene individually. This technique of
smoothing variances is not new in microarray studies. For example, Tusher et al.
(2001), Efron et al. (2001) and Broberg et al. (2003) used t-statistics where an offset
was added to the standard deviation while Smyth (2004) proposed a t-statistic with
a Bayesian adjustment to the denominator. We took the offset s0 as the quantile
83
of the gene-wise standard errors that minimizes the coefficient of variation of the Zg
statistic. Therefore we can calculate the dg statistic to test for treatment effect.
dg =Ag − 0.5
SE(Ag) + s0
Similar adjustments for computing a test statistic were also used by Garrett et al.
(2004) and Hu et al. (2006).
4.6. FDR Estimation with dg statistic
In the application FDR is estimated using permutation and thresholding the test
statistics. Alternative estimation methods using p-values can also be applied (Storey
and Tibshirani (2003)). We can use the following permutation algorithm to select
significant genes and estimate FDR:
1. For the original data calculate the dg statistics, denote their ordered values as
d(g).
2. For the b-th permutation, calculate the dg statistic and denote their ordered
values as
db(g); b = 1, · · · , B.
Denote their averages across all permutations as
d(g) =1
B
B∑b=1
db(g).
3. For a cutoff value ∆, identify the following genes as significant
| d(g) − d(g) |≥ ∆.
Denote
d0 = maxd(g)≤d(g)−∆d(g), and d1 = mind(g)≥d(g)+∆d(g),
84
and estimate the expected number of false positives by chance for the dg statistic
as following
V (∆) =B∑b=1
∑g I{db(g) ≥ d1}+ I{db(g) ≤ d0}
B
where I{·} is the indicator function, and the estimated FDR is
ˆFDR(∆) =V (∆)
R(∆),
where
R(∆) =∑g
I{| d(g) − d(g) |≥ ∆}
is the total number of significant genes. We can similarly calculate the expected
number of false positives and FDR for the SAM, t, AUC and pAUC statistics.
4.7. Results
In this section, we evaluate the performance of our methods using simulated data as
well as three published datasets.
4.7.1. Simulation
We implemented and evaluated 4 methods for identifying differentially expressed
genes: SAM, RST, t statistic, and dg statistic. The performance of our proposed
methods is evaluated using simulated data generated from two distributions incorpo-
rating variability, treatment effect and sample size effect. We consider the scenarios
having two conditions, treatment versus control, and sample sizes of 10, 20 and 40 per
condition. We generated data from 1000 genes and set the proportion of DE genes
as 0.1. We consider equal sample sizes for each scenario. The following simulation
scenarios are considered here:
Sim1 : Sim1 simulates normal data with gene-specific variances from a standard expo-
nential distribution. We consider fixed effect size (D) as 1 and effects are added
to the second group.
85
Sim2 : Sim2 generates data from the extreme value distribution (EV). That is expres-
sions for a gene are generated from EVm(0, b′). We consider scale parameters
as b′ = 2 in the simulation to allow extreme values at the right side of the
distribution. The treatment effect (D) is as described for previous scenario.
The number of true positives and false positives are estimated based on 200 simula-
tions. The results of number of false positives corresponding to number of significant
genes from sim1 are given from Figures 4.4-4.6 at different sample sizes per condition.
The left and right side figures represent different scaling in the number of significant
genes. The number of significant genes are defined by ranking of the FDR values
corresponding to each gene. It is apparent from the figures that with smaller sample
sizes per condition all the methods produce higher number of false positives for a
fixed number of significant genes. It is seen from the results that with small sample
sizes (n1 = n2 = 10) per condition SAM performs better than any other methods.
For moderate sample sizes per condition (n1 = n2 = 20) the SAM and dg statistic
produce very close results in terms of the number of false positives for a fixed num-
ber of significant genes. It is also true for higher sample sizes (n1 = n2 = 20) per
condition. Figure 4.7 to Figure 4.8 provides the results of number of false positives
corresponding to number of significant genes from sim2 simulation scenario. It is seen
from the figures that with large sample sizes per conditions both the dg and SAM
statistic perform better than the RST, t-statistic while the t-statistics are worse than
the others.
4.7.2. Applications
A detailed evaluation of gene selection methods on real biological data is challenging
due to the difficulty of defining a gold standard. Here we have evaluated and applied
86
10 30 50 70
05
1015
2025
Number of Significant Genes
Numb
er of Fa
lse Po
sitives
tRSTSAMdg
80 100 120 140
3040
5060
7080
Number of Significant Genes
Numb
er of Fa
lse Po
sitives
tRSTSAMdg
Figure 4.4: Comparison of t, RST, SAM and dg statistics at sample size 10 and Sim1scenario.
10 30 50 70
02
46
8
Number of Significant Genes
Numb
er of Fa
lse Po
sitives t
RSTSAMdg
80 100 120 140
1020
3040
5060
Number of Significant Genes
Numb
er of Fa
lse Po
sitives t
RSTSAMdg
Figure 4.5: Comparison of t, RST, SAM and dg statistics at sample size 20 and Sim1scenario.
10 30 50 70
0.00.2
0.40.6
0.8
Number of Significant Genes
Numb
er of Fa
lse Po
sitives
tRSTSAMdg
80 100 120 140
010
2030
4050
Number of Significant Genes
Numb
er of Fa
lse Po
sitives t
RSTSAMdg
Figure 4.6: Comparison of t, RST, SAM and dg statistics at sample size 40 and Sim1scenario.
87
10 30 50 70
010
2030
40
Number of Significant Genes
Numb
er of
False
Posit
ives
tRSTSAMdg
80 100 120 140
5060
7080
90
Number of Significant Genes
Numb
er of
False
Posit
ives t
RSTSAMdg
Figure 4.7: Comparison of t, RST, SAM and dg statistics at sample size 10 and Sim2scenario.
10 30 50 70
05
1015
20
Number of Significant Genes
Numb
er of
False
Posit
ives t
RSTSAMdg
80 100 120 140
3040
5060
7080
Number of Significant Genes
Numb
er of
False
Posit
ives
tRSTSAMdg
Figure 4.8: Comparison of t, RST, SAM and dg statistics at sample size 20 and Sim2scenario.
10 30 50 70
01
23
45
67
Number of Significant Genes
Numb
er of
False
Posit
ives
tRSTSAMdg
80 100 120 140
1020
3040
5060
Number of Significant Genes
Numb
er of
False
Posit
ives
tRSTSAMdg
Figure 4.9: Comparison of t, RST, SAM and dg statistics at sample size 40 and Sim2scenario.
88
all the methods to 2 publicly available datasets.
The first data set is the well-known Affymetrix spike-in study that contains 12,626
genes, 12 replicates in each group, and 16 known differentially expressed genes (Cope
et al. (2004)). The dataset is contained in the R package “DEDS”. Among the 16
truly differentially expressed genes 14 genes are identified by t, RST and dg statistics
and 15 genes are identified by SAM method from the 20 top ranked genes. The
ranking of the genes are made by permuted p-values from different methods. All the
16 genes are identified by SAM when 74 significant genes are considered. On the other
hand the numbers are 153, 156 and 162 for dg statistic, t statistic and RST statistic.
We also examined the Affymetrix dataset by calculating the average number of truly
identified genes by t, SAM, RST, and dg statistic from 20 top ranked genes where the
average have been found after 200 randomly drawn samples of equal sizes (n1 = n2)
from each condition. We examined samples of size 7 and 10 from each condition.
Drawing size 7 from each condition, we found on average 13.40, 14.25, 13.44 and
13.63 genes are truly identified by t, SAM, RST, and dg statistic, respectively. The
average numbers are 14.15, 14.86, 14.34, and 14.52, respectively, when we considered
a sample of size 10 from each condition. Therefore SAM performs best among the
methods and the performance of dg statistic improves over RST for all sample sizes.
Figure 4.10 shows the results of Affymetrix spike-in data for comparing the meth-
ods in terms of concordance of genes with SAM. Concordance is defined as the number
of genes in one gene list produced by one method that are also present in another
gene list produced by another method. Here we compute the concordance between
the list of most DE genes produced by SAM and the list of most DE genes produced
by another method. We also get the concordance of genes between SAM and t statis-
tics (the unequal variances of t-test). Comparing with the gene lists of SAM that
consists of 100 genes it appears that 74, 73, and 72 genes are concordant with the
89
0 20 40 60 80 100
020
4060
Number of top ranked genes by SAM
Num
ber o
f con
cord
ence
gen
es w
ith S
AM
gen
e lis
ttRSTdg
Figure 4.10: Examining the methods in terms of concordance with the SAM statistic.
methods t-statistic, RST, and dg statistic, respectively. The four methods applied to
the Affymetrix study produce different set of gene lists. We have found the agree-
ment in the gene lists produced by SAM and other statistics are quite similar for this
dataset.
The second data set is from Golub et al. (1999) leukemia study that was used
to classify two types of leukemia: ALL and AML. The dataset is contained in the R
package golubEsets. Many authors have analyzed these data using different method-
ologies (Pan (2002); Zhao et al. (2003)). The training dataset consists of 27 ALL and
11 AML subjects and the test dataset consists of 20 ALL and 14 AML subjects.
The expression of 7129 genes were measured. Here we have merged the training and
testing samples so that we have a total of 72 samples for our analysis. Figure 4.11
provides the results of expected number of false positives corresponding to different
values of estimated FDR. It is seen from the figure that SAM statistic perform best
90
0.1 0.2 0.3 0.4 0.5
010
0020
0030
0040
0050
00
Estimated FDR
Exp
ecte
d N
umbe
r of F
alse
Pos
itive
s
tRSTSAMdg
Figure 4.11: Expected number of False Positives corresponding to estimated FDRvalues for Golub leukemia dataset.
among all the methods. Both the dg and RST statistic perform very similarly while
the t-statistics are worse than the others.
Table 4.1: Average of classification errors (with their standard errors in parentheses)for leukemia dataset.
k0 SAM RST pAUC(0.1) dg5 0.221 (0.113) 0.241 (0.134) 0.266 (0.173) 0.235 (0.129)10 0.196 (0.126) 0.216 (0.157) 0.238 (0.162) 0.208 (0.123)15 0.176 (0.118) 0.208 (0.151) 0.232 (0.195) 0.201 (0.131)20 0.139 (0.124) 0.159 (0.162) 0.162 (0.156) 0.156 (0.137)25 0.107 (0.119) 0.137 (0.148) 0.141 (0.168) 0.133 (0.127)30 0.097 (0.134) 0.129 (0.153) 0.133 (0.159) 0.114 (0.137)
It is hard to compare the methods when truly significant gene are not known. In
this case, the best model should be the one with the lowest classification error. Our
interest from the golub leukemia dataset is to select important genes and use them
91
to classify the two types of leukemia. Geman et al. (2004) also combined the test
and training set of the leukemia dataset to estimate the error by leave-one-out cross
validation. Here we evaluate the effectiveness of the gene list to form a gene classifier
which could predict the class of a test sample. In using classification to obtain the best
method, we assumed that a better gene list should discriminate between the groups
more effectively. Therefore we evaluate the classification performance of the five
methods SAM, RST, pAUC(0.1), and dg statistic using a simple Gaussian maximum
likelihood discriminant rule, for diagonal class covariance matrices. The pAUC(0.1)
means partial area under the curve at 0.1 threshold. A detailed discussion of the
discriminant rule is given in Dudoit et al. 2002. All the approaches involve splitting
the data into training and test samples. In the training sample the classification
rule is developed, and its performance is determined in the test samples. We used
6-fold cross-validation where the data are divided into 6 subsets. On each iteration
of the cross-validation a different collection of the 4 subsets serve as the training
sample and the remaining 2 subsets serve as the test sample. The performance of the
methods are evaluated by taking average misclassification error in the test samples.
The top ranking genes of a specified number between 5 and 30 are used to create the
classification rule. The mean of the misclassification errors and their corresponding
standard errors for different methods are summarized in Table 4.1. It appears that
the performance of dg statistic agree closely with the SAM and perform better than
RST and pAUC(0.1). Also the standard errors indicate that dg statistic produce less
variable genes which means the use of the dg statistic gives more reproducible genes
than the RST and pAUC(0.1). In fact, compared to other methods the SAM and
dg statistics produce more consistent results. Therefore the dg statistic can be used
effectively to this dataset for identification of the important genes in microarray data
analysis.
92
4.8. Discussion and Conclusion
One aspect of microarray studies is to provide a list of differentially expressed genes
in a given experimental systems. To provide this, we have proposed a nonparametric
rank based approach for ranking genes. We presented a dg statistic that performs
as a modified form of RST on the data, that takes into account the gene-specific
variance information to produce a list of significant targets and assess differential
gene expressions. Both perform extremely well on ideal situations. For large sample
sizes (40 per condition), the dg statistic gives better results by determining less false
positive genes (or more True Positive, TP) compared to the other methods.
Previous studies suggest that using rank transformed data in microarray analysis
is advantageous (Raychaudhuri et al. (2000); Tsodikov et al. (2002)). Theory predicts
that ranked based methods will be optimal in extremely noisy data. Heteroscedas-
ticity is common in gene expression data ( Thomas et al. (2001), Craig et al. (2003),
Pepe et al. (2003)). Presence of outliers is very common in microarray data which
may result in different variances in the two experimental conditions. The current
study shows that when variances are different, the tests, in particular the dg statistic,
are useful to test for differences between the two conditions. In comparing RST and
dg statistic, the version of the dg statistic that allows different variance for each gene
is likely to give more reliable results in microarray gene expression analysis.
With large subjects from each condition, it can be difficult to identify from the
data alone the underlying distribution. In such cases, the application of domain
knowledge and good judgement about the nature of the distribution is required. In
these circumstances we recommend to the dg statistic. Another instances where we
recommend our methods is when the gene-specific variances are different. Also, the
use of the dg statistic is not computationally intensive. The proposed statistical ap-
proach does not always outperform all the other methods, they are always comparable
and sometimes superior.
93
In the analysis of real microarray data, there is no correct answer as to which
method or which statistic should be used, as the choice of statistic can dramatically
affect the set of genes that is selected. A researcher should choose the measure
of differential expression based on the biological system of interest. If changes in
expression relative to the underlying noise are important with large samples, then
our method is preferable since it provides useful and robust analytical tools for gene
selection with minimum requirements about the underlying features of the datasets
of interest than most existing methods in microarray literature.
CHAPTER 5
Nonparametric Method for Detecting
Highly Correlated Differentially
Expressed Genes
Very often biologists are interested to know the biological function of a particular
gene. Its true biological function may depend on other genes. Finding other genes in
the same biological pathway of that gene may enhance further understanding of its bi-
ological function. Therefore, we are interested in finding other differentially expressed
genes whose expression values are highly correlated with that of a “seed” gene. We
propose a nonparametric procedure for selecting differentially expressed genes with
expression levels correlated with that of a “seed” gene in microarray experiments. The
proposed test statistic compares two Area Under Receiver Operating Characteristic
Curves (AUC) for gene pairs taking correlation into account. DeLong et al. (1988)
proposed a nonparametric procedure for calculating a consistent variance estimator
of the difference between two AUCs. We use their variance estimation technique for
comparing pairs of genes, and we focus on the correlated gene selection process with
respect to a particular gene of interest. The performance of our method is compared
to the other methods through the use of simulation and real data analysis.
94
95
5.1. Introduction
It is important to find a novel and efficient statistical technique that will help identify
those genes whose expressions are correlated such that they may belong to the same
molecular pathway. For grouping similar genes using gene expression data, many
methods have been proposed: cluster analysis (Eisen et al. (1998)), self-organizing
maps (Tamayo et al. (1999)), singular value decomposition (Alter et al. (2000)), sup-
port vector machine (Furey et al. (2000)), and genetic algorithm/k-nearest neighbor
method (Li et al. (2001 )).
Often, the biologist has already acquired some prior knowledge of the molecular
basis of the disease, and therefore already knows one or more functional candidates in
the disease pathogenesis in advance. In this scenario, the naive method is to simply
select the genes Gg; (g = 1, · · · ,m) whose expression levels have the highest Pearson
correlation coefficients with the expression levels of Gs; (g = s), where Gs is the
known, pre-specified, marker gene. Here we propose an algorithm that is designed
for selecting a target number of highly correlated genes sequentially if at least one
candidate/ marker/ differentially expressed gene is known in advance.
The motivation of this chapter comes from Ding et al. (2008) paper in which they
proposed a gene selection procedure which can target at finding other genes whose ex-
pression patterns correlate significantly with the gene of known biological significance.
Ding’s method doesn’t tell anything about selecting differentially expressed genes if
the known gene is differentially expressed. Also the method doesn’t take underly-
ing conditions (e.g., treatment and control) into account. The problem of searching
for differentially expressed genes correlated to a known candidate gene has not been
studied extensively in the literature. In this chapter we propose a statistical proce-
dure to solve this problem which makes use of a nonparametric estimator of the AUC
difference of two genes and is based on the concept of estimating the variability with
96
the method by DeLong et al. (1988). Most currently available methods ignore corre-
lation among genes when in fact they are correlated. Expression profiles of multiple
genes are often correlated and thus are more suitably modeled as mutually dependent
variables. Zhao et al. (2003) showed that lack of independence might lead to largely
inflated Type I error (false positive). Recently, different methods (Chilingaryan et al.
(2002), Szabo et al. (2002), Lu et al. (2005), Zhou et al. (2007)) have been proposed
using multivariate statistical techniques in expression data analysis. Here we propose
a test statistic for selecting pair of highly correlated genes which depends on the cor-
relation between the expressions for the two genes being compared. We adjust the
test statistic and calculate the variance following the method of DeLong et al. (1988).
5.2. Ding’s Method
Ding et al. (2008) studied the problem of searching genes correlated to a known can-
didate gene or a “seed” gene. They proposed a statistically valid two-stage procedure
for selecting genes with expression levels correlated with that of a “seed” gene in
microarray experiments:
Stage I: perform array-normal-scores (ANS) transformation of the raw microarray
data and calculate the Pearson correlation coefficients using ANS-transformed
data. The ANS transforms the raw microarray data approximately to a Gaus-
sian distribution.
Stage II: pick correlated genes by employing the false discovery rate (FDR) ap-
proach. They proposed calculating a resampling-based least significant false
discovery rate (LS-FDR) to select the most correlated genes based on a given
LS-FDR threshold.
97
5.2.1. Correlation Test
Let {xig}n1i=1 and {yjg}n2
j=1 denote expression values for the n1 control and n2 treatment
subjects for the gth (g = 1, · · · ,m) gene. We denote the matrix with entries as
wgh = (xig, yjg); (g = 1 · · ·m;h = 1, · · · , n1, n1 + 1, · · · , n) which represent the gene
expression profile data each with m different genes G1, G2, · · ·Gm. We assume that
the expression values are normalized and denote the marker gene by Gs. Thus, the
measurement for Gs on the hth array is wsh. Therefore, the question of finding related
genes Gg and Gs can be formulated statistically as a hypothesis test of whether
the Pearson correlation coefficient between Wgh and Wsh is zero. That is to test
H0 : ρg = 0 where ρg = cov(Gg, Gs)/√
var(Gg)var(Gs). Based on the matrix {Wgh},
the correlation coefficient ρg is estimated by the sample Pearson correlation coefficient
ρg =
∑nh=1(wgh − wg)(wsh − ws)√∑n
h=1(wgh − wg)2∑
h=1 n(wsh − ws)2
where wg =∑n
h=1wgh/n.
Genes that are correlated with Gs also tend to have higher sample correlation. So
the standard Pearson correlation test for H0 : ρg = 0 rejects H0 when | ρg | is too
large. The standard two-sided Pearson correlation test calculates the p-value as:
Pg = 2T−1n−2
(√n− 2 | ρg |√
1− ρ2g
),
where T−1n−2(.) is the cdf of the Students t distribution with n− 2 degrees of freedom
(Larsen and Marks (2001 )). Therefore, the gth gene is declared to be related to Gs
if the corresponding p-value Pg is smaller than a pre-specified Type I error rate α.
5.3. Materials and Methods
5.3.1. Comparison of Two ROC Curves
The comparison of two genes can be based on two ROC curves or their summary
measures. Recent methods for assessing the difference in the AUCs in a paired setting
98
can be found in Braun et al. (2008). DeLong et al. (1988) developed a consistent
nonparametric estimator of the covariance matrix for several AUC estimators taking
pairwise correlation into account. The covariance for the estimators (Ag, As) can be
computed as follows:
1. Compute the treatment and control group components for rth gene
ψgi. =
1
n2
n2∑j=1
ψ(xig, yjg), ψg.j =
1
n1
n1∑i=1
ψ(xig, yjg).
2. Compute the treatment and control group components for sth gene
ψsi. =
1
n2
n2∑j=1
ψ(xis, yjs), ψs.j =
1
n1
n1∑i=1
ψ(xis, yjs).
3. Compute the two terms: sg,s10 and sg,s01
sg,s10 =1
n1 − 1
n1∑i=1
[ψgi. − ψg
..][ψsi. − ψs
..],
sg,s01 =1
n2 − 1
n2∑j=1
[ψg.j − ψg
..][ψs.j − ψs
..].
4. A consistent estimator of the covariance is then
ˆcov(Ag, As) =sg,s10
n1
+sg,s01
n2
.
Since the nonparametric estimator of the AUC is unbiased, the expectation of the
AUC difference is 0 when the two AUCs are equal. Hence, under the assumption
of asymptotic normality of the AUC-statistic and the additional assumption of ex-
changeability of within gene rank-ratings, we can construct a test statistic for testing
no difference between two genes in terms of discrimination.
D =Ag − As
SE(Ag − As)
99
where,
SE(Ag − As) =
√Var(Ag) + Var(As)− 2[cov(Ag, As)].
For each pair of genes, the statistic D is computed to determine the closest gene
to a candidate gene or marker gene in terms of the values of AUC. Suppose the
candidate gene is Gs and we want to compare the gene Gs with another gene Gg,
g = 1, · · · ,m; g = s then we will get m − 1 standard errors, i.e., SE(As − Ag); g =
1, · · · ,m; g = s. The intent of using this test statistic is to incorporate pairwise
correlation between genes individually. Therefore AUC difference for each pair of
genes in a microarray experiment can have its own unique variance. This may be a
consequence of biological or technical factors, but it is well known that variances differ
across genes. To derive a stable pairwise gene specific variance difference estimates,
we can borrow information across genes by shrinking the variance of the difference
estimators. Therefore we propose adding a constant a0 which will serve to stabilize
the denominator of D. Here we follow Efron et al. (2001) and take a0 as the 90th
percentile of SE(Ag − As).
D(adj) =Ag − As
SE(Ag − As) + a0
TheD(adj) statistic possesses a prominent statistical property for microarray data
analysis. The statistic takes into account pairwise correlation for finding genes whose
differential expressions are not marginally detectable in single-gene testing method.
Here we also propose a search algorithm that sequentially identifies genes after looking
at the distance between AUCs of two compared genes. The lowest values of D(adj)
statistic provide the most similarity in terms of discrimination between the pair of
genes.
100
5.3.2. Permuted P-values and FDR estimation with D(adj)
statistic
The False Discovery Rate (FDR) can be calculated following the same method de-
scribed in the previous chapter. Here we describe the FDR estimation based on
estimated p-values. Therefore, the following steps are followed to calculate the FDR
rates by resampling the data:
1. For the original data calculate the D(adj) statistic for gth gene and the marker
gene, denote the k-th ordered values as D(adj)k.
2. permute the gene expression levels of the marker gene wsh = (xis, yjs); i =
1, · · · , n1; j = 1, · · · , n2 across microarrays. That is, we create a permuted
data set w(1)gh ; (g = 1 · · ·m;h = 1, · · · , n), where w
(1)gh = wgh for g = s and
(w(1)s1 , · · · , w
(1)sn ) is a random permutation of (ws1, · · · , wsn). The first n1 obser-
vations are then considered as the control group and next n2 observations are
considered as the treatment group. This permuted data set keeps the dependent
structure of the genes except for Gs.
3. For the new permuted data set calculate the D(adj)g, g = 1, · · · ,m−1 statistic.
4. Repeat this resampling procedureB times, and then, we recalculate theD(adj)(b)g , (b =
1, · · · , B; g = 1, · · · ,m− 1) from the resampling-based permuted data sets.
5. The true p-value of the correlation between the gth gene and the marker gene
can then be estimated by
pg =1
B(m− 1)#(| D(adj)
(b)k |>| D(adj)k |).
6. The Benjamini and Hochbergh (1995) formula can now be used on the esti-
mated p-values to obtain the resampling-based FDR. Genes are selected based
on higher values of FDR.
101
5.3.3. Simulation
We evaluate the performance of D(adj) statistic method using simulated data. Here
we consider SAM, RST and empirical Bayes t-statisitcs (E-B) for comparison because
these three methods are widely used methods in microarray gene selection literature.
The simulated data are generated by following Allison et al. (2002). The approach
used in Allison et al. (2002) helps us to take correlation into account into the simulated
microarray data. The following steps are followed in simulating data:
1. We focus on two conditions, treatment versus control. We consider sample sizes
of 10, 20 and 40 per condition.
2. We generate the datasets comprising 1000 genes.
3. The data for the n1 + n2 samples are multivariate normal and generated inde-
pendently.
X ∼ N1000(µ,Σ)
4. µ has a constant vector of length 1000 and equal to 0.
5. Σ = σ2B⊗
I10, B = 11001′100ρ+ (1− ρ)I100, 1100 = (1, 1, · · · , 1)′ with length
of 100, and I10 is the 10 × 10 identity matrix.⊗
is the kronecker product.
Therefore we consider 10 independent groups and each group consists of 100
correlated genes.
6. The common variance is set at σ2 = 1.
7. We varied ρ over three values: 0.4 (weak dependence), 0.6 (moderate depen-
dence) and 0.8 (strong dependence). Figure 5.1 presents a correlation plot for
a simulated data when correlation coefficient for a group of genes is 0.8. In-
creasingly positive correlations represent with reds of increasing intensity, and
increasingly negative correlations are represented with greens of increasing in-
tensity.
102
Figure 5.1: Correlation Plot for a simulated data when correlation coefficient withingroup of genes is 0.8
103
8. We consider the first gene (G1) is the marker gene. Therefore 99 other genes
(G2, · · · , G100) are correlated with the marker gene.
9. We set the treatment effect, d, as 1, and added to the first 20% of the genes
which will provide only a location shift between the two conditions of the genes.
Now our interest is to see how many of the first 200 genes are selected by the
two methods under different correlations.
10. We performed 500 simulations and calculated the average number of genes which
are correctly identified from the top 200 genes.
Table 5.1: Average Number of Genes Truly DE (correlated DE genes) out of topranked 200 genes after 500 simulations.
ρ Sample Size SAM RST Mod-t D(adj)0.4 n1 = n2 = 10 39.58 (19.37) 37.94 (19.65) 39.25(19.21) 45.21 (25.34)
n1 = n2 = 20 40.70 (20.26) 38.02 (19.67) 40.38(20.1) 44.79 (25.38)n1 = n2 = 40 41.85 (20.31) 39.76 (19.55) 41.97(20.26) 45.70 (27.23)
0.6 n1 = n2 = 10 43.56 (23.54) 44.69 (22.63) 43.82(23.6) 46.62 (28.24)n1 = n2 = 20 41.1 (19.92) 39.42 (21.97) 41.11(19.83) 46.11 (27.5)n1 = n2 = 40 41.25 (19.48) 38.89 (18.48) 41.16(19.49) 47.71 (28.63)
0.8 n1 = n2 = 10 44.64 (23.37) 43.43 (24) 44.79(23.38) 52.91 (35.01)n1 = n2 = 20 42.77 (19.04) 37.21 (21.63) 42.44(19.16) 53.95 (35.38)n1 = n2 = 40 37.48 (21.06) 32.42 (16.51) 39.61(22.14) 53.3 (37.01)
Table 5.1 show the average number of genes and average number of correlated
genes that are correctly identified from the set of 200 top ranked genes by each of
the methods. For example when applying D(adj) statistic to the simulated dataset of
sample size 20 per condition, correlation 0.8 we observe that on average, 53.95 genes
are correctly identified from the top 200 ranked genes and among them 35.38 genes are
104
correlated with the marker gene G1. On the other hand SAM, RST and moderated-
t method produced on average, 42.77 (19.04), 37.21(21.63) and 42.44(19.16) genes,
respectively.
When we consider the performance of the methods in terms of correlation, it is
clear to declare a “winner” from the results of Table 5.1 because there is general
trend indicating that D(adj) statistic gives significantly better answers across all 3
correlations. This is encouraging with the use ofD(adj) statistic to get the pathway of
a seed gene showing differential expression. Overall, D(adj) statistic can identify more
differentially expressed genes than any of the methods and among the differentially
expressed genes this method can identify most correlated genes.
5.3.4. Application: Colon Cancer Data
A detailed evaluation of selection methods on real biological data is challenging due to
the difficulty of defining a gold standard. Here we have evaluated and applied all the
methods to a publicly available dataset. The dataset is taken from Alon et al. (1999)
colon cancer data. In this experiment, 62 samples (40 tumor samples, 22 normal
samples) from colon-cancer patients were analyzed with an Affymetrix oligonucleotide
Hum6000 array. 40 samples are from tumors (labelled as “negative”) and 22 samples
are from normal (labelled as “positive”) biopsies from healthy parts of the colons
of the same patients. The dataset is available from the R package colonCA. Figure
5.2 presents the correlation plot of first 50 differentially expressed genes defined by
SAM. It is seen from the figure that a number of the differentially expressed genes are
pairwise correlated. Considering the top most differentially expressed gene defined
by SAM which is Hsa.8147, it is found that 22 other genes are moderate to highly
correlated (absolute value of correlation coefficient ≥ 0.6) with the gene Hsa.8147.
Table 5.2 presents the top ranked 20 genes by the RST, SAM, Moderated t and
D(adj) statistic. We considered the seed gene as Hsa.8147. The results indicate that
105
Hsa.1902Hsa.832
Hsa.3331Hsa.462
Hsa.2863Hsa.33
Hsa.951Hsa.11673
Hsa.878Hsa.1073Hsa.2097Hsa.2588Hsa.2928Hsa.4252
Hsa.678Hsa.2291Hsa.5444Hsa.1205
Hsa.43279Hsa.6814
Hsa.821Hsa.6472Hsa.2800
Hsa.11616Hsa.1047Hsa.5398
Hsa.601Hsa.3305Hsa.1130Hsa.3016
Hsa.10755Hsa.1221Hsa.8068Hsa.4689Hsa.2344
Hsa.831Hsa.3306Hsa.2456
Hsa.773Hsa.957
Hsa.36952Hsa.8125
Hsa.36689Hsa.1131Hsa.692.1
Hsa.692Hsa.37937
Hsa.1832Hsa.692.2Hsa.8147
Hsa
.814
7H
sa.6
92.2
Hsa
.183
2H
sa.3
7937
Hsa
.692
Hsa
.692
.1H
sa.1
131
Hsa
.366
89H
sa.8
125
Hsa
.369
52H
sa.9
57H
sa.7
73H
sa.2
456
Hsa
.330
6H
sa.8
31H
sa.2
344
Hsa
.468
9H
sa.8
068
Hsa
.122
1H
sa.1
0755
Hsa
.301
6H
sa.1
130
Hsa
.330
5H
sa.6
01H
sa.5
398
Hsa
.104
7H
sa.1
1616
Hsa
.280
0H
sa.6
472
Hsa
.821
Hsa
.681
4H
sa.4
3279
Hsa
.120
5H
sa.5
444
Hsa
.229
1H
sa.6
78H
sa.4
252
Hsa
.292
8H
sa.2
588
Hsa
.209
7H
sa.1
073
Hsa
.878
Hsa
.116
73H
sa.9
51H
sa.3
3H
sa.2
863
Hsa
.462
Hsa
.333
1H
sa.8
32H
sa.1
902
Figure 5.2: Correlation Plot for first 50 differentially expressed genes defined by SAMfrom Colon Cancer data
106
15 genes are found by D(adj) statistic which are concordant with RST method. On
the other hand 7 and 14 genes are found concordant with SAM and Mod-t statistic,
respectively. Therefore D(adj) statistic provides most of concordant genes with the
RST method for this dataset.
Table 5.2: Top ranked 20 genes by different Methodsrank RST SAM (7) Mod-t (14) D(adj)(15)1 Hsa.37937 Hsa.8147 Hsa.8147 Hsa.81472 Hsa.6814 Hsa.692.2 Hsa.692.2 Hsa.369523 Hsa.549 Hsa.692 Hsa.37937 Hsa.4624 Hsa.831 Hsa.1131 Hsa.1832 Hsa.33065 Hsa.627 Hsa.692.1 Hsa.692 Hsa.33316 Hsa.773 Hsa.1832 Hsa.692.1 Hsa.18327 Hsa.2928 Hsa.37937 Hsa.36689 Hsa.8218 Hsa.601 Hsa.4689 Hsa.1131 Hsa.692.29 Hsa.3306 Hsa.8125 Hsa.2456 Hsa.60110 Hsa.36689 Hsa.957 Hsa.8125 Hsa.3668911 Hsa.1832 Hsa.8068 Hsa.6814 Hsa.209712 Hsa.462 Hsa.5398 Hsa.36952 Hsa.292813 Hsa.36952 Hsa.1221 Hsa.601 Hsa.301614 Hsa.8147 Hsa.10755 Hsa.773 Hsa.597115 Hsa.692.2 Hsa.1130 Hsa.2928 Hsa.95716 Hsa.2097 Hsa.831 Hsa.957 Hsa.77317 Hsa.3331 Hsa.36952 Hsa.2344 Hsa.104718 Hsa.821 Hsa.43279 Hsa.3306 Hsa.69219 Hsa.692 Hsa.878 Hsa.831 Hsa.120520 Hsa.3016 Hsa.5444 Hsa.2097 Hsa.2645
Different gene selection algorithms can potentially select different relevant genes
and lead to different classification accuracies. The misclassification errors were found
by choosing 20 to 70 top ranked genes according to the statistical values of different
methods. We evaluate the classification performance using a simple Gaussian max-
imum likelihood discriminant rule, for diagonal class covariance matrices. A 5 fold
cross-validation used to get the classification errors. The form of the algorithm is as
follows:
1. Randomly divide the samples into 5 partitions (subsamples).
2. Of the 5 partitions, a single subsample is retained as the validation data for
testing the model, and the remaining 4 subsamples are used as training data.
3. The cross-validation process is then repeated 5 times (the folds), with each of
the 5 subsamples used exactly once as the validation data.
107
20 30 40 50 60 70
0.30
0.32
0.34
0.36
0.38
Number of top ranked Genes
Aver
age
Mis
clas
sific
atio
n Er
ror
RSTSAMMod−tD(adj)
Figure 5.3: Average misclassification error rate for Colon cancer dataset shown as thenumber of top ranked genes
4. Record the number of top ranked genes from the training data and use these
genes for classification with a simple Gaussian maximum likelihood discriminant
rule, for diagonal class covariance matrices.
5. Error for a given classification relative to a known truth are then calculated by
the classError function of R package mclust.
6. Report the average error over all 5 test sets.
The average of the misclassification errors for different methods are plotted in Fig-
ure 5.3. It appears from the figure that the D(adj) provides better results than RST
method and comparative results with SAM and Moderated-t statistics. Especially the
108
proposed D(adj) statistic performs best when a small number of genes are of interest.
Therefore the D(adj) can be used effectively to this dataset for identification of the
important genes in microarray data analysis.
5.3.5. Effect of Seed Gene: Affymetrix spike-in study
The Affymetrix spike-in study contains 12,626 genes, 12 replicates in each group,
and 16 known differentially expressed genes (Cope et al. (2004)). The dataset is
taken from the Bioconductor package DEDS. The correlation plot with the 16 known
differentially expressed genes has been presented in the Figure 5.4. It is apparent from
the figure that all the 16 truly expressed genes are are highly correlated. Now our
interest is to see the seed gene effect on the D(adj) statistic. We have selected 4 genes
which are considered as seed gene: 684 at, 32660 at, 1552 i at and 1032 at which are
ranked as 1st, 10th, 15th and 20th according to the FDR values of empirical Bayes
moderated-t statistic. Among the seed genes first two are truly differentially expressed
genes and next two are not truly DE. The selection of first 20 genes according to the
permuted p-values when considering different seed gene have been given in the Table
5.3. Xrepresents truly differentially expressed genes. It is seen from the table that
considering 684 at and 32660 at as a seed gene the D(adj) statistic can identify 15
truly expressed genes from the 20 top ranked genes. On the other hand considering
1032 at as a seed gene the proposed statistic can identify only 2 truly expressed genes
from the 20 top ranked genes. Therefore detecting true changes in individual gene
expressions is highly affected by choosing a suitable seed gene in the use of D(adj)
statistic. It is suggested that gene expressions might be altered in related groups
defined by pathways, functions or localizations rather than individually (Segal et al.
(2004)). In such case, genes with distinguished expression changes could be detected,
but many other genes showing correlated but weak changes may be easily missed.
Therefore we are recommending our method in cases when a truly expressed genes is
known in advance from an experiment.
109
546_at
33818_at
1708_at
1091_at
407_at
40322_at
36085_at
36202_at
1024_at
36889_at
36311_at
39058_at
38734_at
1597_at
684_at
37777_at
3777
7_at
684_
at
1597
_at
3873
4_at
3905
8_at
3631
1_at
3688
9_at
1024
_at
3620
2_at
3608
5_at
4032
2_at
407_
at
1091
_at
1708
_at
3381
8_at
546_
at
Figure 5.4: Correlation Plot for the Affymetrix spike-in data
5.4. Discussion and Conclusion
We introduce a new gene selection procedure based on correlation taking into account
and with a known of a “seed” gene. Among the effort of ranking genes, the reported
D(adj) statistic gives satisfactory results.
Most of the methods in microarray studies examine one gene at a time and rank
genes according to their performance ability, and select only the high-ranking genes
for further studies. Therefore some information could be lost by not considering genes
jointly. However, univariate selection can be an inadequate approach from both sta-
tistical and biological point of view. The theory behind the D(adj) statistic is easily
understood and the results have been shown to be more biologically relevant than
those of other methods. The relative benefit of a paired gene analysis compared to a
single gene analysis depends on the correlation information between the observations
for the two genes being compared. In general, statistical correlations can be a hint
110
Table 5.3: Top ranked 20 genes by different different seed generank— Seed Gene 684 at 32660 at 1552 i at 1032 at
1 1024 at X 1024 atX 1552 i at 1032 at2 1091 at X 1091 at X 38254 at 34249 at3 32660 at 32660 at 39058 at X 39000 at4 33818 at X 33818 at X 32115 r at AFFX-YEL021w/URA3 at5 36085 at X 36085 at X 684 at X 39446 s at6 36202 at X 36202 at X 1024 at X 966 at7 36311 at X 36311 at X 1091 at X 35939 s at8 36889 at X 36889 at X 32660 at 37492 at9 37777 at X 37777 at X 33818 at X 32955 at
10 38502 at 38502 at 36085 at X 41184 s at11 38734 at X 38734 at X 36202 at X 37420 i at12 40322 at X 40322 at X 36311 at X 1708 at X13 407 at X 407 at X 36889 at X 38406 f at14 546 at X 546 at X 37777 at X 40276 at15 684 at X 684 at X 38502 at 41764 at16 1708 at X 38734 at X 31412 at17 39058 at X 1552 i at 40322 at X 32650 at18 38254 at 39058 at X 407 at X 1253 at19 1552 i at 32115 r at 546 at X 35359 at20 37492 at 35339 at 35339 at 546 at X
for the fact that two genes belong to the same pathway so that we expect a high sta-
tistical correlation of expression values to have a meaningful biological interpretation.
Therefore the D(adj) statistic provides a simple, yet powerful tool for detecting DE
genes between the two experimental conditions.
If the “seed” gene is not known in advance, then the first ranked gene (or seed gene)
plays an important role in proceeding with the proposed method. That is, the method
highly depends on the first ranked gene identified by univariate gene approach such
as SAM. If the “seed” gene is not correctly identified, the remaining significant genes
in the pathway may be hard to detect correctly. Therefore, we suggest examining the
first ranked gene that is the biologically relevant DE gene that allows clear separation
of the two conditions. It is possible to use the second ranked gene (or third ranked gene
and so on) in the proposed method as a “seed” gene if it is identified as a differentially
expressed gene under the two experimental conditions. The D(adj) statistic performs
better if a small number of genes are biologically interesting. The method shows a
useful approach to extract biological information from expression data. This method
can show its strong analytical power if the seed gene is identified correctly and many
biological processes can be determined from it.
111
It is not expected that a single method will produce the best result under all sim-
ulation scenarios. But we found that whichever method produced the best estimate
in a particular case the proposed nonparametric methods usually came close to the
best method. The the D(adj) statistic exhibit comparable results with the SAM and
moderated-t in most scenarios of simulated data, but the D(adj) statistic produces
more highly correlated genes than any of the other methods. Therefore we propose
to employ the D(adj) statistic to pick correlated genes with higher confidence.
In the field of microarray data analysis, there is no appropriate answer as to which
method or which statistic should be used, as the choice of statistic can dramatically
affect the set of genes that is selected. One should choose the measure of differential
expression based on the biological system of interest. If our interest is to find differ-
entially expressed genes correlated with a “seed” gene, then our proposed method is
preferable since it provides useful analytical tools taking correlation into account.
CHAPTER 6
Conclusion
6.1. Thesis Summary
Prognostic and predictive factors are indispensable tools in the treatment of patients.
Traditional methods of characterization are often limited and do not have the abil-
ity to discern subtle differences that may be of importance for developing a better
understanding of the disease and advancing therapeutic strategies for the treatment
of disease. Gene expression assays have the potential to supplement a few distinct
genes with data from many thousand of genes. We have developed three statistical
techniques that provide predictive capability based on gene expression data derived
from DNA microarray analysis. We have taken both parametric and nonparametric
approaches in creating statistical methods that provide good probabilistic prediction
and classification of two or more conditions based on gene expression data.
The noisy nature of gene expression data has motivated the development of nu-
merous algorithms for identifying differentially expressed genes. Most available para-
metric methods in the study of microarray data analysis rely on a normal distribution
assumption and are based on a Wald statistic. These methods are inefficient, espe-
cially when expression levels follow a skewed distribution. To deal with violations of
the normality assumption, we propose an approximate likelihood ratio test (ALRT)
assuming Generalized Logistic Distribution of Type II (GLDII). The strength of using
GLDII is that the distribution has a shape parameter which will help to consider both
the symmetric and asymmetric distributions for the data. Our simulation and data
112
113
analysis suggest that the t-statistic or SAM or empirical Bayes t can lead to poor
performance if there is in fact significant variation between gene variances, which
is expected in real data. Our ALRT(G) method for GLDII is found to be perform
comparably well at any settings of treatment effect, variability, sample size and noise
effect. Our study shows that this method performs well with small number of sam-
ples and with noisy datasets. The ALRT method also added an advantage that it
can handle multiclass microarray dataset.
We borrow the idea of the receiver operating characteristic (ROC) from clinical
biostatistics and demonstrate its application to microarray analysis. We propose a
nonparametric method in a AUC setting that is based on a novel and model-free
estimate of the variance vector across genes. The contribution to this “single-gene”
statistic is the studentization of the empirical AUC which takes into account the
variances associated with each gene in each experiment. The method suggests how
to use the variance information to produce a list of significant targets and assess
differential gene expressions under two experimental conditions. It is also possible to
obtain p-values using this method. The essence of this method is to take advantage
of using the variance of empirical AUC for each gene to achieve the goal of selecting
important genes. Given the high noise effect, together with small sample sizes, the
proposed method outperformed pAUC and standard ROC methods for detecting
differential gene expression. This underlines the importance of incorporating and
modeling the variance structure of microarray data during the development of future
statistical tests. The added advantage of the proposed method is that it doesn’t
involve complicated formulas and so it is computationally easy to understand.
Another important nonparametric technique is also suggested in this study that
can provide a solution of taking pairwise correlation with a “seed” gene. The statistic
in this method compares a pair genes and selects correlated differentially expressed
genes sequentially. The method can be effectively used in the field of correlated gene
selection. The simulation study shows that, the nonparametric method performed
114
well with datasets that had low noise effects and large sample size. Using simulated
data sets and real microarray data sets, we showed that this novel algorithm is compa-
rable with other gene selection procedures. In particular, the D(adj) statistic exhibits
a significantly greater power of detecting correlated genes.
The study aims to improve the statistical analysis procedure for identifying differ-
entially expressed genes effectively. The ALRT method with underlying distribution
of GLDII assumption gives flexibility by considering shape of the distribution. Using
a simulation study and real datasets, we show that this parametric method outper-
forms other gene selection procedures especially for large sample sizes. The proposed
nonparametric methods do not involve complicated formulas. Both nonparametric
methods can identify a comparable fraction of truly differentially expressed genes, in
particular if the data consists of large sample sizes or the presence of outliers. One
of the methods which examines statistical correlations can suggest that two genes
belong to the same pathway so that we expect a high statistical correlation of ex-
pression values to have a meaningful biological interpretation. That is an important
benefit for this method and shows potential for identifying gene pathways underlying
an observed phenotype. A key point is the capacity to identify not just highly ex-
pressed genes but genes whose expression highly correlates with the phenotype and
thus provide additional insight into the underlying biological pathways.
In practice, choosing a gene selection algorithm for selecting a list of differentially
expressed genes usually depends on the problems involved in expression data like
sample size, treatment effects, variability, correlation structure etc. To identify im-
portant genes one may need to select a number of different gene subsets that can all
solve the classification problem with similar high accuracy. For this purpose, we feel
that different selection algorithms should also be employed to render a comprehensive
exploration of the useful genes. In this case, our methods can be used together with
many other approaches. In the context of machine learning, this approach is referred
as an ensembling, which has also been used in gene selection problems. Therefore,
115
our work can be viewed as providing new choices to build an ensemble system.
6.2. Future Work
6.2.1. Improving the ALRT method
The ALRT method can be adjusted by borrowing information from all the genes. It
may be possible to get more robust results from the ALRT method by borrowing
strength from genes in local intensity regions for estimation of shape and scale pa-
rameters of GLDII. For example, we can use Wald test for comparing two location
parameters of GLDII and smooth the variance by adding an offset in the denomina-
tor similar to what is done in the SAM statistic. However this approach is based on
specific data modeling assumptions. We can also make the procedure nonparamet-
ric by estimating p-values from permutated datasets (random column permutations).
Therefore with the proposed approach, the test statistic become
µ1 − µ2
SE(µ1 − µ2) + s0
where, µ1 and µ2 is the approximate location parameters of the GLDII for treatment
and control group, respectively and SE(.) is the standard error for difference between
the two approximate location parameter. We can also take the offset, s0, as the 90th
percentile of the standard error of the location difference between two conditions
following Efron et al. (2001) approach.
6.2.2. Possible extension for D(adj) statistic
We can develop a score test for testing whether or not some pre-specified groups of
genes are differentially expressed by the D(adj) statistic. The groups of genes can
be those that are involved in a particular biochemical pathway or a genomic region
of interest and should be specified before testing. This method is very valuable
for testing some known pathways that affect clinical outcome in combination with
groups of genes. We can cluster datasets using the performance of genes and assess
116
their ability to separate groups of interest through the D(adj) statistic. However it is
also of interest to see how a better candidate gene which is potentially less correlated
with the first one can contribute new information in microarray analysis.
BIBLIOGRAPHY 117
Bibliography
Affymetrix Inc. (2002). Statistical Algorithms Description Document, http://www.
affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf.
Allison et al.(2002). A mixture model approach for the analysis of microarray gene
expression data, Computational Statistics and Data Analysis, 39:1, 1-20.
Alon et al.(1999). Broad patterns of gene expression revealed by clustering analysis of
tumor and normal colon tissue probed by oligonucleotide arrays, Proc. Natl. Acad.
Sci. USA, 96: 6745-6750.
Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for
genome-wide expression data processing and modeling, Proc. Natl. Acad. Sci.,
97:1010110106.
Arvesen, J.N. (1969). Jackknifing U-statistics, Annals of Mathematical Statistics,
40(6): 2076-2100.
Balakrishnan, N. and Hossain, A. (2007). Inference for the Type II generalized logistic
distribution under progressive Type II censoring, Journal of Statistical Computa-
tion and Simulation, 77, 12:1013-1031.
Balakrishnan, N. and Leung, M.Y., (1988). Order statistics from the Type I general-
ized logistic distribution, Communications in Statistics Simulation and Computa-
tion, 17:2550.
Baldi, P. and A. D. Long (2001). A Bayesian framework for the analysis of microar-
ray expression data: regularized t-test and statistical inferences of gene changes,
Bioinformatics, 17:509519.
Benjamini, Y. and Hochbergh, Y. (1995). Controlling the False Discovery rate: a
BIBLIOGRAPHY 118
practical and powerful approach to multiple testing, J.R. Statistical Society B,
57:89-300.
Bhowmick, D., Davison, A.C., Goldstein, D. R., and Ruffieux, Y. (2006). A Laplace
mixture model for identification of differential expression in microarray experi-
ments, Biostatistics, 7(4):630-641.
Bokka, S. and Mathur, S.K. (2006). A nonparametric likelihood ratio test to iden-
tify differentially expressed genes from microarray data, Applied Bioinformatics,
5(4):267-276.
Braun, T.M. and Alonzo, T.A. (2008). A modified sign test for comparing paired
ROC curves, Biostatistics, 9(2):364-372.
Breiman, L., Friedman, J. H., Olsen, R. A., and Stone, C. J. (1984). Classification
and Regression Trees. Wadsworth, Monterey, CA.
Breitling, R. and Herzyk, P.(2005). Rank-based methods as a non-parametric alterna-
tive of the T-statistic for the analysis of biological microarray data, J. Bioinform.
Comput. Biol., 3(5):1171-89.
Broberg, P. (2003). Statistical methods for ranking differentially expressed genes,
Genome Biology, 4:41.
Brown, P.O. and Botstein, D. (1999). Exploring the new world of the genome with
DNA microarrays, Nat Genet., 21:3337.
Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T.
S., Ares Jr, M. and Haussler, D. (2000). Knowledge-based analysis of microarray
gene expression data by using support vector machines, Proceedings of the National
Academy of Sciences, 97:262- 267.
BIBLIOGRAPHY 119
Chen, Y., Dougherty, E.R. and Bittner, M.L. (1997). Ratio-based decisions and the
quantitative analysis of cDNA micromay images, J. Biomed. Optics, 2: 364-374.
Chilingaryan, A. et al. (2002). Multivariate approach for selceting sets of differntially
expressed genes, Math. Biosci., 176:59-69.
Chu, G., Narasimhan, B., Tibshirani, R. and Tusher, V. (2002). SAM, significance
analysis of microarrays, users guide and technical document, Technical report, Stan-
ford University, http://www-stat.stanford.edu/tibs/SAM.
Cope, L. M., Irizarray, R.A., Jaffee, H., Wu, Z. and Speed T.P. (2004). A benchmark
for Affymetrix GeneChip expression measures, Bioinformatics, 20:323-331.
Cox, D. R. and Hinkley, D. V. (1974).Theoretical Statistics, Chapman and Hall, (page
92).
Craig, B.A., Black, M.A. and Doerge, R.W. (2003). Gene expression data: the tech-
nology and statistical analysis, Journal of Agricultural, Biological, and Environ-
mental Statistics, 8:1-28.
Cui, X., Hwang, J.T.G., Jing, Q. , Natalie J. Blades, and Churchill, G. A. (2005).
Improved statistical tests for differential gene expression by shrinking variance com-
ponents estimates, Biostatistics , 6:59-75.
Dean, N. and Raftery, A. E. (2005). Normal uniform mixture differential gene expres-
sion detection for cDNA microarrays, BMC Bioinformatics, 6:173-186.
Debouck, C. and Goodfellow, PN. (1999). DNA microarrays in drug discovery and
development, Nat. Genet., 21:4850.
Dietz, K., Gail, M., Krickeberg, K., Samet, J., and Tsiatis, A. (2003). Statistics for
Biology and Health, Springer, 21:4850.
BIBLIOGRAPHY 120
Ding, A. A., LIN, J., and Niu, T. (2008). A Statistical Procedure for Detecting Highly
Correlated Genes with a Pre-Specified Candidate Gene in Microarray Analysis,
Communications in StatisticsTheory and Methods, 37 (18): 29913007.
Diciccio, T. and Tibshirani, R. (1991). On the implementation of profile likelihood
methods, Technichal report no. 9107, University of Toronto.
DeLong, E.R., DeLong, D.M. and Clarke-Pearson, D.L. (1988). Comparing the Area
under Two or More Correlated Receiver Operating Characteristic Curves: A Non-
parametric Approach, Biometrics , 44(3): 837-845.
Dorfman, D.D. and Alf JrE. (1969). Maximum likelihood estimation of parameters
of signal detection theory and determination of confidence intervals rating-method
data, Journal of Mathematical Psychology, 6: 487-496.
Dorfman, D., Berbaum, K., and Metz, C. (1992). Receiver operating characteristic
rating analysis: generalization to the population of readers and patients with the
jackknife method, Investigative Radiology, 27: 723-731.
Dudoit, S., Fridlyand, J. and Speed, TP. (2002). Comparison of Discrimination Meth-
ods for the Classification of Tumors Using Gene Expression Data, Journal of the
American Statistical Association, 97(457):77-87.
Dudoit, S., Yang, YH, Callow, MJ. and Speed, TP. ( 2002). Statistical methods for
identifying differentially expressed genes in replicated cDNA microarray experi-
ments, Statistica Sinica, 12:111-139.
Efron, B. and Tibshirani, R (1993). An Introduction to the Bootstrap, NY: Chapman
and Hall.
Efron, B., Tibshirani, R., Storey, J. and Tusher, V. (2001). Empirical Bayes Analysis
of a Microarray Experiment, JASA , 96:1151-1160.
BIBLIOGRAPHY 121
Efron, B. and Tibshirani, R. (2002). Empirical Bayes methods and flase discovery
rates for microarrays, Genet. Epidemiology, 23(1):70-86.
Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster
analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci.,
95:1486314868.
Fox, R. J. and M. W. Dimmic (2006). A two-sample Bayesian t-test for microarray
data, BMC Bioinformatics, 7:126.
Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M. and Haussler,
D. (2000). Support vector machine classification and validation of cancer tissue
samples using microarray expression data, Bioinformatics, 16:906914.
Garrett-Mayer, E., Parmigiani, G., Zhong, X., Cope, L. and Gabrielson, E. (2004).
Cross-study Validation and Combined Analysis of Gene Expression Microarray
Data, Technical Report, Johns Hopkins University, Department of Biostatistics.
Gautier, L., Cope, L., Bolstad B.M., and Irizarry, R.A. (2004). affy-analysis of
Affymetrix GeneChip data at the probe level, Bioinformatics, 20:307-315.
Ge, Y., Dudoit, S. and Speed, T.P. (2003). Resampling-based multiple testing for
microarray data analysis, TEST, 12:1-44.
Geman, D., Christian d’Avignon, Naiman, D.Q., and Winslow, R.L., (2004). Classify-
ing Gene Expression Profiles from Pairwise mRNA Comparisons, Stat Appl Genet
Mol Biol., 3:1 Article 19.
Goddard MJ, Hinberg I. (1990). Receiver operating characteristic (ROC) curves and
non-normal data: an empirical study, Statistics in Medicine, 9:325-337.
Golub et al. (1999). Molecular classification of cancer: class discovery and class pre-
diction by gene expression monitoring, Science, 286:531-537.
BIBLIOGRAPHY 122
Ghosh, D. (2004). Mixture models for assessing differential expression in complex
tissues using microarray data, Bioinformatics 20:1663-1669.
Goeman, J.J., Geer, S., Kort, F. and Houwelingen, H. (2004). A global test for groups
of genes: testing association with a clinical outcome, Bioinformatics, 20(1):93-99.
Hanley JA. (1988). The Robustness of the Binormal Assumption Used in Fitting ROC
Curves, Medical Decision Making, 8(3):197-203.
Hanley, J. and McNeil B.(1983). A method for comparing the areas under receiver
operating characteristic curves derived from the same cases, Radiology, 148:839-843.
Harbig, J, Sprinkle, R; Enkemann, SA. (2005). A sequence-based identification of the
genes detected by probesets on the Affymetrix U133 plus 2.0 array, Nucleic Acids
Res., 33:31.
Haslett, J.N., Sanoudou, D., Kho, A.T., Bennett, R., Greenberg, S.A., Kohane, I.S.,
Beggs, A.H., Kunkel, L.M. (2002). Gene expression comparison of biopsies from
Duchenne muscular dystrophy (DMD) and normal skeletal muscle, Proceedings of
the National Academy of Sciences, USA, 99:15000-15005.
Hoeffding W. (1948). A class of statistics with asymptotically normal distribution,
Annals of Mathematical Statistics, 19(3):293-325.
Hochberg, Y., and tamhane, A. (1987). Multiple comparison procedures, Wiley.
Hossain, A., Beyene, J., Willan, A., and Hu, P. (2009). A flexible approximate likeli-
hood ratio test for detecting differential expression in microarray data, Computa-
tional Statistics and Data Analysis, 53:3685-3695.
Hossain, A., and Willan, A. (2007). Approximate MLEs of the Parameters of
Location-Scale Models under Type II Censoring, Statistics: A Journal of Theo-
retical and Applied Statistics, 41(5):385.
BIBLIOGRAPHY 123
Hu, P., Beyene, J. and Greenwood, CMT (2006). Tests for differential gene expression
using weights in oligonucleotide microarray experiments, BMC Genomics, 7:33.
Huber, W., Von Heydebreck, A., Sultmann, H., Poustka, A. and Vingron, M. (2002).
Variance stabilization applied to micromay data calibration and to the quantifica-
tion of differential expression, Bioinformatics, l(1):1-9.
Hunter, L., Taylor, R.C., Leach, S.M. and Simon, R. (2001). GEST: a gene expression
search tool based on a novel Bayesian similarity metric, Bioinformatics, 17(1):S115-
S122.
Ideker, T., Thorsson, V., Siegel, A.F. and Hood, L.E. (2000). Testing for
Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray
Data, Journal of Computational Biology, 7(6):805-817.
Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf
U. and Speed, T.P., (2003). Exploration, normalization, and summaries of high
density oligonucleotide array probe level data, Biostatistics, 4(2):249-64.
Jeffery, I.B., Higgins, D. G. and Culhane A.C. (2006). Comparison and evaluation
of methods for generating differentially expressed gene lists from microarray data,
BMC Bioinformatics, 7:359.
Kerr, M.K., Martin, M. and Churchill, G.A. (2000). Analysis of variance for gene
expression microarray data, Journal of Computational Biology, 7(6):819-37.
Khan J., Wei, J.S., Ringnr, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold,
F., Schwab, M., Antonescu, C.R., Peterson, C. and Meltzer, P.S. (2001). Classi-
fication and diagnostic prediction of cancers using gene expression profiling and
artificial neural networks, Nature Medicine, 7(6):658-9.
BIBLIOGRAPHY 124
Klebanov, L. and Yakovlev, A. (2007). Diverse corelation structures in gene expression
data and their utility in improving statistical inference, The Annals of Applied
Statistics, 1(2):538-559.
Larsen, R. J. and Marx, M. L. (2001). An Introduction to Mathematical Statistics and
Its Applications, 3rd ed. Englewood Cliffs, NJ: Prentice-Hall.
Lee, M.L.T., Kuo, F.C., Whitmore, G.C. and Sklar, J. (2000). Importance of replica-
tion in microarray gene expression studies: Statistical methods and evidence from
repetitive cDNA hybridizations, PNAS, 97(18):9834-9839.
Li, C. and Wong, W.H. (2001). Model-based analysis of oligonucleotide arrays: Ex-
pression index computation and outlier detection, PNAS, 98(1):3136.
Li, L., Weinberg, C., Darden, T. and Pedersen, L. (2001). Gene selection for sam-
ple classification based on gene expression data: study of sensitivity to choice of
parameters of the GA/KNN method, Bioinformatics, 17:11311142.
Lin,D., Shkedy, Z., Burzykowski,T., Ion,T., Gohlmann, H., Bondt, A., Perera, T.,
Geerts, T., Wyngaert, V., and Bijnens, L. (2008). An Investigation on Perfor-
mance of Significance Analysis of Microarray (SAM) for the Comparisons of Several
Treatments with one Control in the Presence of Small-variance Genes, Biometrical
Journal, 50(5):801-823.
Lobenhofer, E.K., Bushel, P.R., Afshari, C.A. and Hamadeh, H.K.(2001). Progress in
the application of DNA microarrays, Environ. Health Perspect., 2001(109):881891.
Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S.,
Mittmann, M., Wang, C., Kobayashi, M., Horton, H. and Brown E.L. (1996). Ex-
pression monitoring by hybridization to high-density oligonucleotide arrays, Nature
Biotechnology, 14:16751680.
BIBLIOGRAPHY 125
Lonnstedt, I. and Speed, T. P. (2002). Replicated microarray data, Statistica Sinica,
12:31-46.
Lu, Y., Liu, P., Xiao, P. and Deng H. (2005). Hotelling’s T 2 multivariate profiling for
detecting differential expression in microarrays, Bioinformatics, 21(14):3105-3113.
Macoska, JA. (2002). The progressing clinical utility of DNA microarrays, CA Cancer
J Clin., 52:5059.
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis, Academic
Press,London.
Mason, R.L., Gunst, R.F., and Hess, J.L. (1989). Statistical Design and Analysis
of Experiments: With Applications to Engineering and Science, John Wiley, New
York.
Millenaar, F.F., Okyere, J., May, S.T., Zanten, M.V., Voesenek, L.A., and Peeter,
A.J. (2006). How to decide? Different methods of calculating gene expression from
short oligonucleotide array data will give different results, BMC Bioinformatics,
7:137.
Newton, M. A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., and Tsui,
K.W.(2001). On differential variability of expression ratios: improving statistical
inference about gene expression changes from microarray data, J. Comp. Biol.,
8:3752.
Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential
gene expression with a semiparametric hierarchical mixture method, Biostatistics,
5:155176.
Nguyen, D.V., Arpat, A.B., Wang, N., Carroll, R.J. (2002). DNA microarray experi-
ments: biological and technological aspects, Biometrics, 58(4):701-17.
BIBLIOGRAPHY 126
Pan, W. (2002). A comparative review of statistical methods for discovering dif-
ferentially expressed genes in replicated microarray experiments, Bioinformatics,
12:546-554.
Pan, W., Lin, J. and Lee, C. (2003). A mixture model approach to detecting differen-
tially expressed genes with microarray data, Function Interg. Genomics, 3:117-124.
Pawitan, Y., Michiels, S., Koscienlny, S., Gusnanto, A. and Ploner, A. (2005). False
discovery rate, sensitivity and sample size for microarray studies, Bioinformatics,
21(13):3017-3024.
Pepe, M.S., Longton, G. Anderson, G.L., and Schummer, M. (2003). Selecting Differ-
entially Expressed Genes from Microarray Experiments, Biometrics, 59:133-142.
Piper, P., Lewis, D.P. and Noble, WS. (2002). Exploring gene expression data with
class scores, Pac. Symp. Biocomput., p474-85.
Pounds, S. and Cheng, C. (2005). Statistical Development and Evaluation of Microar-
ray Gene Expression Data Filters, Journal of computational biology, 12(4):482495.
Qiu, X., Klebanov, L. and Yakovlev, A. Y. (2005). Correlation between gene expres-
sion levels and limitations of the empirical bayes methodology for finding differen-
tially expressed genes, Statistical Applications in Genetics and Molecular Biology,
4:34.
Qiu, X. and Yakovlev, A. (2006). Some comments on instability of false discovery
rate estimation, J. Bioinformatics and Coputational Biology, 4:1057-1068.
Opgen-Rhein, R. and Strimmer, K. (2007). Accurate Ranking of Differentially Ex-
pressed Genes by a Distribution-Free Shrinkage Approach, Statistical Applications
in Genetics and Molecular Biology, 6(1), Article 9.
BIBLIOGRAPHY 127
Quackenbush, J. (2001). Computational analysis of microarray data, Nature Review
Genetics, 2:418-427.
Raychaudhuri, S., Stuart, J.M., Liu, X., Small, P.M. and Altman, R.B.(2000). Pattern
recognition of genomic features with microarrays: site typing of Mycrobacterium
tuberculosis strains, Proc. Int. Conf.Intell. Syst. Mol. Biol., 8:286-295.
Reyal, F., Stransky, N., Bernard-Pierrot, I., Vincent-Salomon, A., de Rycke, Y. and
Elvin, P. (2005). Visualizing chromosomes as transcriptome correlation maps: Evi-
dence of chromosomal domains containing co-expressed genesa study of 130 invasive
ductal breast carcinomas, Cancer Res., 65:13761383.
Sahai, H. and Agell, M. (2000). Analysis of variance: Fixed, Random and Mixed
models, Boston: Birkhauser.
Sambrook and Russell (2001). Molecular Cloning: A Laboratory Manual, 3rd edition,
Cold Spring Harbor Laboratory Press.
Schulze, A, Downward, J. (2001). Navigating gene expression using microarraysa
technology review, Nat Cell Biol., 3:190195.
Robert B. Scharpf, Hkon Tjelmeland, Giovanni Parmigiani, and Andrew B. Nobel
(2009). A Bayesian Model for Cross-Study Differential Gene Expression, Journal
of the American Statistical Association, 104(488):1295-1310.
Schena, M. (2000). Microarray Biochip Technology. Westborough,MA: BioTechniques
Press.
Segal, E., Friedman, N., Koller, D. and Regev, A. (2004). A module map showing
conditional activity of expression modules in cancer, Nat. Genet., 36:1090-1098.
Seng, K., Glenny, R.W., Madtes, D.K., Spilker, M.E., Vicini, P. and Gharib, S.A.
BIBLIOGRAPHY 128
(2008). Comparison of Statistical Data Models for Identifying Differentially Ex-
pressed Genes Using a Generalized Likelihood Ratio Test, Gene Regulation and
Systems Biology, 2:125139.
Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing dif-
ferential expression in microarray experiments, Statistical Applications in Genetics
and Molecular Biology, 3(1), Article 3.
Smyth, G. K., Thorne, N. P., and Wettenhall, J. (2004). Limma: Linear models
for microarray data user’s guide, Software manual available from http://www.
bioconductor.org.
Speed, T. (eds) (2003). Statistical Analysis of Gene Expression Microarray Data.
Chapman & Hall/CRC, USA.
Storey, J. D. (2002). A direct approach to false discovery rates, J. R. Statist. Soc. B
, 64(3):479498.
Storey, J. and Tibshirani, R. (2003). SAM thresholding and false discovery rates
for detecting differential gene expression in DNA microarrays, In parmigiani, G.,
Garrett, E.S., Irizarry, R.A. and Zeger, S.L. (eds), The Analysis of Gene Expression
Data: Methods and Software, New York: Springer.
Sawilowsky S.S. (1993). Comments on using alternative to normal theory statistics
in social and behavioural science, Canadian Psychology, 34:432-439.
Szabo, A. et al. (2002). Variable selcetion and pattern recognition with gene expres-
sion data generated by the microarray technology, Math. Biosci., 176:71-98.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander,
E. S. and Golub, T. R. (1999). Interpreting patterns of gene expression with self-
organizing maps: methods and application to hematopoietic differentiation, Proc.
Natl. Acad. Sci. U.S.A., 96:29072912.
BIBLIOGRAPHY 129
Thomas, J.G., Olson, J.M., Tapscott, S.J. and Zhao, L.P. (2001). An efficient and
robust statistical modeling approach to discover differentially expressed genes using
genomic expression profiles, Genome Research, 11:1227-1236.
Troyanskaya, O.G., Garber, M., Brown, P., Botstein, D., Altman, R.B. (2002). Non-
paramteric methods for identifying differentially expressed genes in microarray
data, Bioinformatics, 18(11):1454-1461.
Tsodikov, A., Szabo, A. and Jones, D. (2002). Adjustments and measures of differ-
ential expression for microarray data, Bioinformatics, 18:251-260.
Tusher, V.G., Tibshirani, R., Chu, G., (2001). Significance analysis of microarray
applied to the ionizing radiation response, PNAS, 98(9):5116-5121.
Van’t Wout, A. B., G.K. Lehrman, S. A. Mikheeva, G. C. O’Keeffe, M. G. Katze, R.
E. Bumgarner, G. K. Geiss, and J.I. Mullins (2003). Cellular gene expression upon
human immunodeficiency virus type I infection of CD4(+)- Tcell lines, journal of
Virology, 77(2):1392-1402.
Westfall, P.H., and Young, S.S. (1993). Re-sampling based multiple testing, Wiley,
NY.
Wolfinger, R., Gibson, G., Wolfinger, E., Bennett, L., Hamadeh, H., Bushel P, Af-
shari, C. and Paules, R. (2001). Assessing Gene Significance from cDNAMicroarray
Expression Data via Mixed Models, Journal of Computational Biology, 8:625-637.
Wright, G.W. and Simon, R.M. (2003). A random variance model for detection of dif-
ferential gene expression in small microarray experiments, Bioinformatics, 19:2448-
2455.
Wu, B. (2005). Differential gene expression detection using penalized linear regression
models: the improved SAM statistics, Bioinformatics, 21:1565-1571.
BIBLIOGRAPHY 130
Yang, K., Cai, Z., Li, J. and Lin, G. (2006). A stable gene selection in microarray
data analysis, BMC Bioinformatics, 7:228.
Xiong, M., Fang, X., Zhao, J. (2001). Biomarker Identification by Feature Wrappers,
Genome Research, 11:1878-1887.
Yekutieli, D., Benjamini, Y. (1999), Resampling-based false discovery rate controlling
multiple test procedures for correlated test statistics, Journal of Statistical Planning
and Inference, 82:171-196.
Yang, Y.H., Dudoit, S.D., Luu, P. and Speed, T.P. (2001). Normalization for cDNA
Microarray Data, In Spie Bioe.
Yang, K., Cai, Z., Li, J. and Lin, G. (2006). A stable gene selection in microarray
data analysis, BMC Bioinformatics, 7:228.
Yeung, K.Y., Bumgarner, R.E. and Raftery, A.E. (2005). Bayesian model averaging:
development of an improved multi-class, gene selection and classification tool for
microarray data, Bioinformatics, 21:2394.
Zhao, Y. and Pan, W. (2003). Modified nonparametric approaches to detecting dif-
ferentially expressed genes in replicated microarray experiments, Bioinformatics,
19:1046-1054.
Zhang, S. (2006). An Improved Nonparametric Approach for Detecting Differentially
Expressed Genes with Replicated Microarray Data, Statistical Applications in Ge-
netics and Molecular Biology, 5(1).
Zweig, M.H., Campbell, G. (1993). Receiver operating characteristic (ROC) plots: a
fundamental evaluation tool in clinical medicine, Clinacal Chemistry, 39(4):561-77.
Zhou, Y., Corentin C., Ohsugi, M., Stormo, G. and Permutt, M.(2007). A global
approach to identify differentially expressed genes in cDNA (two-color) microarray
experiments, Bioinformatics, 23(16):2073-2079.
BIBLIOGRAPHY 131
Appendix
A. R Code for Chapter 1
####
library(Biobase)
library(multtest)
library(DEDS)
# The following function computes t-statistic values from golub data
# and draw the histogram
data(golub)
t.fun <- comp.t(golub.cl)
t.X <- t.fun(golub)
hx <-
hist(t.X, breaks=100, plot=FALSE) plot(hx, col=ifelse(abs(hx$breaks)
< 3, 4, 2), xlab="T statistic",ylab="",main="")
# The following function computes t-statistic after permutation
t.twoclass<-
function(X, L, B = 2, Delta = 2)
{
# L A vector of integers corresponding to observation (column) class
# labels.
# X A matrix, with m rows corresponding to variables
# (hypotheses) and n columns corresponding to observations.
BIBLIOGRAPHY 132
# B The number of permutations.
# Deltas A vector of values for the threshold delta; see Tusher et al.
ng <- as.vector(table(L))
n <- sum(ng)
t.fun <- comp.t(L, var.equal = TRUE)
tX <- t.fun(as.matrix(X))
tXb <- matrix(nr = nrow(X), nc = B)
for(i in 1:B) {
id <- sample(c(1:n), n, replace = FALSE)
Xb <- X[, id]
t.fun <- comp.t(L, var.equal = TRUE)
tXb[, i] <- t.fun(as.matrix(Xb))
}
p.t1 <- apply(abs(tXb) >= Delta, 2, sum)
p.t <- mean(p.t1)
return(p.t)
}
A. R Code for Chapter 3
### t-statistics and FDR calculation
t.ah.func<- function(X, L, B =
100) {
# L A vector of integers corresponding to observation (column) class
# labels.
# X A matrix, with m rows corresponding to variables
# (hypotheses) and n columns corresponding to observations.
# B The number of permutations.
G1 <- X[, L == 0]
BIBLIOGRAPHY 133
G2 <- X[, L == 1]
n1 <- ncol(G1)
n2 <- ncol(G2)
r <- rowMeans(G1, na.rm = TRUE) - rowMeans(G2, na.rm = TRUE)
ss <- function(x)
{
sum((as.numeric(x) - mean(as.numeric(x), na.rm = TRUE))^2, na.rm = TRUE)
}
s <- sqrt(((apply(G1, 1, ss) + apply(G2, 1, ss)) * (1/n1 + 1/n2))/(n1 + n2 - 2))
t <- r/s
order.t <- order(t)
sort.t <- sort(t)
tB <- matrix(nr = nrow(X), nc = B)
for(i in 1:B) {
id <- sample(c(1:(n1 + n2)), n1 + n2)
G1 <- X[, id[1:n1]]
G2 <- X[, id[(n1 + 1):(n1 + n2)]]
rb <- rowMeans(G2, na.rm = TRUE) - rowMeans(G1, na.rm = TRUE)
sb <- sqrt(((apply(G1, 1, ss) + apply(G2, 1, ss)) * (1/n1 + 1/n2))/(n1 + n2 - 2))
tb <- rb/sb
tB[, i] <- sort(tb)
}
return(cbind(index = order.t, t = sort.t, tB))
}
t.ah.fdr<- function (order.t, ordertB, deltas) {
n <- length(order.t)
BIBLIOGRAPHY 134
ndelta <- length(deltas)
tB <- rowMeans(ordertB)
diff <- order.t - tB
tmp <- quantile(as.vector(ordertB), c(0.25, 0.75), na.rm = TRUE)
pi <- sum(order.t < tmp[2] & order.t > tmp[1], na.rm = TRUE)/(n/2)
pi <- min(pi, 1)
table <- c()
for (i in 1:ndelta) {
delta <- deltas[i]
if (sum((diff > delta) & (order.t > 0)) != 0) {
pos <- min(which((diff > delta) & (order.t > 0))):n
n.pos <- length(pos)
}
else n.pos <- 0
if (sum((diff < (-delta)) & (order.t < 0)) != 0) {
neg <- 1:max(which((diff < (-delta)) & (order.t <
0)))
n.neg <- length(neg)
}
else n.neg <- 0
n.total <- n.pos + n.neg
max <- ifelse(n.pos == 0, Inf, order.t[n - n.pos + 1])
min <- ifelse(n.neg == 0, -Inf, order.t[n.neg])
fp <- median(apply(ordertB, 2, function(z) {
sum(z >= max | z <= min)
}), na.rm = TRUE)
median.fp <- ifelse(pi * fp/n.total > 1, 1, pi *
fp/n.total)
BIBLIOGRAPHY 135
table <- rbind(table, c(delta, n.total, n.pos, n.neg, fp,
median.fp))
}
colnames(table) <- c("delta", "no.significance", "no.positive",
"no.negative", "FP" "FDR")
return(table)
}
comp.ah.t<- function (X,L, B = 200, deltas) {
t <- t.ah.func(X, L, B)
n <- ncol(t)
fdr.table <- t.ah.fdr(t[, 2], t[, 3:n], deltas)
return(list(geneOrder = t[, 1], t.stat = t[, 2], fdr.table = fdr.table))
}
### SAM statistic and FDR calculation
sam.ah.func<- function (X, L, prob = NULL, B = 200, stat.only =
FALSE,
s.step = 0.01, alpha.step = 0.01)
{
G1 <- X[, L == 0]
G2 <- X[, L == 1]
n1 <- ncol(G1)
n2 <- ncol(G2)
r <- rowMeans(G1, na.rm = TRUE) - rowMeans(G2, na.rm = TRUE)
ss <- function(x) {
BIBLIOGRAPHY 136
sum((as.numeric(x) - mean(as.numeric(x), na.rm = TRUE))^2,
na.rm = TRUE)
}
s <- sqrt((apply(G1, 1, ss) + apply(G2, 1, ss)) * (1/n1 +
1/n2)/(n1 + n2 - 2))
if (!is.null(prob))
s0 <- quantile(s, prob)
else s0 <- sam.ah.s0(r, s, s.step = s.step, alpha.step = alpha.step)
d <- r/(s + s0)
if (stat.only)
return(d)
else {
order.d <- order(d)
sort.d <- sort(d)
B <- deds.checkB(L, B)
dB <- matrix(nr = nrow(X), nc = B)
for(i in 1:B){
id <- sample(c(1:(n1 + n2)), n1 + n2)
G1 <- X[, id[1:n1]]
G2 <- X[, id[(n1 + 1):(n1 + n2)]]
rb <- rowMeans(G2, na.rm = TRUE) - rowMeans(G1, na.rm = TRUE)
sb <- sqrt((apply(G1, 1, ss) + apply(G2, 1, ss)) *
(1/n1 + 1/n2)/(n1 + n2 - 2))
db <- rb/(sb + s0)
dB[, i] <- sort(db)
}
return(cbind(index = order.d, t = sort.d, dB))
}
BIBLIOGRAPHY 137
}
sam.ah.s0<- function(r, s, s.step = 0.01, alpha.step = 0.01) {
if(1/s.step >= length(s))
s.step <- 2 * (1/length(s))
q <- quantile(s, seq(0, 1, by = s.step))
while(any(duplicated(q))) {
s.step <- s.step * 2
q <- quantile(s, seq(0, 1, by = s.step))
}
q.indices <- cut(s, q, labels = FALSE, right = FALSE)
q.indices[which(is.na(q.indices))] <- 1/s.step
sam.d.alpha <- function(r, s, alpha)
{
s.alpha <- quantile(s, alpha)
r/(s + s.alpha)
}
cv.alpha <- function(alpha)
{
d.alpha <- sam.d.alpha(r, s, alpha)
v.j <- tapply(d.alpha, q.indices, mad)
sd(v.j, na.rm = TRUE)/mean(v.j, na.rm = TRUE)
}
alpha.seq <- seq(0, 1, by = alpha.step)
cva <- sapply(alpha.seq, cv.alpha)
alpha0 <- alpha.seq[which(cva == min(cva))]
s0 <- quantile(s, alpha0)
BIBLIOGRAPHY 138
i <- 2
while(s0 == 0) {
alpha0 <- alpha.seq[which(cva == cva[order(cva)[i]])]
i <- i + 1
s0 <- quantile(s, alpha0)
}
s0
}
sam.ah.fdr<- function (order.d, ordertB, deltas) {
n <- length(order.d)
ndelta <- length(deltas)
tB <- rowMeans(ordertB)
diff <- order.d - tB
tmp <- quantile(as.vector(ordertB), c(0.25, 0.75), na.rm = TRUE)
pi <- sum(order.d < tmp[2] & order.d > tmp[1], na.rm = TRUE)/(n/2)
pi <- min(pi, 1)
table <- c()
for (i in 1:ndelta) {
delta <- deltas[i]
if (sum((diff > delta) & (order.d > 0)) != 0) {
pos <- min(which((diff > delta) & (order.d > 0))):n
n.pos <- length(pos)
}
else n.pos <- 0
if (sum((diff < (-delta)) & (order.d < 0)) != 0) {
neg <- 1:max(which((diff < (-delta)) & (order.d <
0)))
BIBLIOGRAPHY 139
n.neg <- length(neg)
}
else n.neg <- 0
n.total <- n.pos + n.neg
max <- ifelse(n.pos == 0, Inf, order.d[n - n.pos + 1])
min <- ifelse(n.neg == 0, -Inf, order.d[n.neg])
fp <- median(apply(ordertB, 2, function(z) {
sum(z >= max | z <= min)
}), na.rm = TRUE)
FDR <- ifelse(pi * fp/n.total > 1, 1, pi *
fp/n.total)
table <- rbind(table, c(delta, n.total, n.pos, n.neg, fp,
FDR))
}
colnames(table) <- c("delta", "no.significance", "no.positive",
"no.negative", "fp", "FDR")
return(table)
}
comp.ah.SAM<-
function(X, L, prob = NULL, B = 200, stat.only = FALSE, deltas, s.step =
0.01, alpha.step = 0.01)
{
d <- sam.ah.func(X, L, prob, B, stat.only, s.step = s.step,
alpha.step = alpha.step)
if(stat.only)
return(d)
else {
BIBLIOGRAPHY 140
n <- ncol(d)
fdr.table <- sam.ah.fdr(d[, 2], d[, 3:n], deltas)
return(list(geneOrder = d[, 1], sam = d[, 2], fdr.table =
fdr.table))
}
}
ALRT method
## Fitting GLDII with 2 class microarray data
# AMLE for b, mu1 and mu2 for GLDII
# y1 expression values for treatment condition
# y2 expression values for control condition
# alpha shape parameter for GLDII
GLogistic2.amle2<-
function(alpha=1,y1,y2) {
n1 <- length(y1)
i1 <- 1:n1
pi1 <- i1/(n1 + 1)
qi1 <- 1 - pi1
nui1 <- log( - log(qi1))
n2 <- length(y2)
i2 <- 1:n2
pi2 <- i2/(n2 + 1)
qi2 <- 1 - pi2
BIBLIOGRAPHY 141
nui2 <- log( - log(qi2))
D1 <- function(x)
{
(alpha*(exp(-x)-1))/(1+exp(-x))
}
D1prime <- function(x)
{
-alpha*(2*exp(-x))/(1+exp(-x))^2
}
Ai <- D1(nui1) - nui1 * D1prime(nui1)
Bi <- - D1prime(nui1)
Cj <- D1(nui2) - nui2 * D1prime(nui2)
Dj <- - D1prime(nui2)
K1 <- (sum(Bi * y1) )/(sum(Bi) )
L1 <- sum(Ai) /sum(Bi)
K2 <- (sum(Dj * y2) )/(sum(Dj) )
L2 <- sum(Cj) /sum(Dj)
lambda1 <- sum((y1 - K1) * Ai)+sum((y2 - K2) * Cj)
lambda2 <- sum(Bi * (y1 - K1)^2) +sum(Dj * (y2 - K2)^2)
b <- ( - lambda1 + sqrt(lambda1^2 + 4 * (n1+n2) * lambda2))/(2 * (n1+n2))
mu1 <- K1 - L1 * b
mu2 <- K2 - L2 * b
c(b, mu1,mu2)
}
# AMLE for b, mu0 under H0 for GLDII
BIBLIOGRAPHY 142
GLogistic2.amle.h0<-
function(alpha=1, y1,y2) {
n1 <- length(y1)
i1 <- 1:n1
pi1 <- i1/(n1 + 1)
qi1 <- 1 - pi1
nui1 <- log( - log(qi1))
n2 <- length(y2)
i2 <- 1:n2
pi2 <- i2/(n2 + 1)
qi2 <- 1 - pi2
nui2 <- log( - log(qi2))
D1 <- function(x)
{
(alpha*(exp(-x)-1))/(1+exp(-x))
}
D1prime <- function(x)
{
-alpha*(2*exp(-x))/(1+exp(-x))^2
}
Ai <- D1(nui1) - nui1 * D1prime(nui1)
Bi <- - D1prime(nui1)
Cj <- D1(nui2) - nui2 * D1prime(nui2)
Dj <- - D1prime(nui2)
K0 <- (sum(Bi * y1) +sum(Dj * y2))/(sum(Bi)+sum(Dj) )
BIBLIOGRAPHY 143
L0 <- (sum(Ai)+sum(Cj)) /(sum(Bi)+sum(Dj))
lambda1 <- sum((y1 - K0) * Ai)+sum((y2 - K0) * Cj)
lambda2 <- sum(Bi * (y1 - K0)^2) +sum(Dj * (y2 - K0)^2)
b <- ( - lambda1 + sqrt(lambda1^2 + 4 * (n1+n2) * lambda2))/(2 * (n1+n2))
mu0 <- K0 - L0 * b
c(b, mu0)
}
## Estimating alpha values for GLDII
alpha.Glogistic2<-
function(x1, x2, lower, upper)
{
# lower, upper The least and greatest value at which to evaluate
# the profile likelihood.
alphas <- seq(lower, upper, by = 0.001)
m2 <- length(alphas)
L.est <- NA
for(i in 1:m2) {
alpha <- alphas[i]
est <- GLogistic2.amle2(alpha, x1, x2)
b.est <- est[1]
mu1 <- est[2]
mu2 <- est[3]
pdfglogis <- function(x, alpha, mu, b)
{
(alpha/b) * (exp( - ((x - mu)/b) * alpha)/(1 +
BIBLIOGRAPHY 144
exp( - ((x - mu)/b)))^(alpha + 1))
}
L.est[i] <- sum(log(pdfglogis(x1, alpha, mu1, b.est))) +
sum(log(pdfglogis(x2, alpha, mu2, b.est)))
}
maxL <- which(L.est == max(L.est))
alphas[maxL]
}
get.alpha<-function(dataf,index1,index2,datafilter=as.numeric){
f<-function(i){
return(alpha.Glogistic2(datafilter(dataf[i,index1]),
datafilter(dataf[i,index2]), lower=0, upper=5))
}
return(sapply(1:length(dataf[,1]),f))
}
# ALRT
LR.Glogistic<-
function(alpha = 1, x1, x2)
{
est <- GLogistic2.amle2(alpha, x1, x2)
b.est <- est[1]
mu1 <- est[2]
mu2 <- est[3]
est.ho <- GLogistic2.amle.h0(alpha, x1, x2)
b0.est <- est.ho[1]
BIBLIOGRAPHY 145
mu0 <- est.ho[2]
pdfglogis <- function(x, alpha, mu, b)
{
(gamma(2 * alpha)/(gamma(alpha))^2) * (1/b) * (exp( - (
(x - mu)/b) * alpha)/(1 + exp( - ((x - mu)/b)))^
(2 * alpha))
}
L.est <- sum(log(pdfglogis(x1, alpha, mu1, b.est))) + sum(log(
pdfglogis(x2, alpha, mu2, b.est)))
L.ho <- sum(log(pdfglogis(x1, alpha, mu0, b0.est))) + sum(log(
pdfglogis(x2, alpha, mu0, b0.est)))
LR <- 2 * L.est - 2 * L.ho
LR
}
get.GLR<-
function(dataf, alpha = 1, index1, index2, datafilter = as.numeric)
{
f <- function(i)
{
return(LR.Glogistic(alpha, datafilter(dataf[i, index1]),
datafilter(dataf[i, index2])))
}
return(sapply(1:length(dataf[, 1]), f))
}
BIBLIOGRAPHY 146
# Permutation results
Galrt1.twoclass<-
function (X, L, B = 2,alpha=1)
{
ng<-as.vector(table(L))
n<-sum(ng)
alrt<-get.GLR(X,alpha,1:ng[1],(ng[1]+1):n)
alrtb <- matrix(nr = nrow(X), nc = B)
for (i in 1:B) {
id <- sample(c(1:n), n, replace=FALSE)
Xb<-X[,id]
alrtb[,i] <- get.GLR(Xb,alpha,1:ng[1],(ng[1]+1):n)
}
p.alrt<-apply(abs(alrtb)>=abs(alrt),1,sum)/B
return(p.alrt)
}
# Numebr of significant genes, false postives, corresponding to
cutoff p-value
nosigp<- function(alphas, pvalue, id.DE) {
np <- length(alphas)
table <- c()
for(i in 1:np) {
p <- alphas[i]
sig.t <- sum(pvalue < p, na.rm = T)
fp <- sum(pvalue[ - id.DE] < p, na.rm = T)
BIBLIOGRAPHY 147
fn <- sum(pvalue[id.DE] >= p, na.rm = T)
tp <- sum(pvalue[id.DE] < p, na.rm = T)
tn <- sum(pvalue[ - id.DE] >= p, na.rm = T)
sens <- tp/(tp + fn)
spec <- tn/(fp + tn)
table <- rbind(table, c(p, sig.t, fp, fn, tp, tn, sens, spec))
}
colnames(table) <- c("alpha", "sig", "fp", "fn", "tp", "tn", "sens", "spec")
return(table)
}
# Calculating Area under the ROC curve from simulation
sim.logis<-
function(R = 100, m, alpha0 = 6, beta0 = 2, a0 = 5, a = 4, b = 8, B
= 10, n1 = 5, n2 = 5,
pde = 0.1)
{
auc.LR <- NA
auc.GLR <- NA
auc.sam <- NA
# auc.t<- NA
auc.willcox <- NA
auc.modt <- NA
n <- n1 + n2
for(r in 1:R) {
simml <- function(m, n, a0)
{
BIBLIOGRAPHY 148
bs <- 1/rgamma(m, shape = a0, rate = a0)
alphas <- rgamma(m, shape = alpha0, rate = beta0)
table1 <- c()
for(i in 1:m) {
b <- bs[i]
alpha1 <- alphas[i]
X <- glgsimn2(r = n, alpha1, mu = 0, b)
table1 <- rbind(table1, c(alpha1, X))
}
return(table1)
}
X1 <- simml(m, n, a0)
alpha <- X1[, 1]
X <- X1[, 2:(n + 1)]
mug <- function(m, n2, pde, a, b)
{
table2 <- c()
de <- pde * m
for(i in 1:de) {
mug <- rgamma(n2, shape = a, rate = b)
table2 <- rbind(table2, c(mug))
}
return(table2)
}
mug1 <- mug(m, n2, pde, a, b)
X[1:nrow(mug1), (n1 + 1):n] <- X[1:nrow(mug1), (n1 + 1):n] + mug1
L <- rep(0:1, c(n1, n2))
genenames <- paste(c("g"), 1:m, sep = "")
BIBLIOGRAPHY 149
row.names(X) <- genenames
alpha2 <- seq(0, 1, by = 0.05)
DEGenes <- 1:(nrow(mug1))
auci <- function(pvalues)
{
res <- nosigp(alpha2, pvalues, DEGenes)
se <- res[, 7]
sp <- res[, 8]
rocobj1 <- r.sca(se, sp)
AUCi(rocobj1)
}
GLR.sim <- G2alrt1.twoclass(X, L, B, alpha)
auc.GLR[r] <- auci(GLR.sim)
alrtX.sim <- alrt1.twoclass(X, L, B)
auc.LR[r] <- auci(alrtX.sim)
samX.sim <- sam1.twoclass(X, L, B)
auc.sam[r] <- auci(samX.sim)
# tX.sim<-t1.twoclass(X,L,B)
# auc.t[r]<-auci(tX.sim)
modt.sim <- modt.twoclass(X, L, B)
auc.modt[r] <- auci(modt.sim)
wilcox.sim <- samwilc.twoclass(X, L, B)
auc.willcox[r] <- auci(wilcox.sim)
}
GLRT <- mean(auc.GLR)
sd.GLRT <- sqrt(var(auc.GLR))
ALRT <- mean(auc.LR)
sd.ALRT <- sqrt(var(auc.LR))
BIBLIOGRAPHY 150
sam1 <- mean(auc.sam)
sd.sam <- sqrt(var(auc.sam))
#t1<-mean(auc.t)
#sd.t<-sqrt(var(auc.t))
modt <- mean(auc.modt)
sd.modt <- sqrt(var(auc.modt))
willcox <- mean(auc.willcox)
sd.willcox <- sqrt(var(auc.willcox))
table1 <- data.frame(GLRT, ALRT, sam1, modt, sd.GLRT, sd.ALRT,
sd.sam, sd.modt, willcox, sd.willcox)
return(table1)
}
## Generating GLDII
glgsim1<-
function(alpha) {
W <- runif(1)
invglg3<-function(x, alpha, na.action = na.ommit ) {
u <- qbeta(x, 1, alpha)
log(u/(1-u))
}
invglg3(W, alpha)
}
glgsimn2<-
function(r,alpha=2,mu,b,na.action = na.omit)
{
BIBLIOGRAPHY 151
i <- 1
Z <- NULL
while(i <= r) {
x<-glgsim1(alpha)
Z[i] <- mu+b*x
i <- i + 1
}
Z
}
# Prediction by Discriminant Rule
stat.diag.da<-
function (ls, cll, ts, pool = 1)
{
ls <- as.matrix(ls)
ts <- as.matrix(ts)
n <- nrow(ls)
p <- ncol(ls)
nk <- rep(0, max(cll) - min(cll) + 1)
K <- length(nk)
m <- matrix(0, K, p)
v <- matrix(0, K, p)
disc <- matrix(0, nrow(ts), K)
for (k in (1:K)) {
which <- (cll == k + min(cll) - 1)
nk[k] <- sum.na(which)
m[k, ] <- apply(ls[which, ], 2, mean.na)
BIBLIOGRAPHY 152
v[k, ] <- apply(ls[which, ], 2, var.na)
}
vp <- apply(v, 2, function(z) sum.na((nk - 1) * z)/(n - K))
if (pool == 1) {
for (k in (1:K)) disc[, k] <- apply(ts, 1, function(z) sum.na((z -
m[k, ])^2/vp))
}
if (pool == 0) {
for (k in (1:K)) disc[, k] <- apply(ts, 1, function(z) (sum.na((z -
m[k, ])^2/v[k, ]) + sum.na(log(v[k, ]))))
}
pred <- apply(disc, 1, function(z) (min(cll):max(cll))[order.na(z)[1]])
list(pred = pred)
}
B. R Code for Chapter 4
### CALCULATION OF ROC
W1<-
function(x, y) {
Ai <- w.new(x, y)$A
if(Ai >= 0.5) {
wval1 <- w.new(d1 = x, d2 = y)
A <- wval1$A
s00 <- wval1$s00
s11 <- wval1$s11
BIBLIOGRAPHY 153
var1A <- wval1$var1A
psi.idot <- wval1$psi.idot
psi.dotj <- wval1$psi.dotj
psi.idot.m.A <- psi.idot - A
psi.dotj.m.A <- psi.dotj - A
}
else {
wval2 <- w.new(d1 = y, d2 = x)
A <- wval2$A
s00 <- wval2$s00
s11 <- wval2$s11
var1A <- wval2$var1A
psi.idot <- wval2$psi.idot
psi.dotj <- wval2$psi.dotj
}
list(psi.idot = psi.idot, psi.dotj = psi.dotj, s00 = s00, s11 =
s11, var1A = var1A, A = A)
}
get.A <- function(dataf, index1, index2, datafilter = as.numeric)
{
f <- function(i)
{
return(W1(datafilter(dataf[i, index1]), datafilter(dataf[i,
index2]))$A)
}
return(sapply(1:length(dataf[, 1]), f))
BIBLIOGRAPHY 154
}
get.var <- function(dataf, index1, index2, datafilter = as.numeric)
{
f <- function(i)
{
return(W1(datafilter(dataf[i, index1]), datafilter(dataf[i,
index2]))$var1A)
}
return(sapply(1:length(dataf[, 1]), f))
}
# Concordance with SAM statistic
no.concodance<-
function(bs)
{
nb <- length(bs)
table <- c()
for(i in 1:nb) {
b <- bs[i]
t1 <- t.stat(X, L)
a2.t <- statd.gnames(t1, genenames, crit = b)
a3.t <- data.frame(rank = 1:b, gnames = a2.t$gnames, d =
a2.t$d)
sam1 <- sam.ah.func(X, L, stat.only = TRUE)
a2.samr <- statd.gnames(sam1, genenames, crit = b)
a3.samr <- data.frame(rank = 1:b, gnames = a2.samr$gnames,
BIBLIOGRAPHY 155
d = a2.samr$d)
aucd1 <- AUC.ah.func(X, L, stat.only = TRUE)
a2.aucd <- statd.gnames(sam1, genenames, crit = b)
a3.aucd <- data.frame(rank = 1:b, gnames = a2.aucd$gnames,
d = a2.aucd$d)
auc1 <- AUC5.ah.func(X, L, stat.only = TRUE)
a2.auc <- statd.gnames(sam1, genenames, crit = b)
a3.auc <- data.frame(rank = 1:b, gnames = a2.auc$gnames,
d = a2.auc$d)
no.t <- length(intersect(as.vector(a3.t$gnames), as.vector(
a3.samr$gnames)))
no.aucd <- length(intersect(as.vector(a3.aucd$gnames),
as.vector(a3.samr$gnames)))
no.auc <- length(intersect(as.vector(a3.auc$gnames),
as.vector(a3.samr$gnames)))
table <- rbind(table, c(b, no.t, no.aucd, no.auc))
}
colnames(table) <- c("b", "no.t", "no.aucd", "no.auc")
return(table)
}
AUC.pair<-
function(sigs, dataf, index1, index2)
{
nb <- length(sigs)
fal <- matrix(0, ncol = nb, nrow = dim(dataf)[1])
for(i in 1:nb) {
BIBLIOGRAPHY 156
fal[, i] <- AUC1EQ2(sigs[i], dataf, index1, index2)
}
meana2 <- function(x)
{
mean(abs(x), na.rm = T)
}
apply(fal, 1, meana2)
}
# Simulation Code
# R= Number of simulation
# m= number of genes
# n1=number of samples for treatment group
# n2=number of samples for control group
# d= treatment effect
# pde= proportion of differentially expressed genes
# r=correlation coefficient
# nclass= number of correlated class
frac.genes<- function(R, m = 100, n1 = 10, n2 = 10, r = 0.8, nclass
= 5, sigma2 = 4, d = 2, pde = 0.1)
{
degenes.corauc <- NA
degenes.modt <- NA
degenes.auc <- NA
# degenes.auca2<-NA
for(b in 1:R) {
X1 <- simulated(m = m, n1 = n1, n2 = n2, r, nclass, sigma2,
BIBLIOGRAPHY 157
d, pde)
X <- X1$X
degenes <- paste(c("g"), X1$degenes, sep = "")
genenames <- row.names(X)
L <- rep(0:1, c(n1, n2))
degenes.corauc[b] <- corauc(X, L, genenames, degenes)
A.stat <- get.W(X, 1:n1, (n1 + 1):(n1 + n2))
rA <- statd.gnames(A.stat, genenames, crit = 10)
degenes.auc[b] <- length(intersect(rA$gnames, degenes))
design <- model.matrix( ~ L)
fit1 <- lmFit(X, design, method = "ls")
fit21 <- contrasts.fit(fit1, c(0, 1))
fiteb1 <- eBayes(fit21)
topt5 <- toptable(number = 10, genelist = genenames, fit
= fit21, eb = fiteb1, adjust = "fdr")
degenes.modt[b] <- length(intersect(topt5$ID, degenes))
}
DEgenes.corauc <- mean(degenes.corauc)
DEgenes.modt <- mean(degenes.modt)
DEgenes.auc <- mean(degenes.auc)
# DEgenes.auca2 <- mean(degenes.auca2)
list(DEgenes.corauc = DEgenes.corauc, DEgenes.modt = DEgenes.modt,
DEgenes.auc = DEgenes.auc)
}
B. R Code for Chapter 5
### Correlation Plot
BIBLIOGRAPHY 158
rgbcolor<-function (n = 50)
{
k <- round(n/2)
r <- c(rep(0, k), seq(0, 1, length = k))
g <- c(rev(seq(0, 1, length = k)), rep(0, k))
res <- rgb(r, g, rep(0, 2 * k))
res
}
plot.cor<-
function (x, nrgcols = 50, labels = FALSE, labcols = 1,
title = "")
{
n <- ncol(x)
corr <- x
corr <- cor(x)
image(1:n, 1:n, corr[, n:1], col = rgbcolor(nrgcols),
axes = FALSE, xlab = "", ylab = "")
if (length(labcols) == 1) {
axis(2, at = n:1, labels = labels, las = 2, cex.axis = 0.6,
col.axis = labcols)
axis(3, at = 1:n, labels = labels, las = 2, cex.axis = 0.6,
col.axis = labcols)
}
mtext(title, side = 3, line = 3)
box()
BIBLIOGRAPHY 159
}
### Simulated Dataset for Correlated Samples
simn<-function(m, nclass, n1, n2, r)
{
ngenes <- m/nclass
# ngenes is the number of genes per group
# nclass is Total number of groups
n <- n1 + n2
rmv <- function(mn, r)
{
u <- as.matrix(rep(1, ngenes))
B <- u %*% t(u) * r + (1 - r) * diag(rep(1, ngenes))
Sigma <- B * 4
x1 <- mvrnorm(n1, mn, Sigma)
x0 <- mvrnorm(n2, rep(0,ngenes), Sigma)
mat <- t(rbind(x1, x0))
return(mat)
}
mu<-rep(0,ngenes)
X <- c()
for(i in 1:nclass)
X <- rbind(X, rmv(mn = mu, r))
add1<-rep(1,n1)
X1<-X[1,1:n1]+add1
X2<-X[1,(n1+1):n]
BIBLIOGRAPHY 160
X<-data.frame(X1,X2)
genenames <- paste(c("g"), 1:m, sep = "")
row.names(X) <- genenames
return(X)
}
D(adj) statistic
# d1= expression values for treatment group
# d2= expression values for control group
w.new<-
function(d1, d2)
{
M <- length(d2)
N <- length(d1)
psi <- matrix(0, nrow = N, ncol = M)
for(i in c(1:N)) {
psi[i, ] <- ifelse(d1[i] > d2, 1, 0)
}
A <- sum(psi)/(M * N)
A <- pmax(Ai, 1 - Ai)
psi.idot <- apply(psi, 1, sum)/M
psi.dotj <- apply(psi, 2, sum)/N
s00 <- sum((psi.idot - A)^2)/(N - 1)
s11 <- sum((psi.dotj - A)^2)/(M - 1)
var1A <- s00/N + s11/M
BIBLIOGRAPHY 161
psi.idot.m.A <- psi.idot - A
psi.dotj.m.A <- psi.dotj - A
list(M = M, N = N, psi.idot = psi.idot, psi.dotj = psi.dotj,
psi.idot.m.A = psi.idot.m.A, psi.dotj.m.A = psi.dotj.m.A,
s00 = s00, s11 = s11, var1A = var1A, A = A)
}
psi.m.A<-
function(x, y) {
M <- length(y)
N <- length(x)
psi <- matrix(0, nrow = N, ncol = M)
for(i in c(1:N)) {
psi[i, ] <- ifelse(x[i] > y, 1, 0)
}
Ai <- sum(psi)/(M * N)
if(Ai >= 0.5) {
M <- length(y)
N <- length(x)
psi <- matrix(0, nrow = N, ncol = M)
for(i in c(1:N)) {
psi[i, ] <- ifelse(x[i] > y, 1, 0)
}
A <- sum(psi)/(M * N)
psi.idot <- apply(psi, 1, sum)/M
psi.dotj <- apply(psi, 2, sum)/N
psi.idot.m.A <- psi.idot - A
psi.dotj.m.A <- psi.dotj - A
BIBLIOGRAPHY 162
s00 <- sum((psi.idot - A)^2)/(N - 1)
s11 <- sum((psi.dotj - A)^2)/(M - 1)
var1A <- s00/N + s11/M
}
else {
M <- length(y)
N <- length(x)
psi <- matrix(0, nrow = N, ncol = M)
for(i in c(1:N)) {
psi[i, ] <- ifelse(y[i] > x, 1, 0)
}
A <- sum(psi)/(M * N)
# psi.idot <- apply(psi, 1, sum)/M
# psi.dotj <- apply(psi, 2, sum)/N
psi.dotj <- apply(psi, 1, sum)/M
psi.idot <- apply(psi, 2, sum)/N
psi.idot.m.A <- psi.idot - A
psi.dotj.m.A <- psi.dotj - A
s00 <- sum((psi.idot - A)^2)/(N - 1)
s11 <- sum((psi.dotj - A)^2)/(M - 1)
var1A <- s00/N + s11/M
}
list(psi.idot = psi.idot, psi.dotj = psi.dotj, psi.idot.m.A =
psi.idot.m.A, psi.dotj.m.A = psi.dotj.m.A)
}
get.A <-
BIBLIOGRAPHY 163
function(dataf, index1, index2, datafilter = as.numeric)
{
f <- function(i)
{
return(w.new(datafilter(dataf[i, index1]), datafilter(
dataf[i, index2]))$A)
}
return(sapply(1:length(dataf[, 1]), f))
}
get.var <-
function(dataf, index1, index2, datafilter = as.numeric) {
f <- function(i)
{
return(w.new(datafilter(dataf[i, index1]), datafilter(
dataf[i, index2]))$var1A)
}
return(sapply(1:length(dataf[, 1]), f))
}
get.psi.idot.m.A <-
function(dataf, index1, index2, datafilter = as.numeric) {
f <- function(i)
{
return(w.new(datafilter(dataf[i, index1]), datafilter(
dataf[i, index2]))$psi.idot.m.A)
BIBLIOGRAPHY 164
}
return(sapply(1:length(dataf[, 1]), f))
}
get.psi.dotj.m.A <-
function(dataf, index1, index2, datafilter = as.numeric) {
f <- function(i)
{
return(w.new(datafilter(dataf[i, index1]), datafilter(
dataf[i, index2]))$psi.dotj.m.A)
}
return(sapply(1:length(dataf[, 1]), f))
}
COVA1A2<-
function(dataf, topindex, index1, index2)
{
values.psi.idot.m.A <- get.psi.idot.m.A(dataf, index1, index2)
values.psi.dotj.m.A <- get.psi.dotj.m.A(dataf, index1, index2)
v1 <- as.vector(values.psi.idot.m.A[, topindex])
v2 <- as.matrix(values.psi.idot.m.A)
N <- dim(v2)[1]
s10 <- as.vector((v1 %*% v2)/(N - 1))
v3 <- as.vector(values.psi.dotj.m.A[, topindex])
v4 <- as.matrix(values.psi.dotj.m.A)
M <- dim(v4)[1]
s01 <- as.vector((v3 %*% v4)/(M - 1))
BIBLIOGRAPHY 165
covariance <- s10/N + s01/M
covariance
}
cov12<-
function(x, y)
{
fj<-
function(psi, psi.idot, psi.dotj, A)
{
fi <- function(psi, psi.idot)
{
B <- matrix(0, nrow = nrow(psi), ncol = ncol(psi))
for(i in c(1:nrow(psi))) {
B[i, ] <- psi[i, ] - psi.idot[i]
}
B
}
B <- fi(psi, psi.idot)
C <- matrix(0, nrow = nrow(psi), ncol = ncol(psi))
for(j in c(1:ncol(psi))) {
C[, j] <- B[, j] - psi.dotj[j]
}
C + A
}
A<-psi.m.A(x,y)$A
psi.idot <- psi.m.A(x,y)$psi.idot
BIBLIOGRAPHY 166
psi.dotj <- psi.m.A(x,y)$psi.dotj
fj(psi, psi.idot, psi.dotj, A)
}
s.2genes<-
function(index, dataf,index1,index2)
{
varA <- get.var(dataf, index1, index2)
varA1 <- varA[index]
get.cov<-COVA1A2(dataf,index,index1,index2)
table <- c()
for(i in 1:length(varA)) {
s <- sqrt(round(varA1+varA[i]-2*get.cov[i],5))
table <- cbind(table, c(s))
}
as.vector(table)
}
AUC1EQ2<-
function(index, dataf,index1,index2)
{
A <- get.A(dataf, index1, index2)
BIBLIOGRAPHY 167
A1 <- A[index]
get.s<-s.2genes(index,dataf,index1,index2)
s0 <- quantile(get.s, .9)
table <- c()
for(i in 1:length(A)) {
stat <- (A1 - A[i])/(get.s[i]+s0)
table <- cbind(table, c(stat))
}
as.vector(table)
}
Top Related