Download - CONTRIBUTION TO STATISTICAL TECHNIQUES FOR … · sample size. Currently several mathematical and statistical methods exist to identify signiﬁcantly diﬀerentially expressed genes.

CONTRIBUTION TO STATISTICAL

TECHNIQUES FOR IDENTIFYING

DIFFERENTIALLY EXPRESSED GENES IN

MICROARRAY DATA

By

Ahmed Hossain

SUBMITTED IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

AT

UNIVERSITY OF TORONTO

6TH FLOOR, HEALTH SCIENCES BUILDING, 155 COLLEGE STREET, TORONTO, ON

M5T 3M7, CANADA

MARCH 2011

c⃝ Copyright by Ahmed Hossain, 2011

Abstract

Contribution to Statistical Techniques for Identifying Differentially Expressed Genes in

Microarray Data

Ahmed Hossain

Doctor of Philosophy

Graduate Department of Dalla Lana School of Public Health

University of Toronto

2011

With the development of DNA microarray technology, scientists can now measure the

expression levels of thousands of genes (features or genomic biomarkers) simultaneously

in one single experiment. Robust and accurate gene selection methods are required to

identify differentially expressed genes across different samples for disease diagnosis or

prognosis. The problem of identifying significantly differentially expressed genes can be

stated as follows: Given gene expression measurements from an experiment of two (or

more) conditions, find a subset of all genes having significantly different expression levels

across these two (or more) conditions.

Analysis of genomic data is challenging due to high dimensionality of data and low

sample size. Currently several mathematical and statistical methods exist to identify

significantly differentially expressed genes. The methods typically focus on gene by gene

analysis within a parametric hypothesis testing framework. In this study, we propose

three flexible procedures for analyzing microarray data.

In the first method we propose a parametric method which is based on a flexible

distribution, Generalized Logistic Distribution of Type II (GLDII), and an approximate

likelihood ratio test (ALRT) is developed. Though the method considers gene-by-gene

analysis, the ALRT method with distributional assumption GLDII appears to provide a

ii

favourable fit to microarray data.

In the second method we propose a test statistic for testing whether area under re-

ceiver operating characteristic curve (AUC) for each gene is greater than 0.5 allowing

different variances for each gene. This proposed method is computationally less intensive

and can identify genes that are reasonably stable with satisfactory prediction perfor-

mance.

The third method is based on comparing two AUCs for a pair of genes that is de-

signed for selecting highly correlated genes in the microarray datasets. We propose a

nonparametric procedure for selecting genes with expression levels correlated with that

of a “seed” gene in microarray experiments. The test proposed by DeLong et al. (1988)

is the conventional nonparametric procedure for comparing correlated AUCs. It uses a

consistent variance estimator and relies on asymptotic normality of the AUC estimator.

Our proposed method includes DeLong’s variance estimation technique in comparing pair

of genes and can identify genes with biologically sound implications.

In this thesis, we focus on the primary step in the gene selection process, namely, the

ranking of genes with respect to a statistical measure of differential expression. We assess

the proposed approaches by extensive simulation studies and demonstrate the methods

on real datasets. The simulation study indicates that the parametric method performs

favorably well at any settings of variance, sample size and treatment effects. Importantly,

the method is found less sensitive to contaminated by noise. The proposed nonparametric

methods do not involve complicated formulas and do not require advanced programming

skills. Again both methods can identify a large fraction of truly differentially expressed

(DE) genes, especially if the data consists of large sample sizes or the presence of outliers.

We conclude that the proposed methods offer good choices of analytical tools to identify

DE genes for further biological and clinical analysis.

iii

To My Parents

iv

Acknowledgements

The successful completion of this research work is not a result of only my own effort,

but is an aggregate of contributions from many others ranging from my family members

to teachers of this department .

First, I like to acknowledge my debt of honor to ALLAH, the almighty, for enabling

me to accomplish this research work successfully.

I would like to express my heartiest thank to my supervisor Joseph Beyene, Ph.D.,

for his help and guidence, and leading me into such an interesting area. Without him this

thesis wouldn’t have been this thesis. My heartiest gratitude also goes to my honorable

teacher Professor Andrew R. Willan, Ph.D., for giving me opportunity to do my Ph.D.

here and permitting me to undertake this research work as a partial fulfillment of my

Ph.D. Degree in this department at University of Toronto. I am also thankful to Hospital

for Sick Children for their financial support which help me to proceed with this research.

I also gratefully acknowledge the overseas Scholarship scheme from the University of

Toronto for paying my tuition fee and studentship from the school for providing me with

living maintenance.

I also wish to thank Professor Angelo Canty, Ph.D., Laurent Briollais, Ph.D. and

David Tritchler, Ph.D. for reviewing the thesis and provide me their valuable input to

improve the thesis.

Of course, I am grateful to my parents for their love and encouragement throughout

my studies. Without them this work would never have come into existence (literally).

Finally, I wish to thank the following: Ping Zhao Hu; Shahnaz (for changing my

life from worse to bad); Zafeera (for changing my life from bad to best); and my two

sisters.

Toronto, Ontario Ahmed Hossain (March 3, 2011)

v

Contents

1 Introduction to Microarray Technology 1

1.1 Measuring Gene Expression Using Microarrays . . . . . . . . . . . . . . . 1

1.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 Microarray Technologies . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Microarray Gene Expression Dataset . . . . . . . . . . . . . . . . 6

1.2 Background Adjustment and Normalization . . . . . . . . . . . . . . . . 8

1.3 Challenges of Micorarray Expression Analysis . . . . . . . . . . . . . . . 9

1.4 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Assessing Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.6.1 An Application to Calculate FDR . . . . . . . . . . . . . . . . . . 16

1.7 Importance of Replicates . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.8 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.8.1 Preprocessing packages . . . . . . . . . . . . . . . . . . . . . . . . 19

1.8.2 Testing packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.9 Aim of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.10 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Popular Statistical Methods for Identifying Differentially Expressed

Genes 25

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vi

2.2 Fold Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 t-test and ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Nonparametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Wilcoxon Rank Sum Test (RST) . . . . . . . . . . . . . . . . . . 28

2.4.2 ROC Methodology for Gene Expression Analysis . . . . . . . . . 29

2.5 SAM-Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Empirical Bayes Approach . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.7 Posterior Odds Statistic(LIMMA) . . . . . . . . . . . . . . . . . . . . . 35

2.8 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Approximate Likelihood Ratio Method 39

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Generalized Logistic Distribution of Type II (GLDII) . . . . . . . . . . . 41

3.3 Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Comparison between AMLE and MLE for location and scale parameters

of GLDII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6 FDR Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.7 Permutation based p-values and AUC Estimation . . . . . . . . . . . . . 55

3.8 Comparison with Other Methods . . . . . . . . . . . . . . . . . . . . . . 55

3.8.1 Simulation Experiment . . . . . . . . . . . . . . . . . . . . . . . 56

3.8.2 Duchenne Muscular Dystrophy (DMD) Data . . . . . . . . . . . 59

3.8.3 Golub Leukemia Data: Classification Between ALL and AML . . 63

3.9 Multiclass Microarray Data . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.9.1 Example of Multi-class microarray data: SRBCT Dataset . . . . . 67

3.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

vii

4 Nonparametric Method for Detecting Differentially Expressed Genes:

Single Gene Analysis 72

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2 Parametric versus Nonparametric Methods . . . . . . . . . . . . . . . . . 74

4.3 General Discussion on ROC analysis . . . . . . . . . . . . . . . . . . . . 76

4.4 Motivation of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.5.1 Single Gene Analysis: AUC . . . . . . . . . . . . . . . . . . . . 78

4.6 FDR Estimation with dg statistic . . . . . . . . . . . . . . . . . . . . . . 82

4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.7.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.7.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.8 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5 Nonparametric Method for Detecting Highly Correlated Differentially

Expressed Genes 93

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Ding’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.2.1 Correlation Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.1 Comparison of Two ROC Curves . . . . . . . . . . . . . . . . . . 96

5.3.2 Permuted P-values and FDR estimation with D(adj) statistic . . 98

5.3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.3.4 Application: Colon Cancer Data . . . . . . . . . . . . . . . . . . . 102

5.3.5 Effect of Seed Gene: Affymetrix spike-in study . . . . . . . . . . . 106

5.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 107

viii

6 Conclusion 110

6.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.2.1 Improving the ALRT method . . . . . . . . . . . . . . . . . . . . 113

6.2.2 Possible extension for D(adj) statistic . . . . . . . . . . . . . . . 113

Bibliography 115

ix

CHAPTER 1Introduction to Microarray Technology

This chapter provide a concise overview of data-analytic tasks associated with mi-

croarray studies. We want to give a brief orientation before moving to the methods for

identifying gene expression analysis. Here we introduce the foundations of microar-

ray technology and describe the limitations, concepts and methods in microarray gene

expression analysis used in this thesis.

1.1. Measuring Gene Expression Using Microarrays

1.1.1. Background

The genome consists of long deoxyribonucleic acid (DNA) molecules which are neatly

packed up into chromosomes in the nucleus of each cell. A DNA molecule is a nucleic

acid that consists of two long chains of nucleotides ( or strands) twisted together into

a double helix and joined by hydrogen bonds. Each strand is built up by a sequence

of the bases adenine (A), thymine (T), guanine (G) and cytosine (C). The bases are

paired so that an A in one strand can only bind to T in the other, and a C can only

bind to a G. The two strands are called complementary, since each strand holds the

same sequence of information. It carries the cell’s genetic information and hereditary

characteristics via its nucleotides and their sequence and is capable of self-replication

and RNA synthesis. Some segments of the DNA sequence contain genetic information

and hence these are-loosely-called genes.

1

2

Microarray technology has revolutionized modern biological research by permit-

ting the study of thousands of genes simultaneously. The principle of molecular

genetics that states genetic information flows from DNA to messenger RNA (mRNA)

and from RNA to proteins which perform gene functions (Crick, 1970). The amount

of RNA in this process indicates the level of gene expression. That is the extent to

which a gene is used to produce proteins is known as gene expression. Note that there

are different levels of gene expression, one at the transcription level, where RNA is

made from DNA, and one at the protein level, where protein is made from mRNA.

Microarray measures the gene expression level on a genomic scale by examining the

amount mRNA in cell cultures or tissues and they provide insight into gene function

by quantitatively studying gene expression.

1.1.2. Microarray Technologies

Microarray technology has been applied to many situations, including disease di-

agnosis, drug discovery, and toxicology.[Schulze and Downward (2001), Brown and

Botstein (1999), Debouck and Goodfellow (1999), Macoska (2002), Lobenhofer et al.

(2001)]. It is therefore important to measure gene expression from the sample under

study. Several techniques are available for measuring gene expression, including se-

rial analysis of gene expression (SAGE), cDNA library sequencing, differential display,

cDNA subtraction, multiplex quantitative RT-PCR, and gene expression microarrays.

Microarrays quantify gene expression by measuring the hybridization, or matching, of

DNA immobilized on a small glass, plastic or nylon matrix to mRNA representation

from the sample under study. Note that there are different levels of gene expression,

one at the transcription level, where RNA is made from DNA, and one at the protein

level, where protein is made from mRNA. There are methods for detecting mRNA

expression of a single gene or a few genes. The novelty of a microarray is that it

quantifies transcript levels on a global scale by quantifying transcript abundance of

thousands of genes simultaneously. This novelty has allowed biologists to take a global

3

perspective on life processes and to study the role of all genes or all proteins at once

(Nguyen et al. (2002)). A detailed explanation of how a microarray experiment is

done can be found in Sambrook and Russell (2001) and Dietz et al. (2003). Although

there are different types of microarrays, all follow these common basic procedures:

• Chip manufacture: A microarray is a small chip (made of chemically-coated

glass, nylon membrane or silicon) onto which tens of thousands of DNAmolecules

(probes) are attached in fixed grids. Each grid cell relates to a DNA sequence.

• mRNA preparation, labeling and hybridization: Typically, two mRNA

samples (a test sample and a control sample) are reverse transcribed into cDNAs

(targets), labeled using either fluorescent dyes or radioactive isotopics, and then

hybridized with the cloned sequences on the surface of the chip.

• Chip scanning: Chips are scanned to read the signal intensity that is emitted

from the labeled and hybridized targets.

Microarray technologies include several kinds of so-called cDNA arrays and oligonu-

cleotide arrays. Although both exploit hybridization, they differ in how DNA se-

quences are laid on the array and in the length of these sequences. Schena (2000)

reviews in detail the technical aspects of different microarray technologies. Overviews

of different microarray technologies can also be found in Nguyen et al. (2002). How-

ever, most of the results in this thesis are applicable to oligo systems developed by

Affymerix. Here we give a brief overview of the Affymetrix array.

Oligonucleotide Array: The Affymetrix Array Microarray experiments using

Affymetrix technology are widely used. In Affymetrix arrays expression of each

gene is measured by comparing hybridization of the sample mRNA to a set

of probes, composed of 11-20 pairs of oligonucleotides, each of length 25 base

pairs. The first type of probe in each pair is known as perfect match (PM) and is

taken from the gene sequence. Each PM probe is paired with a mismatch (MM)

4

probe that is created by changing the middle (13th) base of the PM sequence

to reduce the rate of specific binding of mRNA for that gene. The goal of MMs

is controlling for experimental variation and nonspecific binding of mRNA from

other parts of the genome. These two probes (PM, MM) are referred to as a

probe pair. Under relatively ideal situations, when the gene is expressed in the

cell sample, high intensity is expected for the PM probe and low intensity for

the MM probe. In this procedure, an RNA sample is prepared, labeled with a

fluorescent dye, and hybridized to an array. Unlike two-channel arrays, a single

sample is hybridized on a given array. Arrays are then scanned and images are

produced and analyzed to obtain a fluorescence intensity value for each probe,

measuring hybridization for the corresponding oligonucleotide. The software

utilities provided with the Affymetrix suite summarize the probe set intensities

to form one expression measure for each probe set. Oligonucleotide arrays are

discussed by Lockhart et al. (1996); details on Affy arrays can also be found in

Affymetrix (1999).

As microarray technology evolves, study of as many as 20,000 genes is becoming

routine [Reyal et al. (2005), Harbig et al. (2005)]. With the capability to screen large

portions of the human genome, microarrays are typically used for screening large

numbers of genes. In order to obtain meaningful information for the organism being

studied, multiple levels of analysis are performed on the primary data. Figure 1.1

displays the flowchart for a typical microarray experiment. From the analytical point

of view we can separate the microarray experiment with two stages:

Probe-level analysis The first stage of the analysis is probe-level analysis (often

called low-level analysis) which summaries the raw data to obtain a single ex-

pression value for each gene or probe from the experimental data. The probe-

level analysis should provide reliable measurements of gene or probe expression

levels leading us to a second stage of analysis which is called high level analysis.

The low level analysis, often associated with the pre-processing stage within the

5

Figure 1.1: Flowchart for a Typical Microarray Experiment. This figure is modifiedfrom http://www.humgen.nl/microarray_analysis.html.

microarray, has increasingly become an area of active research, traditionally in-

volving techniques from classical statistics. Statisticians explore opportunities

for the application of various methods to several important probe-level microar-

ray analysis problems, such as monitoring gene expression, transcript discovery,

genotyping and resequencing.

High level analysis The high level analysis provides answers to the biological ques-

tions that motivate the microarray experiment. Different types of high level

analysis include clustering, classification and projection methods. We concen-

trate our thesis work on high level analysis.

http://www.humgen.nl/microarray_analysis.html

6

1.1.3. Microarray Gene Expression Dataset

Microarrays produce massive amounts of data. These data, like genome sequence

data, can help us to gain insights into underlying biological processes where they can

be queried, compared and analyzed. A DataMatrix object stores experimental data

in a matrix, with rows typically corresponding to gene names or probe identifiers,

and columns typically corresponding to sample identifiers. A DataMatrix object also

stores metadata, including the gene names or probe identifiers (as the row names)

and sample identifiers (as the column names).

Conceptually, a gene expression dataset can be regarded as consisting of three

parts - the gene expression data matrix, gene annotation and sample annotation( see

Figure 1.2). Gene annotation may include the gene names and sequence information,

location in the genome, a description of the functional roles for known genes and

links to the respective entries in gene sequence databases. Sample annotation may

provide information about the part of the organism from which the sample was taken

or which cell type was used, or whether the sample was treated, and if so what

was the treatment (e.g. a chemical compound and concentration). Samples may

also be related: for instance, they may form a time course. However, each row in

the Affymetrix data matrix corresponds to a perfect match (PM) probe, and each

column corresponds to an Affymetrix CEL file. Each CEL file is generated from a

separate and same type of chip for each sample and the image processing software

stores the location and two summary statistics (a mean and standard deviation) for

each probe. It is also important to have information about the experiment itself.

This could include identification number in a database, experimental protocols used,

preprocessing information, and so forth.

7

Figure 1.2: Conceptual view of gene expression data matrix. A gene expressiondatabase can be regarded as consisting of three parts: the gene expression datamatrix, gene annotation and sample annotation. This figure is from http://www.

ebi.ac.uk/2can/databases/microarray2.html

http://www.ebi.ac.uk/2can/databases/microarray2.html

http://www.ebi.ac.uk/2can/databases/microarray2.html

8

1.2. Background Adjustment and Normalization

Microarray experiments are complicated tasks involving a number of stages, most of

which have the potential to introduce error. These errors can mask the biological

signal of interest. A component of this error may be systematic, i.e., bias is present,

and it is this component that needs to be removed in order to gain as much insight as

possible into the underlying biology in a microarray experiment. This can be achieved

by using a number of well-established statistical methods. It is difficult to account

for all characteristics of gene expression data in a single model. Most models handle

the experimental data in following steps, although they may not be performed in this

order:

1. Possible background correction, which removes background noise from signal

intensities;

2. possible normalization, which is intended to even out unwanted non-biological

variation across arrays; and

3. summarization, which gives the expression index.

Most classical normalization procedures are global approaches, based on normaliza-

tion of the overall mean or median array intensity to a common standard, such as those

implemented in the Affymetrix GeneChip software (Affymetrix Inc., Santa Clara,

California). Detailed descriptions of Affymetrix normalization methods can be found

in the Version 5.0 Affymetrix Microarray Suite, MAS 5.0 User Guide. Normaliza-

tion methods implemented are similar to scaling and enable comparison analysis of

an expression and baseline array. Ideally, expression indices should be both precise

(low variance) and accurate (low bias) after normalization. Until now a number of

methods of background correction and normalization have been proposed. These

methods include the Lowess normalization by Chen et al. (1997), global linear nor-

malization by Yang et al. (2001), and variance stabilization method of Huber et al.

9

(2002). Many researchers use ANOVA models for microarray data that can account

for experiment-wide systematic effects that could bias inferences made on the data

from the individual genes. Kerr et al. (2000) proposed an ANOVA model based

on the logarithms of the original fluorescence measurements that can incorporate ex-

perimental factors and gene-specific interaction effects. Robust multiarray average

(RMA), published by Irizarry et al. (2003), is a commonly used method for prepro-

cessing and normalizing Affymetrix data. However, one part of the RMA method is

quantile normalization that is applicable to all data types. A number of algorithms,

including the aforementioned, have been implemented in the Bioconductor R project

(http://www.bioconductor.org/).

Following normalization of the data, a statistical analysis is performed to identify

differentially expressed genes, find similarities or patterns in gene expression profiling.

The overall results must then be applied to the biological model of interest for its

meaningful and appropriate interpretation.

1.3. Challenges of Micorarray Expression Analysis

Gene expression data from microarrays present many challenging problems for ana-

lysts. The data exhibit complicated error structures which are not widely recognized.

High Dimensionality Owing to the large number of genes m and small number of

samples n(m≫ n), microarray data analysis poses big challenges for statistical

analysis. An obvious problem owing to the “large m small n” is overfitting.

Again with small sample sizes parametric statistical tests of the differences be-

tween mean levels of gene expression for each of the genes will be more sensitive

to assumed distributional forms of the expression data, and resulting p-values

may not be accurate.

Distribution The measured expression levels are often non-normally distributed.

Also, it is often assumed that all the genes have same shaped distribution which

http://www.bioconductor.org/

10

may or may not hold in practice.

Unequal Sample Sizes Very often the sample sizes for the treatment group and the

control group are unequal. If the sizes for these two groups are substantially

different, it is more likely that the model will fit the larger sample better. For ex-

ample, if treatment group has 10 subjects and control group has 50 subjects then

it is more likely that the model will fit the control group better. In this scenario

there is chance that two populations have different variances and one group pop-

ulation will be more skewed than the other. In this case comparing with t-test

will not give robust answers about the data. Using the t-test and assuming nor-

mality, if the population difference between the two groups is 1 and the standard

deviations are 1.5 and 1 respectively, then at 5% significance level, the statistical

power or chance of correctly rejecting the null hypothesis is only about 52.4%

(from http://www.dssresearch.com/toolkit/spcalc/power.asp). On the

other hand the power becomes 97.5% considering 50 subjects per group.

Heterogeneity of variances Gene expression values contain variability caused by

noise and natural variability. A properly chosen transformation can stabilize

the variance and improve the statistical properties of analysis.

Correlation Structure In reality, many genes are dependent to some degree, be-

cause they are biologically related, whereas others are totally independent. The

correlations in microarray data are not only strong but they are also long-

ranged, involving thousands of genes. The consequences of ignoring this fact

may produce unstable results of estimation or testing [Qiu et al. (2005), Qiu and

Yakovlev (2006), Klebanov and Yakovlev (2007 )]. It has gradually come into

the realization that inter-gene correlations are not just a nuisance, complicating

statistical inference from microarray gene expression data, but a rich source of

useful information.

http://www.dssresearch.com/toolkit/spcalc/power.asp

11

1.4. Filtering

The concept of filtering can be visualized as taking a large matrix of data (possibly an

entire database) and making a smaller matrix. The microarray dataset is quite large

and a lot of the information corresponds to genes that do not show any interesting

changes during the experiment. If certain genes are not associated with the clinical

outcome, it is important to exclude them from the predictive model. After the removal

of irrelevant data, we are left with the good quality data, of which most is probably

uninteresting for us. If the goal of the study is to find a couple of dozens of genes for

further studies of the biologically interesting phenomenon, it is a good idea to remove

the uninteresting part of the data before proceeding with the analysis. Uninteresting

data comprises of the genes that do not show any expression changes during the

experiment. A widely used filtering procedure for Affymetrix data is described in the

following.

Affymetrix oligonucleotide chips (Affymetrix, 2002) represent each gene with a

probe set consisting of a number of probe pairs. A probe pair is composed of a

perfect match (PM) probe and a mismatch (MM) probe. An algorithm in MAS

attempts to identify probe sets that are expressed on a particular chip by reporting

a present−absent call. The algorithm compares the expression PM of the PM probe

to the expression MM of the MM probe using a relative expression value given by,

PM −MM

PM +MM

For each probe set, the Wilcoxon signed-rank procedure (Mason et al. (1989)) is

used to test the null hypothesis that the median relative expression of the probe pairs

is less than or equal to some value τ . The MAS default value of τ is 0.015. The call

is then generated by comparing the p-value p from this test with two thresholds, ϵ1

and ϵ2, where ϵ1 < ϵ2. The probe set is declared “present” (i.e., expressed) if p < ϵ1,

“absent” (i.e., unexpressed) if p ≥ ϵ2, and “marginal” if ϵ1 ≤ p < ϵ2. The MAS

defaults for ϵ1 and ϵ2 are 0.04 and 0.06, respectively. These present-absent calls are

12

often used as part of a filtering criterion that removes genes that do not appear to be

expressed in the biological system under investigation. A widely used technique, the

m/n filter, removes all probe sets having fewer than m present calls among a set of n

chips. Pounds and Cheng (2005) derived the statistical properties of the commonly

used m/n filter and proposed the pooled p-value filter and the error-minimizing pooled

p-value filter. Additionally, they proposed and demonstrated an approach to estimate

the number of interesting probes removed by a filter and assess the influence of the

filter on subsequent analysis.

1.5. Analysis

Probe Level Analysis The noisy nature of microarray experiment data has moti-

vated the development of numerous algorithms for estimating gene expression

values. Most models handle the experimental data in several steps, such as

background correction, normalization and summarization. There are several

normalization methods in the published literature:

• Affymetrix Microarray Suite Software version 5.0, MAS 5.0 (Affymetrix,

2002),

• Model-based Expression Index, MBEI (Li and Wong (2001 )),

• Robust multi-array average, RMA ( Irizarry et al. (2003)).

A further issue that needs to be addressed is the difference between the two most

commonly used microarray technologies: spotted cDNA microarrays, which re-

port relative differences in gene expression between two samples, and oligonu-

cleotide microarrays, which report absolute expression levels. Normalization

techniques for one microarray technology might not apply to another, owing to

differences in assumptions and the distributions of the output measurements.

13

Various microarray technologies for measuring expression values have created

some challenging statistical problems. There are many sources of variation in a

typical experiment and these can be accounted for using statistical design and

analysis-of-variance methodology, although careful attention has to be given

to the high-dimensionality and complicated interactions. Statistical methods

invoked early in the data analysis pipeline can remove systematic errors and

improve subsequent inferences.

High Level Analysis The goal of microarray data analysis is to find relationships

and patterns in the data, and ultimately achieve new insights into the underly-

ing biology. One of the ways to obtain new insights into biology by determining

differentially expressed genes under different conditions. Another important

function of microarray experiments is the classification of biological samples

using gene expression data. The goal of classification is to identify the dif-

ferentially expressed genes that may be used to predict class membership for

new samples. The classes are predefined and the task is to understand the

basis for the classification from a set of labeled objects. Examples of classifi-

cation methods range from linear discriminant analysis Mardia et al. (1979 )

to support vector machines Brown et al. (2000) or classification and regression

trees Breiman et al. (1984). Clustering applies when the classes are unknown

a priori and need to be discovered from the data. This provides both a visual

representation of the data and a method for measuring similarity between bio-

logical conditions. The widely used methods for clustering microarray data are:

Hierarchical, K-means and Self-organizing map. The most popular clustering

methods are nicely reviewed by Quackenbush (2001).

14

1.6. Assessing Significance

There are several approaches for reporting the degree of reliability of results in the

analysis aimed at selecting differentially expressed genes. Test statistics are usually

connected to p-values that are used to control type I error probabilities. The methods

are seen as providing rankings of the genes based on p-values of the parameteric tests.

Even if p-values could be derived in gene expression analysis, they would be more

misleading than informative, because of the strong distributional assumptions they

would be based upon. Again conventional approaches based on gene-specific p-values

are generally criticized on the grounds of the multiplicity of comparisons involved

for identifying differentially expressed genes (Dudoit et al. (2002)). For example, a

p-value of 0.05 tells that we have a probability of 5% of making a type I error (false

positive) on one gene, then we expect 500 type I errors (false positive genes) if we

look at 10000 genes. That is usually not acceptable. Multiplying the p-value by the

number of genes in the experiment, we can get an estimate of the number of false

positives. We can take this false positive rate into account when planning further

experiments.

Table 1.1: Possible outcomes of multiple hypothesis testingAccepted Rejected

True H0 T11 T12 m0

True H1 T21 T22 m1

m-R R m

The multiple testing approaches have relied on the FamilyWise Error Rate (FWER)

and the False Discovery rate (FDR). FWER is the probability of at least one false

positive among the genes selected as differentially expressed, that is,

FWER = Pr(T12 > 0)

15

where T12 is defined in Table 1.1. Bonferroni is perhaps the most widely used method

to control the FWER (see Hochberg and Tahmhane (1987)). The Bonferroni correc-

tion can be used to reduce the nominal level of significance. For example, if we have

only 5% probability of making one or more type I errors among all 10000 genes then

with Bonferroni correction the new cutoff becomes 0.05/10000. That is a very strict

cut-off, and for many purposes we can tolerate more type I errors in expectation.

False discovery rate (FDR) is another statistical method, which is used in multiple

hypothesis testing to correct for multiple comparisons. In a list of rejected hypoth-

esis, FDR controls the expected proportion of incorrectly rejected null hypothesis.

Thus it is the expected proportion of false positives among the genes declared to be

differentially expressed, that is,

FDR = E(T12

R| R > 0)

where T12 and R are defined in Table 1.1. Benjamini and Hochbergh (1995) have

proposed a method for controlling the FDR at a specified level. After ranking the

genes according to significance (p-value) and starting at the top of the list, we accept

all genes where

p ≤ T12

Rq

where T12 is the number of genes accepted so far, R is the total number of genes

tested and q is the desired FDR. For T12 > 1, this correction is less strict than a

Bonferroni correction. Because of this directly useful interpretation, FDR is a more

convenient scale to work on instead of the p-value scale. For example, if we declare

a collection of 100 genes with a maximum FDR 0.10 to be DE, then we expect a

maximum of 10 genes to be false positives. No such interpretation is available from

the p-value (Pawitan et al. (2005)). Tusher et al. (2001) and Storey and Tibshirani

(2003) selected potential differentially expressed genes, and then estimate the FDR of

the selected genes by re-sampling , given that at least one gene was selected. Storey

(2002) also introduced a q-value, which is a Bayesian concept and corresponds to

16

the posterior probability that a gene is not differentially expressed given the observed

data.

The FDR can also be estimated using permutations. Permutation is used to

estimate unadjusted p-values while avoiding parametric assumptions about the joint

distribution of the test statistics. An important assumption behind a permutation test

is that the observations are exchangeable under the null hypothesis. An important

consequence of this assumption is that tests of difference in locations require equal

variance. To estimate a p-value for gth gene using permutation analysis, one first

calculates the observed test statistic Tg. Then one permutes the samples of the gene

expression data and recalculates the test statistic, T ∗g . Permuting entire samples of

this expression data creates a situation in which the response is independent of the

gene expression levels, while attempting to preserve the correlation structure and

distributional characteristics of the gene expression levels. The p-value is estimated

by counting the number of T ∗g = {T ∗

gb : b = 1, · · · , B} that are greater than or

equal to Tg and dividing by the total number of permutations, B. More details of

the permutation algorithm for calculating unadjusted p-values are given in Dudoit

et al. (2002), Storey and Tibshirani (2003) and also in Speed (2003). A more recent

paper by Lin et al. (2008) investigate the performance of the SAM method and the

Benjamini and Hochberg-FDR procedure with regard to controlling the FDR.

When controlling the FDR, it is important to be aware of the sensitivity or false

negative rate (FNR), as we may lose too many of the truly DE genes by setting

the FDR too low. Thus, the increasing use of FDR needs to accompanied by the

sensitivity or FNR assessment.

1.6.1. An Application to Calculate FDR

We obtained the gene expression data from the leukemia microarray study of Golub

et al. (1999). Pre-processing was done as described in Dudoit et al. (2002). The data

consist of 27 ALL and 11 AML subjects of 3051 genes. Figure 1.3 is a histogram of

17

the 3051 two-sample t-statistics from the genes. The t-statistics values range from

-10.58 to 7.548. Suppose that we decide to reject all genes whose t-statistic is greater

than 3 in absolute value; there are 614 such genes.

T statistic

−10 −5 0 5

020

4060

8010

012

0

Figure 1.3: Histogram of t-statistics from the Golub et al. (1999) leukemia dataset

To calculate the FDR among these 614 genes we do a random permutation of

the sample labels (27 ALL and 11 AML). We recompute the t-statistics and count

how many exceed ±3. Doing this for 100 permutations, we find that the average

number is 17.58. Thus a simple estimate of the FDR is 17.58/614=2.86%. This

simple estimate tends to be biased because the permutations make all the genes null,

but in the data only a proportion, π0, are null. Hence to improve the estimate of

18

FDR, we multiply it by an estimate of π0. To obtain π0, we look at small values of

the t-statistic (in absolute value), where null statistics are much more abundant than

alternatives. Looking, for example, at t-statistics below 0.15 in absolute value, we

find that 2865 of the observed t-statistics fall into that range, while on average 2993 of

the t-statistics from the permutations fall in that range. (The 0.15 cutoff is arbitrary

in this example, but it can be automatically chosen taking bias and variance into

account as in Storey and Tibshirani (2003).) Hence our estimate of π0 is π0 ≈ 0.95.

Finally, our revised estimate of the FDR is 0.95*2.86=2.717%.

1.7. Importance of Replicates

Carefully designing and controlling experiments is as important as the execution of

the experiment itself. One approach that ensures greater experimental success in

gene expression studies using microarrays is the incorporation of replicates. There

are two primary types of replicate experiments: biological and technical. biological

replicates refer broadly to analysis of RNA of the same type from different subjects

(for example, muscle tissue treated with the same drug in different mice or six different

human samples across six arrays); technical replicates refer to multiple-array analysis

performed using the same RNA (for example, multiple samples from the same tissue

or analyzing one sample six times across multiple arrays). It is important to consider

one or both of these types of replicates depending on the experimental design.

Any type of replicates provide a measure of the experimental variation, such as

RNA isolation, labeling efficiency, or chip quality. The importance of biological repli-

cation in microarray gene expression studies has been addressed by Lee et al. (2000).

They conducted a controlled experiment involving replication of cDNA hybridizations

on a single microarray to investigate inherent variability in gene expression data and

the extent to which replication in an experiment can affect consistent and reliable

findings. The importance of biological replicate microarray experiments was also re-

ported by Pritchard et al. (2001) based on mouse gene expression data collected from

19

different tissues, such as the kidney, liver, and testis. They demonstrated that even

for genetically identical mice of the same age housed under the same conditions, there

were genes that expressed significant variation at the mouse level. In particular, their

data suggest that both specific genes and functional classes of genes will be consis-

tently variable, even in multiple tissue types. Genetically diverse populations such as

humans are likely to show even greater variability in gene expression. The advantages

of using replicates are summarized as follows:

• Replicates can be used to measure variation in the experiment so that statistical

tests can be applied to evaluate differences.

• Averaging across replicates increases the precision of gene expression measure-

ments and allows smaller changes to be detected.

• Replicates can be compared to locate outlier results that may occur due to

aberrations within the array, the sample, or the experimental procedure.

The optimal number of replicates in a general microarray study will depend on many

factors, including array equipment type, laboratory technique, and the condition

and preparation of samples. A study by Pan (2002) discussed how to calculate the

number of replicates (arrays or spots) in the context of applying a normal mixture

model approach to detect changes in gene expression. Their estimation depends on

several factors, including a given magnitude of expression change, a desired statistical

power to detect it, a specified type I error rate, and the statistical method being used

to detect it.

1.8. Software

In this thesis we concentrate on analyzing gene expression data to identify differ-

entially expressed genes. Microarrays generate large amounts of numeric data that

should be analyzed effectively. R statistical software (http://cran.r-project.org)

http://cran.r-project.org

20

and its expansion packages from Bioconductor project (http://www.bioconductor.

org) provide flexible means to manage and analyze these data. There are two parts

where we do our analysis: (1) Preprocessing and (2) Testing. The data analysis part

starts after preprocessing of Affymetrix data.

1.8.1. Preprocessing packages

Functions for reading Affymetrix data are available in the package affy which is

written by a group of authors maintained by Irizarry, R.A. Functions for Affymetrix

normalization are distributed over several packages. The MAS5 method developed by

Affymetrix is available in the package affy, command mas5(). A newer method Plier,

also developed by Affymetrix is available in package plier, command plier(). The

RMA method is implemented in package affy (command rma()), but its adaption

for taking into account the differences in probes GC% (GCRMA), is available in a

separate package gcrma (command gcrma()).

In this thesis, we will apply RMA preprocessing to the data. The reason why

RMA was chosen is based on observations that it gives highly precise (low variance)

estimates of expression (which is desirable), although it might not give as accurate

(low bias) results as MAS5 (Millenaar et al. (2006)). In other words, RMA seems

systematically to underestimate gene expression.

1.8.2. Testing packages

The following few packages are aimed to provide access for identifying differentially

expressed genes.

LIMMA One of the widely used tools for the statistical analysis is limma, which

implements linear models. One of the assumptions of the limma’s method is

that the data is normally distributed, but real world data is not always normally

distributed. Limma itself also provides input and normalization functions which

support features especially useful for the linear modeling approach. The latest

http://www.bioconductor.org


21

version of this package is 3.6.9 which is written by a group of authors and the

package is maintained by Gordon Smyth. A detailed discussion about limma is

given in the book by Smyth (2004).

SIGGENES The use of siggenes package is to identify the differentially expressed

genes and estimate the False Discovery Rate (FDR) using both the Significance

Analysis of Microarrays (SAM) and the Empirical Bayes Analyses of Microar-

rays (EBAM). Schwender (2009) is the author of this package and the current

version number for this package is 1.24.0.

DEDS This library contains functions that calculate various statistics of differential

expression for microarray data, including t statistics, fold change, F statistics,

SAM, moderated t and F statistics and B statistics. It also implements a

new methodology called DEDS (Differential Expression via Distance Summary),

which selects differentially expressed genes by integrating and summarizing a

set of statistics using a weighted distance approach. The authors of this package

for version number 1.24.0 are Xiao and Yang (2007).

ROC The ROC library is a collection of functions related to receiver operating char-

acteristic (ROC) curves and are targeted to use in DNA microarray analysis.

Carey and Redestig (2010) introduced this package with the version number

1.26.0.

OCplus This R-package OCplus containing functions to compute the proportion

of non-differentially expressed (non-DE) genes based on the mixture model,

the resulting FDR and other operating characteristics of microarray data. The

package includes tools both for planned experiments (for sample size assessment)

and for already collected data (identification of differentially expressed genes).

The authors of this package for the version number 1.24.0 are Pawitan and

Ploner (2010).

22

1.9. Aim of the Thesis

The goal of many controlled microarray experiments is to identify genes that are

regulated by modifying conditions of interest. For example one may wish to compare

a drug with an alternative drug treatments. The goal of these experiments is generally

that of identifying as many of the genes as possible that are differentially expressed

across the conditions compared. Gene expression can often be thought of as the

response variable in statistical models.

Microarrays are often used to screen genes for further analysis by more reliable

assays, and the data analysis is best approached by ranking genes and/or by selecting

a subset of genes for further validation. Determining whether a gene is differentially

expressed under different conditions is an important statistical problem in genome

experiments. The standard practice is to test the hypothesis of no differential ex-

pression for each gene when comparison is made between two (or more) different

experimental conditions. Generally speaking, the statistical methods for testing the

hypothesis can be classified into two categories: the parametric method and the non-

parametric method. The most commonly used parametric method is the two sample

t-test and its variations. Although it seems to work well in some situations, the

parametric method has a big drawback: its strong dependence on model assumption.

Several authors pointed out that expression data from microarrays are often not dis-

tributed according to a normal distribution, even after some preprocessing (Hunter

et al. (2001), Thomas et al. (2001), Pan et al. (2003), Craig et al. (2003), Zhao et al.

(2003), Zhang et al. (2006)). According to Hunter et al. (2001) and Thomas et al.

(2001) the microarray data is often noisy and hence the parametric assumption is cer-

tainly inappropriate for a subset of genes despite any given transformation. Therefore

it is challenging to construct a suitable statistical model applicable to all microarray

datasets. The main goal of this thesis is to explore methods for ranking genes in

order of likelihood of being differentially expressed. Top gene lists, that give truly

23

DE genes, are the output. In these contexts we propose three new methods, namely

Parametric Method: We develop an Approximate Likelihood Ratio Test (ALRT)

method assuming the expression levels follow a generalized logistic distribution

of Type II (GLDII). The ALRT method with distributional assumption GLDII

appears to provide a suitable fit of data especially with large samples.

Nonparametric Method 1 We develop an improved test statistic for testing whether

area under receiver operating characteristic curve for each gene is greater than

0.5 allowing different variance for each gene. The method performs reasonably

well and it is computationally efficient enough for practical applications.

Nonparametric Method 2 We develop a nonparametric procedure for selecting

genes with expression levels correlated with that of a “seed” gene or a marker

gene in microarray experiments. The marker gene is known in advance. The

proposed test statistic compares two Area Under Receiver Characteristic Curves

(AUC) for gene pairs taking correlation into account.

1.10. Organization of the Thesis

The thesis is divided into a series of chapters, each devoted to a particular facet of

analysis and arranged in roughly the same order as the issues one might encounter

during a real experiment. It begins with a very brief overview of hybridization, which

nicely summarizes the microarray technology and highlights the current limitations

of the most commonly used methods. Based on the aims of the thesis described in

Section 1.9, the work includes mainly identifying differentially expressed genes. The

presentation of the work in this thesis is organized in the following four chapters

Chapter 2 outlines few of the popular microarray data analysis tools that helps to

detect differential gene expression. This chapter starts with describing how to

assess the significance of any fold changes in expression. The primary concepts

24

and methods used for identifying differentially expressed genes are also intro-

duced. An empirical comparison with other methods is also discussed in this

chapter.

Chapter 3 provides a parametric method that is proposed with a approximate like-

lihood ratio test when the underlying distribution of gene expression data is

skewed. Here we assumed the underlying distribution of gene expression data

follows the generalized logistic distribution of Type II rather than normal dis-

tribution. We also compare results on simulated data and from real microarray

datasets. We conclude the chapter with a discussion summarizing the advan-

tages and disadvantages of using our method.

Chapter 4 provides a nonparametric method for testing whether area under receiver

operating characteristic curve (AUC) for each gene is equal to 0.5 allowing

different variance for each gene. This chapter provides the implementation

of the method with real datasets and simulation procedures and shows the

improvement of identification of differentially expressed genes compared with

rank sum test and t-test.

Chapter 5 studied the problem of searching genes correlated to a known candidate

gene or a “seed” gene. We propose a test statistic that compares two Area

Under Receiver Characteristic Curves (AUC) for gene pairs taking correlation

into account and identifying DE genes sequentially. We compare our method

with other well known methods.

Chapter 6 gives the conclusion of this work and proposes possible directions for

future research.

CHAPTER 2Popular Statistical Methods for

Identifying Differentially ExpressedGenes

Having performed normalization we should now be able to compare the expression

level of any gene in the treatment to the expression level of the same gene in the

control. There are many statistics developed for this purpose. The number of methods

for identifying differentially expressed genes from microarray experiments are now

large and ever increasing. There is no consensus, no strict guidelines or real rules of

thumb when to apply some tests and when to apply others. In this chapter we discuss

some of the well known methods that are used in identifying differentially expressed

genes.

2.1. Introduction

One of the well-known areas from high-dimensional microarray analysis is that of

ranking genes according to the differential expression. For this purpose, many statis-

tical models have been proposed for the analysis of microarray gene expression data,

including generalized linear models (Kerr et al. (2000), Lee et al. (2000), Dudoit

et al. (2002)), mixed effects models (Wolfinger et al. (2001)), modified mixture model

methods (Pan et al. (2003)), significance analysis of microarrays (SAM) (Tusher et al.

(2001)), modified t-statistic method (Zhao et al. (2003)), Likelihood Methods (Ideker

et al. (2000),Wright and Simon (2003), Wu (2005)), Bayesian methods (Opgen-Rhein

25

26

and Strimmer (2007), Newton et al. (2001), Baldi and Long (2001), Lonnstedt and

Speed (2002), Newton et al. (2004), Smyth (2004), Cui et al. (2005), Fox and Dimmic

(2006), Scharpf et al. (2009)), nonparametric methods (Pepe et al. (2003)), Hotelling

T 2 method (Lu et al. (2005)), global test method (Goeman et al. (2004)), and Global

approach (Zhou et al. (2007)), among others. Some of these methods require dis-

tributional assumptions for the underlying gene expressions. It is the aim of this

chapter to consider few of these methods for analyzing microarray data to identify

differentially expressed genes.

2.2. Fold Change

The simplest method to detect differential gene expression is by ranking based on the

fold change (FC). The approach to calculate fold change is to divide the expression

level of a gene in the sample by the expression level of the same gene in the control.

Then we get the fold change, which is 1 for an unchanged expression, less than 1 for

a down-regulated gene, and larger than 1 for an up-regulated gene. The genes are

then ranked by this ratio.

The problem with fold change emerges when one takes a look at a scale. Up-

regulated genes occupy the scale from 1 to infinity (or at least 1000 for a 1000-fold

up-regulated gene) whereas all down-regulated genes only occupy the scale from 0

to 1. This scale is highly asymmetric. Another disadvantage of using fold change is

not taking variability of the gene expression values into account. The basic flaw of

this approach mentioned by Dudoit et al. (2002) is that genes with high fold change

might also exhibit high variability, and hence their differential expression between

experimental conditions may not be significant. Similarly, genes with less than two

fold changes may have highly reproducible expression intensities and hence significant

differential expression can be found across experimental conditions.

In order to assess differential expression in a way that controls both false positives

(genes declared to be differentially expressed when they are not) and false negatives

27

(genes declared to be not differentially expressed when they are), the standard ap-

proach emerging is one based on statistical significance and hypothesis testing, with

careful attention paid to multiple comparisons issues. The following are the few ap-

proaches that are discussed to assess differential expression under hypothesis testing

problem.

2.3. t-test and ANOVA

We can use a t-test to determine whether the expression of a particular gene is sig-

nificantly different between control and treatment. The t-test uses the mean and

variance of the treatment and control distributions and calculates the probability

that the observed difference in mean occurs when the null hypothesis is true. The

formula for t-statistic is the difference in the means over the standard error of the

difference. For 2 groups, this is the equivalent of a 1 way analysis of variance Sahai

and Agell (2000).

When using the t-test it is often assumed that there is equal variance between

sample and control. That allows the treatment and control to be pooled for variance

estimation. If the variance cannot be assumed equal we can use Welch’s t-test which

assumes unequal variances of the two populations. If {xig}n1i=1 and {yjg}n2

j=1 defined

as the expression observed for the n1 control cases and n2 treatment cases in the gth

gene, then for each gene g, the test statistic is

tg =xg − yg√

s2x/n1 + s2y/n2

where xg and yg denote the sample average intensities in groups 1 and 2, and s2x and

s2y denote the sample variances for each group, respectively.

After calculating the t-test p-values for the replicated genes, the ones with the

lowest p-value can be saved into a new genelist and used in further analysis, for

example cluster analysis. These are the genes that most significantly differ between

two conditions.

28

Early statistics for analyzing microarray data include gene specific t-tests to find

differentially expressed genes. This t-statistics soon turned out to be unsuitable for

microarray data. Because of the large number of genes included in the microarray

experiments, there are always some genes with a very small variances across replicates,

so that their (absolute) t-values are large regardless of whether or not the differences

in their averages are large. Some of these turn out to be false positives for the t-

statistic. Again the average statistic is highly influenced by extreme observations,

outliers, and hence unsuitable for data with as few replicates as microarray data.

Several alternative statistics have been proposed to overcome these problems, and

many of them are influenced by the theory of shrinking the variance.

If we have more than two conditions, the t-test may not be the method of choice,

because the number of comparisons grows by performing all possible comparisons

between conditions. The analysis of variance (ANOVA) method will, using the F

distribution, calculate the probability of finding the observed differences in means

between more than two conditions when the null hypothesis is true (when there is no

difference in means).

2.4. Nonparametric Tests

Instead of the parametric test, which usually assumes that the expression values are

normally distributed, non-parametric tests like Wilcoxon rank sum test (two groups)

or Kruskal-Wallis test (two or more groups) can be applied, especially if the expression

values are not normally distributed. Here we describe the Wilcoxon rank sum test

and ROC methodology that were used commonly for analyzing microarray data.

2.4.1. Wilcoxon Rank Sum Test (RST)

Troyanskaya et al. (2002) applied the Wilcoxon rank sum test (RST) to gene expres-

sion analysis. The RST is a nonparametric alternative to the two-sample t-test which

is based solely on the rank of the expression values in which the observations from

29

the two samples fall. The test is based upon ranking the n1 + n2 observations of the

combined sample, where n1 and n2 are the sample size from the two conditions. The

Wilcoxon rank-sum test statistic is the sum of the ranks for observations from one of

the conditions. That is, after ranking all expression values from the two conditions,

the best separation we can have is that all values from one condition rank higher

than all values from the other condition. This corresponds to two non overlapping

distributions in parametric tests. But since the Wilcoxon test does not measure vari-

ance, the significance of this result is limited by the number of replicates in the two

conditions. It is for this reason that the Wilcoxon test for low numbers of replication

has low power and that the distribution of p-values is rather granular.

2.4.2. ROC Methodology for Gene Expression Analysis

An assessment of the expression of a gene can be made through the use of a receiver

operating characteristic (ROC) curve. If Y and X represents the expression values

from treatment population and control population, respectively, then for any real-

valued threshold, c, the population of subjects can be classified into the two groups

according to their expression values being greater or less than c. If a gene is significant

then the treatment group will include proportionally higher expression values than

control group (or vise versa). The agreement between the classification obtained

and the expression values can be characterized by two quantities: sensitivity (True

Positive Rate) and specificity (True Negative Rate) defined as follows:

sens(c) = TPR(c) = P (y > c | Y )

spec(c) = TNR(c) = 1-False Positive Rate (FPR) = P (x ≤ c | X)

The ROC approach allows considering the agreement between expression values

and the presence of different thresholds simultaneously. The ROC curve is the plot

of sensitivity versus 1-specificity where each point on the graph corresponds to a

specific threshold c. Note that for every gene in the groups of control and treatment

30

subjects there is a ROC curve. If a gene could perfectly discriminate, it would have

an expression above which the entire treatment graph would fall and below which all

control expressions would fall or vise versa. The curve would then pass through the

point (0,1) on the unit grid. The closer an ROC curve comes to this ideal point, the

better its discriminating ability. A gene with no discriminating ability will produce

a curve that follows the diagonal of the grid. Like the difference between the means

(µ1 − µ2) the probability P (Y > X) is also a measure of the distance between the

two experimental conditions. Therefore, the straight line with slope equal to one is

the plot of the comparison of two curves with the same mean. Pepe et al. (2003)

argue that two measures related to the ROC curve are suitable for ranking genes in

regards to DE between two conditions: the Area under the ROC curve (AUC) and

the Partial AUC (pAUC). The AUC can be interpreted as the probability that a

randomly selected subject from treatment group has greater expression values than

a randomly selected subject from control group, i.e.,

AUC = P (Yg > Xg) =

∫ 1

0

ROC(t)dt,where t=FPR(c).

If {xig}n1i=1 and {yjg}n2

j=1 defined as the expression observed for the n1 control cases

and n2 treatment cases in the gth gene. In that notation the unbiased estimator of

AUC for the gth gene is given by:

Ag =

∑n1

i=1

∑n2

j=1 ψ(xig, yjg)

n1n2

= ψ..

where,

ψ(x, y) =

{1 x < y

0 x > y

The definition of ψ(x, y) doesn’t allow for ties because of the continuous nature of

microarray data. It might have possible ties when quantile normalization is used and

in this case ψ(x, y) = 0.5 can be additionally defined.

The partial AUC is simply the area under the ROC curve between t0 and t1, i.e.,

pAUC(t0, t1) =

∫ t1

t0

ROC(t)dt,

31

where the interval (t0, t1) denotes the FPRs of interest.

For continuous data, the nonparametric ROC curve may be preferred since it

passes through all observed points and provides unbiased estimates of sensitivity,

specificity, and AUC in large samples (Zweig et al. (1993)). More importantly, the

nonparametric approach does not require the data to be fitted by any particular

model. If the distributions of scores for true-positive and true-negative test sub-

jects are far from Gaussian distribution, the parametric AUC and its corresponding

standard error (SE) derived from a directly fitted binormal model may be distorted

(Godard et al. (1990)). Convergence may also be an issue with data characterized by

extreme values. For these reasons, as well as its relative simplicity and ease of use,

the nonparametric approach continues to be popular among many researchers.

2.5. SAM-Statistic

Significance analysis of microarrays (SAM) is a statistical technique, established in

2001 by Tusher, Tibshirani and Chu, for determining whether changes in gene ex-

pression are statistically significant. To avoid the genes with low expression levels

dominating the results of the analysis, a small, strictly positive constant s0, the so

called fudge factor, is added to the denominator of the usual t-statistic.

dg =xg − yg√

n/(n1n2)s2g + s0

where s0 is some constant estimated from all the individual gene variances. Values

of s0 have a shrinkage effect producing a large decrease in the magnitude of the dg

statistic when the sample variances are small, and a smaller decrease when the sample

variances are large.

This analysis uses non-parametric statistics, since the the null distribution of the

dg-statistic is unknown. In this method, repeated permutations of the data are used to

determine if the expression of any gene is significant related to the response. The use of

permutation-based analysis accounts for correlations in genes and avoids parametric

32

assumptions about the distribution of individual genes. This is an advantage over

other techniques (for example ANOVA and Bonferroni), which assume equal variance

and/or independence of genes.

Each gene in a microarray experiment can have its own unique variance. This may

be a consequence of biological or technical factors but it is clear from our experience

that variances are variable across genes to a greater extent than expected due to

statistical errors of estimation. To derive stable gene specific variance estimates, we

can borrow information across genes by shrinking the variance estimates toward a

prior value or toward their bias-corrected geometric mean. When the true variances

are highly variable it is desirable to shrink less. When the true variances are similar

we should shrink more. In this way the new variance estimates adapt to the degree

of heterogeneity of variances.

In the following we describe the SAM procedure:

1. Compute the expression score dg for each gene g, g = 1, · · · ,m, and order the

these values to obtain the observed order statistics d(g) ≤ · · · ≤ d(m).

2. Draw B random permutations of the group labels. For each permutation b,

compute the permutated expression scores dbg, g = 1, · · · ,m, and order them.

Estimate the expected order statistics by d(g) =∑

b db(g)/B, g = 1, · · · ,m.

3. Plot the observed order statistics d(g) against the expected order statistics d(g)

to obtain the SAM plot.

4. For a fixed threshold ∆ > 0, find the first data point (d(g1), d(g1)) to the right

of the origin for which d(g) − d(g) ≥ ∆, and set d(g1) = cutup(∆). Call any gene

g with dg ≥ cutup(∆) positive significant. Similarly, find the first data point

(d(g2), d(g2)) to the left of the origin for which d(g) − d(g) ≤ −∆, set d(g2) =

cutlow(∆), call any gene g with dg ≤ cutlow(∆) negative significant. See figure

2.1 for the SAM plot that shows positive and negative significant genes under

∆ = 2 that can be taken for the further biological analysis.

33

Figure 2.1: SAM Analysis for the Two-Class Unpaired Case Assuming Unequal Vari-ances and ∆ = 2.

5. Estimate the FDR by

ˆFDR(∆) = π0(1/B)

∑b#{dbg /∈ (cutlow(∆), cutup(∆))

}max {#of significant genes at defined level, 1}

where π0 is an estimate of the prior probability π0 that a gene is not differentially

expressed. This estimate depends on the choice of another ∆′ and taking the

proportion of the number of genes not significant at ∆′ when all null hypotheses

are true by the number of observed genes that are not significant at ∆′. If this

proportion is found greater than 1 then π0 will be consider as 1. A way of

estimating π0 is described by Storey and Tibshirani (2003).

34

6. Repeat steps 4 and 5 for several values of the threshold ∆. Choose the value of

∆ that provides the best balance between the number of identified genes and

the estimated FDR.

Computation of the Fudge Factor: In the SAM analysis, the fudge factor s0 is

specified by the quantile of the standard deviations sg, g = 1, · · · ,m, of the

genes that fulfill an specific optimization criterion. Efron et al. (2001) suggest

to specify the optimal choice of the fudge factor by selecting the value of s0

that leads to the most differentially expressed genes i.e., 90th percentile of the

sg values. In the SAM analysis, the fudge factor is computed by the following

algorithm provided by Chu et al. (2002).

1. Compute the 100 percentiles qk, k = 1, · · · , 100, of the sg values.

2. For α ∈ ℜ = {0, 0.05, 0.1, · · · , 1}

• compute dαg == xg−yg√n/(n1n2)s2g+sα

, where sα denotes the α quantile of the

sg values, and s0 = q0 = ming{sg}, g = 1, · · · ,m.

• calculate ναk = 1.4826 ∗MAD{dαg |si ∈ [qk−1, qk)}, k = 1, · · · , 100,

• compute the coefficient of variation CV(α) of the ναk values.

3. Set α = argminα∈ℜ{CV (α)}, and s0 = sα.

R Package siggenes is a package from Bioconductor for implementing the signif-

icance analysis of microarray technique of Tusher et al. (2001). The package

contains a function sam to calculate the statistic dg for each gene g = 1, · · · ,m.

Additionally, the number of differentially expressed genes and the estimated

FDR is determined for an initial (set of) value(s) for the threshold ∆. The

fudge factor estimate is included in this package. The estimation of the prior

probability, π0, that a gene is not differentially expressed can be obtained by the

function pi0.est. It is estimated by the natural cubic splines based method of

Storey and Tibshirani (2003).

35

2.6. Empirical Bayes Approach

Efron et al. (2001), and Efron and Tibshirani (2002) model the distribution of the

expression scores dg, g = 1, · · · ,m, as a mixture of two components, one component

for the differentially expressed genes, and the other for the not differentially expressed

genes. Easily speaking, the Empirical Bayes method based on the assumption that

there are two classes of genes: “Not Different” and “Different” with their prior proba-

bilities being equal to π0 and π1 = 1−π0, respectively. Introducing the class indicator

variable J , we can write: π0 = Pr{J = 0} and π1 = Pr{J = 1}. Denote the condi-

tional probability density of T (Ti is the test statistic for ith gene) given J = 0 by

f0(t) and the corresponding density of T given J = 1 by f1(t). Then the density of

some statistics, like the two-sample t-statistic, for gene g is

f(t) = π0f0(t) + π1f1(t)

Then to evaluate which genes are differentially expressed one uses the posterior prob-

ability of π0 for each gene:

P (Not differentially expressed | T = t) = τ0 = P (J = 0 | T = t)

=P (T = t | J = 0)P (J = 0)

P (T = t)

=π0f0(t)

f(t)

Small values of the posterior probability, τ0, indicate possibly differentially expressed

genes. The posterior probability τ0 has been termed the local FDR by Efron and

Tibshirani (2002). Therefore,

P (t) = P (J = 1 | T = t) = 1− π0f0(t)

f(t)

The simplest Bayes rule is to select a gene with T = t if P (t) ≥ C for this gene, where

C < 1 is a preselect threshold level.

36

2.7. Posterior Odds Statistic(LIMMA)

Lonnstedt and Speed (2002) proposed a parametric empirical Bayes method to the

problem of identifying differentially expressed genes. They assumed a normal mixture

model for gene expression data and defined a B-statistic which is a estimate of the

log posterior odds-ratio that each gene is differentially expressed. The B-statistic is

equivalent for the purpose of ranking genes to the penalized t-statistic

tg =xg − yg√(s0 + s2g)/n

where the penalty s0 is estimated from the mean and standard deviation of the sample

variances s2g. Using the same notation and assumption the SAM statistic can be

defined of the form

dg =xg − yg

(s0 + sg)/√n

when assessing differentially expression of g-th gene. The SAM statistic differs slightly

from the tg statistic in that the penalty is applied to the sample standard deviation sg

rather than to the sample variance s2g. Smyth (2004) reformulated the posterior odds

statistic in terms of empical Bayes (E-B) t-statistic in which posterior residual stan-

dard deviations are used in place of ordinary standard deviations. Smyth (2004) has

implemented the log odds-ratio method as the function ebayes() in the freely avail-

able R package Limma Smyth (2004). This empirical Bayes statistic is described as

equivalent to shrinkage of the estimated sample variances towards a pooled estimate,

resulting a far more stable inferences when the number of arrays is small. Smyth

(2004) suggested another direction in which the t-statistic can be generalized with

replacing the sample mean difference and sample standard deviation with location

and scale estimators which are robust against outliers.

37

2.8. Other Methods

Cui et al. (2005) FS Statistic: Cui et al. (2005) proposed a shrinkage estimator

for gene-specific variance components based on the James-Stein estimator and

used it to construct a test statistic called FS. They showed that FS test is

robust and perform well under a wide range of assumptions about variance

heterogeneity. They compared FS statistic with some other statistic such as

B, SAM and regularized t and found that FS has comparable or better power

identifying differentially expressed genes.

Wu (2005) Penalized Linear Regression Model: Wu (2005) used linear regres-

sion model for detecting differential gene expression with exploring the penal-

ized linear regression. He proposed the penalized t/F statistics for two-class

microarray data based on a penalty model. He showed the model is intuitive

and performs favorably in applications.

Pan et al. (2003) Empirical Bayes Approach: Pan et al. (2003) incorporated

biological knowledge into a mixture model to analyze microarray data. They

proposed the mixture model that allows the genes in different groups to have

different distributions while the grouping of the genes reflects biological informa-

tion. The group can be obtained from analyzing gene expression data, such as a

set of differentially expressed genes or a cluster of genes with similar expression

patterns. In these approaches, if there are statistically significant enrichments

of the genes in one or more Gene Ontology (GO) categories, the group of the

supplied genes are regarded as biologically more meaningful. They found that

their approach reduces the false discovery rates (FDR) compared with other

standard approach.

Hotelling’s T 2 Statistic Lu et al. (2005) proposed the use of Hotelling’s T 2 statistic

with a multiple forward search (MFS) algorithm that is designed for selecting

38

a subset of genes in high-dimensional microarray datasets. The Hotelling’s T 2

statistic is a natural multidimensional extension of the t-statistic and hence

it can take into account multidimensional structure of microarray data. They

found their approach gave fewer false positives and negatives than t-test.

Seng et al. (2008) Generalized likelihood ratio Method: Parameters of a sta-

tistical data model, which account for potential error sources, can be estimated

using the maximum-likelihood estimation (MLE) method. A generalized likeli-

hood ratio (GLR) test can then be applied to identify genes whose expression

levels are statistically different. A crucial step in the GLR test lies in the se-

lection of the underlying error structure summarizing the influence of multiple

sources of variation in microarray studies. Seng et al. (2008) compared the ef-

fects of different underlying statistical error structures on the GLR test’s power

in identifying differentially expressed genes in microarray data. They also evalu-

ated variants of the GLR test as well as the one sample t-test based on simulated

data by means of receiver operating characteristic (ROC) curves.

Scharpf et al. (2009) Hierarchical Bayesian Model: Scharpf et al. (2009) de-

fined a hierarchical Bayesian model for microarray expression data collected

from several studies and used it to identify differentially expressed genes between

two conditions. The showed a flexible modeling that allows for interactions be-

tween platforms and the estimated effect after including a shrinkage across both

genes and studies. They also provided guidelines for when the Bayesian model

is most likely to be useful.

CHAPTER 3

Approximate Likelihood Ratio Method

The original work of this chapter is taken from Hossain et al. (2009). In this chap-

ter we propose a test assuming the expression levels follow a Generalized Logistic

Distribution of Type II (GLDII). The motivation for this assumed distribution is

to allow longer tails than normal distributions since extreme values are common in

microarray experiments. The shape parameter of this distribution provides a wide

range of flexibility in modeling different shaped distributions. Given the computa-

tional complexity of carrying out Likelihood Ratio (LR) testing for many thousands

of genes, an Approximate LR Test (ALRT) is proposed instead. We also generalize

the two-class ALRT method to multi-class microarray data. The performance of the

ALRT method for the GLDII is compared to those based on Wald-type of statistics

through the use of simulation. The simulation results show that our method performs

quite well compared to the SAM analysis using standardized Wilcoxon rank statistics

and Empircal Bayes t-statistics at any settings of variance, sample size and treatment

effect. Our model is also found less sensitive to contamination. We apply our method

to a real microarray data comparing normal muscle to muscle from patients with

Duchenne muscular dystrophy (DMD), in which a set of truly DEGs are known. We

also illustrate our method in a two class classification problem of Golub et al. (1999)

leukemia dataset.

39

40

3.1. Background

As complex and robust as the available analysis methods for microarray data cur-

rently are, there is always room for error and many inherent problems in identifying

differentially expressed genes (DEGs). The statistical methods used to detect DEGs

can be classified into two categories: parametric methods and nonparametric meth-

ods. The most commonly used parametric method is the two sample t-test and its

variations which are based on Wald statistics. Although the Wald-type statistics may

work well in some situations but they have two main drawbacks. Firstly, their strong

dependence on model assumption and Secondly, for large difference in group means,

the standard error is inflated, lowering the Wald statistic and leading to inflation of

Type II error (false negative).

It is a common practice in microarray studies to assume the underlying distribu-

tion for a gene expression level is normal, in fact it is not always true. According

to Hunter et al. (2001) and Thomas et al. (2001), microarray data are often noisy

and hence parametric assumptions are certainly inappropriate for a subset of genes.

Therefore it is challenging to construct a suitable statistical model applicable to all

microarray datasets. More recently, Bhowmick et al. (2006) showed a Laplace mix-

ture model approach for identification of DEGs in microarray experiments. Lonnstedt

and Speed (2002) proposed a normal mixture model for gene expression data and de-

fined a log posterior odds statistic. Smyth (2004) reformulated the posterior odds

statistic in terms of empical bayes (E-B) t-statistic in which posterior residual stan-

dard deviations are used in place of ordinary standard deviations. The Bioconductor

package LIMMA provides functions for calculating Empirical-Babyes (E-B) t-statistic.

Ghosh (2004) also assumed a mixture model for assessing differential expression.

Recently Jeffery et al. (2006) compared the efficiency of 10 gene selection methods

and recommend for choice of gene selection considering different characteristics of

expression values. In all of these methods it is assumed that all the genes have the

41

same distribution. But this assumption may or may not hold in practice. Rather than

assuming a common distribution for all genes, one can partition genes into several

groups based on their expression levels and then assume that different distributions

hold for each group.

In this chapter we propose assuming that the underlying distribution of gene

expressions follow GLDII which is indexed by a shape parameter, α. In this chapter,

the GLDII is chosen because of its great flexibility in modeling any shaped distribution

on the interval (−∞,∞); a symmetric distribution is a special form of the GLDII

when α = 1. Brief details about the GLDII has been given in the Section 3.2. In order

to estimate the statistical significance, we focus on Likelihood Ratio (LR) test. Ideker

et al. (2000) proposed a generalized LR test which was performed for each gene to

identify DEGs. More recently, Bokka and Mathur (2006) proposed a nonparametric

LR test to identify differentially expressed genes from microarray data. Carrying out

LR test for thousands of genes is computationally intensive, and for this we propose

an Approximate Likelihood Ratio Test (ALRT) which will circumvent these heavy

computations.

3.2. Generalized Logistic Distribution of Type II

(GLDII)

Balakrishnan and Leung (1988) defined the generalized logistic distributions of Type

II (GLDII) by compounding a reduced log-Weibull distribution with a gamma distri-

bution. Recently Balakrishnan and Hossain (2007) studied this distribution under

progressive type II censoring. The GLDII with location parameter µ, scale parameter

σ and shape parameter α has density given by

f(x | µ, σ, α) = α

σ

exp(−(αx−µ

σ))(

1 + exp(−x−µ

σ

))α+1 , −∞ < x <∞,

and the corresponding cumulative distribution function function is given by

42

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

f(x)

Normal(0,1)GLDII(alpha=3)GLDII(alpha=2)GLDII(alpha=1)GLDII(alpha=0.75)GLDII(alpha=0.5)

Figure 3.1: Type II generalized logistic density for different values of α and sigma=0.5.

F (x | α, µ, σ) = 1−

(exp

(−x−µ

σ

)1 + exp

(−x−µ

σ

))α

, −∞ < x <∞.

Let Z = X−µσ

. Then, Z has the standard Type II generalized logistic distribution

with pdf

f(z | α) = αe−αz

(1 + e−z)α+1, −∞ < z <∞

and the corresponding cumulative distribution function function is

F (z | α) = 1−(

e−z

1 + e−z

)α

, −∞ < z <∞.

43

The standard Type II generalized logistic density function has been plotted in Figure

3.1 for different values of α. It is apparent from the figure that the distribution is

negatively skewed α < 1 and is positively skewed for α > 1. That is, skewness is

a increasing function of α for the GLDII. Again, it is apparent that if the value of

α tends to infinity, then the GLDII has heavier tails than the normal distribution,

meaning that there is less probability of extreme values than that under a normal

distribution. Therefore, GLDII can be used as an alternative to normal distribution

or skewed distributions.

3.3. Motivation and objectives

Figure 3.2 shows the density plots (Gaussian kernel smoothing by density() function

in R) of 3 genes which are randomly selected from Golub et al. (1999) leukemia

dataset. The dataset involves 7129 genes in 38 ALL and 34 AML subjects. It is seen

from the Figure 3.2 that the the genes follow asymmetric distributions and different

shapes. The following three statistics have been calculated to see the performance of

GLDII against the normal distribution:

Kurtosis If xi denotes for the non-missing observations of x, n for their number, µ

for their mean, and mr =∑

i(xi − µ)r/n for the sample moments of order r.

Kurtosis is defined as

Kurtosis =m4

m22

− 3.

Distributions with zero excess kurtosis belong to a symmetric distribution fam-

ily. A distribution with positive excess kurtosis has a more acute peak around

the mean (that is, a lower probability than a normally distributed variable of

values near the mean) and fatter tails (that is, a higher probability than a

normally distributed variable of extreme values). A distribution with negative

excess kurtosis has a lower, wider peak around the mean (that is, a higher

probability than a normally distributed variable of values near the mean) and

44

−1500 −1000 −500 0

0.00

000.

0010

U97502_rna1_at

Expression Values

Pro

babi

lity

ALLAML

−1000 −500 0 500

0.00

000.

0010

D14874_at

Expression Values

Pro

babi

lity

−2500 −2000 −1500 −1000 −500

0.00

000.

0006

0.00

12

U82970_at

Expression Values

Pro

babi

lity

Figure 3.2: Density plots of 3 genes from leukemia Dataset: Black-solid line ALL andRed-dashed line AML

45

thinner tails (if viewed as the height of the probability density-that is, a lower

probability than a normally distributed variable of extreme values).

Skewness In probability theory and statistics, skewness is a measure of the asym-

metry of the probability distribution of a real-valued random variable. It is

defined as,

Skewness = m3/m(3/2)2 .

Qualitatively, a negative skewness indicates that the tail on the left side of

probability density function is longer than the right side and the bulk of the

values (including the median) lies to the right of the mean. A positive skew

indicates that the tail on the right side is longer than the left side and the bulk

of the values lies to the left of the mean. A zero value indicates that the values

are relatively evenly distributed on both sides of the mean, typically but not

necessarily implying a symmetric distribution.

Akaike Information Criterion Akaike’s information criterion (AIC) is a measure

of the goodness of fit of an estimated statistical model. Given a data set, several

competing models may be ranked according to their AIC, with the one having

the lowest AIC being the best. In the general case, the AIC is

AIC = 2k − 2 lnL

where k is the number of parameters in the statistical model, and L is the max-

imized value of the likelihood function for the estimated model. Increasing the

number of free parameters to be estimated improves the goodness of fit, regard-

less of the number of free parameters in the model. Hence AIC not only rewards

goodness of fit, but also includes a penalty that is an increasing function of the

number of estimated parameters. This penalty discourages overfitting. The

preferred model is the one with the lowest AIC value. The AIC methodology

attempts to find the model that best explains the data with a minimum of free

parameters.

46

Table 3.1 presents the skewness, kurtosis and AIC values for the 3 genes which were

randomly selected from Golub et al. (1999) leukemia dataset. It is seen from the

results that the GLDII is a preferable model than a normal distribution in terms

of AIC values because GLDII produces smaller AIC values for most of the genes.

Especially with higher skewness values GLDII performs best. Therefore, GLDII can

be used for robustness studies instead of classical procedures in microarray studies

since extreme values are often observed in real microarray data.

Table 3.1: Model selection from GLDII and Normal Distribution based on AIC values

Gene Name Conditions Skewness kurtosis AIC (GLDII) AIC (Normal)U97502 rna1 at ALL -0.3710 -0.3981 539.07 541.37U97502 rna1 at AML 0.4327 -0.8372 495.63 498.92

D14874 at ALL -0.1153 -0.6149 537.81 537.37D14874 at AML 0.8593 0.9328 480.46 485.16U82970 at ALL -1.0236 1.2732 567.08 573.52U82970 at AML 0.3549 -1.2963 493.02 494.63

In this chapter, we obtain the receiver operating characteristic (ROC) curve by

computing sensitivity and specificity for a range of p-value cutpoints. The area under

the curve (AUC) is used to compare the performance of different methods. A very

good method has a high true positive rate for a given false positive rate, so that the

ROC curve occupies the upper left hand side of the graph with AUC approaching the

ideal of 1.0.

The present chapter makes four main contributions: (1) we propose underlying

distribution of a given gene expression profile follows a GLDII because it provides

a long-tailed alternative to the normal distribution, (2) we propose an approximate

likelihood ratio test (ALRT) method which is more robust than Wald tests especially

for large sample size, (3) the ALRT method is extended to multiclass microarray data,

(4) we perform a comparative study to elucidate key features of different methods in

47

microarray study.

3.4. Method

A random variable Y on (−∞,∞) which follows a GLDII has a density function of

the form

fY (y) =α

b

exp(−(y−µ

b)α)(

1 + exp(−y−µ

b

))α+1 ; −∞ < y <∞ (3.4.1)

where α, µ ∈ (−∞,∞) and b > 0 are shape, location and scale parameters, respec-

tively. When the shape parameter α = 1, the density in (3.4.1) corresponds to the

usual logistic density function. Moments of the GLDII are conveniently obtained via

the moment generating function (MGF). For the standard GLDII, the MGF is

M(t) =Γ(1 + t)Γ(α− t)

Γ(α),

which gives the mean and variance as ψ(α) − ψ(1) and ψ′(α) + ψ′(1), respectively,

where ψ(α) = ddα

ln Γ(α). Suppose {xig}n1i=1 and {yjg}n2

j=1 defined as the expression

observed for the n1 control cases and n2 treatment cases in the gth gene from the

GLDII. Here we aim to test that the means of the two groups are equal. To test

the hypothesis here we assume that the variances and shape of the two distributions

are equal, which implies α1 = α2 = α (say) and b1 = b2 = b (say) where α1 and α2,

are the shape parameters for the two groups, respectively and b1 and b2 are the scale

parameters for the two groups, respectively. It is not a crucial assumption because

we are assuming within a gene the variances and shapes of different groups are same

but between genes are different, and this assumption would give us computational

flexibility. Therefore the null hypothesis becomes H0 : µ1 = µ2, where µ1 and µ2

are the location parameters of treatment group and control group, respectively. For

simplicity of notation, we will use xi(i = 1, · · · , n1) instead of xig for treatment group

and yj(j = 1, · · · , n2) instead of yjg for control group. Following Hossain and Willan

(2007), first we estimate the parameters of location and scale parameter in the case

when the shape parameter α is known. Denoting zi =xi−µ1

band zj =

yj−µ2

b, the log

48

likelihood function for the GLDII is

logL = −n1 ln b+

n1∑i=1

ln f(zi)− n2 ln b+

n2∑j=1

ln f(zj), (3.4.2)

where f(zi) and f(zj) are the density function of the treatment group and control

group, respectively. Let

Φ1(zi) =∂

∂ziln f(zi) =

1− ezi

1 + ezi

and

Φ2(zj) =∂

∂zjln f(zj) =

1− ezj

1 + ezj

Then, we obtain the likelihood equations for µ1, µ2 and b, from (3.4.2), as follows:

∂ logL

∂µ1

=− 1

b

n1∑i=1

Φ1(zi) = 0, (3.4.3)

∂ logL

∂µ2

=− 1

b

n2∑j=1

Φ2(zj) = 0, (3.4.4)

∂ logL

∂b=− n1

b− 1

b

n1∑i=1

ziΦ1(zi)−n2

b− 1

b

n2∑j=1

zjΦ2(zj) = 0. (3.4.5)

The likelihood equations in (3.4.3) to (3.4.5) are non-linear and do not admit explicit

solutions because of the presence of the terms Φ1(zi) and Φ2(zj). Consequently some

numerical methods have to be employed to obtain the MLEs of the parameters. But

the potential problem in using numerical methods is that the method requires starting

values near the global maximum. Therefore it is quite difficult to use the numerical

methods in microarray data analysis where we are doing thousands of gene by gene

testing. Here we propose an approximate likeihood estimation procedure. A trade-off

between approximate likelihood estimates and full likelihood estimates are given in

49

the paper by Hossain and Willan (2007). Following Hossain and Willan (2007), we

approximate the functions Φ1(zi) and Φ2(zj) by expanding them in a Taylor series

around F−1(pi) = νi and F−1(pj) = νj, respectively, and keeping only the first two

terms we get two approximations Φ1(zi) and Φ2(zj).

Φ1(zi) ≈ Φ1(νi) + Φ′1(νi)(zi − νi)

= Φ1(νi)− νiΦ′1(νi) + ziΦ

′1(νi)

= A1i −B1izi, (3.4.6)

Φ2(zj) ≈ Φ2(νj) + Φ′1(νj)(zj − νj)

= Φ2(νj)− νjΦ′1(νj) + zjΦ

′1(νj)

= A2j −B2jzj, (3.4.7)

where,

A1i = Φ1(νi)− νiΦ′1(νi), B1i = −Φ′

1(νi),

A2j = Φ1(νj)− νjΦ′1(νj), B2j = −Φ′

1(νj).

νi = ln(− ln qi)

νj = ln(− ln qj)

and,

pi =i

(n1 + 1), qi = 1− pi; i = 1, 2, · · · , n1,

pj =j

(n2 + 1), qj = 1− pj; j = 1, 2, · · · , n2.

50

Substituting the approximation (3.4.6) and (3.4.7) into (3.4.3) to(3.4.4), we get

two approximate normal equations:

∂ logL

∂µ1

≈− 1

b

n1∑i=1

(A1i −B1izi) = 0 (3.4.8)

∂ logL

∂µ2

≈− 1

b

n2∑j=1

(A2j −B2jzj) = 0 (3.4.9)

Now solving the equation (3.4.8) the AMLE of µ1 can be obtained as µ1 = V1 −W1b,

where

V1 =

∑n1

i=1B1ixi∑n1

i=1B1i

, W1 =

∑n1

i=1A1i∑n1

i=1B1i

.

Similarly, we can obtain the AMLE of µ2 by solving the second normal equations

(3.4.9), µ2 = V2 − W2b. Now substituting the values of µ1 and µ2 in the third

approximate normal equation we can get a quadratic equation which has two roots.

But b must be positive and hence the AMLE of b can be obtained as

b =−λ1 +

√λ21 + 4(n1 + n2)λ22(n1 + n2)

(3.4.10)

where,

λ1 =

n1∑i=1

A1i(xi − V1) +

n2∑j=1

A2j(yj − V2),

λ2 =

n1∑i=1

B1i(xi − V1)2 +

n1∑j=1

B2j(yj − V2)2.

Now, under H0, we have µ1 = µ2 = µ0(say). We obtain the likelihood equations for

µ0 = µ(b0) and b0, from (3.4.2), as follows:

∂ logL

∂µ0

=− 1

b0

n1∑i=1

Φ1(zi)−1

b0

n2∑j=1

Φ2(zj) = 0, (3.4.11)

∂ logL

∂b0=− n1

b0− 1

b0

n1∑i=1

ziΦ1(zi)−n2

b0− 1

b0

n2∑j=1

zjΦ2(zj) = 0. (3.4.12)

51

Following the similar steps as above with likelihood equations (3.4.11) and (3.4.12)

the AMLE of µ0 obtain as

µ0 = V0 −W0b0, (3.4.13)

where

V0 =

∑n1

i=1B1ixi +∑n1

j=1B2jyj∑n1

i=1B1i +∑n2

j=1B2j

W0 =

∑n1

i=1A1i +∑n2

j=1A2j∑n1

i=1B1i +∑n2

j=1B2j

,

and the AMLE of b0 obtain as

b0 =−λ10 +

√λ210 + 4(n1 + n2)λ202(n1 + n2)

(3.4.14)

where,

λ10 =

n1∑i=1

A1i(xi − V0) +

n2∑j=1

A2j(yj − V0),

λ20 =

n1∑i=1

B1i(xi − V0)2 +

n1∑j=1

B2j(yj − V0)2.

Now, for the parameter α, the estimate is obtained from the likelihood function

by maximizing over α with the parameters µ1, µ2 and b replaced by µ1, µ2 and b,

i.e. L1(α) = maxα L(µ1, µ2, b, α) (profile likelihood method, Diciccio and Tibshirani

(1991)). The ALRT statistic can then be obtained as,

Λ = −2 log

[L(µ0, b0, α0)

L(µ1, µ2, b, α)

](3.4.15)

which is asymptotically χ21. However, the exact distribution of the likelihood ratio

corresponding to specific hypotheses is very difficult to determine. A convenient

result, though, says that as the sample size n approaches infinity, the test statistic

Λ for a nested model will be asymptotically χ2 distributed with degrees of freedom

52

equal to the difference in dimensionality of L(µ0, b0, α0) and L(µ1, µ2, b, α)(Cox and

Hinkley (1974)). A simulation study has been conducted with 5000 samples from

GLDII under the two conditions. Under the null hypothesis we consider both the

location parameters are equal at 0. Both the scale and shape parameters are assumed

as 1. Therefore, the samples per condition generated from Y ∼ GLDII(1, 0, 1). We

consider sample size 50 per condition. Figure 3.3 displays the histogram of likelihood

ratio statistic values with density scaling. It is seen from the figure that the shape is

approximately similar to the chi-square distribution with 1 degrees of freedom (red-

dotted line).

Histogram of Likelihood Ratio Test Statistic with density scaling

LR

Den

sity

0 5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 3.3: Histogram of Likelihood Ratio values with density scaling under the nullhypothesis

53

This ALRT method can also be extended to a scenario with more than two classes.

A detailed discussion for the multiclass microarray data is given in section 3.6 to 3.8.

If the measured expression levels for a gene has k > 2 classes which are taken to be

independent and distributed as GLDII with location expression levels µ1, µ2, · · · , µk

and common scale parameter b, the ALRT statistic becomes

Λ = − log

[L(µ0, b0, α0)

L(µ1, µ2, · · · , µk, b, α)

](3.4.16)

which is asymptotically χ2k−1.

3.5. Comparison between AMLE and MLE for lo-

cation and scale parameters of GLDII

A simulation study is conducted with three choices of sample sizes for comparing

between AMLE and MLE for location and scale parameters of GLDII. Table 3.2

shows the results of average values and variances of MLEs and AMLEs for sample

sizes 10, 20 and 50 determined by simulating data from standard GLDII with α

= 1.5. All the averages were computed over 1000 simulations. On comparing the

AMLE values in table with the corresponding entries of the MLEs, we observe that

the AMLEs are almost as efficient as the MLEs even for small sample sizes. It is

apparent that for large sample size, the AMLE seems to provide the smallest bias

and variance for the estimates.

54

Table 3.2: Average values and variances of MLEs and AMLEs when the data aresimulated from GLDII(1.5,0,1).

size MLE AMLE

µ b var(µ) var(b) µ b var(µ) var(b)10 -0.02146 0.95327 0.32105 0.06635 -0.02058 0.97748 0.32906 0.0784920 0.00475 0.98148 0.15893 0.03114 0.00477 0.98988 0.16194 0.0340650 -0.00125 0.99346 0.06078 0.01069 -0.00123 0.99943 0.06374 0.01393

3.6. FDR Estimation

A nice property of the ALRT method is its connection with FDR. Because of extensive

computation involvement with ALRT method, we used 500 permutations to estimate

FDR. Though it is possible to get gene specific p-values with the asymptotic chi-

squared distribution property of ALRT method, here we propose to estimate FDR

with the test statistic (3.4.11). Similar to the SAM analysis (Tusher et al. (2001)) we

evaluated the statistical significance of the ALRT method. We can use the following

permutation algorithm to select significant genes and estimate FDR:

1. For the original data calculate the Λg statistics for gth gene (g = 1, · · · ,m),

denote their ordered values as

Λ∗(g)

2. For the r-th permutation, calculate the Λg statistic and denote their ordered

values as

Λr(g); r = 1, · · · , 500

Denote their averages across all permutations as

Λ(g) =1

500

500∑r=1

Λr(g)

55

3. For a cutoff value ∆, identify the following genes as significant

| Λ∗(g) − Λ(g) |≥ ∆

Denote

Λ0 = maxΛ∗(g)

≤Λ(g)−∆Λ∗(g), and Λ1 = minΛ∗

(g)≥Λ(g)+∆Λ

∗(g),

and estimate the expected number of false positives by chance for the Λg statistic

as follows

V (∆) =500∑r=1

∑g I{Λr

(g) ≥ Λ1}+ I{Λr(g) ≤ Λ0}

500

where I{·} is the indicator function, and the estimated FDR is

ˆFDR(∆) =V (∆)

R(∆),

where

R(∆) =∑g

I{| Λ∗(g) − Λ(g) |≥ ∆}

is the total number of significant genes.

3.7. Permutation based p-values and AUC Estima-

tion

In cases where we are unwilling to assume the null distribution or not able to identify

the distribution of the test statistic, we can obtain an assessment of the p-value via

permutation. The permuted p-values for each gene can be calculated using permuta-

tions of the class labels. For a range of cutoff values for all the methods, the number

of false positives and false-negative results are calculated from the permuted p-values.

For each method, a sufficient range of cutoffs was chosen to be able to calculate the

area under receiver characteristic curves (ROC). Seng et al. (2008) also evaluated

their methods based on simulated data by means of receiver operating characteris-

tic (ROC) curves. In addition, Hu et al. (2006) evaluated their methods with ROC

56

curves. The ROC is evaluated by means of a plot of the true positive fraction (sen-

sitivity) versus the true negative fraction (1-specificity) using a continuous varying

decision threshold. A method with good discrimination will rank the true positives

above the true negatives and therefore will have a true positive fraction greater or

equal to true negative fraction at all points. The AUCi function from the ROC package

of R is used to calculate the AUC after giving the sensitivity and specificity values in

the function.

3.8. Comparison with Other Methods

We evaluate the performance of ALRT method using simulated data as well as two

published datasets. Here we consider SAM using Wilcoxon statistic (S-W) and em-

pirical Bayes t-statisitcs (E-B) for comparison because these two methods are widely

used methods in microarray gene selection literature.

3.8.1. Simulation Experiment

The performance of our proposed method is evaluated using simulated data generated

from three distributions incorporating variability, treatment effect, shape effect and

sample size effect. We consider the scenarios having two conditions, treatment versus

control, and sample sizes of 10 and 25 per condition. We generate data from 1000

genes and set the proportion of DE genes as 0.1.

First we simulate data for control and treatment group from a GLDII. That is

expression values for a gene are generated from Y ∼ GLDIIm(α, 0, b). The GLDII cov-

ers symmetric as well as asymmetric distribution, which is practical with microarray

experiments. Although in reality genes do interact with each other, the independence

assumption is a useful simplification.

1. Here we consider two types of variability by sampling b for each gene from an

inverse Gamma distribution. Note that when b ∼ 1/Gamma(a0, a0), the mean

is E(b) = a0a0−1

and the variance is Var(b) =a20

(a0−1)(a0−2). For moderate variances

57

we use a0 = 30, and for high variances we use a0 = 5.

2. We consider four types of shape by sampling α for each gene from a Gamma

distribution. For negatively skewed expression we use α ∼ Gamma(50,100), for

symmetric expression α ∼ Gamma(50,50) and for moderately positively skewed

α ∼ Gamma(6,2) and highly positively skewed we consider α ∼ Gamma(10,2).

3. Treatment group data for DE genes are drawn from a GLDIIm(α, δ, b) distribu-

tion, where the treatment effects, δ, are sampled from a Gamma(a1, b1) distri-

bution. To allow the variance of δ(= a1b21) to increase with the mean of δ(= a1

b1),

and mean of δ to be 0.5, 1 and 2, respectively, we sample δ ∼ Gamma(4, 8),

δ ∼ Gamma(4, 4) and δ ∼ Gamma(8, 4).

To compare performance, we use receiver operating characteristic (ROC) curves,

where the test sensitivities and specificities (true positive and true negative propor-

tions) for a range of p-value cutoffs are averaged over 500 simulated datasets. The

AUC can be interpreted as the true positive rate average uniformly over the range of

false positives. Table 3.3 and Table 3.4 show AUCs for different simulation models

for the two types of variability, respectively. As seen in Tables 3.3 and 3.4, the ALRT

by assuming GLDII (ALRT(G)) provides best results at any settings of treatment ef-

fects, variability, shape effects and sample sizes. For instance, when the simulation is

made from GLDII by considering scale parameter from inverse gamma (5,5), sample

size of 10 per group, expected treatment effect of 0.5 and expected shape parameter

of 5 then ALRT(G), E-B and S-W statistic give AUC of 0.6123, 0.5814 and 0.5763,

respectively, and the standard errors of these AUCs are 0.0931, 0.1006, 0.1148, respec-

tively. ALRT(G) performs consistently and gives more reliable method for identifying

differential expression of genes. As expected, all the methods perform well when the

sample sizes and the treatment effects are assumed large. Again, when the expression

of a gene comes from symmetric distribution i.e.,E(α) = 1, ALRT(G), ALRT with

logistic distribution (ALRT(L)), and E-B provide similar results and outperform the

58

S-W statistic. When the expression of a gene are skewed, the AUCs by S-W increases

for any settings of simulation.

Secondly, we generate data from the extreme value distribution. That is expres-

sions for a gene are generated from Y ∼ EVm(0, b′). We consider scale parameters

as b′ = 1 and b′ = 2 in the simulation to allow extreme values at the left side and

the right side of the distribution, respectively. We allow the treatment effect (δ) and

the sample size effect (n) similar as previous scenario. One added comparison has

been made by considering unequal sample sizes (n1 = 10, n2 = 15). The results of

AUCs under different simulation settings are given in Table 3.5. It appears that all

the methods perform well in increasing the sample sizes. Overall, it is apparent from

the Table that the ALRT(G) gives best results at any settings of treatment effects,

scale effects and sample size effects.

Thirdly, we use normal distribution to generate the data. That is expressions for

a gene are generated from Y ∼ normalm(0, σ2). Here we also consider 1000 genes,

in which 100 genes are differentially expressed with a constant treatment effect (δ)

equal to 1.0. We sample σ2 for each gene from an inverse Gamma distribution, i.e.,

σ2 ∼ 1/Gamma(a00, a00). For moderate variances we use a00 = 21, and for high

variances we use a00 = 3. Here we also randomly add noise, η, to 20% of the samples

and 20% of the genes. To allow the noise effect in the expression values we consider

two noise effects, η ∼Gamma(4, 8) and η ∼Gamma(4, 4), respectively. The results of

AUCs under different simulation settings are summarized in Table 3.6. It can be seen

that the results from the ALRT(G) method and E-B t-statistics are close, though in

the presence of noise there may be a slight loss of power (i.e. reduced true positive)

in using E-B as compare to the ALRT(G) method. This is because, the differences

between mean levels for each of the genes used in the numerator of the E-B t-statistic

will be more sensitive and the resulting p-values may not be accurate. The ALRT

method is not affected much by the noise. It is seen from the table that the relative

performance of ALRT method improves when noise effect is considered high in the

59

data, i.e., when E(η) = 1. It should be noted that the presence of noise is very

common in real microarray data. Therefore, even when the normality assumption

holds for a given microarray data set the ALRT method with GLDII distributional

assumption performs well for identifying differentially expressed genes.

Table 3.3: AUCs when the assumed simulation model is GLDII with high variance ofexpression values(i.e.,b ∼ 1/Gamma(5, 5))

E(δ) E(α) n1 = 10, n2 = 10 n1 = 25, n2 = 25ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W

0.5 0.5 0.5528 0.5415 0.5425 0.5337 0.5796 0.5684 0.5697 0.55351 0.5809 0.5707 0.5711 0.5559 0.6351 0.6247 0.6223 0.61363 0.5999 0.5792 0.5832 0.5715 0.6948 0.6736 0.6742 0.66195 0.6123 0.5799 0.5814 0.5763 0.7224 0.6838 0.6862 0.6803

1 0.5 0.5978 0.5923 0.5889 0.5768 0.7051 0.6840 0.6813 0.69071 0.6428 0.6289 0.6386 0.6262 0.7988 0.7886 0.7934 0.77683 0.6955 0.6806 0.6826 0.6481 0.8657 0.8364 0.8483 0.84585 0.7071 0.6788 0.6817 0.6631 0.8826 0.8347 0.8511 0.8509

2 0.5 0.6937 0.6898 0.6871 0.6815 0.8821 0.8582 0.8634 0.86311 0.7783 0.7579 0.7611 0.7462 0.9289 0.9253 0.9262 0.92173 0.8419 0.8065 0.8278 0.7939 0.9574 0.9449 0.9500 0.94855 0.8443 0.8101 0.8265 0.8007 0.9639 0.9487 0.9494 0.9468

Table 3.4: AUCs when the assumed simulation model is GLDII with moderate vari-ance of expression values(i.e.,b ∼ 1/Gamma(30, 30))

E(δ) E(α) n1 = 10, n2 = 10 n1 = 25, n2 = 25ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W

0.5 0.5 0.5545 0.5453 0.5421 0.5398 0.5842 0.5651 0.5623 0.56941 0.5653 0.5601 0.5629 0.5414 0.6431 0.6314 0.6313 0.62073 0.6007 0.5805 0.5888 0.5563 0.7015 0.6736 0.6750 0.66335 0.6176 0.5891 0.5912 0.5674 0.7329 0.6867 0.6866 0.6833

1 0.5 0.6029 0.5920 0.5831 0.5793 0.7512 0.7348 0.7191 0.74211 0.6432 0.6391 0.6409 0.6187 0.8246 0.8168 0.8173 0.79373 0.6979 0.6733 0.6784 0.6441 0.9056 0.8720 0.8762 0.86485 0.7134 0.6917 0.6983 0.6639 0.9245 0.8784 0.8797 0.8726

2 0.5 0.7221 0.7153 0.7142 0.7008 0.9396 0.9219 0.9198 0.92471 0.8037 0.7982 0.8031 0.7898 0.9669 0.9633 0.9622 0.95513 0.8771 0.8595 0.8577 0.8223 0.9775 0.9562 0.9607 0.95665 0.8948 0.8624 0.8614 0.8517 0.9808 0.9699 0.9735 0.9729

60

Table 3.5: AUCs when the assumed simulation model is Extreme Value

E(δ) b′ n1 = 10, n2 = 10 n1 = 10, n2 = 15 n1 = 25, n2 = 25ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W

0.5 1 0.564 0.558 0.554 0.543 0.587 0.568 0.567 0.560 0.609 0.579 0.581 0.5762 0.572 0.559 0.557 0.551 0.601 0.577 0.578 0.569 0.618 0.585 0.586 0.581

1 1 0.641 0.623 0.624 0.614 0.659 0.628 0.630 0.621 0.797 0.771 0.773 0.7642 0.649 0.629 0.626 0.619 0.665 0.632 0.634 0.627 0.814 0.785 0.786 0.774

2 1 0.707 0.682 0.685 0.676 0.724 0.695 0.694 0.686 0.845 0.811 0.813 0.8072 0.724 0.704 0.701 0.698 0.733 0.708 0.716 0.700 0.867 0.835 0.838 0.831

Table 3.6: AUCs when the assumed simulation model is normal

E(η) E(σ2) n1 = 10, n2 = 10 n1 = 10, n2 = 15 n1 = 25, n2 = 25ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W ALRT(G) ALRT(L) E-B S-W

0.5 1.5 0.623 0.621 0.623 0.609 0.648 0.639 0.647 0.633 0.805 0.801 0.804 0.7911.05 0.643 0.641 0.641 0.629 0.673 0.668 0.672 0.663 0.810 0.802 0.807 0.801

1 1.5 0.626 0.622 0.625 0.610 0.646 0.640 0.647 0.638 0.795 0.791 0.793 0.7881.05 0.646 0.639 0.642 0.628 0.677 0.671 0.675 0.661 0.817 0.810 0.812 0.800

3.8.2. Duchenne Muscular Dystrophy (DMD) Data

Haslett et al. (2002) examined the pathogenic pathways and identify new or modi-

fying factors involved in Duchenne Muscular Dystrophy (DMD). They used the ex-

pression microarrays to compare individual gene expression profiles of skeletal mus-

cle biopsies from 12 DMD patients and 12 unaffected control patients. Affymetrix

GeneChip Ver. 5.0 software (MAS5.0) was used for raw data processing to obtain

signal intensities and normalized with a linear regression. They used geometric fold

change method to test of differential expression. The differential expression of 12

genes (13 probesets) was confirmed by quantitative RT-PCR analysis of seven DMD

biopsies and four unaffected biopsies. Here in this re-analysis we use only 23 arrays

(only 11 DMD arrays) since one file was truncated. Therefore, we have 12625 genes in

the data set with 12 samples in normal group and 11 samples in DMD patients group.

Raw data are converted to signal estimates using MAS5 by Affymetrix Inc. (2002)

which is implemented using the affy package in Bioconductor written by Gautier

61

et al. (2004). It is also possible to use dChip or RMA or GCRMA for normaliza-

tion and perform additional normalization to make the underlying distribution more

symmetric. Here we used MAS5 because our focus is to compare the performances of

methods under noisy data. An analysis of this data is also given in Hu et al. (2006).

The estimated shape parameter for the GLDII of 50% of the genes lies between

0.838 (1st Quart.) and 1.180 (3rd Quart.). Therefore the underlying distribution of

the gene expressions for 50% of the genes are close to symmetric. The lowest and

highest shape parameter for the GLDII is estimated as 0.498 and 2.310, respectively.

Figure 3.4 depicts the ROC curves for each of the four methods: E-B, SAM, ALRT(L)

and ALRT(G). For ideal comparison of these four methods we calculate the p-values

after 500 permutations and different cutoffs are chosen to be able to calculate the

area under receiver characteristic curves (ROC). For the selected cutoffs of different

values, all the four methods produce false positive genes which mostly similar or differ

by only one or two genes. We can see from the ROC curve that overall they have very

similar performance. Therefore, utilizing all the DE methods in the analysis of DMD

data, ALRT(G) is able to perform competitively with other widely used methods.

A method claiming large number of differentially expressed genes is not considered

superior unless it also produces relatively small number of false positives. All the 12

genes are found by all four methods at at a fixed cut off p-value of 0.05. Histograms of

permuted p-values from each of four methods are provided in Figure 3.5. Comparing

with the other methods, the ALRT(G) method gives fewer significant genes under a

fixed cut off of p-values. Figure 3.6 displays the number of significant genes corre-

sponding to the estimated FDR values by SAM and ALRT(G) methods. It is seen

from the Figure that the ALRT(G) method produces less number of significant genes

for a fixed estimated FDR values. It suggests from this figure that the ALRT(G)

produces fewer false positive genes compare to the SAM method. Therefore it sug-

gests from this dataset that it may be beneficial to use the ALRT(G) method in the

problem of marker or gene identification which have higher confidence in genes.

62

Figure 3.4: ROC plots for test of DE in the Duchenne Muscular Dystrophy data

Figure 3.5: Histogram of permuted p-values for test of DE in the Duchenne MuscularDystrophy data

63

0.00 0.05 0.10 0.15 0.20

050

010

0015

0020

0025

0030

00

Estimated FDR

Numb

er of

Signif

icant

Gene

s

SAMALRT(G)

Figure 3.6: Number of significant genes against estimated FDR values for theDuchenne Muscular Dystrophy data

3.8.3. Golub Leukemia Data: Classification Between ALL

and AML

Golub et al. (1999) used gene expression to discriminate between two types of leukemia

ALL and AML. Many authors used their data using different methodologies. The

training dataset consists of 27 ALL and 11 AML subjects and the test dataset con-

sists of 20 ALL and 14 AML subjects. The expression of 7129 genes were originally

measured. Here we have merged the training and testing samples so that we have

total 72 samples for our analysis. Geman et al. (2004) also combined the test and

training set to estimate the error by leave-one-out cross validation. A normalized

version of the Golub leukemia data is taken from R version package hddplot. Our

primary interest is to select important genes and use them to classify the two types

of leukemia.

Figure 3.6 compares the expected number of false positives for the four methods

corresponding to different values of estimated FDR. We can see that both t statis-

tics and ALRT method have very similar performance. Again the SAM statistic and

64

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

010

0020

0030

0040

0050

0060

00

Estimated FDR

Expe

cted

Num

ber o

f Fal

se P

ositi

ves

tSAMALRT(L)ALRT(G)

Figure 3.7: Expected number of False Positives corresponding to estimated FDRvalues.

ALRT method with generalized distribution provide very similar performance un-

til FDR values reaches 0.5 though ALRT(G) method performing better than SAM

statistic at higher values of FDR (> 0.5).

Selecting a small number of relevant genes for accurate classification of samples is

essential for the development of diagnostic tests. Different gene selection algorithms

can potentially select different relevant genes and lead to different classification accu-

racies. We assessed the performance of the four gene selection methods: E-B, SAM,

ALRT(L) and ALRT(G) with selecting different numbers of genes and use these genes

for classification with a simple Gaussian maximum likelihood discriminant rule, for

diagonal class covariance matrices. A detailed discussion of this discriminant rule

is given in Dudoit et al. (2002). All the approaches involve a spilt of the data into

training and test samples. In the training sample the classification rule is developed,

and in the test sample its performance is determined. We separate the whole data

into three folds. We train each classifier on two folds and test it on the remaining

65

5 10 15 20 25 30 35 40

0.00

0.05

0.10

0.15

0.20

0.25

Number of top ranked genes

Aver

age

misc

lass

ificat

ion

erro

rE−BSAMALRT(L)ALRT(G)

Figure 3.8: Average misclassification error rate for leukemia dataset shown as thenumber of genes

one. The top ranking genes of a specified number between 5 and 40 are used to create

the classification rule. Error for a given classification relative to a known truth are

then calculated by the classError function of R package mclust. The top ranked

genes are selected and classification error are measured within cross-validation. The

performance of the methods are evaluated by taking average misclassification errors

in the test samples. The average of the misclassification errors shown as the number

of top ranked genes in the Figure 3.8. It is seen from the figure that the ALRT(G)

method performs better than E-B and SAM method when very few of the genes are

used for classification. The classification performance improves for all the methods

when more genes are used in the classification procedure. The results also indicate

that the performance of ALRT(G), E-B and SAM methods agree closely when many

of the top 25 or 30 significant genes are of interest. Therefore the ALRT(G) can effec-

tively be used to the problem of identification of important genes for this microarray

data.

66

3.9. Multiclass Microarray Data

Multiclass microarray analysis in which the data consist of more than two classes is

rapidly gaining attention in the literature (Yeung et al. (2005)). Suppose the measured

expression levels for a gene has k > 2 classes which are taken to be independent and

distributed as GLDII with location expression levels µ1, µ2, · · · , µk and common scale

parameter b. For k classes the ALRT statistic becomes

Λ = − log

[L(µ0, b0, α0)

L(µ1, µ2, · · · , µk, b, α)

]

which is asymptotically χ2k−1, and

µ0 = K0 − L0b0,

where

K0 =

∑n1

i=1B1ixi +∑n2

j=1B2jyj + · · ·+∑nk

l=1Bklwl∑n1

i=1B1i +∑n2

j=1B2j + · · ·+∑nk

l=1Bkl

and

L0 =

∑n1

i=1A1i +∑n2

j=1A2j · · ·+∑nk

l=1Akl∑n1

i=1B1i +∑n2

j=1B2j + · · ·+∑nk

l=1Bkl

.

b0 =−λ10 +

√λ210 + 4(n1 + n2 + · · ·+ nk)λ202(n1 + n2 + · · ·+ nk)

where,

λ10 =

n1∑i=1

A1i(yi −K0) +

n2∑j=1

A2j(yj −K0) + · · ·+nk∑l=1

Akl(yl −K0),

λ20 =

n1∑i=1

B1i(yi −K0)2 +

n1∑j=1

B2j(yj −K0)2 + · · ·+

nk∑l=1

Bkl(yl −K0)2.

µu = Ku − Lub, u = 1, 2, · · · , k;

67

where

Ku =Bu1y1 +Bu2y2 + · · ·+Bunuynu

Bu1 +Bu2 + · · ·+Bunu

and

Lu =Au1 + Au2 + · · ·+ Aunu

Bu1 +Bu2 + · · ·+Bunu

.

b =−λ1 +

√λ21 + 4(n1 + n2 + · · ·+ nk)λ22(n1 + n2 + · · ·+ nk)

where,

λ1 =

n1∑i=1

A1i(xi −K1) +

n2∑j=1

A2j(yj −K2) + · · ·+nk∑l=1

Akl(wl −Kk),

λ2 =

n1∑i=1

B1i(xi −K1)2 +

n1∑j=1

B2j(yj −K2)2 + · · ·+

nk∑l=1

Bkl(wl −Kk)2.

and α is obtained from the likelihood function by maximizing over α with the parame-

ters µ1, · · · , µk and b replaced by µ1, · · · , µk and b, i.e. L1(α) = maxα L(µ1, · · · , µk, b, α).

3.9.1. Example of Multi-class microarray data: SRBCT Dataset

We have applied the gene selection methods to the small round blue cell tumor (SR-

BCT) dataset (Khan et al. (2001)). The dataset is analyzed from the R package

sda. The SRBCT microarray data measured the expression levels of 2308 genes for 88

samples from four tumor types: Burkitt Lymphoma (BL, 11 samples), Ewing sarcoma

(EWS, 29 samples), neuroblastoma (NB, 18 samples), rhabdomyosarcoma (RMS, 25

samples) and and 5 other (non-SRBCT) samples. We excluded the 5 non-SRBCT

samples from the dataset and analyzed the remaining datasets with 83 samples for

four tumor types. This dataset was also analyzed by Yang et al. (2006). Here we

compare our methods with the F statistic and multivariate version of SAM statis-

tic. An R package called DEDS provides the function comp.F for the computation of

68

F statistics and the samr function from SAMR package is used to calculate the SAM

statistic under multiclass problem type. The misclassification errors were found by

choosing K, (K = 10, 15, 20, 25 and 30) top genes according to the statistical values

of different methods. We evaluate the classification performance using a simple Gaus-

sian maximum likelihood discriminant rule, for diagonal class covariance matrices. A

5 fold cross-validation used to get the classification errors. The form of the algorithm

is as follows:

1. Randomly divide the samples into 5 partitions (subsamples).

2. Of the 5 partitions, a single subsample is retained as the validation data for

testing the model, and the remaining 4 subsamples are used as training data.

3. The cross-validation process is then repeated 5 times (the folds), with each of

the 5 subsamples used exactly once as the validation data.

4. Record the number of top ranked genes from the training data and use these

genes for classification with a simple Gaussian maximum likelihood discriminant

rule, for diagonal class covariance matrices.

5. Error for a given classification relative to a known truth are then calculated by

the classError function of R package mclust.

6. Report the average error over all 5 test sets.

The mean of the misclassification errors for different methods are summarized in

Table 3.7. It appears from the results that the ALRT method with GLDII provides

the best results among all the methods in terms of producing lowest misclassification

error. Therefore the ALRT method with GLDII can be used effectively to this dataset

for identification of the important genes in microarray data analysis.

69

Table 3.7: Average misclassification error for the SRBCT datasetK F (DEDS) SAM (SAMR) ALRT(L) ALRT(GLDII)10 0.325 0.311 0.316 0.30515 0.297 0.290 0.292 0.28520 0.271 0.268 0.267 0.26525 0.265 0.260 0.263 0.25830 0.258 0.252 0.256 0.253

3.10. Discussion

In this chapter, we propose a new method for detecting differentially expressed genes

in microarray data. The method is based on a flexible distribution, GLDII and

an approximate likelihood ratio test (ALRT) is developed. We discuss the use of

ALRT method by assuming underlying distribution is GLDII and show a comparable

performance to the SAM method, the SAM using Wilcoxon Rank statistics and E-

B t-statistics with an application to simulated and two real datasets. The ALRT

method with distributional assumption GLDII appears to provide a favorable fit to

data at any settings of simulation. In the presence of noise in gene expression data,

Wald-type statistical tests of the differences between mean levels for each of the

genes will be more sensitive to assumed distributional forms, and resulting p-values

may not be accurate. Based on our simulation analysis, our method appears to be

more favourable method in applications than the S-W method and E-B t-statistic

even when the sample sizes for the two groups are assumed small (n=10). Our

method appears to be slightly powerful in large samples. Overfitting problem will

arise in use of our method with the three parameters of GLDII when sample size

is very small. Therefore, in use of our method, we want to have a large sample

size because the desirable properties of the estimators are justified in large sample

situations. The results from the simulation studies are considerably affected by the

shape of the distribution. Therefore, it is critical to generate an underlying null

70

distribution as close as possible to real microarray data because a gene’s statistical

significance can be dramatically different under different underlying null distributions.

The strength of ALRT(G) method is it’s flexibility of considering the shape according

to the underlying distribution of expression values.

The DMD data analysis and Golub leukemia data analysis show that our method

performs well in comparison to other methods. The SAM statistics or E-B t-statistics

are often used for testing each gene’s differential expression. But all the methods

require a large number of samples to produce reasonable estimates. Furthermore, the

comparison of means can be greatly influenced by outliers with dramatically smaller

or bigger expression intensities. Gene discovery based on these Wald-type statistics

can be misleading due to different error variances under different biological conditions

and/or on different intensity ranges of microarray expression. Our ALRT(G) method

is found to be favorable at any settings of treatment effect, variability, sample size and

noise effect. The ALRT method also has the advantage that it can handle multiclass

microarray dataset.

The main motivation for using the GLDII is its closed form solution for microarray

data. The use of GLDII is particularly desirable in microarray data analysis for the

stability of its tail probabilities which play an important role in assessing statistical

significance. The ALRT method using GLDII is a part of an emerging literature that

attempts to improve statistical tests for DE. This approach may prove useful in a

number of other genome-wide estimation and inference problems.

The ALRT method performs as well or better than the Wald type test statistic for

testing the differences between two experimental conditions. The ALRT method with

GLDII has the added advantage that it provides a flexible model that can account

both symmetric and asymmetric structures of the data. Due to its flexibility, the

ALRT method with GLDII presents a viable alternative to find differential expressed

genes in microarray studies. We assumed that the data from different genes are

independent, which is unlikely to hold in microarray data since functions of many

71

genes are interrelated in varying degrees. However, the ALRT method can be adjusted

by borrowing information from all the genes. It may be possible to get more robust

results by the ALRT method by borrowing strength from genes in local intensity

regions for estimation of shape and scale parameters of GLDII. For example, one

could use Wald test for comparing two location parameters of GLDII and smooth the

variance by adding an offset in the denominator similar to what is done to the SAM

statistic. Therefore the test statistic will be

µ1 − µ2

SE(µ1 − µ2) + s0

where, µ1 and µ2 is the approximate location parameters of the GLDII for treatment

and control group, respectively and SE(.) is the standard error for difference between

the two approximate location parameter. We can also take the offset, s0, as the

90th percentile of the standard error of the mean difference between two conditions

following Efron et al. (2001) approach.

CHAPTER 4

Nonparametric Method for Detecting

Differentially Expressed Genes: Single

Gene Analysis

In this chapter, we propose a flexible rank based nonparametric procedure for an-

alyzing microarray data. In the method we propose a statistic for testing whether

area under receiver operating characteristic curve (AUC) for each gene is equal to

0.5 allowing different variance for each gene. The contribution to this “single-gene”

statistic is the studentization of the empirical AUC which takes into account the vari-

ances associated with each gene in the experiment. DeLong et al. (1988) proposed a

nonparametric procedure for calculating a consistent variance estimator of the AUC.

We use their variance estimation technique to get a test statistic, and we focus on the

primary step in the gene selection process, namely, the ranking of genes with respect

to a statistical measure of differential expression. Two real datasets are analyzed to

illustrate the methods and a simulation study is carried out to assess the relative per-

formance of different statistical gene ranking measures. The work includes how to use

the variance information to produce a list of significant targets and assess differential

gene expressions under two conditions. The proposed method does not involve com-

plicated formulas and does not require advanced programming skills. We conclude

that the proposed methods offer useful analytical tools for identifying differentially

expressed genes for further biological and clinical analysis.

72

73

4.1. Introduction

There are a variety of gene selection methods developed in the last few years. Among

them, some methods assume explicit statistical models on the gene expression data

which are called parametric methods. Other methods do not assume any specific

distribution model on the gene expression data and they are referred to as non-

parametric gene selection methods. For example, Pepe et al. (2003) proposed two

measures related to the ROC curve for ranking genes (or proteins) in regards to dif-

ferential expression between tissues: AUC and partial AUC(t0) (i.e., pAUC), where

t0 is some small false positive rate. Xiong et al. (2001) suggested a method to select

genes through the space of feature subsets using classification errors. Recently Jeffery

et al. (2006) compared the efficiency of 10 gene selection methods including both

parametric and nonparametric methods. It has been reported that the results of non-

parametric gene selection methods may be influenced by the classification methods

chosen for scoring the genes (Troyanskaya et al. (2002)). Nonetheless, model based

gene selection methods lack adaptability, because it is often impossible to construct a

universal probabilistic analysis model that is suitable for all kinds of gene expression

data, where noise and variance may vary dramatically across different gene expression

data (Troyanskaya et al. (2002)). In this sense, nonparametric gene selection methods

are more desirable than model-based ones. Here we proposed a new gene selection

method which do not assume any explicit statistical model on the gene expression

values.

An assessment of the expression of a gene can be made through the use of a

receiver operating characteristic (ROC) curve. If a gene could perfectly discriminate

between two conditions then there will be an expression level that the entire treatment

population would fall above and all control expressions would fall bellow or vise versa.

The curve would then pass through the point (0,1) on the unit grid. The closer an

ROC curve comes to this ideal point, the better its discriminating ability. A gene with

74

no discriminating ability will produce a curve that follows the diagonal of the grid.

Pepe et al. (2003) argue that two measures related to the ROC curve are suitable

for ranking genes in regards to differential expression between two conditions: Area

under the ROC curve (AUC) and Partial AUC (pAUC).

For continuous data, the nonparametric ROC curve may be preferred since it

passes through all observed points and provides unbiased estimates of sensitivity,

specificity, and AUC in large samples (Zweig et al. (1993)). More importantly, the

nonparametric approach does not require that data be fitted to any particular model.

If the distributions of scores for true-positive and true-negative test subjects are

far from Gaussian, the parametric AUC and its corresponding standard error (SE)

derived from a directly fitted binormal model may be distorted (Godard et al. (1990)).

Convergence may also be an issue with expression data since presence of extreme

values are common in such data. For these reasons, as well as its relative simplicity

and ease of use, the nonparametric approach continues to be popular among many

researchers.

The remainder of the chapter is organized as follows. We have a general discussion

comparing parametric methods with nonparametric methods on Section 4.2. A brief

discussion on ROC analysis is given in Section 4.3. We discuss the motivation and

related works in Section 4.4. We describe our proposed method in Section 4.5. In

Section 4.7, we present simulation results and we also illustrate the methods using

two real microarray datasets. We discuss the advantages and disadvantages of our

method and provide conclusion in Section 4.8.

4.2. Parametric versus Nonparametric Methods

Theoretical distributions are described by quantities called parameters, notably the

mean and standard deviation. Methods that use distributional assumptions are called

parametric methods, because we estimate the parameters of the distribution assumed

75

for the data. Frequently used parametric methods include t tests and analysis of vari-

ance for comparing groups, and least squares regression and correlation for studying

the relation between variables. All of the common parametric methods (“t meth-

ods”) assume that in some way the data follow a normal distribution and also that

the spread of the data (variance) is uniform either between groups or across the range

being studied. For example, the two sample t test assumes that the two samples of

observations come from populations that have normal distributions with the same

standard deviation. The importance of the assumptions for t methods diminishes as

sample size increases.

Alternative methods, such as the Mann-Whitney test, and rank correlation, do

not require the data to follow a particular distribution. They work by using the rank

order of observations rather than the measurements themselves. Methods which do

not require any distributional assumptions about the data, such as the rank meth-

ods, are called non-parametric methods. The term non-parametric applies to the

statistical method used to analyze data, and is not a property of the data. As tests

of significance, rank methods have almost as much power as t methods to detect a

real difference when samples are large, even for data which meet the distributional

requirements (Sawilowsky (1993)).

Non-parametric methods are most often used to analyze data which do not meet

the distributional requirements of parametric methods. In particular, skewed data are

frequently analyzed by non-parametric methods, although data transformation can

often make the data suitable for parametric analyses. Sawilowsky (1993) concluded

that “the t-test was more powerful only under a distribution that was relatively

symmetric, although the magnitude of the differences was trivial. In contrast, the

Mann-Whitney held huge power advantages for data sets which presented skewness”.

To compensate for the advantage of being free of assumptions about the distri-

bution of the data, rank methods have the disadvantage that they are mainly suited

76

to hypothesis testing and no useful estimate is obtained, such as the average differ-

ence between two groups. Estimates and confidence intervals are easy to find with

parametric methods. Non-parametric estimates and confidence intervals can be cal-

culated, however, but depend on extra assumptions which are almost as strong as

those for t methods. Rank methods have the added disadvantage of not generalizing

to more complex situations, most obviously when we wish to use regression methods

to adjust for several other factors.

The choice of an approach may also be related to sample size, as the distributional

assumptions are more important for small samples.

4.3. General Discussion on ROC analysis

Receiver operating characteristic (ROC) analysis provides a comprehensive picture of

the ability of a test to make the distinction being examined over all decision thresh-

olds. Several different methods have been developed for the analysis of ROC curves.

The area under an ROC curve (AUC) is indicative of the overall accuracy of a test

and represents the probability that a randomly selected true-positive individual will

score higher on the test than a randomly selected true-negative individual. AUC can

be estimated both parametrically and nonparametrically. The parametric methods

usually model the ROC curves by assuming a particular underlying distribution of

subject outcomes (usually assuming that a bivariate distribution of outcomes is trans-

formable to a binormal). The binormal ROC curves were shown to be quite robust

for a wide class of curves encountered in practice (Hanley (1988)), a property that is

in part due to variety of distributions that can be approximated by a monotone trans-

formation of a binormal distribution. One of the best known parametric approaches

to the analysis of the ROC curves is the maximum likelihood approach introduced by

Dorfman and Alf JrE. (1969).

Nonparametric methods utilize empirical ROC points by connecting them with

straight lines, step functions or sometimes by fitting a smooth curve. The main

77

advantage of nonparametric methods compared to parametric ones is the absence of

specific assumptions about the shape of the curve or the underlying distribution of

outcomes. Furthermore, unlike many parametric procedures, iterative algorithms are

not needed for implementation of most nonparametric methods. For continuous data,

the nonparametric ROC curve may be preferred since it passes through all observed

points and provides unbiased estimates of sensitivity, specificity, and AUC in large

samples Zweig et al. (1993). In this chapter we apply nonparametric technique to

ROC analysis for analyzing microarray gene expression data.

4.4. Motivation of this Chapter

The motivation of this chapter comes from two published papers containing nonpara-

metric approaches for identifying differentially expressed genes. Pepe et al. (2003)

proposed two measures related to the ROC curve for ranking genes (or proteins)

in regards to differential expression between tissues: AUC and partial AUC(t0) (i.e.,

pAUC), where t0 is some small false positive rate. The nonparametric AUC is equal to

the numerator of the Mann-Whitney U statistic and hence equivalent to the Wilcoxon

rank sum test (RST). pAUC is not a recommended method for small sample size (Jef-

fery et al. (2006)). Troyanskaya et al. (2002) compared three model-free approaches

and assessed their performances under varying noise levels. The three model-free ap-

proaches were: (1) nonparametric t-test, (2) RST, and (3) a heuristic method based

on high Pearson correlation to a perfectly differentiating gene (“ideal discriminator

method”). The RST is used as an alternative to the t-test to avoid the parametric

assumptions.

Figure 4.1 displays distributions (Gaussian kernel smoothing by density()function

in R) of four randomly selected genes from the Golub et al. (1999) leukemia dataset.

Our interest here is to build a separation between two cancer types: acute lym-

phoblastic leukemia (ALL) and acute myeloid leukemia (AML). It is apparent from

the distributions of the four genes that the underlying distributions of the genes lack

78

symmetry. Therefore building a method under the assumption of normality may be

invalid. Figure 4.2 displays the corresponding ROC curves for the four genes. The

AUC and pAUC (0.1) are calculated using an R package ROC. It indicates from the

figure that the gene “D14874 at” separates the two conditions more clearly than the

other three genes. The expressions for gene “D14874 at” produces AUC of 0.946 and

the pAUC(0.1) of 0.048 which clearly indicate that it is the most DE gene. Com-

paring the other three genes with respect to their AUC values indicate that, gene

“X93512 at” is the second most DE gene. But the remaining two genes are not com-

parable with respect to the values of AUC and pAUC because they have equal values

of AUC and pAUC. One of the features of the pAUC and AUC (or RST) is that they

do not account for gene-specific variability of the expression values. To improve this

scenario in microarray analysis, we suggest a statistic which can take gene-specific

variances under the two conditions. Figure 4.3 shows the estimate of AUCs and their

corresponding variances for each gene from Golub leukemia dataset. The estimation

procedure is given in the next section. We can see from the Figure that the genes

with AUC values close to 0.5 have higher variances and the genes with AUC values

close to 0 or 1 are having lower variances. Therefore, it is important to take gene

specific variances into account into the test statistic for testing the equal expression

for the two conditions.

4.5. Materials and Methods

4.5.1. Single Gene Analysis: AUC

For simplicity, it is assumed that higher values on the expression are associated with

treatment group. Let {xig}n1i=1 and {yjg}n2

j=1 denote expression values for the n1 control

79

−200 0 200 400 600

0.00

00.

002

0.00

40.

006

U97502_rna1_at

−200 0 200 400 600 800

0.00

00.

002

0.00

40.

006

D14874_at

−1000 −600 −200 0 200

0.00

00.

002

0.00

40.

006

U82970_at

−200 0 200 400 600 800

0.00

00.

004

0.00

8

X93512_at

Figure 4.1: Density plots of 4 genes from Leukemia Dataset: solid line AML anddashed line ALL.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1−specificity

sens

itivi

ty

U97502_rna1_at (A=0.785, pA=0.0273)D14874_at (A=0.946, pA=0.048)U82970_at (A=0.785, pA=0.0273)X93512_at (A=0.805, pA=0.0559)

Figure 4.2: ROC Curves of 4 randomly selected genes from Leukemia Dataset.

80

0.0 0.2 0.4 0.6 0.8 1.0

0.00

00.

005

0.01

00.

015

0.02

0

AUC

Varia

nce

of A

UC

Figure 4.3: AUC and corresponding Variance for genes from Leukemia Dataset.

and n2 treatment subjects for the gth (g = 1, · · · ,m) gene. An unbiased estimator

of AUC for the gth gene (Ag)is given by:

Ag =

∑n1

i=1

∑n2

j=1 ψ(xig, yjg)

n1n2

= ψ..

where,

ψ(x, y) =

{1 x < y

0 x > y

The definition of ψ(x, y) doesn’t allow for ties because of continuous nature of mi-

croarray data. It might possible ties when quantile normalization is used and in this

case ψ(x, x) = 0.5 can be additionally defined. Note that Ag is equal to the numerator

of the Mann-Whitney U statistic and hence equivalent to the Wilcoxon RST. The

Wilcoxon RST is a nonparametric alternative to the two-sample t-test and is based

solely on the order in which the observations from the two samples fall. Troyanskaya

et al. (2002) described the RST and applied the method to microarray data analy-

sis. The estimate (u1), mean and variance for RST by Troyanskaya et al. (2002) are

81

defined as follows:

w1 =∑

rankcontrol group sample

u1 = w1 −n1(n1 + 1)

2

meanu1 =n1n2

2

varianceu1 =n1n2(n1 + n2 + 1)

12

It is apparent that, the mean and variance for the u1 statistic are constant depending

on the sizes of both the groups and therefore don’t take gene specific variability into

account.

The AUC can be used as an alternative to a t-test when the data are not normally

distributed. The AUC index reflects the inherent discriminative ability of a diagnostic

procedure and has a nice interpretation as the probability of correct discrimination

between treatment and control groups. The estimator Ag is approximately normally

distributed under quite general assumptions ( Hoeffding et al. (1948)). Hence, know-

ing the variance of the estimator is essential for constructing a test statistic for testing

the hypothesis H0 : Ag = 0.5 against the alternative H1 : Ag = 0.5. An Ag value of

0.5 represents no predictive or discriminative ability. This is important in cases where

expressions for control group are expected to score higher than treatment group.

Several methods (Hanley et al. (1983); DeLong et al. (1988), Efron et al. (1993);

Dorfman et al. (1992)) have been proposed for computing the variances and covari-

ance of nonparametric AUC estimates derived from the same sample of cases. These

may be used to facilitate statistical tests of AUC differences between measures. The

consistent, completely nonparametric estimators of the covariance matrix for AUC

estimators was developed by DeLong et al. (1988). The conventional variance esti-

mator proposed by DeLong et al. (1988) can also be shown to be equivalent to the

two-sample jackknife estimator (Arvesen (1969)) of the the variance. Because of the

structure of the nonparametric estimator of AUC, its variance estimator is easy to

82

compute, i.e.:

1. Compute the treatment and control group components:

ψi. =1

n2

n2∑j=1

ψ(xi, yj) ψ.j =1

n1

n1∑i=1

ψ(xi, yj).

2. Calculate

s10 =1

n1 − 1

n1∑i=1

[ψi. − ψ..]2, s01 =

1

n2 − 1

n2∑j=1

[ψ.j − ψ..]2.

3. The consistent estimator of the variance for the gth gene is:

V (Ag) =s10n1

+s01n2

.

Now for testing the hypothesis H0 : Ag = 0.5 the test statistics becomes,

Zg =Ag − 0.5

SE(Ag)

which is approximately standard normally distributed. We can rank gene according

to the values of Zg.

However, when there are only a small number of arrays in each group, the estimates

of standard errors (SE) for each gene can be unstable. Some genes might by chance

have very small SEs, and therefore appear highly significant. If the discrimination

between the two conditions happen perfectly by the gene i.e., Ag is 1, then SE(Ag)

becomes 0 which makes the Zg statistic arbitrarily large. To address this problem

we smooth the variance estimates by borrowing information from the ensemble of

genes. This can assist in inference about each gene individually. This technique of

smoothing variances is not new in microarray studies. For example, Tusher et al.

(2001), Efron et al. (2001) and Broberg et al. (2003) used t-statistics where an offset

was added to the standard deviation while Smyth (2004) proposed a t-statistic with

a Bayesian adjustment to the denominator. We took the offset s0 as the quantile

83

of the gene-wise standard errors that minimizes the coefficient of variation of the Zg

statistic. Therefore we can calculate the dg statistic to test for treatment effect.

dg =Ag − 0.5

SE(Ag) + s0

Similar adjustments for computing a test statistic were also used by Garrett et al.

(2004) and Hu et al. (2006).

4.6. FDR Estimation with dg statistic

In the application FDR is estimated using permutation and thresholding the test

statistics. Alternative estimation methods using p-values can also be applied (Storey

and Tibshirani (2003)). We can use the following permutation algorithm to select

significant genes and estimate FDR:

1. For the original data calculate the dg statistics, denote their ordered values as

d(g).

2. For the b-th permutation, calculate the dg statistic and denote their ordered

values as

db(g); b = 1, · · · , B.

Denote their averages across all permutations as

d(g) =1

B

B∑b=1

db(g).

3. For a cutoff value ∆, identify the following genes as significant

| d(g) − d(g) |≥ ∆.

Denote

d0 = maxd(g)≤d(g)−∆d(g), and d1 = mind(g)≥d(g)+∆d(g),

84

and estimate the expected number of false positives by chance for the dg statistic

as following

V (∆) =B∑b=1

∑g I{db(g) ≥ d1}+ I{db(g) ≤ d0}

B

where I{·} is the indicator function, and the estimated FDR is

ˆFDR(∆) =V (∆)

R(∆),

where

R(∆) =∑g

I{| d(g) − d(g) |≥ ∆}

is the total number of significant genes. We can similarly calculate the expected

number of false positives and FDR for the SAM, t, AUC and pAUC statistics.

4.7. Results

In this section, we evaluate the performance of our methods using simulated data as

well as three published datasets.

4.7.1. Simulation

We implemented and evaluated 4 methods for identifying differentially expressed

genes: SAM, RST, t statistic, and dg statistic. The performance of our proposed

methods is evaluated using simulated data generated from two distributions incorpo-

rating variability, treatment effect and sample size effect. We consider the scenarios

having two conditions, treatment versus control, and sample sizes of 10, 20 and 40 per

condition. We generated data from 1000 genes and set the proportion of DE genes

as 0.1. We consider equal sample sizes for each scenario. The following simulation

scenarios are considered here:

Sim1 : Sim1 simulates normal data with gene-specific variances from a standard expo-

nential distribution. We consider fixed effect size (D) as 1 and effects are added

to the second group.

85

Sim2 : Sim2 generates data from the extreme value distribution (EV). That is expres-

sions for a gene are generated from EVm(0, b′). We consider scale parameters

as b′ = 2 in the simulation to allow extreme values at the right side of the

distribution. The treatment effect (D) is as described for previous scenario.

The number of true positives and false positives are estimated based on 200 simula-

tions. The results of number of false positives corresponding to number of significant

genes from sim1 are given from Figures 4.4-4.6 at different sample sizes per condition.

The left and right side figures represent different scaling in the number of significant

genes. The number of significant genes are defined by ranking of the FDR values

corresponding to each gene. It is apparent from the figures that with smaller sample

sizes per condition all the methods produce higher number of false positives for a

fixed number of significant genes. It is seen from the results that with small sample

sizes (n1 = n2 = 10) per condition SAM performs better than any other methods.

For moderate sample sizes per condition (n1 = n2 = 20) the SAM and dg statistic

produce very close results in terms of the number of false positives for a fixed num-

ber of significant genes. It is also true for higher sample sizes (n1 = n2 = 20) per

condition. Figure 4.7 to Figure 4.8 provides the results of number of false positives

corresponding to number of significant genes from sim2 simulation scenario. It is seen

from the figures that with large sample sizes per conditions both the dg and SAM

statistic perform better than the RST, t-statistic while the t-statistics are worse than

the others.

4.7.2. Applications

A detailed evaluation of gene selection methods on real biological data is challenging

due to the difficulty of defining a gold standard. Here we have evaluated and applied

86

10 30 50 70

05

1015

2025

Number of Significant Genes

Numb

er of Fa

lse Po

sitives

tRSTSAMdg

80 100 120 140

3040

5060

7080


Numb

er of Fa

lse Po

sitives

tRSTSAMdg

Figure 4.4: Comparison of t, RST, SAM and dg statistics at sample size 10 and Sim1scenario.

10 30 50 70

02

46

8


Numb

er of Fa

lse Po

sitives t

RSTSAMdg

80 100 120 140

1020

3040

5060


Numb

er of Fa

lse Po

sitives t

RSTSAMdg


10 30 50 70

0.00.2

0.40.6

0.8


Numb

er of Fa

lse Po

sitives

tRSTSAMdg

80 100 120 140

010

2030

4050


Numb

er of Fa

lse Po

sitives t

RSTSAMdg


87

10 30 50 70

010

2030

40


Numb

er of

False

Posit

ives

tRSTSAMdg

80 100 120 140

5060

7080

90


Numb

er of

False

Posit

ives t

RSTSAMdg


10 30 50 70

05

1015

20


Numb

er of

False

Posit

ives t

RSTSAMdg

80 100 120 140

3040

5060

7080


Numb

er of

False

Posit

ives

tRSTSAMdg


10 30 50 70

01

23

45

67


Numb

er of

False

Posit

ives

tRSTSAMdg

80 100 120 140

1020

3040

5060


Numb

er of

False

Posit

ives

tRSTSAMdg


88

all the methods to 2 publicly available datasets.

The first data set is the well-known Affymetrix spike-in study that contains 12,626

genes, 12 replicates in each group, and 16 known differentially expressed genes (Cope

et al. (2004)). The dataset is contained in the R package “DEDS”. Among the 16

truly differentially expressed genes 14 genes are identified by t, RST and dg statistics

and 15 genes are identified by SAM method from the 20 top ranked genes. The

ranking of the genes are made by permuted p-values from different methods. All the

16 genes are identified by SAM when 74 significant genes are considered. On the other

hand the numbers are 153, 156 and 162 for dg statistic, t statistic and RST statistic.

We also examined the Affymetrix dataset by calculating the average number of truly

identified genes by t, SAM, RST, and dg statistic from 20 top ranked genes where the

average have been found after 200 randomly drawn samples of equal sizes (n1 = n2)

from each condition. We examined samples of size 7 and 10 from each condition.

Drawing size 7 from each condition, we found on average 13.40, 14.25, 13.44 and

13.63 genes are truly identified by t, SAM, RST, and dg statistic, respectively. The

average numbers are 14.15, 14.86, 14.34, and 14.52, respectively, when we considered

a sample of size 10 from each condition. Therefore SAM performs best among the

methods and the performance of dg statistic improves over RST for all sample sizes.

Figure 4.10 shows the results of Affymetrix spike-in data for comparing the meth-

ods in terms of concordance of genes with SAM. Concordance is defined as the number

of genes in one gene list produced by one method that are also present in another

gene list produced by another method. Here we compute the concordance between

the list of most DE genes produced by SAM and the list of most DE genes produced

by another method. We also get the concordance of genes between SAM and t statis-

tics (the unequal variances of t-test). Comparing with the gene lists of SAM that

consists of 100 genes it appears that 74, 73, and 72 genes are concordant with the

89

0 20 40 60 80 100

020

4060

Number of top ranked genes by SAM

Num

ber o

f con

cord

ence

gen

es w

ith S

AM

gen

e lis

ttRSTdg

Figure 4.10: Examining the methods in terms of concordance with the SAM statistic.

methods t-statistic, RST, and dg statistic, respectively. The four methods applied to

the Affymetrix study produce different set of gene lists. We have found the agree-

ment in the gene lists produced by SAM and other statistics are quite similar for this

dataset.

The second data set is from Golub et al. (1999) leukemia study that was used

to classify two types of leukemia: ALL and AML. The dataset is contained in the R

package golubEsets. Many authors have analyzed these data using different method-

ologies (Pan (2002); Zhao et al. (2003)). The training dataset consists of 27 ALL and

11 AML subjects and the test dataset consists of 20 ALL and 14 AML subjects.

The expression of 7129 genes were measured. Here we have merged the training and

testing samples so that we have a total of 72 samples for our analysis. Figure 4.11

provides the results of expected number of false positives corresponding to different

values of estimated FDR. It is seen from the figure that SAM statistic perform best

90

0.1 0.2 0.3 0.4 0.5

010

0020

0030

0040

0050

00

Estimated FDR

Exp

ecte

d N

umbe

r of F

alse

Pos

itive

s

tRSTSAMdg

Figure 4.11: Expected number of False Positives corresponding to estimated FDRvalues for Golub leukemia dataset.

among all the methods. Both the dg and RST statistic perform very similarly while

the t-statistics are worse than the others.

Table 4.1: Average of classification errors (with their standard errors in parentheses)for leukemia dataset.

k0 SAM RST pAUC(0.1) dg5 0.221 (0.113) 0.241 (0.134) 0.266 (0.173) 0.235 (0.129)10 0.196 (0.126) 0.216 (0.157) 0.238 (0.162) 0.208 (0.123)15 0.176 (0.118) 0.208 (0.151) 0.232 (0.195) 0.201 (0.131)20 0.139 (0.124) 0.159 (0.162) 0.162 (0.156) 0.156 (0.137)25 0.107 (0.119) 0.137 (0.148) 0.141 (0.168) 0.133 (0.127)30 0.097 (0.134) 0.129 (0.153) 0.133 (0.159) 0.114 (0.137)

It is hard to compare the methods when truly significant gene are not known. In

this case, the best model should be the one with the lowest classification error. Our

interest from the golub leukemia dataset is to select important genes and use them

91

to classify the two types of leukemia. Geman et al. (2004) also combined the test

and training set of the leukemia dataset to estimate the error by leave-one-out cross

validation. Here we evaluate the effectiveness of the gene list to form a gene classifier

which could predict the class of a test sample. In using classification to obtain the best

method, we assumed that a better gene list should discriminate between the groups

more effectively. Therefore we evaluate the classification performance of the five

methods SAM, RST, pAUC(0.1), and dg statistic using a simple Gaussian maximum

likelihood discriminant rule, for diagonal class covariance matrices. The pAUC(0.1)

means partial area under the curve at 0.1 threshold. A detailed discussion of the

discriminant rule is given in Dudoit et al. 2002. All the approaches involve splitting

the data into training and test samples. In the training sample the classification

rule is developed, and its performance is determined in the test samples. We used

6-fold cross-validation where the data are divided into 6 subsets. On each iteration

of the cross-validation a different collection of the 4 subsets serve as the training

sample and the remaining 2 subsets serve as the test sample. The performance of the

methods are evaluated by taking average misclassification error in the test samples.

The top ranking genes of a specified number between 5 and 30 are used to create the

classification rule. The mean of the misclassification errors and their corresponding

standard errors for different methods are summarized in Table 4.1. It appears that

the performance of dg statistic agree closely with the SAM and perform better than

RST and pAUC(0.1). Also the standard errors indicate that dg statistic produce less

variable genes which means the use of the dg statistic gives more reproducible genes

than the RST and pAUC(0.1). In fact, compared to other methods the SAM and

dg statistics produce more consistent results. Therefore the dg statistic can be used

effectively to this dataset for identification of the important genes in microarray data

analysis.

92

4.8. Discussion and Conclusion

One aspect of microarray studies is to provide a list of differentially expressed genes

in a given experimental systems. To provide this, we have proposed a nonparametric

rank based approach for ranking genes. We presented a dg statistic that performs

as a modified form of RST on the data, that takes into account the gene-specific

variance information to produce a list of significant targets and assess differential

gene expressions. Both perform extremely well on ideal situations. For large sample

sizes (40 per condition), the dg statistic gives better results by determining less false

positive genes (or more True Positive, TP) compared to the other methods.

Previous studies suggest that using rank transformed data in microarray analysis

is advantageous (Raychaudhuri et al. (2000); Tsodikov et al. (2002)). Theory predicts

that ranked based methods will be optimal in extremely noisy data. Heteroscedas-

ticity is common in gene expression data ( Thomas et al. (2001), Craig et al. (2003),

Pepe et al. (2003)). Presence of outliers is very common in microarray data which

may result in different variances in the two experimental conditions. The current

study shows that when variances are different, the tests, in particular the dg statistic,

are useful to test for differences between the two conditions. In comparing RST and

dg statistic, the version of the dg statistic that allows different variance for each gene

is likely to give more reliable results in microarray gene expression analysis.

With large subjects from each condition, it can be difficult to identify from the

data alone the underlying distribution. In such cases, the application of domain

knowledge and good judgement about the nature of the distribution is required. In

these circumstances we recommend to the dg statistic. Another instances where we

recommend our methods is when the gene-specific variances are different. Also, the

use of the dg statistic is not computationally intensive. The proposed statistical ap-

proach does not always outperform all the other methods, they are always comparable

and sometimes superior.

93

In the analysis of real microarray data, there is no correct answer as to which

method or which statistic should be used, as the choice of statistic can dramatically

affect the set of genes that is selected. A researcher should choose the measure

of differential expression based on the biological system of interest. If changes in

expression relative to the underlying noise are important with large samples, then

our method is preferable since it provides useful and robust analytical tools for gene

selection with minimum requirements about the underlying features of the datasets

of interest than most existing methods in microarray literature.

CHAPTER 5

Nonparametric Method for Detecting

Highly Correlated Differentially

Expressed Genes

Very often biologists are interested to know the biological function of a particular

gene. Its true biological function may depend on other genes. Finding other genes in

the same biological pathway of that gene may enhance further understanding of its bi-

ological function. Therefore, we are interested in finding other differentially expressed

genes whose expression values are highly correlated with that of a “seed” gene. We

propose a nonparametric procedure for selecting differentially expressed genes with

expression levels correlated with that of a “seed” gene in microarray experiments. The

proposed test statistic compares two Area Under Receiver Operating Characteristic

Curves (AUC) for gene pairs taking correlation into account. DeLong et al. (1988)

proposed a nonparametric procedure for calculating a consistent variance estimator

of the difference between two AUCs. We use their variance estimation technique for

comparing pairs of genes, and we focus on the correlated gene selection process with

respect to a particular gene of interest. The performance of our method is compared

to the other methods through the use of simulation and real data analysis.

94

95

5.1. Introduction

It is important to find a novel and efficient statistical technique that will help identify

those genes whose expressions are correlated such that they may belong to the same

molecular pathway. For grouping similar genes using gene expression data, many

methods have been proposed: cluster analysis (Eisen et al. (1998)), self-organizing

maps (Tamayo et al. (1999)), singular value decomposition (Alter et al. (2000)), sup-

port vector machine (Furey et al. (2000)), and genetic algorithm/k-nearest neighbor

method (Li et al. (2001 )).

Often, the biologist has already acquired some prior knowledge of the molecular

basis of the disease, and therefore already knows one or more functional candidates in

the disease pathogenesis in advance. In this scenario, the naive method is to simply

select the genes Gg; (g = 1, · · · ,m) whose expression levels have the highest Pearson

correlation coefficients with the expression levels of Gs; (g = s), where Gs is the

known, pre-specified, marker gene. Here we propose an algorithm that is designed

for selecting a target number of highly correlated genes sequentially if at least one

candidate/ marker/ differentially expressed gene is known in advance.

The motivation of this chapter comes from Ding et al. (2008) paper in which they

proposed a gene selection procedure which can target at finding other genes whose ex-

pression patterns correlate significantly with the gene of known biological significance.

Ding’s method doesn’t tell anything about selecting differentially expressed genes if

the known gene is differentially expressed. Also the method doesn’t take underly-

ing conditions (e.g., treatment and control) into account. The problem of searching

for differentially expressed genes correlated to a known candidate gene has not been

studied extensively in the literature. In this chapter we propose a statistical proce-

dure to solve this problem which makes use of a nonparametric estimator of the AUC

difference of two genes and is based on the concept of estimating the variability with

96

the method by DeLong et al. (1988). Most currently available methods ignore corre-

lation among genes when in fact they are correlated. Expression profiles of multiple

genes are often correlated and thus are more suitably modeled as mutually dependent

variables. Zhao et al. (2003) showed that lack of independence might lead to largely

inflated Type I error (false positive). Recently, different methods (Chilingaryan et al.

(2002), Szabo et al. (2002), Lu et al. (2005), Zhou et al. (2007)) have been proposed

using multivariate statistical techniques in expression data analysis. Here we propose

a test statistic for selecting pair of highly correlated genes which depends on the cor-

relation between the expressions for the two genes being compared. We adjust the

test statistic and calculate the variance following the method of DeLong et al. (1988).

5.2. Ding’s Method

Ding et al. (2008) studied the problem of searching genes correlated to a known can-

didate gene or a “seed” gene. They proposed a statistically valid two-stage procedure

for selecting genes with expression levels correlated with that of a “seed” gene in

microarray experiments:

Stage I: perform array-normal-scores (ANS) transformation of the raw microarray

data and calculate the Pearson correlation coefficients using ANS-transformed

data. The ANS transforms the raw microarray data approximately to a Gaus-

sian distribution.

Stage II: pick correlated genes by employing the false discovery rate (FDR) ap-

proach. They proposed calculating a resampling-based least significant false

discovery rate (LS-FDR) to select the most correlated genes based on a given

LS-FDR threshold.

97

5.2.1. Correlation Test

Let {xig}n1i=1 and {yjg}n2

j=1 denote expression values for the n1 control and n2 treatment

subjects for the gth (g = 1, · · · ,m) gene. We denote the matrix with entries as

wgh = (xig, yjg); (g = 1 · · ·m;h = 1, · · · , n1, n1 + 1, · · · , n) which represent the gene

expression profile data each with m different genes G1, G2, · · ·Gm. We assume that

the expression values are normalized and denote the marker gene by Gs. Thus, the

measurement for Gs on the hth array is wsh. Therefore, the question of finding related

genes Gg and Gs can be formulated statistically as a hypothesis test of whether

the Pearson correlation coefficient between Wgh and Wsh is zero. That is to test

H0 : ρg = 0 where ρg = cov(Gg, Gs)/√

var(Gg)var(Gs). Based on the matrix {Wgh},

the correlation coefficient ρg is estimated by the sample Pearson correlation coefficient

ρg =

∑nh=1(wgh − wg)(wsh − ws)√∑n

h=1(wgh − wg)2∑

h=1 n(wsh − ws)2

where wg =∑n

h=1wgh/n.

Genes that are correlated with Gs also tend to have higher sample correlation. So

the standard Pearson correlation test for H0 : ρg = 0 rejects H0 when | ρg | is too

large. The standard two-sided Pearson correlation test calculates the p-value as:

Pg = 2T−1n−2

(√n− 2 | ρg |√

1− ρ2g

),

where T−1n−2(.) is the cdf of the Students t distribution with n− 2 degrees of freedom

(Larsen and Marks (2001 )). Therefore, the gth gene is declared to be related to Gs

if the corresponding p-value Pg is smaller than a pre-specified Type I error rate α.

5.3. Materials and Methods

5.3.1. Comparison of Two ROC Curves

The comparison of two genes can be based on two ROC curves or their summary

measures. Recent methods for assessing the difference in the AUCs in a paired setting

98

can be found in Braun et al. (2008). DeLong et al. (1988) developed a consistent

nonparametric estimator of the covariance matrix for several AUC estimators taking

pairwise correlation into account. The covariance for the estimators (Ag, As) can be

computed as follows:

1. Compute the treatment and control group components for rth gene

ψgi. =

1

n2

n2∑j=1

ψ(xig, yjg), ψg.j =

1

n1

n1∑i=1

ψ(xig, yjg).

2. Compute the treatment and control group components for sth gene

ψsi. =

1

n2

n2∑j=1

ψ(xis, yjs), ψs.j =

1

n1

n1∑i=1

ψ(xis, yjs).

3. Compute the two terms: sg,s10 and sg,s01

sg,s10 =1

n1 − 1

n1∑i=1

[ψgi. − ψg

..][ψsi. − ψs

..],

sg,s01 =1

n2 − 1

n2∑j=1

[ψg.j − ψg

..][ψs.j − ψs

..].

4. A consistent estimator of the covariance is then

ˆcov(Ag, As) =sg,s10

n1

+sg,s01

n2

.

Since the nonparametric estimator of the AUC is unbiased, the expectation of the

AUC difference is 0 when the two AUCs are equal. Hence, under the assumption

of asymptotic normality of the AUC-statistic and the additional assumption of ex-

changeability of within gene rank-ratings, we can construct a test statistic for testing

no difference between two genes in terms of discrimination.

D =Ag − As

SE(Ag − As)

99

where,

SE(Ag − As) =

√Var(Ag) + Var(As)− 2[cov(Ag, As)].

For each pair of genes, the statistic D is computed to determine the closest gene

to a candidate gene or marker gene in terms of the values of AUC. Suppose the

candidate gene is Gs and we want to compare the gene Gs with another gene Gg,

g = 1, · · · ,m; g = s then we will get m − 1 standard errors, i.e., SE(As − Ag); g =

1, · · · ,m; g = s. The intent of using this test statistic is to incorporate pairwise

correlation between genes individually. Therefore AUC difference for each pair of

genes in a microarray experiment can have its own unique variance. This may be a

consequence of biological or technical factors, but it is well known that variances differ

across genes. To derive a stable pairwise gene specific variance difference estimates,

we can borrow information across genes by shrinking the variance of the difference

estimators. Therefore we propose adding a constant a0 which will serve to stabilize

the denominator of D. Here we follow Efron et al. (2001) and take a0 as the 90th

percentile of SE(Ag − As).

D(adj) =Ag − As

SE(Ag − As) + a0

TheD(adj) statistic possesses a prominent statistical property for microarray data

analysis. The statistic takes into account pairwise correlation for finding genes whose

differential expressions are not marginally detectable in single-gene testing method.

Here we also propose a search algorithm that sequentially identifies genes after looking

at the distance between AUCs of two compared genes. The lowest values of D(adj)

statistic provide the most similarity in terms of discrimination between the pair of

genes.

100

5.3.2. Permuted P-values and FDR estimation with D(adj)

statistic

The False Discovery Rate (FDR) can be calculated following the same method de-

scribed in the previous chapter. Here we describe the FDR estimation based on

estimated p-values. Therefore, the following steps are followed to calculate the FDR

rates by resampling the data:

1. For the original data calculate the D(adj) statistic for gth gene and the marker

gene, denote the k-th ordered values as D(adj)k.

2. permute the gene expression levels of the marker gene wsh = (xis, yjs); i =

1, · · · , n1; j = 1, · · · , n2 across microarrays. That is, we create a permuted

data set w(1)gh ; (g = 1 · · ·m;h = 1, · · · , n), where w

(1)gh = wgh for g = s and

(w(1)s1 , · · · , w

(1)sn ) is a random permutation of (ws1, · · · , wsn). The first n1 obser-

vations are then considered as the control group and next n2 observations are

considered as the treatment group. This permuted data set keeps the dependent

structure of the genes except for Gs.

3. For the new permuted data set calculate the D(adj)g, g = 1, · · · ,m−1 statistic.

4. Repeat this resampling procedureB times, and then, we recalculate theD(adj)(b)g , (b =

1, · · · , B; g = 1, · · · ,m− 1) from the resampling-based permuted data sets.

5. The true p-value of the correlation between the gth gene and the marker gene

can then be estimated by

pg =1

B(m− 1)#(| D(adj)

(b)k |>| D(adj)k |).

6. The Benjamini and Hochbergh (1995) formula can now be used on the esti-

mated p-values to obtain the resampling-based FDR. Genes are selected based

on higher values of FDR.

101

5.3.3. Simulation

We evaluate the performance of D(adj) statistic method using simulated data. Here

we consider SAM, RST and empirical Bayes t-statisitcs (E-B) for comparison because

these three methods are widely used methods in microarray gene selection literature.

The simulated data are generated by following Allison et al. (2002). The approach

used in Allison et al. (2002) helps us to take correlation into account into the simulated

microarray data. The following steps are followed in simulating data:

1. We focus on two conditions, treatment versus control. We consider sample sizes

of 10, 20 and 40 per condition.

2. We generate the datasets comprising 1000 genes.

3. The data for the n1 + n2 samples are multivariate normal and generated inde-

pendently.

X ∼ N1000(µ,Σ)

4. µ has a constant vector of length 1000 and equal to 0.

5. Σ = σ2B⊗

I10, B = 11001′100ρ+ (1− ρ)I100, 1100 = (1, 1, · · · , 1)′ with length

of 100, and I10 is the 10 × 10 identity matrix.⊗

is the kronecker product.

Therefore we consider 10 independent groups and each group consists of 100

correlated genes.

6. The common variance is set at σ2 = 1.

7. We varied ρ over three values: 0.4 (weak dependence), 0.6 (moderate depen-

dence) and 0.8 (strong dependence). Figure 5.1 presents a correlation plot for

a simulated data when correlation coefficient for a group of genes is 0.8. In-

creasingly positive correlations represent with reds of increasing intensity, and

increasingly negative correlations are represented with greens of increasing in-

tensity.

102

Figure 5.1: Correlation Plot for a simulated data when correlation coefficient withingroup of genes is 0.8

103

8. We consider the first gene (G1) is the marker gene. Therefore 99 other genes

(G2, · · · , G100) are correlated with the marker gene.

9. We set the treatment effect, d, as 1, and added to the first 20% of the genes

which will provide only a location shift between the two conditions of the genes.

Now our interest is to see how many of the first 200 genes are selected by the

two methods under different correlations.

10. We performed 500 simulations and calculated the average number of genes which

are correctly identified from the top 200 genes.

Table 5.1: Average Number of Genes Truly DE (correlated DE genes) out of topranked 200 genes after 500 simulations.

ρ Sample Size SAM RST Mod-t D(adj)0.4 n1 = n2 = 10 39.58 (19.37) 37.94 (19.65) 39.25(19.21) 45.21 (25.34)

n1 = n2 = 20 40.70 (20.26) 38.02 (19.67) 40.38(20.1) 44.79 (25.38)n1 = n2 = 40 41.85 (20.31) 39.76 (19.55) 41.97(20.26) 45.70 (27.23)

0.6 n1 = n2 = 10 43.56 (23.54) 44.69 (22.63) 43.82(23.6) 46.62 (28.24)n1 = n2 = 20 41.1 (19.92) 39.42 (21.97) 41.11(19.83) 46.11 (27.5)n1 = n2 = 40 41.25 (19.48) 38.89 (18.48) 41.16(19.49) 47.71 (28.63)

0.8 n1 = n2 = 10 44.64 (23.37) 43.43 (24) 44.79(23.38) 52.91 (35.01)n1 = n2 = 20 42.77 (19.04) 37.21 (21.63) 42.44(19.16) 53.95 (35.38)n1 = n2 = 40 37.48 (21.06) 32.42 (16.51) 39.61(22.14) 53.3 (37.01)

Table 5.1 show the average number of genes and average number of correlated

genes that are correctly identified from the set of 200 top ranked genes by each of

the methods. For example when applying D(adj) statistic to the simulated dataset of

sample size 20 per condition, correlation 0.8 we observe that on average, 53.95 genes

are correctly identified from the top 200 ranked genes and among them 35.38 genes are

104

correlated with the marker gene G1. On the other hand SAM, RST and moderated-

t method produced on average, 42.77 (19.04), 37.21(21.63) and 42.44(19.16) genes,

respectively.

When we consider the performance of the methods in terms of correlation, it is

clear to declare a “winner” from the results of Table 5.1 because there is general

trend indicating that D(adj) statistic gives significantly better answers across all 3

correlations. This is encouraging with the use ofD(adj) statistic to get the pathway of

a seed gene showing differential expression. Overall, D(adj) statistic can identify more

differentially expressed genes than any of the methods and among the differentially

expressed genes this method can identify most correlated genes.

5.3.4. Application: Colon Cancer Data

A detailed evaluation of selection methods on real biological data is challenging due to

the difficulty of defining a gold standard. Here we have evaluated and applied all the

methods to a publicly available dataset. The dataset is taken from Alon et al. (1999)

colon cancer data. In this experiment, 62 samples (40 tumor samples, 22 normal

samples) from colon-cancer patients were analyzed with an Affymetrix oligonucleotide

Hum6000 array. 40 samples are from tumors (labelled as “negative”) and 22 samples

are from normal (labelled as “positive”) biopsies from healthy parts of the colons

of the same patients. The dataset is available from the R package colonCA. Figure

5.2 presents the correlation plot of first 50 differentially expressed genes defined by

SAM. It is seen from the figure that a number of the differentially expressed genes are

pairwise correlated. Considering the top most differentially expressed gene defined

by SAM which is Hsa.8147, it is found that 22 other genes are moderate to highly

correlated (absolute value of correlation coefficient ≥ 0.6) with the gene Hsa.8147.

Table 5.2 presents the top ranked 20 genes by the RST, SAM, Moderated t and

D(adj) statistic. We considered the seed gene as Hsa.8147. The results indicate that

105

Hsa.1902Hsa.832

Hsa.3331Hsa.462

Hsa.2863Hsa.33

Hsa.951Hsa.11673

Hsa.878Hsa.1073Hsa.2097Hsa.2588Hsa.2928Hsa.4252

Hsa.678Hsa.2291Hsa.5444Hsa.1205

Hsa.43279Hsa.6814

Hsa.821Hsa.6472Hsa.2800

Hsa.11616Hsa.1047Hsa.5398

Hsa.601Hsa.3305Hsa.1130Hsa.3016

Hsa.10755Hsa.1221Hsa.8068Hsa.4689Hsa.2344

Hsa.831Hsa.3306Hsa.2456

Hsa.773Hsa.957

Hsa.36952Hsa.8125

Hsa.36689Hsa.1131Hsa.692.1

Hsa.692Hsa.37937

Hsa.1832Hsa.692.2Hsa.8147

Hsa

.814

7H

sa.6

92.2

Hsa

.183

2H

sa.3

7937

Hsa

.692

Hsa

.692

.1H

sa.1

131

Hsa

.366

89H

sa.8

125

Hsa

.369

52H

sa.9

57H

sa.7

73H

sa.2

456

Hsa

.330

6H

sa.8

31H

sa.2

344

Hsa

.468

9H

sa.8

068

Hsa

.122

1H

sa.1

0755

Hsa

.301

6H

sa.1

130

Hsa

.330

5H

sa.6

01H

sa.5

398

Hsa

.104

7H

sa.1

1616

Hsa

.280

0H

sa.6

472

Hsa

.821

Hsa

.681

4H

sa.4

3279

Hsa

.120

5H

sa.5

444

Hsa

.229

1H

sa.6

78H

sa.4

252

Hsa

.292

8H

sa.2

588

Hsa

.209

7H

sa.1

073

Hsa

.878

Hsa

.116

73H

sa.9

51H

sa.3

3H

sa.2

863

Hsa

.462

Hsa

.333

1H

sa.8

32H

sa.1

902

Figure 5.2: Correlation Plot for first 50 differentially expressed genes defined by SAMfrom Colon Cancer data

106

15 genes are found by D(adj) statistic which are concordant with RST method. On

the other hand 7 and 14 genes are found concordant with SAM and Mod-t statistic,

respectively. Therefore D(adj) statistic provides most of concordant genes with the

RST method for this dataset.

Table 5.2: Top ranked 20 genes by different Methodsrank RST SAM (7) Mod-t (14) D(adj)(15)1 Hsa.37937 Hsa.8147 Hsa.8147 Hsa.81472 Hsa.6814 Hsa.692.2 Hsa.692.2 Hsa.369523 Hsa.549 Hsa.692 Hsa.37937 Hsa.4624 Hsa.831 Hsa.1131 Hsa.1832 Hsa.33065 Hsa.627 Hsa.692.1 Hsa.692 Hsa.33316 Hsa.773 Hsa.1832 Hsa.692.1 Hsa.18327 Hsa.2928 Hsa.37937 Hsa.36689 Hsa.8218 Hsa.601 Hsa.4689 Hsa.1131 Hsa.692.29 Hsa.3306 Hsa.8125 Hsa.2456 Hsa.60110 Hsa.36689 Hsa.957 Hsa.8125 Hsa.3668911 Hsa.1832 Hsa.8068 Hsa.6814 Hsa.209712 Hsa.462 Hsa.5398 Hsa.36952 Hsa.292813 Hsa.36952 Hsa.1221 Hsa.601 Hsa.301614 Hsa.8147 Hsa.10755 Hsa.773 Hsa.597115 Hsa.692.2 Hsa.1130 Hsa.2928 Hsa.95716 Hsa.2097 Hsa.831 Hsa.957 Hsa.77317 Hsa.3331 Hsa.36952 Hsa.2344 Hsa.104718 Hsa.821 Hsa.43279 Hsa.3306 Hsa.69219 Hsa.692 Hsa.878 Hsa.831 Hsa.120520 Hsa.3016 Hsa.5444 Hsa.2097 Hsa.2645

Different gene selection algorithms can potentially select different relevant genes

and lead to different classification accuracies. The misclassification errors were found

by choosing 20 to 70 top ranked genes according to the statistical values of different

methods. We evaluate the classification performance using a simple Gaussian max-

imum likelihood discriminant rule, for diagonal class covariance matrices. A 5 fold

cross-validation used to get the classification errors. The form of the algorithm is as

follows:

1. Randomly divide the samples into 5 partitions (subsamples).

2. Of the 5 partitions, a single subsample is retained as the validation data for

testing the model, and the remaining 4 subsamples are used as training data.

3. The cross-validation process is then repeated 5 times (the folds), with each of

the 5 subsamples used exactly once as the validation data.

107

20 30 40 50 60 70

0.30

0.32

0.34

0.36

0.38

Number of top ranked Genes

Aver

age

Mis

clas

sific

atio

n Er

ror

RSTSAMMod−tD(adj)

Figure 5.3: Average misclassification error rate for Colon cancer dataset shown as thenumber of top ranked genes

4. Record the number of top ranked genes from the training data and use these

genes for classification with a simple Gaussian maximum likelihood discriminant

rule, for diagonal class covariance matrices.

5. Error for a given classification relative to a known truth are then calculated by

the classError function of R package mclust.

6. Report the average error over all 5 test sets.

The average of the misclassification errors for different methods are plotted in Fig-

ure 5.3. It appears from the figure that the D(adj) provides better results than RST

method and comparative results with SAM and Moderated-t statistics. Especially the

108

proposed D(adj) statistic performs best when a small number of genes are of interest.

Therefore the D(adj) can be used effectively to this dataset for identification of the

important genes in microarray data analysis.

5.3.5. Effect of Seed Gene: Affymetrix spike-in study

The Affymetrix spike-in study contains 12,626 genes, 12 replicates in each group,

and 16 known differentially expressed genes (Cope et al. (2004)). The dataset is

taken from the Bioconductor package DEDS. The correlation plot with the 16 known

differentially expressed genes has been presented in the Figure 5.4. It is apparent from

the figure that all the 16 truly expressed genes are are highly correlated. Now our

interest is to see the seed gene effect on the D(adj) statistic. We have selected 4 genes

which are considered as seed gene: 684 at, 32660 at, 1552 i at and 1032 at which are

ranked as 1st, 10th, 15th and 20th according to the FDR values of empirical Bayes

moderated-t statistic. Among the seed genes first two are truly differentially expressed

genes and next two are not truly DE. The selection of first 20 genes according to the

permuted p-values when considering different seed gene have been given in the Table

5.3. Xrepresents truly differentially expressed genes. It is seen from the table that

considering 684 at and 32660 at as a seed gene the D(adj) statistic can identify 15

truly expressed genes from the 20 top ranked genes. On the other hand considering

1032 at as a seed gene the proposed statistic can identify only 2 truly expressed genes

from the 20 top ranked genes. Therefore detecting true changes in individual gene

expressions is highly affected by choosing a suitable seed gene in the use of D(adj)

statistic. It is suggested that gene expressions might be altered in related groups

defined by pathways, functions or localizations rather than individually (Segal et al.

(2004)). In such case, genes with distinguished expression changes could be detected,

but many other genes showing correlated but weak changes may be easily missed.

Therefore we are recommending our method in cases when a truly expressed genes is

known in advance from an experiment.

109

546_at

33818_at

1708_at

1091_at

407_at

40322_at

36085_at

36202_at

1024_at

36889_at

36311_at

39058_at

38734_at

1597_at

684_at

37777_at

3777

7_at

684_

at

1597

_at

3873

4_at

3905

8_at

3631

1_at

3688

9_at

1024

_at

3620

2_at

3608

5_at

4032

2_at

407_

at

1091

_at

1708

_at

3381

8_at

546_

at

Figure 5.4: Correlation Plot for the Affymetrix spike-in data

5.4. Discussion and Conclusion

We introduce a new gene selection procedure based on correlation taking into account

and with a known of a “seed” gene. Among the effort of ranking genes, the reported

D(adj) statistic gives satisfactory results.

Most of the methods in microarray studies examine one gene at a time and rank

genes according to their performance ability, and select only the high-ranking genes

for further studies. Therefore some information could be lost by not considering genes

jointly. However, univariate selection can be an inadequate approach from both sta-

tistical and biological point of view. The theory behind the D(adj) statistic is easily

understood and the results have been shown to be more biologically relevant than

those of other methods. The relative benefit of a paired gene analysis compared to a

single gene analysis depends on the correlation information between the observations

for the two genes being compared. In general, statistical correlations can be a hint

110

Table 5.3: Top ranked 20 genes by different different seed generank— Seed Gene 684 at 32660 at 1552 i at 1032 at

1 1024 at X 1024 atX 1552 i at 1032 at2 1091 at X 1091 at X 38254 at 34249 at3 32660 at 32660 at 39058 at X 39000 at4 33818 at X 33818 at X 32115 r at AFFX-YEL021w/URA3 at5 36085 at X 36085 at X 684 at X 39446 s at6 36202 at X 36202 at X 1024 at X 966 at7 36311 at X 36311 at X 1091 at X 35939 s at8 36889 at X 36889 at X 32660 at 37492 at9 37777 at X 37777 at X 33818 at X 32955 at

10 38502 at 38502 at 36085 at X 41184 s at11 38734 at X 38734 at X 36202 at X 37420 i at12 40322 at X 40322 at X 36311 at X 1708 at X13 407 at X 407 at X 36889 at X 38406 f at14 546 at X 546 at X 37777 at X 40276 at15 684 at X 684 at X 38502 at 41764 at16 1708 at X 38734 at X 31412 at17 39058 at X 1552 i at 40322 at X 32650 at18 38254 at 39058 at X 407 at X 1253 at19 1552 i at 32115 r at 546 at X 35359 at20 37492 at 35339 at 35339 at 546 at X

for the fact that two genes belong to the same pathway so that we expect a high sta-

tistical correlation of expression values to have a meaningful biological interpretation.

Therefore the D(adj) statistic provides a simple, yet powerful tool for detecting DE

genes between the two experimental conditions.

If the “seed” gene is not known in advance, then the first ranked gene (or seed gene)

plays an important role in proceeding with the proposed method. That is, the method

highly depends on the first ranked gene identified by univariate gene approach such

as SAM. If the “seed” gene is not correctly identified, the remaining significant genes

in the pathway may be hard to detect correctly. Therefore, we suggest examining the

first ranked gene that is the biologically relevant DE gene that allows clear separation

of the two conditions. It is possible to use the second ranked gene (or third ranked gene

and so on) in the proposed method as a “seed” gene if it is identified as a differentially

expressed gene under the two experimental conditions. The D(adj) statistic performs

better if a small number of genes are biologically interesting. The method shows a

useful approach to extract biological information from expression data. This method

can show its strong analytical power if the seed gene is identified correctly and many

biological processes can be determined from it.

111

It is not expected that a single method will produce the best result under all sim-

ulation scenarios. But we found that whichever method produced the best estimate

in a particular case the proposed nonparametric methods usually came close to the

best method. The the D(adj) statistic exhibit comparable results with the SAM and

moderated-t in most scenarios of simulated data, but the D(adj) statistic produces

more highly correlated genes than any of the other methods. Therefore we propose

to employ the D(adj) statistic to pick correlated genes with higher confidence.

In the field of microarray data analysis, there is no appropriate answer as to which

method or which statistic should be used, as the choice of statistic can dramatically

affect the set of genes that is selected. One should choose the measure of differential

expression based on the biological system of interest. If our interest is to find differ-

entially expressed genes correlated with a “seed” gene, then our proposed method is

preferable since it provides useful analytical tools taking correlation into account.

CHAPTER 6

Conclusion

6.1. Thesis Summary

Prognostic and predictive factors are indispensable tools in the treatment of patients.

Traditional methods of characterization are often limited and do not have the abil-

ity to discern subtle differences that may be of importance for developing a better

understanding of the disease and advancing therapeutic strategies for the treatment

of disease. Gene expression assays have the potential to supplement a few distinct

genes with data from many thousand of genes. We have developed three statistical

techniques that provide predictive capability based on gene expression data derived

from DNA microarray analysis. We have taken both parametric and nonparametric

approaches in creating statistical methods that provide good probabilistic prediction

and classification of two or more conditions based on gene expression data.

The noisy nature of gene expression data has motivated the development of nu-

merous algorithms for identifying differentially expressed genes. Most available para-

metric methods in the study of microarray data analysis rely on a normal distribution

assumption and are based on a Wald statistic. These methods are inefficient, espe-

cially when expression levels follow a skewed distribution. To deal with violations of

the normality assumption, we propose an approximate likelihood ratio test (ALRT)

assuming Generalized Logistic Distribution of Type II (GLDII). The strength of using

GLDII is that the distribution has a shape parameter which will help to consider both

the symmetric and asymmetric distributions for the data. Our simulation and data

112

113

analysis suggest that the t-statistic or SAM or empirical Bayes t can lead to poor

performance if there is in fact significant variation between gene variances, which

is expected in real data. Our ALRT(G) method for GLDII is found to be perform

comparably well at any settings of treatment effect, variability, sample size and noise

effect. Our study shows that this method performs well with small number of sam-

ples and with noisy datasets. The ALRT method also added an advantage that it

can handle multiclass microarray dataset.

We borrow the idea of the receiver operating characteristic (ROC) from clinical

biostatistics and demonstrate its application to microarray analysis. We propose a

nonparametric method in a AUC setting that is based on a novel and model-free

estimate of the variance vector across genes. The contribution to this “single-gene”

statistic is the studentization of the empirical AUC which takes into account the

variances associated with each gene in each experiment. The method suggests how

to use the variance information to produce a list of significant targets and assess

differential gene expressions under two experimental conditions. It is also possible to

obtain p-values using this method. The essence of this method is to take advantage

of using the variance of empirical AUC for each gene to achieve the goal of selecting

important genes. Given the high noise effect, together with small sample sizes, the

proposed method outperformed pAUC and standard ROC methods for detecting

differential gene expression. This underlines the importance of incorporating and

modeling the variance structure of microarray data during the development of future

statistical tests. The added advantage of the proposed method is that it doesn’t

involve complicated formulas and so it is computationally easy to understand.

Another important nonparametric technique is also suggested in this study that

can provide a solution of taking pairwise correlation with a “seed” gene. The statistic

in this method compares a pair genes and selects correlated differentially expressed

genes sequentially. The method can be effectively used in the field of correlated gene

selection. The simulation study shows that, the nonparametric method performed

114

well with datasets that had low noise effects and large sample size. Using simulated

data sets and real microarray data sets, we showed that this novel algorithm is compa-

rable with other gene selection procedures. In particular, the D(adj) statistic exhibits

a significantly greater power of detecting correlated genes.

The study aims to improve the statistical analysis procedure for identifying differ-

entially expressed genes effectively. The ALRT method with underlying distribution

of GLDII assumption gives flexibility by considering shape of the distribution. Using

a simulation study and real datasets, we show that this parametric method outper-

forms other gene selection procedures especially for large sample sizes. The proposed

nonparametric methods do not involve complicated formulas. Both nonparametric

methods can identify a comparable fraction of truly differentially expressed genes, in

particular if the data consists of large sample sizes or the presence of outliers. One

of the methods which examines statistical correlations can suggest that two genes

belong to the same pathway so that we expect a high statistical correlation of ex-

pression values to have a meaningful biological interpretation. That is an important

benefit for this method and shows potential for identifying gene pathways underlying

an observed phenotype. A key point is the capacity to identify not just highly ex-

pressed genes but genes whose expression highly correlates with the phenotype and

thus provide additional insight into the underlying biological pathways.

In practice, choosing a gene selection algorithm for selecting a list of differentially

expressed genes usually depends on the problems involved in expression data like

sample size, treatment effects, variability, correlation structure etc. To identify im-

portant genes one may need to select a number of different gene subsets that can all

solve the classification problem with similar high accuracy. For this purpose, we feel

that different selection algorithms should also be employed to render a comprehensive

exploration of the useful genes. In this case, our methods can be used together with

many other approaches. In the context of machine learning, this approach is referred

as an ensembling, which has also been used in gene selection problems. Therefore,

115

our work can be viewed as providing new choices to build an ensemble system.

6.2. Future Work

6.2.1. Improving the ALRT method

The ALRT method can be adjusted by borrowing information from all the genes. It

may be possible to get more robust results from the ALRT method by borrowing

strength from genes in local intensity regions for estimation of shape and scale pa-

rameters of GLDII. For example, we can use Wald test for comparing two location

parameters of GLDII and smooth the variance by adding an offset in the denomina-

tor similar to what is done in the SAM statistic. However this approach is based on

specific data modeling assumptions. We can also make the procedure nonparamet-

ric by estimating p-values from permutated datasets (random column permutations).

Therefore with the proposed approach, the test statistic become

µ1 − µ2

SE(µ1 − µ2) + s0

where, µ1 and µ2 is the approximate location parameters of the GLDII for treatment

and control group, respectively and SE(.) is the standard error for difference between

the two approximate location parameter. We can also take the offset, s0, as the 90th

percentile of the standard error of the location difference between two conditions

following Efron et al. (2001) approach.

6.2.2. Possible extension for D(adj) statistic

We can develop a score test for testing whether or not some pre-specified groups of

genes are differentially expressed by the D(adj) statistic. The groups of genes can

be those that are involved in a particular biochemical pathway or a genomic region

of interest and should be specified before testing. This method is very valuable

for testing some known pathways that affect clinical outcome in combination with

groups of genes. We can cluster datasets using the performance of genes and assess

116

their ability to separate groups of interest through the D(adj) statistic. However it is

also of interest to see how a better candidate gene which is potentially less correlated

with the first one can contribute new information in microarray analysis.

BIBLIOGRAPHY 117

Bibliography

Affymetrix Inc. (2002). Statistical Algorithms Description Document, http://www.

affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf.

Allison et al.(2002). A mixture model approach for the analysis of microarray gene

expression data, Computational Statistics and Data Analysis, 39:1, 1-20.

Alon et al.(1999). Broad patterns of gene expression revealed by clustering analysis of

tumor and normal colon tissue probed by oligonucleotide arrays, Proc. Natl. Acad.

Sci. USA, 96: 6745-6750.

Alter, O., Brown, P. O. and Botstein, D. (2000). Singular value decomposition for

genome-wide expression data processing and modeling, Proc. Natl. Acad. Sci.,

97:1010110106.

Arvesen, J.N. (1969). Jackknifing U-statistics, Annals of Mathematical Statistics,

40(6): 2076-2100.

Balakrishnan, N. and Hossain, A. (2007). Inference for the Type II generalized logistic

distribution under progressive Type II censoring, Journal of Statistical Computa-

tion and Simulation, 77, 12:1013-1031.

Balakrishnan, N. and Leung, M.Y., (1988). Order statistics from the Type I general-

ized logistic distribution, Communications in Statistics Simulation and Computa-

tion, 17:2550.

Baldi, P. and A. D. Long (2001). A Bayesian framework for the analysis of microar-

ray expression data: regularized t-test and statistical inferences of gene changes,

Bioinformatics, 17:509519.

Benjamini, Y. and Hochbergh, Y. (1995). Controlling the False Discovery rate: a

http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

BIBLIOGRAPHY 118

practical and powerful approach to multiple testing, J.R. Statistical Society B,

57:89-300.

Bhowmick, D., Davison, A.C., Goldstein, D. R., and Ruffieux, Y. (2006). A Laplace

mixture model for identification of differential expression in microarray experi-

ments, Biostatistics, 7(4):630-641.

Bokka, S. and Mathur, S.K. (2006). A nonparametric likelihood ratio test to iden-

tify differentially expressed genes from microarray data, Applied Bioinformatics,

5(4):267-276.

Braun, T.M. and Alonzo, T.A. (2008). A modified sign test for comparing paired

ROC curves, Biostatistics, 9(2):364-372.

Breiman, L., Friedman, J. H., Olsen, R. A., and Stone, C. J. (1984). Classification

and Regression Trees. Wadsworth, Monterey, CA.

Breitling, R. and Herzyk, P.(2005). Rank-based methods as a non-parametric alterna-

tive of the T-statistic for the analysis of biological microarray data, J. Bioinform.

Comput. Biol., 3(5):1171-89.

Broberg, P. (2003). Statistical methods for ranking differentially expressed genes,

Genome Biology, 4:41.

Brown, P.O. and Botstein, D. (1999). Exploring the new world of the genome with

DNA microarrays, Nat Genet., 21:3337.

Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T.

S., Ares Jr, M. and Haussler, D. (2000). Knowledge-based analysis of microarray

gene expression data by using support vector machines, Proceedings of the National

Academy of Sciences, 97:262- 267.

BIBLIOGRAPHY 119

Chen, Y., Dougherty, E.R. and Bittner, M.L. (1997). Ratio-based decisions and the

quantitative analysis of cDNA micromay images, J. Biomed. Optics, 2: 364-374.

Chilingaryan, A. et al. (2002). Multivariate approach for selceting sets of differntially

expressed genes, Math. Biosci., 176:59-69.

Chu, G., Narasimhan, B., Tibshirani, R. and Tusher, V. (2002). SAM, significance

analysis of microarrays, users guide and technical document, Technical report, Stan-

ford University, http://www-stat.stanford.edu/tibs/SAM.

Cope, L. M., Irizarray, R.A., Jaffee, H., Wu, Z. and Speed T.P. (2004). A benchmark

for Affymetrix GeneChip expression measures, Bioinformatics, 20:323-331.

Cox, D. R. and Hinkley, D. V. (1974).Theoretical Statistics, Chapman and Hall, (page

92).

Craig, B.A., Black, M.A. and Doerge, R.W. (2003). Gene expression data: the tech-

nology and statistical analysis, Journal of Agricultural, Biological, and Environ-

mental Statistics, 8:1-28.

Cui, X., Hwang, J.T.G., Jing, Q. , Natalie J. Blades, and Churchill, G. A. (2005).

Improved statistical tests for differential gene expression by shrinking variance com-

ponents estimates, Biostatistics , 6:59-75.

Dean, N. and Raftery, A. E. (2005). Normal uniform mixture differential gene expres-

sion detection for cDNA microarrays, BMC Bioinformatics, 6:173-186.

Debouck, C. and Goodfellow, PN. (1999). DNA microarrays in drug discovery and

development, Nat. Genet., 21:4850.

Dietz, K., Gail, M., Krickeberg, K., Samet, J., and Tsiatis, A. (2003). Statistics for

Biology and Health, Springer, 21:4850.

http://www-stat.stanford.edu/�tibs/SAM

BIBLIOGRAPHY 120

Ding, A. A., LIN, J., and Niu, T. (2008). A Statistical Procedure for Detecting Highly

Correlated Genes with a Pre-Specified Candidate Gene in Microarray Analysis,

Communications in StatisticsTheory and Methods, 37 (18): 29913007.

Diciccio, T. and Tibshirani, R. (1991). On the implementation of profile likelihood

methods, Technichal report no. 9107, University of Toronto.

DeLong, E.R., DeLong, D.M. and Clarke-Pearson, D.L. (1988). Comparing the Area

under Two or More Correlated Receiver Operating Characteristic Curves: A Non-

parametric Approach, Biometrics , 44(3): 837-845.

Dorfman, D.D. and Alf JrE. (1969). Maximum likelihood estimation of parameters

of signal detection theory and determination of confidence intervals rating-method

data, Journal of Mathematical Psychology, 6: 487-496.

Dorfman, D., Berbaum, K., and Metz, C. (1992). Receiver operating characteristic

rating analysis: generalization to the population of readers and patients with the

jackknife method, Investigative Radiology, 27: 723-731.

Dudoit, S., Fridlyand, J. and Speed, TP. (2002). Comparison of Discrimination Meth-

ods for the Classification of Tumors Using Gene Expression Data, Journal of the

American Statistical Association, 97(457):77-87.

Dudoit, S., Yang, YH, Callow, MJ. and Speed, TP. ( 2002). Statistical methods for

identifying differentially expressed genes in replicated cDNA microarray experi-

ments, Statistica Sinica, 12:111-139.

Efron, B. and Tibshirani, R (1993). An Introduction to the Bootstrap, NY: Chapman

and Hall.

Efron, B., Tibshirani, R., Storey, J. and Tusher, V. (2001). Empirical Bayes Analysis

of a Microarray Experiment, JASA , 96:1151-1160.

BIBLIOGRAPHY 121

Efron, B. and Tibshirani, R. (2002). Empirical Bayes methods and flase discovery

rates for microarrays, Genet. Epidemiology, 23(1):70-86.

Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster

analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci.,

95:1486314868.

Fox, R. J. and M. W. Dimmic (2006). A two-sample Bayesian t-test for microarray

data, BMC Bioinformatics, 7:126.

Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M. and Haussler,

D. (2000). Support vector machine classification and validation of cancer tissue

samples using microarray expression data, Bioinformatics, 16:906914.

Garrett-Mayer, E., Parmigiani, G., Zhong, X., Cope, L. and Gabrielson, E. (2004).

Cross-study Validation and Combined Analysis of Gene Expression Microarray

Data, Technical Report, Johns Hopkins University, Department of Biostatistics.

Gautier, L., Cope, L., Bolstad B.M., and Irizarry, R.A. (2004). affy-analysis of

Affymetrix GeneChip data at the probe level, Bioinformatics, 20:307-315.

Ge, Y., Dudoit, S. and Speed, T.P. (2003). Resampling-based multiple testing for

microarray data analysis, TEST, 12:1-44.

Geman, D., Christian d’Avignon, Naiman, D.Q., and Winslow, R.L., (2004). Classify-

ing Gene Expression Profiles from Pairwise mRNA Comparisons, Stat Appl Genet

Mol Biol., 3:1 Article 19.

Goddard MJ, Hinberg I. (1990). Receiver operating characteristic (ROC) curves and

non-normal data: an empirical study, Statistics in Medicine, 9:325-337.

Golub et al. (1999). Molecular classification of cancer: class discovery and class pre-

diction by gene expression monitoring, Science, 286:531-537.

BIBLIOGRAPHY 122

Ghosh, D. (2004). Mixture models for assessing differential expression in complex

tissues using microarray data, Bioinformatics 20:1663-1669.

Goeman, J.J., Geer, S., Kort, F. and Houwelingen, H. (2004). A global test for groups

of genes: testing association with a clinical outcome, Bioinformatics, 20(1):93-99.

Hanley JA. (1988). The Robustness of the Binormal Assumption Used in Fitting ROC

Curves, Medical Decision Making, 8(3):197-203.

Hanley, J. and McNeil B.(1983). A method for comparing the areas under receiver

operating characteristic curves derived from the same cases, Radiology, 148:839-843.

Harbig, J, Sprinkle, R; Enkemann, SA. (2005). A sequence-based identification of the

genes detected by probesets on the Affymetrix U133 plus 2.0 array, Nucleic Acids

Res., 33:31.

Haslett, J.N., Sanoudou, D., Kho, A.T., Bennett, R., Greenberg, S.A., Kohane, I.S.,

Beggs, A.H., Kunkel, L.M. (2002). Gene expression comparison of biopsies from

Duchenne muscular dystrophy (DMD) and normal skeletal muscle, Proceedings of

the National Academy of Sciences, USA, 99:15000-15005.

Hoeffding W. (1948). A class of statistics with asymptotically normal distribution,

Annals of Mathematical Statistics, 19(3):293-325.

Hochberg, Y., and tamhane, A. (1987). Multiple comparison procedures, Wiley.

Hossain, A., Beyene, J., Willan, A., and Hu, P. (2009). A flexible approximate likeli-

hood ratio test for detecting differential expression in microarray data, Computa-

tional Statistics and Data Analysis, 53:3685-3695.

Hossain, A., and Willan, A. (2007). Approximate MLEs of the Parameters of

Location-Scale Models under Type II Censoring, Statistics: A Journal of Theo-

retical and Applied Statistics, 41(5):385.

BIBLIOGRAPHY 123

Hu, P., Beyene, J. and Greenwood, CMT (2006). Tests for differential gene expression

using weights in oligonucleotide microarray experiments, BMC Genomics, 7:33.

Huber, W., Von Heydebreck, A., Sultmann, H., Poustka, A. and Vingron, M. (2002).

Variance stabilization applied to micromay data calibration and to the quantifica-

tion of differential expression, Bioinformatics, l(1):1-9.

Hunter, L., Taylor, R.C., Leach, S.M. and Simon, R. (2001). GEST: a gene expression

search tool based on a novel Bayesian similarity metric, Bioinformatics, 17(1):S115-

S122.

Ideker, T., Thorsson, V., Siegel, A.F. and Hood, L.E. (2000). Testing for

Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray

Data, Journal of Computational Biology, 7(6):805-817.

Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf

U. and Speed, T.P., (2003). Exploration, normalization, and summaries of high

density oligonucleotide array probe level data, Biostatistics, 4(2):249-64.

Jeffery, I.B., Higgins, D. G. and Culhane A.C. (2006). Comparison and evaluation

of methods for generating differentially expressed gene lists from microarray data,

BMC Bioinformatics, 7:359.

Kerr, M.K., Martin, M. and Churchill, G.A. (2000). Analysis of variance for gene

expression microarray data, Journal of Computational Biology, 7(6):819-37.

Khan J., Wei, J.S., Ringnr, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold,

F., Schwab, M., Antonescu, C.R., Peterson, C. and Meltzer, P.S. (2001). Classi-

fication and diagnostic prediction of cancers using gene expression profiling and

artificial neural networks, Nature Medicine, 7(6):658-9.

BIBLIOGRAPHY 124

Klebanov, L. and Yakovlev, A. (2007). Diverse corelation structures in gene expression

data and their utility in improving statistical inference, The Annals of Applied

Statistics, 1(2):538-559.

Larsen, R. J. and Marx, M. L. (2001). An Introduction to Mathematical Statistics and

Its Applications, 3rd ed. Englewood Cliffs, NJ: Prentice-Hall.

Lee, M.L.T., Kuo, F.C., Whitmore, G.C. and Sklar, J. (2000). Importance of replica-

tion in microarray gene expression studies: Statistical methods and evidence from

repetitive cDNA hybridizations, PNAS, 97(18):9834-9839.

Li, C. and Wong, W.H. (2001). Model-based analysis of oligonucleotide arrays: Ex-

pression index computation and outlier detection, PNAS, 98(1):3136.

Li, L., Weinberg, C., Darden, T. and Pedersen, L. (2001). Gene selection for sam-

ple classification based on gene expression data: study of sensitivity to choice of

parameters of the GA/KNN method, Bioinformatics, 17:11311142.

Lin,D., Shkedy, Z., Burzykowski,T., Ion,T., Gohlmann, H., Bondt, A., Perera, T.,

Geerts, T., Wyngaert, V., and Bijnens, L. (2008). An Investigation on Perfor-

mance of Significance Analysis of Microarray (SAM) for the Comparisons of Several

Treatments with one Control in the Presence of Small-variance Genes, Biometrical

Journal, 50(5):801-823.

Lobenhofer, E.K., Bushel, P.R., Afshari, C.A. and Hamadeh, H.K.(2001). Progress in

the application of DNA microarrays, Environ. Health Perspect., 2001(109):881891.

Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S.,

Mittmann, M., Wang, C., Kobayashi, M., Horton, H. and Brown E.L. (1996). Ex-

pression monitoring by hybridization to high-density oligonucleotide arrays, Nature

Biotechnology, 14:16751680.

BIBLIOGRAPHY 125

Lonnstedt, I. and Speed, T. P. (2002). Replicated microarray data, Statistica Sinica,

12:31-46.

Lu, Y., Liu, P., Xiao, P. and Deng H. (2005). Hotelling’s T 2 multivariate profiling for

detecting differential expression in microarrays, Bioinformatics, 21(14):3105-3113.

Macoska, JA. (2002). The progressing clinical utility of DNA microarrays, CA Cancer

J Clin., 52:5059.

Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979). Multivariate Analysis, Academic

Press,London.

Mason, R.L., Gunst, R.F., and Hess, J.L. (1989). Statistical Design and Analysis

of Experiments: With Applications to Engineering and Science, John Wiley, New

York.

Millenaar, F.F., Okyere, J., May, S.T., Zanten, M.V., Voesenek, L.A., and Peeter,

A.J. (2006). How to decide? Different methods of calculating gene expression from

short oligonucleotide array data will give different results, BMC Bioinformatics,

7:137.

Newton, M. A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., and Tsui,

K.W.(2001). On differential variability of expression ratios: improving statistical

inference about gene expression changes from microarray data, J. Comp. Biol.,

8:3752.

Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential

gene expression with a semiparametric hierarchical mixture method, Biostatistics,

5:155176.

Nguyen, D.V., Arpat, A.B., Wang, N., Carroll, R.J. (2002). DNA microarray experi-

ments: biological and technological aspects, Biometrics, 58(4):701-17.

BIBLIOGRAPHY 126

Pan, W. (2002). A comparative review of statistical methods for discovering dif-

ferentially expressed genes in replicated microarray experiments, Bioinformatics,

12:546-554.

Pan, W., Lin, J. and Lee, C. (2003). A mixture model approach to detecting differen-

tially expressed genes with microarray data, Function Interg. Genomics, 3:117-124.

Pawitan, Y., Michiels, S., Koscienlny, S., Gusnanto, A. and Ploner, A. (2005). False

discovery rate, sensitivity and sample size for microarray studies, Bioinformatics,

21(13):3017-3024.

Pepe, M.S., Longton, G. Anderson, G.L., and Schummer, M. (2003). Selecting Differ-

entially Expressed Genes from Microarray Experiments, Biometrics, 59:133-142.

Piper, P., Lewis, D.P. and Noble, WS. (2002). Exploring gene expression data with

class scores, Pac. Symp. Biocomput., p474-85.

Pounds, S. and Cheng, C. (2005). Statistical Development and Evaluation of Microar-

ray Gene Expression Data Filters, Journal of computational biology, 12(4):482495.

Qiu, X., Klebanov, L. and Yakovlev, A. Y. (2005). Correlation between gene expres-

sion levels and limitations of the empirical bayes methodology for finding differen-

tially expressed genes, Statistical Applications in Genetics and Molecular Biology,

4:34.

Qiu, X. and Yakovlev, A. (2006). Some comments on instability of false discovery

rate estimation, J. Bioinformatics and Coputational Biology, 4:1057-1068.

Opgen-Rhein, R. and Strimmer, K. (2007). Accurate Ranking of Differentially Ex-

pressed Genes by a Distribution-Free Shrinkage Approach, Statistical Applications

in Genetics and Molecular Biology, 6(1), Article 9.

BIBLIOGRAPHY 127

Quackenbush, J. (2001). Computational analysis of microarray data, Nature Review

Genetics, 2:418-427.

Raychaudhuri, S., Stuart, J.M., Liu, X., Small, P.M. and Altman, R.B.(2000). Pattern

recognition of genomic features with microarrays: site typing of Mycrobacterium

tuberculosis strains, Proc. Int. Conf.Intell. Syst. Mol. Biol., 8:286-295.

Reyal, F., Stransky, N., Bernard-Pierrot, I., Vincent-Salomon, A., de Rycke, Y. and

Elvin, P. (2005). Visualizing chromosomes as transcriptome correlation maps: Evi-

dence of chromosomal domains containing co-expressed genesa study of 130 invasive

ductal breast carcinomas, Cancer Res., 65:13761383.

Sahai, H. and Agell, M. (2000). Analysis of variance: Fixed, Random and Mixed

models, Boston: Birkhauser.

Sambrook and Russell (2001). Molecular Cloning: A Laboratory Manual, 3rd edition,

Cold Spring Harbor Laboratory Press.

Schulze, A, Downward, J. (2001). Navigating gene expression using microarraysa

technology review, Nat Cell Biol., 3:190195.

Robert B. Scharpf, Hkon Tjelmeland, Giovanni Parmigiani, and Andrew B. Nobel

(2009). A Bayesian Model for Cross-Study Differential Gene Expression, Journal

of the American Statistical Association, 104(488):1295-1310.

Schena, M. (2000). Microarray Biochip Technology. Westborough,MA: BioTechniques

Press.

Segal, E., Friedman, N., Koller, D. and Regev, A. (2004). A module map showing

conditional activity of expression modules in cancer, Nat. Genet., 36:1090-1098.

Seng, K., Glenny, R.W., Madtes, D.K., Spilker, M.E., Vicini, P. and Gharib, S.A.

BIBLIOGRAPHY 128

(2008). Comparison of Statistical Data Models for Identifying Differentially Ex-

pressed Genes Using a Generalized Likelihood Ratio Test, Gene Regulation and

Systems Biology, 2:125139.

Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing dif-

ferential expression in microarray experiments, Statistical Applications in Genetics

and Molecular Biology, 3(1), Article 3.

Smyth, G. K., Thorne, N. P., and Wettenhall, J. (2004). Limma: Linear models

for microarray data user’s guide, Software manual available from http://www.

bioconductor.org.

Speed, T. (eds) (2003). Statistical Analysis of Gene Expression Microarray Data.

Chapman & Hall/CRC, USA.

Storey, J. D. (2002). A direct approach to false discovery rates, J. R. Statist. Soc. B

, 64(3):479498.

Storey, J. and Tibshirani, R. (2003). SAM thresholding and false discovery rates

for detecting differential gene expression in DNA microarrays, In parmigiani, G.,

Garrett, E.S., Irizarry, R.A. and Zeger, S.L. (eds), The Analysis of Gene Expression

Data: Methods and Software, New York: Springer.

Sawilowsky S.S. (1993). Comments on using alternative to normal theory statistics

in social and behavioural science, Canadian Psychology, 34:432-439.

Szabo, A. et al. (2002). Variable selcetion and pattern recognition with gene expres-

sion data generated by the microarray technology, Math. Biosci., 176:71-98.

Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander,

E. S. and Golub, T. R. (1999). Interpreting patterns of gene expression with self-

organizing maps: methods and application to hematopoietic differentiation, Proc.

Natl. Acad. Sci. U.S.A., 96:29072912.



BIBLIOGRAPHY 129

Thomas, J.G., Olson, J.M., Tapscott, S.J. and Zhao, L.P. (2001). An efficient and

robust statistical modeling approach to discover differentially expressed genes using

genomic expression profiles, Genome Research, 11:1227-1236.

Troyanskaya, O.G., Garber, M., Brown, P., Botstein, D., Altman, R.B. (2002). Non-

paramteric methods for identifying differentially expressed genes in microarray

data, Bioinformatics, 18(11):1454-1461.

Tsodikov, A., Szabo, A. and Jones, D. (2002). Adjustments and measures of differ-

ential expression for microarray data, Bioinformatics, 18:251-260.

Tusher, V.G., Tibshirani, R., Chu, G., (2001). Significance analysis of microarray

applied to the ionizing radiation response, PNAS, 98(9):5116-5121.

Van’t Wout, A. B., G.K. Lehrman, S. A. Mikheeva, G. C. O’Keeffe, M. G. Katze, R.

E. Bumgarner, G. K. Geiss, and J.I. Mullins (2003). Cellular gene expression upon

human immunodeficiency virus type I infection of CD4(+)- Tcell lines, journal of

Virology, 77(2):1392-1402.

Westfall, P.H., and Young, S.S. (1993). Re-sampling based multiple testing, Wiley,

NY.

Wolfinger, R., Gibson, G., Wolfinger, E., Bennett, L., Hamadeh, H., Bushel P, Af-

shari, C. and Paules, R. (2001). Assessing Gene Significance from cDNAMicroarray

Expression Data via Mixed Models, Journal of Computational Biology, 8:625-637.

Wright, G.W. and Simon, R.M. (2003). A random variance model for detection of dif-

ferential gene expression in small microarray experiments, Bioinformatics, 19:2448-

2455.

Wu, B. (2005). Differential gene expression detection using penalized linear regression

models: the improved SAM statistics, Bioinformatics, 21:1565-1571.

BIBLIOGRAPHY 130

Yang, K., Cai, Z., Li, J. and Lin, G. (2006). A stable gene selection in microarray

data analysis, BMC Bioinformatics, 7:228.

Xiong, M., Fang, X., Zhao, J. (2001). Biomarker Identification by Feature Wrappers,

Genome Research, 11:1878-1887.

Yekutieli, D., Benjamini, Y. (1999), Resampling-based false discovery rate controlling

multiple test procedures for correlated test statistics, Journal of Statistical Planning

and Inference, 82:171-196.

Yang, Y.H., Dudoit, S.D., Luu, P. and Speed, T.P. (2001). Normalization for cDNA

Microarray Data, In Spie Bioe.

Yang, K., Cai, Z., Li, J. and Lin, G. (2006). A stable gene selection in microarray

data analysis, BMC Bioinformatics, 7:228.

Yeung, K.Y., Bumgarner, R.E. and Raftery, A.E. (2005). Bayesian model averaging:

development of an improved multi-class, gene selection and classification tool for

microarray data, Bioinformatics, 21:2394.

Zhao, Y. and Pan, W. (2003). Modified nonparametric approaches to detecting dif-

ferentially expressed genes in replicated microarray experiments, Bioinformatics,

19:1046-1054.

Zhang, S. (2006). An Improved Nonparametric Approach for Detecting Differentially

Expressed Genes with Replicated Microarray Data, Statistical Applications in Ge-

netics and Molecular Biology, 5(1).

Zweig, M.H., Campbell, G. (1993). Receiver operating characteristic (ROC) plots: a

fundamental evaluation tool in clinical medicine, Clinacal Chemistry, 39(4):561-77.

Zhou, Y., Corentin C., Ohsugi, M., Stormo, G. and Permutt, M.(2007). A global

approach to identify differentially expressed genes in cDNA (two-color) microarray

experiments, Bioinformatics, 23(16):2073-2079.

BIBLIOGRAPHY 131

Appendix

A. R Code for Chapter 1

####

library(Biobase)

library(multtest)

library(DEDS)

# The following function computes t-statistic values from golub data

# and draw the histogram

data(golub)

t.fun <- comp.t(golub.cl)

t.X <- t.fun(golub)

hx <-

hist(t.X, breaks=100, plot=FALSE) plot(hx, col=ifelse(abs(hx$breaks)

< 3, 4, 2), xlab="T statistic",ylab="",main="")

# The following function computes t-statistic after permutation

t.twoclass<-

function(X, L, B = 2, Delta = 2)

{

# L A vector of integers corresponding to observation (column) class

# labels.

# X A matrix, with m rows corresponding to variables

# (hypotheses) and n columns corresponding to observations.

BIBLIOGRAPHY 132

# B The number of permutations.

# Deltas A vector of values for the threshold delta; see Tusher et al.

ng <- as.vector(table(L))

n <- sum(ng)

t.fun <- comp.t(L, var.equal = TRUE)

tX <- t.fun(as.matrix(X))

tXb <- matrix(nr = nrow(X), nc = B)

for(i in 1:B) {

id <- sample(c(1:n), n, replace = FALSE)

Xb <- X[, id]

t.fun <- comp.t(L, var.equal = TRUE)

tXb[, i] <- t.fun(as.matrix(Xb))

}

p.t1 <- apply(abs(tXb) >= Delta, 2, sum)

p.t <- mean(p.t1)

return(p.t)

}

A. R Code for Chapter 3

### t-statistics and FDR calculation

t.ah.func<- function(X, L, B =

100) {

# L A vector of integers corresponding to observation (column) class

# labels.

# X A matrix, with m rows corresponding to variables

# (hypotheses) and n columns corresponding to observations.

# B The number of permutations.

G1 <- X[, L == 0]

BIBLIOGRAPHY 133

G2 <- X[, L == 1]

n1 <- ncol(G1)

n2 <- ncol(G2)

r <- rowMeans(G1, na.rm = TRUE) - rowMeans(G2, na.rm = TRUE)

ss <- function(x)

{

sum((as.numeric(x) - mean(as.numeric(x), na.rm = TRUE))^2, na.rm = TRUE)

}

s <- sqrt(((apply(G1, 1, ss) + apply(G2, 1, ss)) * (1/n1 + 1/n2))/(n1 + n2 - 2))

t <- r/s

order.t <- order(t)

sort.t <- sort(t)

tB <- matrix(nr = nrow(X), nc = B)

for(i in 1:B) {

id <- sample(c(1:(n1 + n2)), n1 + n2)

G1 <- X[, id[1:n1]]

G2 <- X[, id[(n1 + 1):(n1 + n2)]]

rb <- rowMeans(G2, na.rm = TRUE) - rowMeans(G1, na.rm = TRUE)

sb <- sqrt(((apply(G1, 1, ss) + apply(G2, 1, ss)) * (1/n1 + 1/n2))/(n1 + n2 - 2))

tb <- rb/sb

tB[, i] <- sort(tb)

}

return(cbind(index = order.t, t = sort.t, tB))

}

t.ah.fdr<- function (order.t, ordertB, deltas) {

n <- length(order.t)

BIBLIOGRAPHY 134

ndelta <- length(deltas)

tB <- rowMeans(ordertB)

diff <- order.t - tB

tmp <- quantile(as.vector(ordertB), c(0.25, 0.75), na.rm = TRUE)

pi <- sum(order.t < tmp[2] & order.t > tmp[1], na.rm = TRUE)/(n/2)

pi <- min(pi, 1)

table <- c()

for (i in 1:ndelta) {

delta <- deltas[i]

if (sum((diff > delta) & (order.t > 0)) != 0) {

pos <- min(which((diff > delta) & (order.t > 0))):n

n.pos <- length(pos)

}

else n.pos <- 0

if (sum((diff < (-delta)) & (order.t < 0)) != 0) {

neg <- 1:max(which((diff < (-delta)) & (order.t <

0)))

n.neg <- length(neg)

}

else n.neg <- 0

n.total <- n.pos + n.neg

max <- ifelse(n.pos == 0, Inf, order.t[n - n.pos + 1])

min <- ifelse(n.neg == 0, -Inf, order.t[n.neg])

fp <- median(apply(ordertB, 2, function(z) {

sum(z >= max | z <= min)

}), na.rm = TRUE)

median.fp <- ifelse(pi * fp/n.total > 1, 1, pi *

fp/n.total)

BIBLIOGRAPHY 135

table <- rbind(table, c(delta, n.total, n.pos, n.neg, fp,

median.fp))

}

colnames(table) <- c("delta", "no.significance", "no.positive",

"no.negative", "FP" "FDR")

return(table)

}

comp.ah.t<- function (X,L, B = 200, deltas) {

t <- t.ah.func(X, L, B)

n <- ncol(t)

fdr.table <- t.ah.fdr(t[, 2], t[, 3:n], deltas)

return(list(geneOrder = t[, 1], t.stat = t[, 2], fdr.table = fdr.table))

}

### SAM statistic and FDR calculation

sam.ah.func<- function (X, L, prob = NULL, B = 200, stat.only =

FALSE,

s.step = 0.01, alpha.step = 0.01)

{

G1 <- X[, L == 0]

G2 <- X[, L == 1]

n1 <- ncol(G1)

n2 <- ncol(G2)

r <- rowMeans(G1, na.rm = TRUE) - rowMeans(G2, na.rm = TRUE)

ss <- function(x) {

BIBLIOGRAPHY 136

sum((as.numeric(x) - mean(as.numeric(x), na.rm = TRUE))^2,

na.rm = TRUE)

}

s <- sqrt((apply(G1, 1, ss) + apply(G2, 1, ss)) * (1/n1 +

1/n2)/(n1 + n2 - 2))

if (!is.null(prob))

s0 <- quantile(s, prob)

else s0 <- sam.ah.s0(r, s, s.step = s.step, alpha.step = alpha.step)

d <- r/(s + s0)

if (stat.only)

return(d)

else {

order.d <- order(d)

sort.d <- sort(d)

B <- deds.checkB(L, B)

dB <- matrix(nr = nrow(X), nc = B)

for(i in 1:B){

id <- sample(c(1:(n1 + n2)), n1 + n2)

G1 <- X[, id[1:n1]]

G2 <- X[, id[(n1 + 1):(n1 + n2)]]

rb <- rowMeans(G2, na.rm = TRUE) - rowMeans(G1, na.rm = TRUE)

sb <- sqrt((apply(G1, 1, ss) + apply(G2, 1, ss)) *

(1/n1 + 1/n2)/(n1 + n2 - 2))

db <- rb/(sb + s0)

dB[, i] <- sort(db)

}

return(cbind(index = order.d, t = sort.d, dB))

}

BIBLIOGRAPHY 137

}

sam.ah.s0<- function(r, s, s.step = 0.01, alpha.step = 0.01) {

if(1/s.step >= length(s))

s.step <- 2 * (1/length(s))

q <- quantile(s, seq(0, 1, by = s.step))

while(any(duplicated(q))) {

s.step <- s.step * 2

q <- quantile(s, seq(0, 1, by = s.step))

}

q.indices <- cut(s, q, labels = FALSE, right = FALSE)

q.indices[which(is.na(q.indices))] <- 1/s.step

sam.d.alpha <- function(r, s, alpha)

{

s.alpha <- quantile(s, alpha)

r/(s + s.alpha)

}

cv.alpha <- function(alpha)

{

d.alpha <- sam.d.alpha(r, s, alpha)

v.j <- tapply(d.alpha, q.indices, mad)

sd(v.j, na.rm = TRUE)/mean(v.j, na.rm = TRUE)

}

alpha.seq <- seq(0, 1, by = alpha.step)

cva <- sapply(alpha.seq, cv.alpha)

alpha0 <- alpha.seq[which(cva == min(cva))]

s0 <- quantile(s, alpha0)

BIBLIOGRAPHY 138

i <- 2

while(s0 == 0) {

alpha0 <- alpha.seq[which(cva == cva[order(cva)[i]])]

i <- i + 1

s0 <- quantile(s, alpha0)

}

s0

}

sam.ah.fdr<- function (order.d, ordertB, deltas) {

n <- length(order.d)

ndelta <- length(deltas)

tB <- rowMeans(ordertB)

diff <- order.d - tB

tmp <- quantile(as.vector(ordertB), c(0.25, 0.75), na.rm = TRUE)

pi <- sum(order.d < tmp[2] & order.d > tmp[1], na.rm = TRUE)/(n/2)

pi <- min(pi, 1)

table <- c()

for (i in 1:ndelta) {

delta <- deltas[i]

if (sum((diff > delta) & (order.d > 0)) != 0) {

pos <- min(which((diff > delta) & (order.d > 0))):n

n.pos <- length(pos)

}

else n.pos <- 0

if (sum((diff < (-delta)) & (order.d < 0)) != 0) {

neg <- 1:max(which((diff < (-delta)) & (order.d <

0)))

BIBLIOGRAPHY 139

n.neg <- length(neg)

}

else n.neg <- 0

n.total <- n.pos + n.neg

max <- ifelse(n.pos == 0, Inf, order.d[n - n.pos + 1])

min <- ifelse(n.neg == 0, -Inf, order.d[n.neg])

fp <- median(apply(ordertB, 2, function(z) {

sum(z >= max | z <= min)

}), na.rm = TRUE)

FDR <- ifelse(pi * fp/n.total > 1, 1, pi *

fp/n.total)

table <- rbind(table, c(delta, n.total, n.pos, n.neg, fp,

FDR))

}

colnames(table) <- c("delta", "no.significance", "no.positive",

"no.negative", "fp", "FDR")

return(table)

}

comp.ah.SAM<-

function(X, L, prob = NULL, B = 200, stat.only = FALSE, deltas, s.step =

0.01, alpha.step = 0.01)

{

d <- sam.ah.func(X, L, prob, B, stat.only, s.step = s.step,

alpha.step = alpha.step)

if(stat.only)

return(d)

else {

BIBLIOGRAPHY 140

n <- ncol(d)

fdr.table <- sam.ah.fdr(d[, 2], d[, 3:n], deltas)

return(list(geneOrder = d[, 1], sam = d[, 2], fdr.table =

fdr.table))

}

}

ALRT method

## Fitting GLDII with 2 class microarray data

# AMLE for b, mu1 and mu2 for GLDII

# y1 expression values for treatment condition

# y2 expression values for control condition

# alpha shape parameter for GLDII

GLogistic2.amle2<-

function(alpha=1,y1,y2) {

n1 <- length(y1)

i1 <- 1:n1

pi1 <- i1/(n1 + 1)

qi1 <- 1 - pi1

nui1 <- log( - log(qi1))

n2 <- length(y2)

i2 <- 1:n2

pi2 <- i2/(n2 + 1)

qi2 <- 1 - pi2

BIBLIOGRAPHY 141


D1 <- function(x)

{

(alpha*(exp(-x)-1))/(1+exp(-x))

}

D1prime <- function(x)

{

-alpha*(2*exp(-x))/(1+exp(-x))^2

}

Ai <- D1(nui1) - nui1 * D1prime(nui1)

Bi <- - D1prime(nui1)

Cj <- D1(nui2) - nui2 * D1prime(nui2)

Dj <- - D1prime(nui2)

K1 <- (sum(Bi * y1) )/(sum(Bi) )

L1 <- sum(Ai) /sum(Bi)

K2 <- (sum(Dj * y2) )/(sum(Dj) )

L2 <- sum(Cj) /sum(Dj)

lambda1 <- sum((y1 - K1) * Ai)+sum((y2 - K2) * Cj)

lambda2 <- sum(Bi * (y1 - K1)^2) +sum(Dj * (y2 - K2)^2)

b <- ( - lambda1 + sqrt(lambda1^2 + 4 * (n1+n2) * lambda2))/(2 * (n1+n2))

mu1 <- K1 - L1 * b

mu2 <- K2 - L2 * b

c(b, mu1,mu2)

}

# AMLE for b, mu0 under H0 for GLDII

BIBLIOGRAPHY 142

GLogistic2.amle.h0<-

function(alpha=1, y1,y2) {

n1 <- length(y1)

i1 <- 1:n1

pi1 <- i1/(n1 + 1)

qi1 <- 1 - pi1


n2 <- length(y2)

i2 <- 1:n2

pi2 <- i2/(n2 + 1)

qi2 <- 1 - pi2


D1 <- function(x)

{

(alpha*(exp(-x)-1))/(1+exp(-x))

}

D1prime <- function(x)

{

-alpha*(2*exp(-x))/(1+exp(-x))^2

}

Ai <- D1(nui1) - nui1 * D1prime(nui1)

Bi <- - D1prime(nui1)

Cj <- D1(nui2) - nui2 * D1prime(nui2)

Dj <- - D1prime(nui2)

K0 <- (sum(Bi * y1) +sum(Dj * y2))/(sum(Bi)+sum(Dj) )

BIBLIOGRAPHY 143

L0 <- (sum(Ai)+sum(Cj)) /(sum(Bi)+sum(Dj))

lambda1 <- sum((y1 - K0) * Ai)+sum((y2 - K0) * Cj)

lambda2 <- sum(Bi * (y1 - K0)^2) +sum(Dj * (y2 - K0)^2)

b <- ( - lambda1 + sqrt(lambda1^2 + 4 * (n1+n2) * lambda2))/(2 * (n1+n2))

mu0 <- K0 - L0 * b

c(b, mu0)

}

## Estimating alpha values for GLDII

alpha.Glogistic2<-

function(x1, x2, lower, upper)

{

# lower, upper The least and greatest value at which to evaluate

# the profile likelihood.

alphas <- seq(lower, upper, by = 0.001)

m2 <- length(alphas)

L.est <- NA

for(i in 1:m2) {

alpha <- alphas[i]

est <- GLogistic2.amle2(alpha, x1, x2)

b.est <- est[1]

mu1 <- est[2]

mu2 <- est[3]

pdfglogis <- function(x, alpha, mu, b)

{

(alpha/b) * (exp( - ((x - mu)/b) * alpha)/(1 +

BIBLIOGRAPHY 144

exp( - ((x - mu)/b)))^(alpha + 1))

}

L.est[i] <- sum(log(pdfglogis(x1, alpha, mu1, b.est))) +

sum(log(pdfglogis(x2, alpha, mu2, b.est)))

}

maxL <- which(L.est == max(L.est))

alphas[maxL]

}

get.alpha<-function(dataf,index1,index2,datafilter=as.numeric){

f<-function(i){

return(alpha.Glogistic2(datafilter(dataf[i,index1]),

datafilter(dataf[i,index2]), lower=0, upper=5))

}

return(sapply(1:length(dataf[,1]),f))

}

# ALRT

LR.Glogistic<-

function(alpha = 1, x1, x2)

{

est <- GLogistic2.amle2(alpha, x1, x2)

b.est <- est[1]

mu1 <- est[2]

mu2 <- est[3]

est.ho <- GLogistic2.amle.h0(alpha, x1, x2)

b0.est <- est.ho[1]

BIBLIOGRAPHY 145

mu0 <- est.ho[2]

pdfglogis <- function(x, alpha, mu, b)

{

(gamma(2 * alpha)/(gamma(alpha))^2) * (1/b) * (exp( - (

(x - mu)/b) * alpha)/(1 + exp( - ((x - mu)/b)))^

(2 * alpha))

}

L.est <- sum(log(pdfglogis(x1, alpha, mu1, b.est))) + sum(log(

pdfglogis(x2, alpha, mu2, b.est)))

L.ho <- sum(log(pdfglogis(x1, alpha, mu0, b0.est))) + sum(log(

pdfglogis(x2, alpha, mu0, b0.est)))

LR <- 2 * L.est - 2 * L.ho

LR

}

get.GLR<-

function(dataf, alpha = 1, index1, index2, datafilter = as.numeric)

{

f <- function(i)

{

return(LR.Glogistic(alpha, datafilter(dataf[i, index1]),

datafilter(dataf[i, index2])))

}

return(sapply(1:length(dataf[, 1]), f))

}

BIBLIOGRAPHY 146

# Permutation results

Galrt1.twoclass<-

function (X, L, B = 2,alpha=1)

{

ng<-as.vector(table(L))

n<-sum(ng)

alrt<-get.GLR(X,alpha,1:ng[1],(ng[1]+1):n)

alrtb <- matrix(nr = nrow(X), nc = B)

for (i in 1:B) {

id <- sample(c(1:n), n, replace=FALSE)

Xb<-X[,id]

alrtb[,i] <- get.GLR(Xb,alpha,1:ng[1],(ng[1]+1):n)

}

p.alrt<-apply(abs(alrtb)>=abs(alrt),1,sum)/B

return(p.alrt)

}

# Numebr of significant genes, false postives, corresponding to

cutoff p-value

nosigp<- function(alphas, pvalue, id.DE) {

np <- length(alphas)

table <- c()

for(i in 1:np) {

p <- alphas[i]

sig.t <- sum(pvalue < p, na.rm = T)

fp <- sum(pvalue[ - id.DE] < p, na.rm = T)

BIBLIOGRAPHY 147

fn <- sum(pvalue[id.DE] >= p, na.rm = T)

tp <- sum(pvalue[id.DE] < p, na.rm = T)

tn <- sum(pvalue[ - id.DE] >= p, na.rm = T)

sens <- tp/(tp + fn)

spec <- tn/(fp + tn)

table <- rbind(table, c(p, sig.t, fp, fn, tp, tn, sens, spec))

}

colnames(table) <- c("alpha", "sig", "fp", "fn", "tp", "tn", "sens", "spec")

return(table)

}

# Calculating Area under the ROC curve from simulation

sim.logis<-

function(R = 100, m, alpha0 = 6, beta0 = 2, a0 = 5, a = 4, b = 8, B

= 10, n1 = 5, n2 = 5,

pde = 0.1)

{

auc.LR <- NA

auc.GLR <- NA

auc.sam <- NA

# auc.t<- NA

auc.willcox <- NA

auc.modt <- NA

n <- n1 + n2

for(r in 1:R) {

simml <- function(m, n, a0)

{

BIBLIOGRAPHY 148

bs <- 1/rgamma(m, shape = a0, rate = a0)

alphas <- rgamma(m, shape = alpha0, rate = beta0)

table1 <- c()

for(i in 1:m) {

b <- bs[i]

alpha1 <- alphas[i]

X <- glgsimn2(r = n, alpha1, mu = 0, b)

table1 <- rbind(table1, c(alpha1, X))

}

return(table1)

}

X1 <- simml(m, n, a0)

alpha <- X1[, 1]

X <- X1[, 2:(n + 1)]

mug <- function(m, n2, pde, a, b)

{

table2 <- c()

de <- pde * m

for(i in 1:de) {

mug <- rgamma(n2, shape = a, rate = b)

table2 <- rbind(table2, c(mug))

}

return(table2)

}

mug1 <- mug(m, n2, pde, a, b)

X[1:nrow(mug1), (n1 + 1):n] <- X[1:nrow(mug1), (n1 + 1):n] + mug1

L <- rep(0:1, c(n1, n2))

genenames <- paste(c("g"), 1:m, sep = "")

BIBLIOGRAPHY 149

row.names(X) <- genenames

alpha2 <- seq(0, 1, by = 0.05)

DEGenes <- 1:(nrow(mug1))

auci <- function(pvalues)

{

res <- nosigp(alpha2, pvalues, DEGenes)

se <- res[, 7]

sp <- res[, 8]

rocobj1 <- r.sca(se, sp)

AUCi(rocobj1)

}

GLR.sim <- G2alrt1.twoclass(X, L, B, alpha)

auc.GLR[r] <- auci(GLR.sim)

alrtX.sim <- alrt1.twoclass(X, L, B)

auc.LR[r] <- auci(alrtX.sim)

samX.sim <- sam1.twoclass(X, L, B)

auc.sam[r] <- auci(samX.sim)

# tX.sim<-t1.twoclass(X,L,B)

# auc.t[r]<-auci(tX.sim)

modt.sim <- modt.twoclass(X, L, B)

auc.modt[r] <- auci(modt.sim)

wilcox.sim <- samwilc.twoclass(X, L, B)

auc.willcox[r] <- auci(wilcox.sim)

}

GLRT <- mean(auc.GLR)

sd.GLRT <- sqrt(var(auc.GLR))

ALRT <- mean(auc.LR)

sd.ALRT <- sqrt(var(auc.LR))

BIBLIOGRAPHY 150

sam1 <- mean(auc.sam)

sd.sam <- sqrt(var(auc.sam))

#t1<-mean(auc.t)

#sd.t<-sqrt(var(auc.t))

modt <- mean(auc.modt)

sd.modt <- sqrt(var(auc.modt))

willcox <- mean(auc.willcox)

sd.willcox <- sqrt(var(auc.willcox))

table1 <- data.frame(GLRT, ALRT, sam1, modt, sd.GLRT, sd.ALRT,

sd.sam, sd.modt, willcox, sd.willcox)

return(table1)

}

## Generating GLDII

glgsim1<-

function(alpha) {

W <- runif(1)

invglg3<-function(x, alpha, na.action = na.ommit ) {

u <- qbeta(x, 1, alpha)

log(u/(1-u))

}

invglg3(W, alpha)

}

glgsimn2<-

function(r,alpha=2,mu,b,na.action = na.omit)

{

BIBLIOGRAPHY 151

i <- 1

Z <- NULL

while(i <= r) {

x<-glgsim1(alpha)

Z[i] <- mu+b*x

i <- i + 1

}

Z

}

# Prediction by Discriminant Rule

stat.diag.da<-

function (ls, cll, ts, pool = 1)

{

ls <- as.matrix(ls)

ts <- as.matrix(ts)

n <- nrow(ls)

p <- ncol(ls)

nk <- rep(0, max(cll) - min(cll) + 1)

K <- length(nk)

m <- matrix(0, K, p)

v <- matrix(0, K, p)

disc <- matrix(0, nrow(ts), K)

for (k in (1:K)) {

which <- (cll == k + min(cll) - 1)

nk[k] <- sum.na(which)

m[k, ] <- apply(ls[which, ], 2, mean.na)

BIBLIOGRAPHY 152

v[k, ] <- apply(ls[which, ], 2, var.na)

}

vp <- apply(v, 2, function(z) sum.na((nk - 1) * z)/(n - K))

if (pool == 1) {

for (k in (1:K)) disc[, k] <- apply(ts, 1, function(z) sum.na((z -

m[k, ])^2/vp))

}

if (pool == 0) {

for (k in (1:K)) disc[, k] <- apply(ts, 1, function(z) (sum.na((z -

m[k, ])^2/v[k, ]) + sum.na(log(v[k, ]))))

}

pred <- apply(disc, 1, function(z) (min(cll):max(cll))[order.na(z)[1]])

list(pred = pred)

}

B. R Code for Chapter 4

### CALCULATION OF ROC

W1<-

function(x, y) {

Ai <- w.new(x, y)$A

if(Ai >= 0.5) {

wval1 <- w.new(d1 = x, d2 = y)

A <- wval1$A

s00 <- wval1$s00

s11 <- wval1$s11

BIBLIOGRAPHY 153

var1A <- wval1$var1A

psi.idot <- wval1$psi.idot

psi.dotj <- wval1$psi.dotj

psi.idot.m.A <- psi.idot - A

psi.dotj.m.A <- psi.dotj - A

}

else {

wval2 <- w.new(d1 = y, d2 = x)

A <- wval2$A

s00 <- wval2$s00

s11 <- wval2$s11

var1A <- wval2$var1A

psi.idot <- wval2$psi.idot

psi.dotj <- wval2$psi.dotj

}

list(psi.idot = psi.idot, psi.dotj = psi.dotj, s00 = s00, s11 =

s11, var1A = var1A, A = A)

}

get.A <- function(dataf, index1, index2, datafilter = as.numeric)

{

f <- function(i)

{

return(W1(datafilter(dataf[i, index1]), datafilter(dataf[i,

index2]))$A)

}


BIBLIOGRAPHY 154

}

get.var <- function(dataf, index1, index2, datafilter = as.numeric)

{

f <- function(i)

{

return(W1(datafilter(dataf[i, index1]), datafilter(dataf[i,

index2]))$var1A)

}


}

# Concordance with SAM statistic

no.concodance<-

function(bs)

{

nb <- length(bs)

table <- c()

for(i in 1:nb) {

b <- bs[i]

t1 <- t.stat(X, L)

a2.t <- statd.gnames(t1, genenames, crit = b)

a3.t <- data.frame(rank = 1:b, gnames = a2.t$gnames, d =

a2.t$d)

sam1 <- sam.ah.func(X, L, stat.only = TRUE)

a2.samr <- statd.gnames(sam1, genenames, crit = b)

a3.samr <- data.frame(rank = 1:b, gnames = a2.samr$gnames,

BIBLIOGRAPHY 155

d = a2.samr$d)

aucd1 <- AUC.ah.func(X, L, stat.only = TRUE)

a2.aucd <- statd.gnames(sam1, genenames, crit = b)

a3.aucd <- data.frame(rank = 1:b, gnames = a2.aucd$gnames,

d = a2.aucd$d)

auc1 <- AUC5.ah.func(X, L, stat.only = TRUE)

a2.auc <- statd.gnames(sam1, genenames, crit = b)

a3.auc <- data.frame(rank = 1:b, gnames = a2.auc$gnames,

d = a2.auc$d)

no.t <- length(intersect(as.vector(a3.t$gnames), as.vector(

a3.samr$gnames)))

no.aucd <- length(intersect(as.vector(a3.aucd$gnames),

as.vector(a3.samr$gnames)))

no.auc <- length(intersect(as.vector(a3.auc$gnames),

as.vector(a3.samr$gnames)))

table <- rbind(table, c(b, no.t, no.aucd, no.auc))

}

colnames(table) <- c("b", "no.t", "no.aucd", "no.auc")

return(table)

}

AUC.pair<-

function(sigs, dataf, index1, index2)

{

nb <- length(sigs)

fal <- matrix(0, ncol = nb, nrow = dim(dataf)[1])

for(i in 1:nb) {

BIBLIOGRAPHY 156

fal[, i] <- AUC1EQ2(sigs[i], dataf, index1, index2)

}

meana2 <- function(x)

{

mean(abs(x), na.rm = T)

}

apply(fal, 1, meana2)

}

# Simulation Code

# R= Number of simulation

# m= number of genes

# n1=number of samples for treatment group

# n2=number of samples for control group

# d= treatment effect

# pde= proportion of differentially expressed genes

# r=correlation coefficient

# nclass= number of correlated class

frac.genes<- function(R, m = 100, n1 = 10, n2 = 10, r = 0.8, nclass

= 5, sigma2 = 4, d = 2, pde = 0.1)

{

degenes.corauc <- NA

degenes.modt <- NA

degenes.auc <- NA

# degenes.auca2<-NA

for(b in 1:R) {

X1 <- simulated(m = m, n1 = n1, n2 = n2, r, nclass, sigma2,

BIBLIOGRAPHY 157

d, pde)

X <- X1$X

degenes <- paste(c("g"), X1$degenes, sep = "")

genenames <- row.names(X)

L <- rep(0:1, c(n1, n2))

degenes.corauc[b] <- corauc(X, L, genenames, degenes)

A.stat <- get.W(X, 1:n1, (n1 + 1):(n1 + n2))

rA <- statd.gnames(A.stat, genenames, crit = 10)

degenes.auc[b] <- length(intersect(rA$gnames, degenes))

design <- model.matrix( ~ L)

fit1 <- lmFit(X, design, method = "ls")

fit21 <- contrasts.fit(fit1, c(0, 1))

fiteb1 <- eBayes(fit21)

topt5 <- toptable(number = 10, genelist = genenames, fit

= fit21, eb = fiteb1, adjust = "fdr")

degenes.modt[b] <- length(intersect(topt5$ID, degenes))

}

DEgenes.corauc <- mean(degenes.corauc)

DEgenes.modt <- mean(degenes.modt)

DEgenes.auc <- mean(degenes.auc)

# DEgenes.auca2 <- mean(degenes.auca2)

list(DEgenes.corauc = DEgenes.corauc, DEgenes.modt = DEgenes.modt,

DEgenes.auc = DEgenes.auc)

}

B. R Code for Chapter 5

### Correlation Plot

BIBLIOGRAPHY 158

rgbcolor<-function (n = 50)

{

k <- round(n/2)

r <- c(rep(0, k), seq(0, 1, length = k))

g <- c(rev(seq(0, 1, length = k)), rep(0, k))

res <- rgb(r, g, rep(0, 2 * k))

res

}

plot.cor<-

function (x, nrgcols = 50, labels = FALSE, labcols = 1,

title = "")

{

n <- ncol(x)

corr <- x

corr <- cor(x)

image(1:n, 1:n, corr[, n:1], col = rgbcolor(nrgcols),

axes = FALSE, xlab = "", ylab = "")

if (length(labcols) == 1) {

axis(2, at = n:1, labels = labels, las = 2, cex.axis = 0.6,

col.axis = labcols)

axis(3, at = 1:n, labels = labels, las = 2, cex.axis = 0.6,

col.axis = labcols)

}

mtext(title, side = 3, line = 3)

box()

BIBLIOGRAPHY 159

}

### Simulated Dataset for Correlated Samples

simn<-function(m, nclass, n1, n2, r)

{

ngenes <- m/nclass

# ngenes is the number of genes per group

# nclass is Total number of groups

n <- n1 + n2

rmv <- function(mn, r)

{

u <- as.matrix(rep(1, ngenes))

B <- u %*% t(u) * r + (1 - r) * diag(rep(1, ngenes))

Sigma <- B * 4

x1 <- mvrnorm(n1, mn, Sigma)

x0 <- mvrnorm(n2, rep(0,ngenes), Sigma)

mat <- t(rbind(x1, x0))

return(mat)

}

mu<-rep(0,ngenes)

X <- c()

for(i in 1:nclass)

X <- rbind(X, rmv(mn = mu, r))

add1<-rep(1,n1)

X1<-X[1,1:n1]+add1

X2<-X[1,(n1+1):n]

BIBLIOGRAPHY 160

X<-data.frame(X1,X2)

genenames <- paste(c("g"), 1:m, sep = "")

row.names(X) <- genenames

return(X)

}

D(adj) statistic

# d1= expression values for treatment group

# d2= expression values for control group

w.new<-

function(d1, d2)

{

M <- length(d2)

N <- length(d1)

psi <- matrix(0, nrow = N, ncol = M)

for(i in c(1:N)) {

psi[i, ] <- ifelse(d1[i] > d2, 1, 0)

}

A <- sum(psi)/(M * N)

A <- pmax(Ai, 1 - Ai)

psi.idot <- apply(psi, 1, sum)/M

psi.dotj <- apply(psi, 2, sum)/N

s00 <- sum((psi.idot - A)^2)/(N - 1)

s11 <- sum((psi.dotj - A)^2)/(M - 1)

var1A <- s00/N + s11/M

BIBLIOGRAPHY 161



list(M = M, N = N, psi.idot = psi.idot, psi.dotj = psi.dotj,

psi.idot.m.A = psi.idot.m.A, psi.dotj.m.A = psi.dotj.m.A,

s00 = s00, s11 = s11, var1A = var1A, A = A)

}

psi.m.A<-

function(x, y) {

M <- length(y)

N <- length(x)


for(i in c(1:N)) {

psi[i, ] <- ifelse(x[i] > y, 1, 0)

}

Ai <- sum(psi)/(M * N)

if(Ai >= 0.5) {

M <- length(y)

N <- length(x)


for(i in c(1:N)) {

psi[i, ] <- ifelse(x[i] > y, 1, 0)

}


psi.idot <- apply(psi, 1, sum)/M

psi.dotj <- apply(psi, 2, sum)/N



BIBLIOGRAPHY 162

s00 <- sum((psi.idot - A)^2)/(N - 1)

s11 <- sum((psi.dotj - A)^2)/(M - 1)

var1A <- s00/N + s11/M

}

else {

M <- length(y)

N <- length(x)


for(i in c(1:N)) {

psi[i, ] <- ifelse(y[i] > x, 1, 0)

}


# psi.idot <- apply(psi, 1, sum)/M

# psi.dotj <- apply(psi, 2, sum)/N

psi.dotj <- apply(psi, 1, sum)/M

psi.idot <- apply(psi, 2, sum)/N



s00 <- sum((psi.idot - A)^2)/(N - 1)

s11 <- sum((psi.dotj - A)^2)/(M - 1)

var1A <- s00/N + s11/M

}

list(psi.idot = psi.idot, psi.dotj = psi.dotj, psi.idot.m.A =

psi.idot.m.A, psi.dotj.m.A = psi.dotj.m.A)

}

get.A <-

BIBLIOGRAPHY 163

function(dataf, index1, index2, datafilter = as.numeric)

{

f <- function(i)

{

return(w.new(datafilter(dataf[i, index1]), datafilter(

dataf[i, index2]))$A)

}


}

get.var <-

function(dataf, index1, index2, datafilter = as.numeric) {

f <- function(i)

{


dataf[i, index2]))$var1A)

}


}

get.psi.idot.m.A <-


f <- function(i)

{


dataf[i, index2]))$psi.idot.m.A)

BIBLIOGRAPHY 164

}


}

get.psi.dotj.m.A <-


f <- function(i)

{


dataf[i, index2]))$psi.dotj.m.A)

}


}

COVA1A2<-

function(dataf, topindex, index1, index2)

{

values.psi.idot.m.A <- get.psi.idot.m.A(dataf, index1, index2)

values.psi.dotj.m.A <- get.psi.dotj.m.A(dataf, index1, index2)

v1 <- as.vector(values.psi.idot.m.A[, topindex])

v2 <- as.matrix(values.psi.idot.m.A)

N <- dim(v2)[1]

s10 <- as.vector((v1 %*% v2)/(N - 1))

v3 <- as.vector(values.psi.dotj.m.A[, topindex])

v4 <- as.matrix(values.psi.dotj.m.A)

M <- dim(v4)[1]

s01 <- as.vector((v3 %*% v4)/(M - 1))

BIBLIOGRAPHY 165

covariance <- s10/N + s01/M

covariance

}

cov12<-

function(x, y)

{

fj<-

function(psi, psi.idot, psi.dotj, A)

{

fi <- function(psi, psi.idot)

{

B <- matrix(0, nrow = nrow(psi), ncol = ncol(psi))

for(i in c(1:nrow(psi))) {

B[i, ] <- psi[i, ] - psi.idot[i]

}

B

}

B <- fi(psi, psi.idot)

C <- matrix(0, nrow = nrow(psi), ncol = ncol(psi))

for(j in c(1:ncol(psi))) {

C[, j] <- B[, j] - psi.dotj[j]

}

C + A

}

A<-psi.m.A(x,y)$A

psi.idot <- psi.m.A(x,y)$psi.idot

BIBLIOGRAPHY 166

psi.dotj <- psi.m.A(x,y)$psi.dotj

fj(psi, psi.idot, psi.dotj, A)

}

s.2genes<-

function(index, dataf,index1,index2)

{

varA <- get.var(dataf, index1, index2)

varA1 <- varA[index]

get.cov<-COVA1A2(dataf,index,index1,index2)

table <- c()

for(i in 1:length(varA)) {

s <- sqrt(round(varA1+varA[i]-2*get.cov[i],5))

table <- cbind(table, c(s))

}

as.vector(table)

}

AUC1EQ2<-

function(index, dataf,index1,index2)

{

A <- get.A(dataf, index1, index2)

BIBLIOGRAPHY 167

A1 <- A[index]

get.s<-s.2genes(index,dataf,index1,index2)

s0 <- quantile(get.s, .9)

table <- c()

for(i in 1:length(A)) {

stat <- (A1 - A[i])/(get.s[i]+s0)

table <- cbind(table, c(stat))

}

as.vector(table)

}