Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially...

38
Gene Set Analysis 09/24/07
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially...

Gene Set Analysis

09/24/07

From individual gene to gene sets

• Finding a list of differentially expressed genes is only the starting point. Suppose we have identified 500 genes that are differentially expressed, then what do we do about it? Can we learn something about the underlying biological pathway?

• Sometimes one cannot find a single gene that is differentially expressed, as the statistical criteria are too stringent and/or the data is too noisy. Can we still learn something useful from the microarray experiment?

(Mootha 2003)

Gene set

• A gene set contains genes that are functionally

related. The gene set assignment is independent

of the microarray data at hand. We want to know

whether a gene set is differentially expressed.

• Functional annotation is usually obtained from

the following sources.

– Kyoto Encyclopedia of Genes and Genomes (KEGG):

– Gene Ontology (GO):

KEGG

• KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for: – 1. Metabolism

2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Human Diseases

and also on the structure relationships (KEGG drug structure maps) in:

– 6. Drug Development– Website: http://www.genome.jp/kegg/

GO terms

• Ontologies are 'specifications of a relational vocabulary'.

• GO contains three structured vocabularies: cellular component, biological process and molecular function.

• GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context.

• Website: http://www.geneontology.org/

Khatri and Draghichi 2005

Li

•Null hypothesis: The genes in S are at most as often differentially expressed as the genes in Sc.

Over-representative analysis

Differentially expressed

Not differentially expressed

in S

In SC

O1 = a

c + d

a + b

nb + da + c

Total

Compare a/(a + b) with (a + c)/n.

O2 = b

O3 = c O4 = d

Statistical significance

• Chi-square test

• Fisher’s exact test

d.o.f. 1on with distributi )( 2

4

1

22

i i

ii

E

EO

hypergeometric distribution

Testing multiple GO nodes simultaneously

Determine significance level for each node

The adjust for multiple hypothesis testing: FWER; FDR; etc.

(GOSurfer)

(GOSurfer)

Problems with using differentially expressed genes

• Result is sensitive to the criteria for differentially expressed genes. Useless if the criteria is too stringent.

• Reducing a continuous variable to binary variable loses useful quantitative information.

ErmineJ

• Called FCS in Pavlidis et al. 2004.

• The mean of –log(p-value) for all genes in a gene sets is used as a aggregate score.

• Use permutation test (with gene) to obtain the p-value corresponding to the aggregate score.

• Correction for multiple occurrence of a single gene.

• Adjust for multiple-hypothesis testing by controlling FDR.

ErmineJ

Permute genes

Randomize genes or arrays?

Li

Permute array labels

Interpretation of p-values• In the gene-sampling setup (e.g., Chi-square

test), inference is about a new sample of genes. Expression of genes are assumed to be independent.

• In the subject-sampling setup (e.g., permutation test), inference is about a new subject. Label of a subject (treatment or control) is assumed to be independent. Expressions of different genes may be correlated.

It is more biologically meaningful to use subject-sampling methods.

Gene Set Enrichment Analysis (GSEA)

• Consider all genes instead of differentially

expressed genes.

• Permute class labels

• Steps:

– 1: Calculation of an enrichment score (ES).

– 2: Estimation of significance level of ES.

– 3: Adjustment for multiple hypothesis testing.

(Mootha 2003)

A B

Basic idea: Rank the genes according to their p-value for being differentially expressed. If there is no correlation between gene expression and membership in A or B, then the rank-distributions for the two sets should also be approximately equal.

Enrichment Score

• Rank the genes by their p-values corresponding to the significance level of differential expression: R1, …, RN.

• Define if Ri is not in S, and if Ri is in S.

• Then that is, the maximum deviation from the expected running sum.

)/( GNGX i

GGNX i /)(

j

ii

NjXES

11max

?)/( GNGX i Why

0/)/()(1

GGNGGNGGNX

N

i i

•Unbiased

•Normalized

NGGNGGNGGNXN

i i

22

1

2 /)/()(

Permutation test of the significance of ES

• Randomly assign labels to samples, reorder genes, and recompute ES(S).

• Estimate the p-values by comparing the observed ES(S) with computed from randomly shuffled data.

Multiple hypothesis testing

• Determine ES(S) for each gene set in the collection.

• For each S and 1000 fixed permutations of the array labels, reorder the genes and determine ES(S, ).

• Adjust for variation in gene set size.

• Compute FDR.

Applications of GSEA

Data

– 22,000 genes

– 43 subjects: 17 normal (NGT), 8 partially impaired, 18

diagnosed with disease (DM2)

– Gene sets independently curated from literature

No single gene is differentially expressed according the

stringent multiple hypothesis testing criteria.

Results from GSEA

• Select the gene set with maximum ES: (OXPHOS)

• Genes are consistently down-regulated, although the fold changes are moderate.

• Selected gene sets are biologically sensible --- consistent with expection.

Starting point for further analysis• Apply clustering analysis to the selected gene set.• Many genes in the gene set are corregulated, suggesting

they share similar functions.

A self-contained null hypothesis

• Null hypothesis:– Competitive version: The genes in G are at

most as often differentially expressed as the genes in Gc.

– Self-contained version: No genes in G are differentially expressed.

• “Self-contained” is more strict than “competitive”.

Drawback for comparing S against SC

• This is compared to a “zero-sum-game”. Gene classes are competing with each other. The stronger the evidence in support of differential expression is for one class, the weaker the evidence for differential expression is judged to be for a second class.

Not significant?

Drawback for comparing S against SC

Hcomp vs Hself

• Advantage:– Self-consistent– When there are a large number of genes are

differentially expressed, multiple pathways may be selected.

• Drawback:– Too aggressive. A gene-class containing very

few differentially expressed genes may not be biologically meaningful.

Hybrid methods

• Several aspects of different methods can be mixed, e.g.– Modify GSEA by using self-contained version

to evaluate p-value.– Similar treatment to ErmineJ.

(J.J.Goeman and P.Buhlmann 2006)

Multivariate analysis

• Let X1 and X2 be the expression levels for the subject groups 1 and 2.

• Given a gene set containing q genes.

• The self-contained null hypothesis can be rephrased as the multi-dimensional mean expression vectors (within the given gene set) are the same.

• Use multivariate hypothesis testing.

Holstelling’s T2

Under the null hypothesis, T2 follows the F-distribution

1,)1/()2( qnqFqnqn

Multiple hypothesis testing is addressed by FDR control.

Dimension reduction

Diagonalize the variance matrix S and then project to principle components.

where

Dimensions corresponding to very small eigenvalues are ignored.

XUDX '' 2/1

'UDUS

Results

• Figure 1 in Sek Kwon’s paper.