Frédéric Schütz [email protected]

39
Frédéric Schütz [email protected] Statistics and bioinformatics applied to –omics technologies Part II: Integrating biological knowledge Center for Integrative Genomics University of Lausanne, Switzerland Bioinformatics Core Facility Swiss Institute of Bioinformatics

description

Statistics and bioinformatics applied to –omics technologies Part II: Integrating biological knowledge. Frédéric Schütz [email protected]. Bioinformatics Core Facility Swiss Institute of Bioinformatics. Center for Integrative Genomics University of Lausanne, Switzerland. Contents. - PowerPoint PPT Presentation

Transcript of Frédéric Schütz [email protected]

Page 1: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Frédéric Schütz

[email protected]

Statistics and bioinformaticsapplied to –omics technologies

Part II: Integrating biological knowledge

Center for Integrative GenomicsUniversity of Lausanne, Switzerland

Bioinformatics Core FacilitySwiss Institute of Bioinformatics

Page 2: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Class prediction 1-19

• Gene Ontology analysis 20-29

• Geneset analysis (GSEA, etc) 30-39

Contents

Slides

Page 3: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Class discovery and class prediction

• Example: patients from which we obtained measurements (e.g. gene expression)

Class discovery

Gene 1

Gen

e 2

Find natural groups in the data (e.g. setsof patients with similar gene expression)

Class prediction

Given previous measurements for whichthe grouping is known (red and blue),can we predict the group to which a newobservation belongs ?

Gene 1

Gen

e 2 ?

Page 4: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Many questions in biology and medicine are “class prediction” questions:– Does a patient have a predisposition for a given disease ?– What is the prognosis for this patient ?– What will be the response of this patient to a given drug ?– Is this tumour benign or malign ?– What type is this tumour ?– Which treatment should be used ?

Why do we want to do class prediction ?

Page 5: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Class prediction: easy case

Gene 1

Gen

e 2

Classify everythingon this side as “red”

Classify everythingon this side as “blue”

Threshold

Page 6: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Example

Pierre Farmer et al. Identification of molecular apocrine breasttumours by microarray analysis. Oncogene (2005) 24, 4660–4671

Blue points represent “oestrogen receptor (ER) status positive” determined

by immunohistochemistry.

Page 7: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Class prediction: in practice

Gene 1

Gen

e 2

• The two groups are not perfectly separated (and this is still a pretty good case…)

• One variable (gene) is not sufficient to assign patients to groups• Remember that with microarrays, we are not talking about just 2

measurements, but several 10,000s.

Page 8: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Goal: assign objects (e.g. patients) to classes based on some measurements (e.g. gene expression)

• Typically, in a microarray setting:– 10s or (at best) 100s of patients– 10,000s genes

• Unsupervised learning: nothing is known about the grouping of the data, and we try to find natural groups in the data

• Supervised learning: the classes are predefined; we use previously labelled objects to create a procedure for classification of future observations.

Discrimination in general

Page 9: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• K-nearest neighbours

• Linear Discriminant Analysis

• Classification trees

• Support Vector Machines (SVM)

• etc.

Some supervised analysis methods

Page 10: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Example: 3-nearest neighbours

Gene 1

Gen

e 2

Red or blue ?

Page 11: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Example: 3-nearest neighbours

Gene 1

Gen

e 2

2 red vs 1 blue: the point is assigned to “red”

Page 12: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Choose a value for k (typical values: 3 or 5); in practice it can be chosen using the learning data (value that produces the best result)

• Find the k observations in the learning set that are closest to the new, unknown, observation

• Predict the class by a majority vote, that is, choose the class that is most common among the neighbours.

• Very simple method, with surprisingly good performance

K-nearest neighbours

Page 13: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Suggested by R.A. Fisher in 1935• Procedure to find a linear combination of the observed

variables that best separates (discriminates) two classes of objects.

• Using the “new variable”, objects from the same class are close together, and objects from different class are further away.

• Straightforward to calculate• Can easily be extended to more than two classes• Similar idea to Principal Component Analysis (PCA)• Often forgotten in favour of PCA

Linear Discriminant Analysis

Page 14: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Back to the easy case

Gene 1

Gen

e 2

Classify everythingon this side as “red”High value ofthe discriminant

Classify everythingon this side as “blue”

Low value ofthe discriminant

Threshold

Discriminant = Gene 1

Page 15: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Linear Discriminant Analysis: Example

Gene 1

Gen

e 2

• The two groups are well separated• Neither Gene1 nor Gene2 is able to discriminate between

the two categories

Page 16: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Linear Discriminant Analysis: Example

Gene 1

Gen

e 2

• However, the linear combinationL = Gene1 + Gene2

discriminates well between the two groups• Blue points tend to have smaller L values• Red points tend to have bigger L values

Low values

High va

lues

Page 17: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Linear Discriminant Analysis: Example

Gene 1

Gen

e 2

• A threshold is set in between the mean of the two groups• Points with a value L above the threshold are classified as red• Points with a value L below the threshold are classified as blue

Low values

High values

Threshold

Page 18: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Caveats: Overfitting

• It is easy to create classifiers which fit the training data perfectly• It is harder to find classifiers which still work as well when

validated on new data• A classifier must ALWAYS be tested on data independent from

the one used to actually train the classifier.• This is particularly important in microarray analysis:

– Few samples– Many different measurements

• If not careful, it is always possible to find a classifier that works well for your training data !

Page 19: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Caveats: Overfitting

Gene 1

Gen

e 2

Classify everything in thisregion as red

• Perfect classifier for this data• Probably not so good with any new data

Page 20: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Many microarray experiments produce lists of genes that are significantly differently expressed between two conditions (gene comparison).

• In some (rare) cases, only a few genes are of interest, and they can easily be examined and validated.

• In most cases, however, a long list of differentially expressed genes is returned, and these genes can not be considered individually.

• It is harder to obtain biological understanding from this data.• One strategy: consider the functional annotation of the differentially

expressed genes.• Question: what do these genes have in common that could be of

interest ?

Gene Ontology analysis

Page 21: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Collaborative effort to address the need for consistent descriptions of gene products in different databases.

• Three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated – biological processes– cellular components– molecular functions

in a species-independent manner.

Reminder: Gene Ontology (GO) project

(From http://www.geneontology.org/)

Page 22: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Example

(From http://www.geneontology.org/)

PPARA, NR1C1, PPAR: Peroxisome proliferator-activated receptor alpha

(TAS: Traceable Author Statement, IPI: Inferred from Physical Interaction)

Page 23: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Example of GO analysis

10,000genes in total

10%1000 genesdifferentiallyexpressed

• Simple microarray experience: WT vs KO• The microarray has 10,000 genes, 100 of which have GO annotation “fatty acid

transport”• I obtain 1000 differentially expressed genes (10% of all genes)

90%

• If my experiment has nothing to do with “fatty acid transport”, I expect in average about 10% of genes (or 10) to be differentially expressed.

• If this proportion is higher, it means the list of differentially-expressed genes is enriched in “fatty acid transport” genes

• If the difference is significant, it suggests a link between differential expression and this GO annotation: genes with this annotation are more likely to be differentially expressed than others

• This indicates that this biological process may be related to my KO experiment.

Page 24: Frédéric Schütz Frederic.Schutz@isb-sib.ch

10,000genes in total

10%1000 genesdifferentiallyexpressed

90%

10 (10%)

90 (90%)

Number of genes “fatty acid transport”

100 (100%)

0 (0%)

Looks like a randomdistribution

No apparent associationStrong association?

. . .

Page 25: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Statistical analysis• Assume that I found 20 differential expression with the GO annotation of interest.• Count the numbers of genes with the GO annotation or not, and compare with

differential expression:

• A statistical test such as Fisher’s exact test can tell us what is the probability of observing this result (or more extreme) if there is no association between the rows and columns

• In this case, this probability (p-value) is 0.002• This indicates that this biological process may be important in the difference between

WT and KO.

Differentially

expressed

Not D.E. Total

“Fatty acid transport” 20 80 100

Others 980 8980 9900

Total 1000 9000 10000

Page 26: Frédéric Schütz Frederic.Schutz@isb-sib.ch

In practice

• One can either suggest a GO annotation and see if it is enriched in the list of differentially expressed genes

• Or we may want to go “fishing” and try all potentially interesting GO annotations to see if any of them is enriched.

• Easy to do• Multiple services available on the web

– User indicates the list of genes differentially expressed– Returns the most significant GO annotations

Page 27: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Microarray with about 22,000 genes• We look at the 1% of the genes that are most different between different subtypes

of cancer.• Which processes are likely to be different between these subtypes ?

– Those for which more than 1% of the genes are differentially expressed are good candidates

Gene Ontology analysis: example. I

Pierre Farmer et al. Identification of molecular apocrine breasttumours by microarray analysis. Oncogene (2005) 24, 4660–4671

Prop.

5%19%10%3%5%5%4%

Page 28: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• To apply this GO analysis, we need first to define a list of differentially expressed genes.

• This usually means calculating a “score” (e.g. p-value), and selecting a cut-off point.

• While there are some traditional cut-off points (0.001, 0.01 or the “magical” 0.05), they remain fairly arbitrary– Is there really a difference between a gene associated with a

p-value of 0.049 and another one with a p-value of 0.051 ?

Gene Ontology analysis: example. II

Page 29: Frédéric Schütz Frederic.Schutz@isb-sib.ch

• Some genes may be differentially expressed, but the change may be so small (lost in the noise) that it will not appear in the list.

• However, the difference in expression may appear at the level of a set of genes rather than individual genes

• Set of genes may correspond e.g. to co-regulated genes, or genes belonging to the same pathway

• If the change of expression is consistent across genes in the set, it may indicate that the set is of interest, even if no individual gene shows a significant difference.

Gene Ontology analysis: example. III

Page 30: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Gene set enrichment analysis (GSEA)

Page 31: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Gene set enrichment analysis (GSEA)

• Series of papers describing a method for analyzing the expression of sets of genes

• Software available, along with a database of biologically relevant gene sets

• Relatively hot topic in bioinformatics/statistics: many differerent papers and methods published on the topic, with small or large differences

• GSEA usually refers to this particular program, but sometimes indicates any such method which examines sets of genes.

Page 32: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Principle of GSEA

• We have a list of genes sorted according to a given measure (score for differential expression, correlation to a phenotype, etc)

• Among this list, we have a smaller set of genes of interest (e.g. all belonging to a given pathway)

• Is the smaller set distributed randomly in the sorted list of genes ?– If yes, the set is less likely to be of interest– If no, it may indicate that the function represented by the set

is linked with the measure.

Page 33: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Principle of GSEA (most methods)All genes, sorted

High values(e.g. upregulated)

Low values(e.g. down-regulated)

Position in the listof genes of our setof interest

The location of the genes of our set of interest withinthe list seem random (uniform); the set does not appearto be linked with differential expression.

Page 34: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Principle of GSEA (most methods)All genes, sorted

High values(e.g. upregulated)

Low values(e.g. down-regulated)

Position in the listof genes of our setof interest

Link withup-regulation

Position in the listof genes of our setof interest

Link withdown-regulation

Page 35: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Statistical analysis• “Random walk”:

– The list of genes is walked down from left to right– Everytime a gene belong to our list S, the score goes up– Everytime a gene does not belong to the list, it goes down

• If the genes of the set are uniformly distributed, the score will never go very high (“up” soon followed by a “down”)

• If the genes are distributed together, the score will go higher before getting back to 0.

• Using a permutation test, a p-value can be associated to the geneset.

From fig. 1 of Subramanian et al. PNAS 2005; 102; 15545-15550

Page 36: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Statistical analysis

• How can we summarise and assess an apparent link between a set and differential expression ?

• Each method uses different statistics• Original GSEA method based on the Kolmogorov-

Smirnov test (compare the distribution of genes with a uniform distribution)

• Later replaced by an “Enrichment Score” (similar but weighted)

Page 37: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Example

• mRNA expression profiles from lymphoblastoid cell lines derived from 15 males and 17 females

• Identify gene sets correlated with the difference between males and females

(False Discovery Rate)

From table 2 of Subramanian et al. PNAS 2005; 102; 15545-15550

Page 38: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Example

• Gene expression patterns from a collection of 50 cancer cell lines• p53 regulates gene expression in response to various signals of cellular stress• 33 cell lines carry a mutation on the p53 gene, and 17 are normal.

From table 2 of Subramanian et al. PNAS 2005; 102; 15545-15550

Page 39: Frédéric Schütz Frederic.Schutz@isb-sib.ch

Conclusions

• GeneSet Enrichment Analysis methods have quickly become widespread in the microarray community.

• Intuitive method• Can be used to confirm an association known or

suspected… (use a given geneset)• … or to go “fishing” for unknown association (use a

database of genesets)

• More generally, microarray analysis uses more and more this external biological knowledge.