Genomics on Obesity June 7 th -8 th 2007, Toulouse Statistics and bioinformatics applied to –omics...

29
Genomics on Obesity June 7 th -8 th 2007, Toulouse Statistics and bioinformatics applied to –omics technologies: Part I Pascal GP MARTIN Laboratoire de Pharmacologie et Toxicologie INRA, Toulouse, France [email protected] r PM, june 07

Transcript of Genomics on Obesity June 7 th -8 th 2007, Toulouse Statistics and bioinformatics applied to –omics...

  • Slide 1

Genomics on Obesity June 7 th -8 th 2007, Toulouse Statistics and bioinformatics applied to omics technologies: Part I Pascal GP MARTIN Laboratoire de Pharmacologie et Toxicologie INRA, Toulouse, France [email protected] PM, june 07 Slide 2 I. Data pre-processing and normalization4-7 II. Different answers for different questions8-11 III. Class comparison12-17 IV. Class discovery18-23 V. Conclusion24-25 VI. References26-29 Contents PM, june 07 Slide Abstract3 Slide 3 Abstract High throughput technologies are becoming increasingly used in biology. They are applied, for example, to the genomic, transcriptomic, proteomic and metabolomic fields. They generate large datasets from which relevant biological knowledge needs to be extracted from relatively noisy data. The characteristics of these datasets such as the measure of thousands of variables on a limited set of samples, and the presence of missing values and their noisy nature, require the use of statistical methods which are able to handle such features. Data filtering, transformation and normalization are among the first non-trivial steps in the process of analyzing these datasets and require both statistical skills and biological considerations in order to yield informative data. Equally the purpose of the experiment and its statistical analysis must be clear, before the experiment is done: it generally falls into one of three categories : 1) class comparison which aims at identifying variables exhibiting a class effect (genes that are differentially expressed among the groups of samples in the case of transcriptomics), 2) class discovery which aims at identifying subgroups of samples or variables sharing similar profiles and 3) class prediction which aims at building a classifier able to correctly predict the class of new, unknown samples or variables. The most common statistical tools used to answer the first two categories of questions are presented along with the major drawbacks in their practical use Slide 4 I. The data p x n matrix with p=genes and n=samples (generally p>>n) Relatively noisy data (background, marks, image analysis) Most often requires data transformation and normalization Generally contains a lot of non-informative data (not all genes are expressed in a given sample) PM, june 07 Replicates, filters, methods handling missing values methods handling p>>n Filters correcting for asymmetric distributions, biases Slide 5 Sources of variations SystematicRandom Normalization (generates data that are comparable) Replicates BiologicalTechnical I. Sources of bias and noise PM, june 07 Slide 6 I. Normalization PM, june 07 A set of genes is used as a reference and is considered not to change on average across samples and conditions. All genes (global mean/median, genomic DNA) + small error of the mean / - not always applicable Lowess (normalization by intensity ranges) + small error of the mean / - may underestimate ratios if many genes are regulated Internal controls (housekeeping genes) + same systematic bias / - hard choice, small number External controls (spiked-in RNAs) + small variations / - highly sensitive to errors in RNA sample assay Stable genes (ranks,) + small error of the mean / - not always applicable !! WARNING !! : the error of the mean for the reference set adds up to the error of each gene Slide 7 I. Data transformation log log transformation Multiplicative effects become additive effects PM, june 07 z-scores {X 1, X 2, , X n } are the data. X is the mean and S x the standard deviation Scaled data : S z = 1 Centered data : Z = 0 Z i has no physical units ranks (robustness but loss of quantitative information) Slide 8 II. Different answers for different questions PM, june 07 Class comparison Question : Which genes are differentially expressed between the two (or more) experimental conditions Examples : treated vs untreated, transgenic vs wild-type, obese vs lean, Tools : threshold definition, statistical tests, analysis of variance, other modeling tools (regression, mixed effects models, mixture models) Slide 9 II. Different answers for different questions PM, june 07 Class discovery Question : Can I identify homogeneous subgroups of genes (or sample) which are characterized by similar expression profiles Examples : search for co-regulated genes, for tumor subtypes, for outliers, data exploration Tools : pseudo-profiles, unsupervised clustering, factorial methods (Principal Component Analysis, Multidimensional Scaling, ) Slide 10 II. Different answers for different questions PM, june 07 Class prediction Question : Can I find a rule to classify my samples (or genes) with a minimal error rate and apply this rule to predict the class of a new sample (or gene) Examples : search for genes function, predicting tumor type, prognosis (metastasis, transplant rejection, survival rate), biomarker of exposure Tools : classification (supervised) : discriminant analysis, classification tree and random forests, support vector machines, generalized linear model Slide 11 II. Different answers for different questions PM, june 07 Other bioinformatic tools / interpreting the results Promoter analysis to understand the basis of gene co-regulation (Genomatix tools, TransFac, ) Pathway analysis to understand the consequences of gene regulations at the pathway/network level (KEGG, cMAP, Ingenuity, ) Integrating the results with biological knowledge (bibliographic: PubMed, Ingenuity, bibliosphere,, Gene-centered: Gene Ontology, Panther,) Using the results to model biological systems (collaborations with physics, chemistry, informatics, mathematics, ) Data reconciliation to understand the links between different levels of observations (transcriptome/proteome, transcriptome/metabolome,) Slide 12 III. Class comparison: the t-test PM, june 07 If M 1, M 2, , M n are n log(ratios) from our microarray experiment (independent and following N( , 2 ) where and 2 are unknown). Null Hypothesis H 0 : = 0 log(ratio) = 0 ratio = 1 Alternative hypothesis H 1 : 0 (two-sided test) If H 0 is true, then: whereis the estimation of the mean andis the estimation of the variance Two main problems : instability of the estimations (small sample size + outliers) and dealing with the multiplicity of the test statistics. Slide 13 III. Class comparison: the t-test PM, june 07 Instability of the estimations Stabilizing by using robust statistics (ranks) or estimators (median) Stabilizing S 2 by making groups of genes or by modifying the test statistic ( e.g. Efron B et al., 2000 or Lnnstedt I & Speed TP, 2001 ) Efron B et al. Microarrays and their use in a comparative experiment, Technical Report, Dept of Statistics, Stanford Univ., Stanford, CA, 2000 Lnnstedt I & Speed TP. Replicated microarray data, Statistica Sinica, 12(1):31-46, 2001. The multiplicity problem Analogy (a bit sordid) : Consider a russian roulette game where 5 bullets are randomly placed in a gun which can contain 100 bullets. A player has a 5% chance to loose. If 100 players play at the same time, then 5 players will loose on average. For a given gene and a given type I error rate (=5%), we know that this gene has a 5% probability to be a false positive. Thus, when doing one test at =5% for each gene, we know that the number of false positives will be 5% times the number of tests (5 FP for 100 tests, 50 FP for 1000 tests,, 2200 FP for 44000 tests) Control the false discovery rate (proportion of FP among the rejected H 0 ) instead of the probability to get one FP (Family-wise error rate) Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A practical approach to multiple testing. J. Roy. Stat. Soc. 1995, 57(1):289-300 Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003, 18(1):71-103 Tusher VG, Tibshirani R, Chu G. SAM applied to the ionizing radiation response. PNAS. 2001, 98:5116-21. Slide 14 III. Class comparison: analysis of variance PM, june 07 M -+ M SS M 1, M 2, , M n are n log(ratios) from our microarray experiment. Their variance is estimated by : Slide 15 III. Class comparison: analysis of variance PM, june 07 Treatment A Treatment B Treatment C Expression of gene X ---------- ---------- ---------- Residual SS Treatment SS + + Total SS = = y yAyA yByB yCyC Expr of X~Treatment+ Example : 1 factor (Treatment) with 3 levels (A, B, C). Expression of gene X. Slide 16 III. Class comparison: analysis of variance PM, june 07 N = 15 observations, I=3 levels for the treatment factor Sources of variation Degrees of freedom Sum of Squares (SS) Mean Square (=Variance) F Statistic (Fisher)p-value TreatmentI-1 = 2SS treatment MS treat = SS treat / (I-1) F = MS treat / MS resid 0.001 RsidualsN-I = 12SS resid MS resid = SS resid / (N-I) Total variation N-1 = 14SS total Here, the treatment factor has a significant effect on the expression of gene X which means that expression of gene X is different between at least one of the 3 treatments compared to the other ones (we do not not which treatment is different from the other at this stage). Slide 17 III. Class comparison: analysis of variance PM, june 07 When can I use the analysis of variance ? Homoscedasticity (Homogeneity of the variances) Can be assessed graphically 0 Fitted value (rsiduals) 0 1 2 1 2 Not OK. Try to use a transformation such as log, cube root, OK, good for anova Independance (the effect of a factor on an individual does not influence the other individuals) Generally postulated Normality of the residuals Can be tested (Shapiro-Wilk, KS) but generally assessed graphically through QQplot of residuals residuals Quantiles of N(0,1) Slide 18 IV. Class discovery PM, june 07 Gene 1 Gene 2 ns SEVERAL AIMS (object = sample or variable) : reducing the dimension, grouping the objects together, understanding why the objects are grouped together, examining the objects and their distances, taking into account the links between the objects Slide 19 1pj n X (n,p) = x ij 1 i Expression level of gene j for sample i i Row vector 1pj i pp n points in p jj Column vector nn 1 n i p points in n Configuration of a cloud of points in a multidimensional space Factorial methods (PCA, MDS) Clustering methods (hierarchical, k-means, SOM) L. Lebart, A. Morineau, M. Piron IV. Class discovery PM, june 07 Slide 20 IV. Ascendant hierarchical clustering PM, june 07 Procedure : A distance measure between objects (single or groups) is defined Beginning: each object is a class Algorithm: at each step; the 2 closest objects are grouped together End : one class groups all the objects RESULT : a tree (dendrogram) is built AIM : group together (or cluster) the objects in homogeneous groups Initially, 2 choices must be made : the distance metric and the agglomerative criterion Then, the number of clusters must be chosen Slide 21 PPAR -/- Wild-type PPAR target genes (HPNCL, CPT2, MCAD) PPAR target genes (AOX, Cyp4a, mHMGCoAs) Lipogenesis, cholesterogenesis, fatty acid transport Downregulated in PPAR -/- (catalase, L- FABP) IV. Hierarchical clustering and Heatmaps PGP Martin et al, unpublished data Slide 22 IV. Principal Component Analysis PM, june 07 1 st principal component 2 d principal component Centre of gravity of the fish PCA allows us to define spaces of lower dimension (compared to the initial space) on which the projection of the initial cloud of point presents a minimum of distortion. It means that these new spaces capture a maximum of information (or variability). Slide 23 PPAR -/- Wild-type PC1 PC2 PC1 (32%) PC2 (26%) Mice Fatty acids Principal Component Analysis on hepatic fatty acid composition (21 fatty acids x 60 samples): 1st Plane (PC1+PC2): 58% of the experimental variance is captured 3rd PC: 16% = genotype effects (not shown) Dietary FA modulate hepatic FA composition IV. Principal Component Analysis PM, june 07 Slide 24 Conclusion PM, june 07 The Statistician and the Biologist (reproduced from a presentation of R. Tibshirani) They are both being executed, and are each granted one last request The Statistician asks that he be allowed to give one last lecture on his Grand Theory of Statistics. The Biologist asks that he be executed first. and the last word is for R.A. Fisher To call in a statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of Slide 25 Laboratoire de Statistique et Probabilits PM, june 07 Thank you for your attention !! Slide 26 Some references: bioinformatics/databases Gene Ontology (GO) : http://www.geneontology.org/ Outils de recherche et danalyse GO : http://fatigo.bioinfo.cnio.es/ http://www.godatabase.org/cgi-bin/amigo/go.cgi http://david.niaid.nih.gov/david/ http://vortex.cs.wayne.edu:8080/index.jsp Rseaux : http://www.genmapp.org/introduction.asp http://www.genome.jp/kegg/ Analyses de promoteurs : http://www.genomatix.de/ http://www.cbrc.jp/research/db/TFSEARCH.html http://www.gene-regulation.com/ PM, june 07 Slide 27 Some references Books on microarray data analysis : Statistical Analysis of Gene Expression Microarray Data. Terry Speed (Chapman & HALL) Design and Analysis of DNA Microarray Investigations. Richard M. Simon et al. (Springer) The Analysis of Gene Expression Data. Giovanni Parmigiani et al. (Springer) Nature Genetics 2002 : The Chipping Forecast II. Vol 32 supplement 2. pp 461-552 General Statistics books (in french) : Statistique Thorique et Applique Tomes 1 et 2. Pierre Dagnelie (De Boeck Univ.) Analyse statistique plusieurs variables. Pierre Dagnelie (Tec&Doc) Initiation aux traitements statistiques. B. Escofier et J. Pags (Presses Univ. de Rennes) Statistique exploratoire multidimensionnelle. L. Lebart, A. Morineau, M. Piron (Dunod) Analyses factorielles simples et multiples. B. Escofier et J. Pags (Dunod) PM, june 07 Slide 28 Articles on data normalization : Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Yang YH et al. Nucleic Acids Res. 2002. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bolstad BM et al. Bioinformatics. 2003. Empirical evaluation of data transformations and ranking statistics for microarray analysis. Qin LX and Kerr KF. Nucleic Acids Res. 2004. Some references Articles on experimental design : Statistical design and the analysis of gene expression microarray data. Kerr MK and Churchill GA. Genet Res. 2001. Experimental design for gene expression microarrays. Kerr MK and Churchill GA. Biostatistics. 2001. Questions and answers on design of dual-label microarrays for identifying differentially expressed genes. Dobbin K, Shih JH and Simon R. J Natl Cancer Inst. 2003. Experimental design and low-level analysis of microarray data. Bolstad BM et al. Int Rev Neurobiol. 2004. PM, june 07 Slide 29 Articles on linear models for microarray data analysis : Assessing gene significance from cDNA microarray expression data via mixed models. Wolfinger RD et al. J Comput Biol. 2001. Linear models for microarray data analysis: hidden similarities and differences. Kerr MK. J Comput Biol. 2003. Analysis of oligonucleotide array experiments with repeated measures using mixed models. Li H et al. BMC Bioinformatics. 2004. Some references Articles on multidimentional exploratory analysis : Cluster analysis and display of genome-wide expression patterns. Eisen MB et al. Proc Natl Acad Sci USA. 1998. Large-scale clustering of cDNA-fingerprinting data. Herwig R et al. Genome Res. 1999. Using biplots to interpret gene expression patterns in plants. Chapman S et al. Bioinformatics. 2002. NOTE : These references are only provided to illustrate the intensity and diversity of the research efforts in the field of microarray data analysis and many more publications could have been worth mentioned. PM, june 07