EiB Seminar from Antoni Miñarro, Ph.D

44

description

Seminar from Antonio Miñarro, Ph.d.Statistics and Bioinformatics research group.Department of StatisticsUniversity of Barcelona

Transcript of EiB Seminar from Antoni Miñarro, Ph.D

Page 1: EiB Seminar from Antoni Miñarro, Ph.D
Page 2: EiB Seminar from Antoni Miñarro, Ph.D

RNA-seq

RNA-seq, also called "Whole Transcriptome Shotgun Sequencing“ ("WTSS"), refersto the use of high-throughput sequencing technologies to sequence cDNA in orderto get information about a sample's RNA content.

Page 3: EiB Seminar from Antoni Miñarro, Ph.D

Analysis of RNA-seq data

• Single nucleotide variation discovery: currentlybeing applied to cancer research and microbiology.

• Fusion gene detection: fusion genes have gainedattention because of their relationship with cancer. The ideafollows from the process of aligning the short transcriptomicreads to a reference genome. Most of the short reads will fallwithin one complete exon, and a smaller but still large setwould be expected to map to known exon-exon junctions. Theremaining unmapped short reads would then be furtheranalyzed to determine whether they match an exon-exonjunction where the exons come from different genes. Thiswould be evidence of a possible fusion.

• Gene expression

Page 4: EiB Seminar from Antoni Miñarro, Ph.D

Paired-end

Page 5: EiB Seminar from Antoni Miñarro, Ph.D

Fusion gene detection

Page 6: EiB Seminar from Antoni Miñarro, Ph.D

Definitions

Page 7: EiB Seminar from Antoni Miñarro, Ph.D

Gene expressionDetect differences in gene level expression between samples. This sort ofanalysis is particularly relevant for controlled experiments comparingexpression in wild-type and mutant strains of the same tissue, comparingtreated versus untreated cells, cancer versus normal, and so on.

Page 8: EiB Seminar from Antoni Miñarro, Ph.D

Differential expression (2)

• RNA-seq gives a discrete measurement for each gene.

• Transformation of count data is not well approximated by continuousdistributions, especially in the lower count range and for small samples.Therefore, statistical models appropriate for count data are vital toextracting the most information from RNA-seq data.

• In general, the Poisson distribution forms the basis for modeling RNA-seqcount data.

Page 9: EiB Seminar from Antoni Miñarro, Ph.D

RNA-seq Pipeline

Page 10: EiB Seminar from Antoni Miñarro, Ph.D

Mapping

• The first step in this procedure is the read mapping or alignment: tofind the unique location where a short read is identical to thereference.

• However, in reality the reference is never a perfect representationof the actual biological source of RNA being sequenced: SNPs,indels, also the consideration that the reads arise from a splicedtranscriptome rather than a genome.

• Short reads can sometimes align perfectly to multiple locations andcan contain sequencing errors that have to be accounted for.

• The real task is to find the location where each short read bestmatches the reference, while allowing for errors and structuralvariation.

Page 11: EiB Seminar from Antoni Miñarro, Ph.D

Aligners

• Aligners differ in how they handle ‘multimaps’ (reads thatmap equally well to several locations). Most aligners eitherdiscard multimaps, allocate them randomly or allocate themon the basis of an estimate of local coverage.

• Paired-end reads reduce the problem of multi-mapping, asboth ends of the cDNA fragment from which the short readswere generated should map nearby on the transcriptome,allowing the ambiguity of multimaps to be resolved in mostcircumstances.

Page 12: EiB Seminar from Antoni Miñarro, Ph.D

Reference genome

• The most commonly used approach is to use the genome itself as thereference. This has the benefit of being easy and not biased towards anyknown annotation. However, reads that span exon boundaries will notmap to this reference. Thus, using the genome as a reference will givegreater coverage (at the same true expression level) to transcripts withfewer exons, as they will contain fewer exon junctions.

In order to account for junction reads, it is commonpractice to build exon junction libraries in whichreference sequences are constructed usingboundaries between annotated exons, a proxygenome generated with known exonic sequences.Another option is the de novo assembly of thetranscriptome, for use as a reference, using genomeassembly tools.A commonly used approach for transcriptomemapping is to progressively increase the complexityof the mapping strategy to handle the unalignedreads.

Page 13: EiB Seminar from Antoni Miñarro, Ph.D

Normalization

Page 14: EiB Seminar from Antoni Miñarro, Ph.D

Normalization (2)

When testing individual genes for DE between samples, technical biases, such as gene lengthand nucleotide composition, will mainly cancel out because the underlying sequence usedfor summarization is the same between samples. However, between-sample normalization isstill essential for comparing counts from different libraries relative to each other. Thesimplest and most commonly used normalization adjusts by the total number of reads in thelibrary [34,51], accounting for the fact that more reads will be assigned to each gene if asample is sequenced to a greater depth.

Within-library normalization allows quantification of expression levels of each gene relativeto other genes in the sample. Because longer transcripts have higher read counts (at thesame expression level), a common method for within-library normalization is to divide thesummarized counts by the length of the gene [32,34]. The widely used RPKM (reads perkilobase of exon model per million mapped reads) accounts for both library size and genelength effects in within-sample comparisons.

Page 15: EiB Seminar from Antoni Miñarro, Ph.D

Normalization: methods

Page 16: EiB Seminar from Antoni Miñarro, Ph.D

Normalization (example)

Page 17: EiB Seminar from Antoni Miñarro, Ph.D

NG-5045 (Diabetes)

• Pool 1 2, 4, 12, 16

• Pool 2 3, 9, 13, 14

• Pool 3 1, 5, 6, 7

• Pool 4 8, 10, 11, 15

• Morbidly obese persons without insulin resistance: 2, 3, 4, 9, 12, 13, 14, 16.

• Morbidly obese persons with high insulin resistance: 1, 5, 6, 7, 8, 10, 11, 15.

Page 18: EiB Seminar from Antoni Miñarro, Ph.D
Page 19: EiB Seminar from Antoni Miñarro, Ph.D
Page 20: EiB Seminar from Antoni Miñarro, Ph.D
Page 21: EiB Seminar from Antoni Miñarro, Ph.D
Page 22: EiB Seminar from Antoni Miñarro, Ph.D

Differential expression

The goal of a DE analysis is to highlight genes that have changed significantlyin abundance across experimental conditions. In general, this means taking atable of summarized count data for each library and performing statisticaltesting between samples of interest.Transformation of count data is not well approximated by continuousdistributions, especially in the lower count range and for small samples.Therefore, statistical models appropriate for count data are vital to extractingthe most information from RNA-seq data.

Page 23: EiB Seminar from Antoni Miñarro, Ph.D

Poisson-based analysis

In an early RNA-seq study using a single source of RNA goodness-of-fit statisticssuggested that the distribution of counts across lanes for the majority of genes wasindeed Poisson distributed . This has been independently confirmed using a technicalexperiment and software tools are readily available to perform these analyses.

Page 24: EiB Seminar from Antoni Miñarro, Ph.D

Each RNA sample was sequenced in seven lanes, producing 12.9–14.7 million reads per lane at the 3 pM concentration and 8.4–9.3million reads at the 1.5 pM concentration. We aligned all reads against the whole genome. 40% of reads mapped uniquely to a genomic location, and of these, 65% mapped to autosomal or sex chromosomes (the remainder mapped almost exclusively to mitochondrial DNA).

Page 25: EiB Seminar from Antoni Miñarro, Ph.D
Page 26: EiB Seminar from Antoni Miñarro, Ph.D

Poisson based software

R packages in Bioconductor:

•DEGseq (Wang et al., 2010)

Page 27: EiB Seminar from Antoni Miñarro, Ph.D

Alternative strategies

Biological variability is not captured well by the Poisson assumption. Hence,Poisson-based analyses for datasets with biological replicates will be prone tohigh false positive rates resulting from the underestimation of sampling error

Goodness-of-fit tests indicate that a small proportion of genes show cleardeviations from this model (extra-Poisson variation), and although we found thatthese deviations did not lead to falsepositive identification of differentiallyexpressed genes at a stringent FDR, there is nevertheless room for improvedmodels that account for the extra-Poisson variation. One natural strategy would beto replace the Poisson distribution with another distribution, such as the quasi-Poisson distribution (Venables and Ripley 2002) or the negative binomialdistribution (Robinson and Smyth 2007), which have an additional parameter thatestimates over- (or under-) dispersion relative to a Poisson model.

Page 28: EiB Seminar from Antoni Miñarro, Ph.D

Poisson-Negative Binomial

• The negative binomial distribution, can be used as an alternative to the Poisson distribution. It is especiallyuseful for discrete data over an unbounded positive range whose sample variance exceeds the samplemean. In such cases, the observations are overdispersed with respect to a Poisson distribution, for whichthe mean is equal to the variance. Hence a Poisson distribution is not an appropriate model. Since thenegative binomial distribution has one more parameter than the Poisson, the second parameter can beused to adjust the variance independently of the mean.

r

mean

2

var

Page 29: EiB Seminar from Antoni Miñarro, Ph.D

Negative-Binomial based analysis

In order to account for biological variability, methods that have beendeveloped for serial analysis of gene expression (SAGE) data have recentlybeen applied to RNA-seq data. The major difference between SAGE and RNA-seq data is the scale of the datasets. To account for biological variability, thenegative binomial distribution has been used as a natural extension of thePoisson distribution, requiring an additional dispersion parameter to beestimated.

Page 30: EiB Seminar from Antoni Miñarro, Ph.D

Description of SAGE

• Serial analysis of gene expression (SAGE)is a method for comprehensive analysisof gene expression patterns.

Three principles underlie the SAGEmethodology:

1. A short sequence tag (10-14bp) containssufficient information to uniquely identify atranscript provided that that the tag isobtained from a unique position within eachtranscript;

2. Sequence tags can be linked together tofrom long serial molecules that can becloned and sequenced; and

3. Quantization of the number of times aparticular tag is observed provides theexpression level of the correspondingtranscript.

Page 31: EiB Seminar from Antoni Miñarro, Ph.D

Robinson, McCarthy, Smyth (2010)

Page 32: EiB Seminar from Antoni Miñarro, Ph.D

edgeR paperedgeR paperRobinson, McCarthy, Smyth (2010) (2)

Page 33: EiB Seminar from Antoni Miñarro, Ph.D

Robinson and Smyth 2008

Page 34: EiB Seminar from Antoni Miñarro, Ph.D

Robinson and Smyth 2008 (2)

Page 35: EiB Seminar from Antoni Miñarro, Ph.D

Negative-Binomial based software

R packages in Bioconductor:

edgeR (Robinson et al., 2010): Exact test based on NegativeBinomial distribution.

DESeq (Anders and Huber, 2010): Exact test based on NegativeBinomial distribution.

baySeq (Hardcastle et al., 2010): Estimation of the posteriorlikelihood of dierential expression (or more complex hypotheses) viaempirical Bayesian methods using Poisson or NB distributions.

Page 36: EiB Seminar from Antoni Miñarro, Ph.D

CLC Genomics Workbench approach

19.4.2.1 Kal et al.'s test (Z-test)Kal et al.'s test [Kal et al., 1999] compares a single sample against another singlesample, and thus requires that each group in you experiment has only one sample.The test relies on an approximation of the binomial distribution by the normaldistribution [Kal et al., 1999]. Considering proportions rather than raw counts the testis also suitable in situations where the sum of counts is different between thesamples.

19.4.2.2 Baggerley et al.'s test (Beta-binomial)Baggerley et al.'s test [Baggerly et al., 2003] compares the proportions of counts in agroup of samples against those of another group of samples, and is suited to caseswhere replicates are available in the groups. The samples are given different weightsdepending on their sizes (total counts). The weights are obtained by assuming a Betadistribution on the proportions in a group, and estimating these, along with theproportion of a binomial distribution, by the method of moments. The result is aweighted t-type test statistic.

Page 37: EiB Seminar from Antoni Miñarro, Ph.D

Baggerly, K., Deng, L., Morris, J., and Aldaz, C. (2003). Differential expression in SAGE: accounting for normal between-library variation.

Bioinformatics, 19(12):1477-1483.

Page 38: EiB Seminar from Antoni Miñarro, Ph.D
Page 39: EiB Seminar from Antoni Miñarro, Ph.D
Page 40: EiB Seminar from Antoni Miñarro, Ph.D
Page 41: EiB Seminar from Antoni Miñarro, Ph.D

Resolució amb edgeR

> library(edgeR)> set.seed(101)> n <- 200> lib.sizes <- c(40000, 50000, 38000, 40000)> p <- runif(n, min = 1e-04, 0.001)> mu <- outer(p, lib.sizes)> mu[1:5, 3:4] <- mu[1:5, 3:4] * 8> y <- matrix(rnbinom(4 * n, size = 4, mu = mu), nrow = n)> rownames(y) <- paste("tag", 1:nrow(y), sep = ".")> y[1:10, ]

[,1] [,2] [,3] [,4]tag.1 15 13 117 77tag.2 3 4 49 33tag.3 25 56 302 332tag.4 40 13 271 91tag.5 13 3 51 56tag.6 14 7 31 18tag.7 16 39 19 9tag.8 6 28 6 6tag.9 10 42 80 14tag.10 33 25 5 27

> d <- DGEList(counts = y, group = rep(1:2, each = 2), lib.size = lib.sizes)> d <- estimateCommonDisp(d)> de.common <- exactTest(d)Comparison of groups: 2 - 1 > topTags(de.common)Comparison of groups: 2 - 1

logConc logFC PValue FDRtag.184 -13.636760 -5.236853 6.112570e-05 0.005195714tag.2 -11.769438 3.766465 6.405229e-05 0.005195714tag.3 -8.550981 3.214682 7.793571e-05 0.005195714tag.4 -9.188394 2.911743 3.300004e-04 0.013214944tag.1 -10.135230 2.984351 3.303736e-04 0.013214944tag.5 -10.944756 2.868619 1.035516e-03 0.034517212tag.105 -10.693557 2.618355 2.337750e-03 0.066792856tag.164 -11.253348 -2.209660 1.090272e-02 0.233310771tag.14 -11.258031 2.238669 1.090272e-02 0.233310771tag.123 -13.277812 -2.756096 1.166554e-02 0.233310771> >

Page 42: EiB Seminar from Antoni Miñarro, Ph.D

Suggested pipeline ?

•Quality Control: fastQC, DNAA

•Mapping the reads:•Obtaining the reference•Aligning reads to the reference: BOWTIE

•Differential Expression•Summarization of reads•Differential Expression Testing: edgeR

•Gene Set testing (GO): goseq

Page 43: EiB Seminar from Antoni Miñarro, Ph.D

Many of the current strategies for DE analysis of count data are limited tosimple experimental designs, such as pairwise or multiple group comparisons.To the best of our knowledge, no general methods have been proposed forthe analysis of more complex designs, such as paired samples or time courseexperiments, in the context of RNA-seq data. In the absence of such methods,researchers have transformed their count data and used tools appropriate forcontinuous data. Generalized linear models provide the logical extension tothe count models presented above, and clever strategies to share informationover all genes will need to be developed; software tools now provide thesemethods (such as edgeR).

Experimental design ?

Auer, P.L., and Doerge R.W. (2010) Statistical Design and Analysis of RNA Sequencing Data. Genetics, 185, 405-416.

Page 44: EiB Seminar from Antoni Miñarro, Ph.D

Integration with other data

There is wide scope for integrating the results of RNA-seq data with othersources of biological data to establish a more complete picture of generegulation [69]. For example, RNA-seq has been used in conjunction withgenotyping data to identify genetic loci responsible for variation in geneexpression between individuals (expression quantitative trait loci or eQTLs)[35,70]. Furthermore, integration of expression data with transcription factorbinding, RNA interference, histone modification and DNA methylationinformation has the potential for greater understanding of a variety ofregulatory mechanisms. A few reports of these ‘integrative’ analyses haveemerged recently [71-73].