RNA sequencing - University of Washington
Transcript of RNA sequencing - University of Washington
6/07/16
1
RNAsequencingIntegra1veGenomicsmodule
MichaelInouyeCentreforSystemsGenomics
UniversityofMelbourne,Australia
SummerIns@tuteinSta@s@calGene@cs2016SeaBle,USA
@minouye271inouyelab.org
Thislecture• Introtohigh-throughputsequencing
• Basicsequencinginforma1cs
• Technicalvaria1onvsbiologicalvaria1on• Normalisa1on
• MethodstotestforDE• Example:EdgeR
6/07/16
2
SequencingexperimentsDNAfragments
Sequencer
Sequencereads
AGCCATCAGCTA
AGCCATCAGCTA
CGACTCGACAGT
(Pairedendsequencing)
High-throughputsequencingexperiments
DNAsamples Sequencer
Analysis:AligntoareferenceAssemblewithoutareferenceAnnotatesequencefunc@onTesthypotheseswithsta@s@cs
Applica1ons:GenomesequencingRNAsequencingChIPsequencingMetagenomicsequencing
Sequencereads
6/07/16
3
High-throughputsequencing
DNAfragmenta1on Adaptorliga1on
Fixadaptorstosurface&lify Addbasesincycles
Shendure,NatBiotech,2008
@lexnederbragt
~ONT~
6/07/16
4
Watchthisspace
• Manynewtechnologiesemergingallthe1me
• Singlecell
• Someday:Longread(1read->1transcript)
• Reviewofthelatestsequencingtechnologies– GoodwinSetal,NatRevGeneDcs2016.17:333-351.
Sequencingread-out
@HWI-ST226_0154:5:1101:1452:2196#CTTGTA/1GGCGGCGAGAAAGCGCGCCTGGTACTGGCGCTGATCGTCTGGCAGCGTCCAAATCTGCTGTTGCTCGATGAACCGACCAACCACCTGGATCTCGACATGC+HWI-ST226_0154:5:1101:1452:2196#CTTGTA/1gggggggggeggeefggggggggcgfefdfdggbegggggdae`^^db_ddcedebbZYb[c^[`XZY]]_d]c^bac^ccfbaf[_cTM_VR\]`^[^^@HWI-ST226_0154:5:1101:1383:2197#CTTGTA/1TACGATAACTCACTGGTTTCTAATGCGTTTGGTTTTTTACGTCTGCCAATGAACTTCCAGCCGTATGACAGCGATGCCGACTGGGTGATCACTGGCGTAC+HWI-ST226_0154:5:1101:1383:2197#CTTGTA/1ggggggggggggggggggggggggggggggggegggggfdgaggedgegaY[b``eceaUcec_cea_eeedcaXVacY``_`bbYdBBBBBBBBBBBBB@HWI-ST226_0154:5:1101:1355:2220#CTTGTA/1GACCGCTACCCACCAACACACCGATCCTTACGGTAACGTCATTGCCCAGGGCGGCAGTTTGTCGCTACAGGAGTACACCGGCGATCCGAAGAGCCCGCTG+HWI-ST226_0154:5:1101:1355:2220#CTTGTA/1gggggggggggggggggeggegfgegggggggfdggggeggggbggdbdeeedec[c_ddedeggbdbaecSYG\]^P\Wc]aO^_`]\]]JWF_^BBBB@HWI-ST226_0154:5:1101:1262:2242#CTTGTA/1ATGTTTTACGAAACATCTTCGGGTTGTGAGGTTAAGCGACTAAGCGTACACGGTGGATGCCCTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCGA+HWI-ST226_0154:5:1101:1262:2242#CTTGTA/1gggggggggggggggggggggggggggggggeggeggggggggggggegggggbggad^edebSfb^eb`bdccfca[\Y\`_b_]]\Y^T`]Ya^[c^B
fastqformat
6/07/16
5
Sequencingread-out
@HWI-ST226_0154:5:1101:1452:2196#CTTGTA/1GGCGGCGAGAAAGCGCGCCTGGTACTGGCGCTGATCGTCTGGCAGCGTCCAAATCTGCTGTTGCTCGATGAACCGACCAACCACCTGGATCTCGACATGC+HWI-ST226_0154:5:1101:1452:2196#CTTGTA/1gggggggggeggeefggggggggcgfefdfdggbegggggdae`^^db_ddcedebbZYb[c^[`XZY]]_d]c^bac^ccfbaf[_cTM_VR\]`^[^^@HWI-ST226_0154:5:1101:1383:2197#CTTGTA/1TACGATAACTCACTGGTTTCTAATGCGTTTGGTTTTTTACGTCTGCCAATGAACTTCCAGCCGTATGACAGCGATGCCGACTGGGTGATCACTGGCGTAC+HWI-ST226_0154:5:1101:1383:2197#CTTGTA/1ggggggggggggggggggggggggggggggggegggggfdgaggedgegaY[b``eceaUcec_cea_eeedcaXVacY``_`bbYdBBBBBBBBBBBBB@HWI-ST226_0154:5:1101:1355:2220#CTTGTA/1GACCGCTACCCACCAACACACCGATCCTTACGGTAACGTCATTGCCCAGGGCGGCAGTTTGTCGCTACAGGAGTACACCGGCGATCCGAAGAGCCCGCTG+HWI-ST226_0154:5:1101:1355:2220#CTTGTA/1gggggggggggggggggeggegfgegggggggfdggggeggggbggdbdeeedec[c_ddedeggbdbaecSYG\]^P\Wc]aO^_`]\]]JWF_^BBBB@HWI-ST226_0154:5:1101:1262:2242#CTTGTA/1ATGTTTTACGAAACATCTTCGGGTTGTGAGGTTAAGCGACTAAGCGTACACGGTGGATGCCCTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCGA+HWI-ST226_0154:5:1101:1262:2242#CTTGTA/1gggggggggggggggggggggggggggggggeggeggggggggggggegggggbggad^edebSfb^eb`bdccfca[\Y\`_b_]]\Y^T`]Ya^[c^B
fastqformat
1234
readidenDfiers
Sequencingread-out
@HWI-ST226_0154:5:1101:1452:2196#CTTGTA/1GGCGGCGAGAAAGCGCGCCTGGTACTGGCGCTGATCGTCTGGCAGCGTCCAAATCTGCTGTTGCTCGATGAACCGACCAACCACCTGGATCTCGACATGC+HWI-ST226_0154:5:1101:1452:2196#CTTGTA/1gggggggggeggeefggggggggcgfefdfdggbegggggdae`^^db_ddcedebbZYb[c^[`XZY]]_d]c^bac^ccfbaf[_cTM_VR\]`^[^^@HWI-ST226_0154:5:1101:1383:2197#CTTGTA/1TACGATAACTCACTGGTTTCTAATGCGTTTGGTTTTTTACGTCTGCCAATGAACTTCCAGCCGTATGACAGCGATGCCGACTGGGTGATCACTGGCGTAC+HWI-ST226_0154:5:1101:1383:2197#CTTGTA/1ggggggggggggggggggggggggggggggggegggggfdgaggedgegaY[b``eceaUcec_cea_eeedcaXVacY``_`bbYdBBBBBBBBBBBBB@HWI-ST226_0154:5:1101:1355:2220#CTTGTA/1GACCGCTACCCACCAACACACCGATCCTTACGGTAACGTCATTGCCCAGGGCGGCAGTTTGTCGCTACAGGAGTACACCGGCGATCCGAAGAGCCCGCTG+HWI-ST226_0154:5:1101:1355:2220#CTTGTA/1gggggggggggggggggeggegfgegggggggfdggggeggggbggdbdeeedec[c_ddedeggbdbaecSYG\]^P\Wc]aO^_`]\]]JWF_^BBBB@HWI-ST226_0154:5:1101:1262:2242#CTTGTA/1ATGTTTTACGAAACATCTTCGGGTTGTGAGGTTAAGCGACTAAGCGTACACGGTGGATGCCCTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCGA+HWI-ST226_0154:5:1101:1262:2242#CTTGTA/1gggggggggggggggggggggggggggggggeggeggggggggggggegggggbggad^edebSfb^eb`bdccfca[\Y\`_b_]]\Y^T`]Ya^[c^B
fastqformat
1234
readsequences–stringsofDNAbases
6/07/16
6
Sequencingread-out
@HWI-ST226_0154:5:1101:1452:2196#CTTGTA/1GGCGGCGAGAAAGCGCGCCTGGTACTGGCGCTGATCGTCTGGCAGCGTCCAAATCTGCTGTTGCTCGATGAACCGACCAACCACCTGGATCTCGACATGC+HWI-ST226_0154:5:1101:1452:2196#CTTGTA/1gggggggggeggeefggggggggcgfefdfdggbegggggdae`^^db_ddcedebbZYb[c^[`XZY]]_d]c^bac^ccfbaf[_cTM_VR\]`^[^^@HWI-ST226_0154:5:1101:1383:2197#CTTGTA/1TACGATAACTCACTGGTTTCTAATGCGTTTGGTTTTTTACGTCTGCCAATGAACTTCCAGCCGTATGACAGCGATGCCGACTGGGTGATCACTGGCGTAC+HWI-ST226_0154:5:1101:1383:2197#CTTGTA/1ggggggggggggggggggggggggggggggggegggggfdgaggedgegaY[b``eceaUcec_cea_eeedcaXVacY``_`bbYdBBBBBBBBBBBBB@HWI-ST226_0154:5:1101:1355:2220#CTTGTA/1GACCGCTACCCACCAACACACCGATCCTTACGGTAACGTCATTGCCCAGGGCGGCAGTTTGTCGCTACAGGAGTACACCGGCGATCCGAAGAGCCCGCTG+HWI-ST226_0154:5:1101:1355:2220#CTTGTA/1gggggggggggggggggeggegfgegggggggfdggggeggggbggdbdeeedec[c_ddedeggbdbaecSYG\]^P\Wc]aO^_`]\]]JWF_^BBBB@HWI-ST226_0154:5:1101:1262:2242#CTTGTA/1ATGTTTTACGAAACATCTTCGGGTTGTGAGGTTAAGCGACTAAGCGTACACGGTGGATGCCCTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCGA+HWI-ST226_0154:5:1101:1262:2242#CTTGTA/1gggggggggggggggggggggggggggggggeggeggggggggggggegggggbggad^edebSfb^eb`bdccfca[\Y\`_b_]]\Y^T`]Ya^[c^B
fastqformat
1234
qualityscoreforeachDNAbase
Phredscore: Q=-10log10PwhereP=probabilityofanerror
Qualityscore Prob.error Accuracy10 1in10 90%20 1in100 99%30 1in1000 99.9%
Phredvsreadbaseposi1on
6/07/16
7
Proper1esofsequencedatatokeepinmind
• Data=Stringsofbases+qualityscores
• Readlength– Fixedorvariable?– Short(e.g.35bpSOLiD)orlong(e.g.500+bp454)
• Errors– Errorrate:howfrequentareerrors?Phredscoredistribu@on?– Errorprofile:whatkindoferrorsaremostcommon?
• Numberofreads– Millions?Hundredsofmillions?– Howmuchtotalsequence?Howdoesthatcomparetogenomesize?
Readalignment
Referencesequence,similartoourDNAsample
Outputs:•whatreferencesequencesarepresent(e.g.genomevaria@on,RNA-seq,ChIP-seq)•howmanycopiesarethere?
6/07/16
8
ReadassemblyReference-free,usethenewreadsalone(denovo)toreconstructwhatoriginalDNAsamplelookedlike
reads
contigs
gap
c
a
c
cc
ccC
consensus
Genomesequencing:aimtoassembleeachchromosomeMetagenomics:aimtoassembleDNAfragmentsfromeachmemberofthecommunityRNA-seq:aimtoassembleeachmRNAtranscript
RNAsequencing(RNAseq)
Input:cDNAreversetranscribed
frommRNARepresents:allthemessengerRNA
transcriptspresentinasetofcells
(i.e.whatisbeingexpressed)
Image:Rgocs(WikimediaCommons)
6/07/16
9
Differen1alexpression(DE)
• Areobserveddifferencesinreadcountsbetweengroupsduetochanceornot?
• HowisHTSdifferenttoarrays?– Dataisinherentlycounts– Dynamicrangeistheore@callyunbounded– Splicingvaria@oncanbeassessed– Analyseatthegene,transcript,exonlevel?– Differenttechnologymeansdifferentsourcesofconfoundingeffectsandbias
Whataresourcesoftechnicalvaria1onbetweensamples?
• Sequencingdepth• RNAcomposi@on(aresomegenesveryhighlyexpressedinonegroupandnotanother?)
• GCcontent(b/ngenes)• Genelength(b/ngenes)• Classicsourcesfrommicroarrays
6/07/16
10
Doyouhavereplicatesornot?
• Ifnoreplicates,then…– Itmaynotbeadvisabletoes@matesignificanceofdifferences,calculatearankoffoldchanges
– Fisher’sexacttestorachi-squaredtestfor2-by-2con@ngencytable
– Dosomereplicates?
• Iftherearereplicates,then…– Inter-libraryvaria@oncanbees@mated– Therearemorerela@velysophis@catedop@ons
DifferentmethodsforDE
• Examples– EdgeR(RobinsonandSmyth)– Cufflinks(Trapnelletal)– DESeq(Anders&Huber)– SAMseq(Li&Tibshirani)
• Manyothers,morebeingpublishedregularly
6/07/16
11
Howdoesonechooseamethod?
ModifiedfromSoneson&Delorenzi,BMCBioinf2013
N=2 N=5 N=10
625up/down-reg
Howdoesonechooseamethod?
ModifiedfromSoneson&Delorenzi,BMCBioinf2013
1,250(10%)up-reg
N=2 N=5 N=10
625up/down-reg
N=2 N=5 N=10
625up/down-reg 625up/down-reg1outliersample10%xrandomfactor
5%acrossallsamplesxrandomfactor
6/07/16
12
Example:EdgeR
• Whataretheinputs?– Atableofcounts(matrix)• Rowsas‘genes’• Columnsassamples(libraries)
– Alistofgroupassignmentsforeachsample(vector)
Normalisa1on
• Explicitscalingbylibrarysize– TMMnormalisa@on
• Othernormalisa1onfactorscanbeincludedinmodel
6/07/16
13
Normalisa1on:TrimmedMeanofM-values(TMM)
• Ahighlyexpressedgene(s)canmakeothergenesappearfalselydown-regulatedwhencomparingacrosslibraries
ModifiedfromRobinson&Oshlack,GenomeBiology2010
Setofhighlyofexpressedgenes
M(logra
@o)
A(logabundance)
housekeeping
Normalisa1on:TMM
• Howcanwecorrectforthiseffect?– Findsetofscalingfactorsforlibrariesthatminimizethelog-fold
changesbetweensamplesformostgenes– Es@matethera@oofRNAproduc@onof2samples(called1&2)
M _ gene = log( count _ gene1/ total _ reads1count _ gene2 / total _ reads2
)
A_ gene = 12log(count _ gene1
total _ reads1x count _ gene2total _ reads2
)
Logexpressionra1o
Logabsoluteexpression
6/07/16
14
Normalisa1on:TMM• TrimmedMeanoftheMvalues(TMM)isweightedaverageaker
removingtheupper/lowerN%ofthedata(typically25%forM,5%forA)• Weightofageneistheinverseofitses@matedvariance• Akertrimming,calculatethescalingfactorforlibrary1(comparedto
library2)as
log(TMM ) =(weight _ gene_ i)(M _ gene_ i)
gene_ i∈G*∑
weight _ gene_ igene_ i∈G*∑
Ifthere’snoRNAcomposi1oneffect,thenTMM=1
Theeffec,velibrarysize(TMMxlibrary_size)isthenusedinalldownstreamanalysis
EdgeRmodel• We’reinterestedinreadcountsforageneacrossreplicates
• Varia@oninrela@vegeneabundanceisduetobiologicalcauses+technicalcauses
• Becausethedataiscounts,we’llusuallythinkit’sPoissondistributed,and
TotalCV2=TechnicalCV2+BiologicalCV2
• WhatisaPoissondistribu@on?
Wikipedia
Expectedvalue=mean(λ)=variance
6/07/16
15
EdgeRmodel:WhynotuseaPoisson?
• Assump1onthatmean=varianceisstrong
• InRNAseq,observedvaria1onistypicallygreaterthanthemean– Thatis,thedatais‘overdispersed’
• Howcanwehandleoverdispersion?
2replicates42replicates
Alterna1ve:Nega1vebinomial(gamma-Poisson)
• Assumetrueexpressionlevelofageneisacon1nuousvariablewithagammadistribu1onacrossreplicates– Impliesthatthereadcountsfollowanega@vebinomialdistribu@on(adiscreteanalogueofgamma)
• NBisparameterisedbymeanandr(dispersionparameter)– Notetheextraparameter(comparedtoPoisson)whichhandlesvarianceindependentofthemean
– BiologicalCVissqrt(r)
6/07/16
16
EdgeRmodel:Es1ma1ngthedispersionparameter
• Whyisthisimportant?– Overes@ma@onlikelymeansaconserva@veDEtest– Underes@ma@onlikelymeansaliberalDEtest
• Manymethods– Maximum-likelihood(ML)– Pseudo-likelihood– Quasi-likelihood– Condi@onalML(iflibrariesareequalsize)– Quan@leadjustedcondi@onalML(qCML)
• Bojomlineisabigsimula1onstudywasperformed– HTSdata:manygenes,means,variances,librarysizes– qCMLwasmostaccurateacrossallscenarios– Robinson&SmythBiostaDsDcs2008
EdgeRmodel• Geneshavedifferentmean-variancerela@onships,sodispersionisn’tsameacrossgenes
• Ini@allyedgeRes@mates‘common’dispersionacrossallgenesthenappliesanempiricalBayesapproachtoshrinkgene-specificdispersionstowardthe‘common’
• Whydowecare?– Allowsustomakeweakerassump@onsaboutmean-varianceandthus
makesmodelmorerobusttooutliergenes
Subramaniam&Hsiao,NatImm2012
2replicates42replicates
6/07/16
17
Differen1alexpressionbetween2groups
• ‘Exact’test– NULL:mean_A=mean_B(postnormalisa@on–pseudoexact)– Adjustdistribu@onsofcountsfordifferentlibrarysizessotheyareiden@cal
– GiventhesumofiidNBrandomvariablesisNB,theprobabilityofobservingcountsequaltoormoreextremethanthatobservedcanbecalculated(usingNB)
• Forexperimentswith>2groups,ageneralizedlinearmodel(GLM)isusedandDEistestedusingaGLMlikelihoodra1otest– BullardetalBMCBioinformaDcs2010
Mul1pletes1ng• Eachlocusistestedindependently– If20,000testsareperformedandalphaissettoP<0.05,thenweexpectatleast1,000DElocibychance(0.05*20,000)
– Balancepowerandfalseposi@ves
• ControlFDR– Benjamini-Hochbergalgorithm– AdjustPvaluesaccordingly
• Bonferronicorrec1on
6/07/16
18
Whatoutputareweinterestedin?
CPM–Countspermillion(notformallyusedinedgeRDE)FPKM(cufflinks)–FragmentsPerKboftranscriptperMillionmappedreads
*inferredusingasta1s1calmodel*
Smearplot
6/07/16
19
Furtherreading
• Forworkflowsandcomparisonof2ofthemostpopulartools(DESeqandedgeR)– AndersSetal,NatureProtocols2013.8(9):1765-86.
Whathaven’tIcovered?• Splicingvaria1on/diversityandhowtotestfordifferences
• Toolsforalignmentandassembly
• NoveldesignsforRNAseqexperiments
• Datavisualiza1on
• VariantcallingandgenotypingfromRNAseq
• Genefunc1on/ontologiesforRNAseq
• Computa1onallimita1ons