Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

49
Genetics 211 - 2014 Lecture 3 Genome Resequencing and Functional Genomics Gavin Sherlock January 21 st 2014

Transcript of Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Page 1: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Genetics 211 - 2014 Lecture 3

Genome Resequencing and Functional Genomics Gavin Sherlock January 21st 2014

Page 2: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

SAM files •  Sequence Alignment/Map format •  Is a concise file format that contains information about how sequence

reads maps to a reference genome •  Can be further compressed in BAM format, which is a binary format

of SAM. •  Can also be sorted and indexed to provide fast random access, using

SAMtools (more on this in a minute). •  Requires ~1 byte per input base to store sequences, qualities and meta

information. •  Supports paired-end reads and color space. •  Is produced by bowtie, bwa •  SAM can be converted to pileup format.

Page 3: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

SAM format No. Name Description 1 QNAME Query NAME of the read or read pair 2 FLAG Bitwise FLAG 3 RNAME Reference Sequence Name 4 POS 1-Based leftmost POSitionof clipped alignment 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR Extended CIGAR string (operations: MIDNSHP) 7 MRNM Mate Reference NaMe (‘=’ if same as RNAME) 8 MPOS 1-Based leftmost Mate POSition 9 ISIZE Inferred Insert SIZE 10 SEQ Query SEQuence on the same strand as the reference 11 QUAL Query QUALity (ASCII-33=Phred base quality)

Page 4: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

SAM Format coor 12345678901234 5678901234567890123456789012345!ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT!!r001+ TTAGATAAAGGATA*CTG!r004+ ATAGCT..............TCAGC!r001- CAGCGCCAT!

@SQ SN:ref LN:45!r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *!r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *!r001 83 ref 37 30 9M * 0 0 CAGCGCCAT *!

Page 5: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Alignment Visualization •  SAMtools has its own terminal-based text

visualization tool, called tview •  Very simple, but fast, and useful for looking at SNPs

and indels

Page 6: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Alignment Visualization •  For more complex viewing needs, use either GenomeView or

IGV •  Can display BAM files, plus annotation tracks •  Java based dynamic visualization software •  Shows snps, indels, spliced reads and a signal track

http://genomeview.sourceforge.net/ http://www.broadinstitute.org/igv/

•  Also check out JBrowse for online viewing of BAM files

NB: Browser-driven analyses are necessarily anecdotal and at best semi-quantitative.

Page 7: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Genome Resequencing

•  SNP/indel discovery – Can now multiplex 48 yeast strains on a single

lane of the HiSeq 2000 (~70x coverage each) –  1 lane of HiSeq = ~10x human genome

•  Structural variant discovery •  Copy Number changes •  Novel sequence discovery

Page 8: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

SNP and indel calling •  SNPs and indels are ‘called’ from SAM/BAM files •  Typically use GATK (Genome Analysis Toolkit), which will

identify nucleotides with support for variation –  Sounds trivial, but typically lots of false positives –  Harder in a diploid than a haploid –  Harder with lower coverage –  You *must* read the “Best Practice Variant Detection with the

GATK” page before using GATK; it changes frequently –  Make sure the version of GATK you are using (it is updated

frequently) corresponds to the latest “best practices” –  Always document what you did – this is best done in the form of a

Perl or Python script that glues GATK calls into a pipeline, and that documents its parameters.

Page 9: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

VCF files •  VCF stands for variant call format •  Produced by the GATK, and describes all

variant positions derived from bam files •  Meta information lines, preceded by ##,

indicate the source of data used, and how the file was generated.

•  Variants are not annotated – to do this, using a tool such as SNPeff or ANNOVAR –  Helps you zero in on which variants might be of

interest

Page 10: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Copy Number Variation

•  Sequence coverage of a region can allow detection of amplifications or deletions – i.e. duplicated regions will have higher coverage.

•  Power to detect such regions depends on their size, copy number and number of reads.

Page 11: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

CNV Detection Power

Chiang et al, 2009. Nature Methods 6, 99-103.

Page 12: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Coverage Shows Amplification HXT7                                        HXT6  

Page 13: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Structural Variation

•  Using paired-end read data, it’s possible to identify structural variants:

Adapted from Korbel at al, 2007.

Page 14: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Structural Variant Identification

•  There are several tools, e.g.: –  BreakDancer (used by WashU Sequencing Center) –  VariationHunter –  MoDIL –  inGAP-SV

•  has a nice visualization tool (http://ingap.sourceforge.net/)

–  PEMer (developed by the Gerstein lab) •  No real robust comparison between these exists on a

standard dataset •  Ask around, find out what others are using, get latest

versions and experiment

Page 15: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

•  interactions between nucleic acids and proteins"

•  transcript identity"•  transcript abundance"• RNA editing"• SNPs"• Allele specific expression"• Regulation"

• Nucleosome positioning"• 3D genome architecture"• Active promoters"•  interactions between

nucleic acids and proteins"•  chromatin modifications"

• genome variability"• metagenomics"• genome modifications"• detection of mutations"• association studies"• phylogeny"• evolution"

Applications of Next-Gen Sequencing

genome chromatin transcriptome"

de novo sequencing"

assembly"

annotation"

mapping"

resequencing"

detection of variants"

mapping"

Hi-C"

3D reconstruction"

mapping"

ChIP-Seq"

detection of binding sites"

mapping"

RNA-Seq"

transcript detection and quantification"

mapping"

ATAC-Seq"

Identify open

chromatin"

Page 16: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

RNA-seq

•  RNA-Seq is similar to performing expression profiling on microarrays

•  Instead of detecting transcript abundance by hybridization signal, we count fragments of transcripts

•  Several advantages over microarrays –  No prior knowledge needed of which parts of the genome are

expressed –  Allows splice site discovery –  5’ and 3’ UTR mapping –  Novel transcript discovery –  View RNA modifications (editing, other enzymatic changes) –  With longer reads, can “phase” splice sites –  Possibly discover many novel isoforms –  More sensitive

Page 17: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Dynamic Range

Page 18: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

How do we sequence mRNA?

Total  RNA  

DNAase  Treatment  

Oligo-­‐dT  beads  

PolyA  purified  RNA  

Page 19: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

First Strand cDNA synthesis

AAA(A)n" 3’ poly A tail"5’ cap structure"

mRNA"

AAA(A)n" 3’ "5’ "3’" 5’" oligo (dT)12-18 primer"

3’" 5’"AAA(A)n" 3’ "5’ "

dNTPs reverse transcriptase"

AAA(A)n" 3’ "5’ "3’" 5’"

cDNA:mRNA hybrid"

Page 20: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Second Strand Synthesis

AAA(A)n" 3’ "5’ "3’" 5’"

dNTPs RNAaseH"E. coli polymerase I!

5’ "3’" 5’"

remnants of mRNA serve as primers for"synthesis of second strand of cDNA"

5’ "3’" 5’"

bacteriophage T4"DNA ligase"

5’ "3’" 5’"

3’ "double stranded cDNA"

Library construction, similar to genomic DNA, using forked adapters"

Page 21: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Shatter RNA, Prime with Random Hexamers

AAA(A)n" 3’ poly A tail"5’ cap structure"

Fragment RNA"

5’" 3’"

Prime 1st strand synthesis with random hexamers"

5’ "3’" 5’"

3’ "double stranded cDNA"

Library construction, using forked adapters"

Page 22: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Random Hexamer Induced Sequence Errors

van Gurp TP, McIntyre LM, Verhoeven KJ. (2013). Consistent errors in first strand cDNA due to random hexamer mispriming. PLoS One 8(12):e85583.

Page 23: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

PolyA  purified  RNA  

Fragmented  RNA  

5’  Ambion  Fragmenta>on  Buffer  

5’   3’    

Polynucleo>de  kinase  in  T4  RNA  ligase  buffer  (which  has  ATP).    Should  be  fresh.  

RNA  Fragments  with  5’P  and  3’OH  

P-­‐5’   3’-­‐OH    

Remove  ATP  

Retaining Strand Specificity Through RNA ligation

Page 24: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Adapter Ligation

Ligate  adenylated  3’  adaptor  

T4  RNA  ligase  I  (no  ATP)  

Size  Select  Fragments  between  125  and  200bp  on  a  polyacrylamide  gel  

Remove  unligated  adaptor  

Ligate  5’  Adaptor  

T4  RNA  ligase  

Page 25: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Creation of Strand Specific Library

Reverse  Transcrip>on  (Superscript  II)  

RNA  hydrolysis  

PCR  

Sequence"

Page 26: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Using dUTP to create strand specificity

mRNA"

fragment"

1st strand synthesis with random hexamers

and normal dNTPs"5’" 3’"

forked adapter ligation"

5’" 3’"2nd strand synthesis with dTTP -> dUTP"

Page 27: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Creating Strand Specificity

UNG treatment"

Ad #1"Ad #2"

Pre-amplification and sequencing"

Page 28: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Mapping of Reads

•  Map reads to both the genome, and the predicted spliced genome.

•  Un-mappable reads may span unknown exon-exon junctions from novel transcripts or exons. –  With short reads, a challenge to map these –  TopHat (works in conjunction with Bowtie) can find

these quite efficiently. •  Need to be able to accommodate mismatches.

Page 29: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Exonic Read Density

•  To measure abundance when sequencing entire transcripts, you must normalize the data for the transcript length.

•  Exonic Read Density = Reads per kb gene exon per million mapped reads – Developed by the Wold lab, but makes intuitive

sense.

Page 30: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Why Exonic Read Density? What we observe in mapped reads"

What was sequenced"

What was present in RNA"

1 rpkm 3 rpkm 1 rpkm

Page 31: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Experimental Considerations

•  Paired-end vs. single end –  Transcriptome complexity will dictate your choice

•  Strand specific vs. strand agnostic –  There’s no real reason to not use strand specific

library protocols •  Total RNA vs. polyA+ purified vs. ribosomal

depleted vs. other RNA subpopulations.

Page 32: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Analysis Considerations •  Read Mapping

–  Unspliced Aligners –  Spliced Aligners

•  Transcriptome Reconstruction –  Genome guided –  Genome independent reconstruction

•  Expression quantification –  Gene quantification –  Isoform quantification

•  Differential Expression

Page 33: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Unspliced Aligners

•  Limited to identifying known exons and junctions

•  Requires a good reference transcriptome •  BWT (e.g. bowtie, bwa) based aligners are

much faster •  Seed and extend aligners (e.g. SHRiMP,

Stampy) are more sensitive

Page 34: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Spliced Aligners Align to whole genome, including intron-spanning reads that allow large gaps •  Exon first (MapSplice, SpliceMap, TopHat)

–  Two step process •  Use unspliced alignment •  Take unmapped reads, split, and look for possible spliced connections

–  Typically faster •  Seed-extend (GSNAP, QPALMA)

–  Break reads into short seeds and place on genome, then examine with more sensitive methods

–  Find more splice junctions, though not *yet* clear if they tend to be false positives

Page 35: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Garber et al, 2011, Nature Methods

Page 36: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine
Page 37: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Transcriptome Reconstruction

•  Challenging because – Transcript abundance spans several orders of

magnitude – Reads will originate from mature mRNA, as

well as incompletely spliced precursor RNA – Reads are short, and genes can have many

isoforms, making it challenging to determine which isoform produced which read

Page 38: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Two Approaches •  Genome Guided

–  Relies on reference genome –  Uses spliced reads to reconstruct the transcriptome –  E.g. cufflinks (identifies minimal set of isoforms),

scripture (identifies maximal set of isoforms) •  Genome Independent Approach

–  Tries to de novo assemble transcripts –  TransAbyss, Velvet –  Sensitive to sequencing errors –  Usually requires more computational resources

Page 39: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Two isoforms of the same gene:

Page 40: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine
Page 41: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Determining differential Expression

•  A number of packages available – Cuffdiff, DE-Seq, EdgeR etc.

•  Require replicates for each condition, so can compare within vs. between sample variance

•  More abundant transcripts are more able to be determined to have differential expression

Page 42: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

3’ SEQ

•  Sometimes you have samples with low quality, fragmented RNA

•  Alternatively, you may not want to sequence complete transcripts, but instead, simply quantitate them

•  3’ SEQ attempts to reduce read coverage across transcripts, while still quantifying them

Page 43: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

3’ SEQ

Page 44: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Better Assaying Isoforms •  To better understand a biological system, we really want to

understand all transcripts –  Alternative splicing first seen in viruses in the 1970s

•  Splicing generates complexity –  Humans have only ~2X more genes than Drosophila –  More than one gene one protein –  >38,000 Dscam isoforms! –  Alternative splicing can be altered in disease

•  With relatively short reads, even with paired end sequencing, it’s not clear which exons ends up with which other exons in mature isoforms

•  Long-Read RNA-Seq results in better isoform determination.

Page 45: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Long-Read RNA-Seq

Sharon D, Tilgner H, Grubert F, Snyder M. (2013). A single-molecule long-read survey of the human transcriptome. Nat Biotechnol 31(11):1009-14

Page 46: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

TIF-Seq

•  Transcript Isoform Sequencing •  Does not capture exonic structure •  Instead captures 5’ and 3’ ends of

transcripts •  From only ~6000 genes in yeast, almost 2

million unique transcript isoforms identified •  371,087 major TIFs identified genome-wide

Pelechano V, Wei W, Steinmetz LM. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497(7447):127-31.

Page 47: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

TIF-Seq

Page 48: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

TIF-Seq

Page 49: Genetics 211 - 2014 Lecture 3 - Home | Stanford Medicine

Recommended Reading •  Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data

Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16):2078-9. •  Thorvaldsdóttir, H., Robinson, J.T. and Mesirov, J.P. (2013). Integrative Genomics Viewer (IGV): high-performance genomics data

visualization and exploration. Brief Bioinform. 14(2):178-92. •  Abeel, T., Van Parys, T., Saeys, Y., Galagan, J. and Van de Peer, Y. (2012). GenomeView: a next-generation genome browser. Nucleic

Acids Res. 40(2):e12. •  Westesson, O., Skinner, M., Holmes, I. (2013). Visualizing next-generation sequencing data with JBrowse. Brief Bioinform. 14(2):172-7.

•  Van der Auwera, G.A., Carneiro, M., Hartl, C., Poplin, R., del Angel, G., Levy-Moonshine, A., Jordan, T., Shakir, K., Roazen, D., Thibault, J., Banks, E., Garimella, K., Altshuler, D., Gabriel, S. and DePristo, M. (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics 43:11.10.1-11.10.33.

•  Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., McVean, G., Durbin, R. and the 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics 27(15):2156-8.

•  Parkhomchuk, D., Borodina, T., Amstislavskiy, V., Banaru, M., Hallen, L., Krobitsch, S., Lehrach, H., and Soldatov, A. (2009). Transcriptome analysis by strand-specific sequencing of complementary DNA. Nucleic Acids Res. 37(18):e123.

•  Borodina, T., Adjaye, J., Sultan, M. (2011). A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 500:79-98.

•  van Gurp, T.P., McIntyre, L.M., Verhoeven, K.J. (2013). Consistent errors in first strand cDNA due to random hexamer mispriming. PLoS One 8(12):e85583.

•  Beck, A.H., Weng, Z., Witten, D.M., Zhu, S., Foley, J.W., Lacroute, P., Smith, C.L., Tibshirani, R., van de Rijn, M., Sidow, A. and West, R.B. (2010). 3'-end sequencing for expression quantification (3SEQ) from archival tumor samples. PLoS One 5(1):e8768.

•  Sharon, D., Tilgner, H., Grubert, F. and Snyder, M. (2013). A single-molecule long-read survey of the human transcriptome. Nat Biotechnol. 31(11):1009-14.

•  Pelechano, V., Wei, W. and Steinmetz, L.M. (2013). Extensive transcriptional heterogeneity revealed by isoform profiling. Nature 497(7447):127-31.

•  Trapnell, C., Pachter, L. and Salzberg, S.L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25(9):1105-11. •  Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J. and Pachter, L. (2010).

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 28(5):511-5. Cufflinks