Course outline -...
Transcript of Course outline -...
-
Course outline Goal: Learn basic programming and bioinformatic skills to complete a project using available NGS data Structure: Lectures (4) Journal club (4) Workshops (4) Grading: Problem sets (3) Class participation (journal club) Project report (oral and written)
-
Introduction to genome sequencing: Approaches and Platforms Bio472- Spring 2014 Amanda Larracuente
-
Outline 1. History 2. Basic assembly approaches 3. First generation technology 4. Second generation technology 5. Third generation technology 6. Challenges
-
Progress in genome sequencing
NHGRI at genome.gov
1. History
-
History: Sanger sequencing • Introduced in 1975 • 1982- Bacteriophage lambda • 1995- H. influenzae • 1996- Yeast • 1998- C. elegans • 2000- Drosophila melanogaster • 2000- Arabidopsis • 2001- Human
1. History
-
Sequence reads • Reads
• Sequence output from a DNA fragment • Base qualities
• Paired-end reads
• Reads from both ends of a DNA fragment • Similar to or same as mate pairs (depending on platform)
2. Basic Assembly Approaches
DNA fragment
Paired-end reads
-
Genome assemblies
Human male karyotype http://www.genome.gov
109 short sequencing reads 3Gb whole genome
2. Basic Assembly Approaches
-
Whole Genome Shotgun (WGS) approach
( (
Overlapping reads
contig
Mate pairs
scaffold
Chromosomes GATCGTGTCCCATTGTCAGATCGTG Finished assembly
1. Shear genome into 3-5kb
fragments, clone into plasmids and sequence
2. Find overlapping reads 3. Assemble overlapping reads
into contigs
4. Assemble contigs into scaffolds 5. Link scaffolds into “finished”
sequence corresponding to chromosomes
2. Basic Assembly Approaches
-
Hierarchical Approach
( (
BACs
100-150 kb inserts
Mate pairs
scaffold
Chromosomes
1. Shear genome into 150kb
fragments and put in BACs 2. Create map of BACs to
genome and create a tiling path 3. Shotgun sequence individual
BACs from tiling path
4. Assemble BAC sequences 5. Use sequenced tiling path to
reconstruct genome
GATCGTGTCCCATTGTCAGATCGTG Finished assembly
Tiling path
2. Basic Assembly Approaches
-
Comparing assembly approaches • Whole Genome Shotgun
• Faster • Assembly is a huge
computational effort
• Celera Genomics approach to human genome
• Hierarchical • Slower • Labor-intensive • Higher quality assembly in
difficult-to-assemble regions
• Publicly funded Human Genome Project
2. Basic Assembly Approaches
Took >10 years and cost $3 billion
-
First generation sequencing technology
Shear genomic DNA
Subclone into vectors
Bacterial replication
Isolate amplified clones
Capillary sequencing
3. First generation technology
-
!"!#$#$""!$"##!#"$#!"%!"!#$#$""!$"##!#"$#!%!"!#$#$""!$"##!#"$#%!"!#$#$""!$"##!#"$%!"!#$#$""!$"##!#"%!"!#$#$""!$"##!#%!"!#$#$""!$"##!%!"!#$#$""!$"##%!"!#$#$""!$"#%!"!#$#$""!$"%!"!#$#$""!$%!"!#$#$""!%!"!#$#$""%
!"!#$#$%!"!#$#%!"!#$%!"!#%!"!%!"%!%
!"!#$#$"%
!"!#$#$""!$"##!#"$#!"%
&'#%()*+,-./0-%
!-,(*/1-%&'#%!"
"/(2**/.+%$-*%
3./4,-5
1%026-
%
7-89-5:-%
(.2,-.%Sanger sequencing • Chain termination • Fluorescently labeled,
modified nucleotides • Capillary gel
electrophoresis
3. First generation technology
-
Applications • Sequencing PCR fragments • Sequencing off plasmids
• Sequencing genomes
• Sequencing cDNA libraries
3. First generation technology
-
Second generation sequencing technology
Amplification
Base detection
Shear genomic DNA
Solid support fixation
4. Second generation technology
Wash and Scan
-
454 pyrosequencing
Rothberg and Leamon 2008
a. Isolate gDNA, fragment and ligate adapters
b. Bind to beads and carry out
emulsion PCR (emPCR—1 fragment/bead)
c. Break emulsion and add beads to
fiber-optic slide d. Pyrosequencing reaction, 1 nt
added at a time (peak height corresponds to # of nucl)
a
b
c
d
4. Second generation technology
-
Illumina • Fragment gDNA • Ligate adapters
• Fix fragments on solid surface
• Bridge amplification to generate clusters
• Sequence one end (using reversible terminators)
• If paired-end, regenerate cluster and sequence the other end
Figure from Mardis 2013
4. Second generation technology
-
Ion Torrent 1. Shear DNA, ligate adapters
2. Attach fragments to beads and amplify with emPCR
3. Place bead in wells on plate
4. Flow nucleotides over wells, one at a time
5. DNA polymerase incorporates bases and give off H+
6. Mini semi-conductor reads pH change
http://www.lifetechnologies.com
4. Second generation technology
*more like 2.5-generation technology
-
Applications • Genome re-sequencing (reference based assembly)
• Genome sequencing (de novo assembly)
• Sequencing transcriptome (RNAseq)
• Sequencing DNA associated with proteins (CHiPseq)
4. Second generation technology
-
Third generation sequencing technology
No amplification
Base detection
solid support fixation
Shear genomic
DNA
5. Third generation technology
Single-molecule sequencing
-
Single molecule sequencing e.g. Pacific Biosciences (PacBio)
• Single-molecule real-time (SMRT) sequencing • Real time fluorescent nucleotides • Some reads >10kb • High error rate
Eid et al. 2009
5. Third generation technology
-
Applications • Low-depth: Scaffolding contigs (de novo assembly) • High-depth: Genome sequencing of repetitive regions or
structural rearrangements
5. Third generation technology
-
Comparison of NGS technologies (non-exhaustive)
Method strategy Read length
Error type
Error rate Output per run
454 Synthesis/pyrosequencing Up to 700bp indels 1% 400-600 Mbp
SOLID DNA ligase 75bp AT bias >0.01-0.06% 20-30 Gbp
Illumina (HiSeq)
Synthesis/DNA poly 150bp Subs. >0.1% 600 Gbp
Ion Torrent H+ detection 90bp indels 1.5% 1 Gbp
PacBio Single
molecule/synthesis
>2.5kb (up to 10kb) insertions 15%
75-100 Mbp (5-10 Mbp
usable)
6. Challenges
-
The $1000 genome—Illumina!
“The HiSeq X™ Ten, composed of 10 HiSeq X Systems, is the first sequencing platform that breaks the $1000 barrier for a 30x human genome. The HiSeq X Ten System is ideal for population-scale projects focused on the discovery of genotypic variation to understand and improve human health”
http://investor.illumina.com/
Reported January 14 2014:
6. Challenges
-
Summary of technology • Point:
• Sequencing is cheap and easy
• Individual labs
• Current challenge • Computational • Data management
6. Challenges
NHGRI at genome.gov
-
Repetitive DNA
Interspersed repeats
e.g. transposable elements
Tandem repeats
e.g. satellites, CNVs
?
?
6. Challenges
-
Challenges for repetitive DNA • Repeat unit longer than read length (e.g. Transposable
elements)
• Repeat unit longer than insert sizes (e.g. Transposable elements)
6. Challenges
-
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
-
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
Single end libraries
-
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
Paired end libraries
-
Challenges for repetitive DNA
6. Challenges
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
ATGGAATATGGAA
AATATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATGAATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
TGGTGTACCCAATATGGTGTA
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
ATATGGA GCGATAATATGGAA
AATATGGAATAT
True Genomic sequence
Assembly
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
AATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
AATATGGAATATGGAATATGGAATATGGAATATGGAATATGG
CCTGCGATAATATGCCTGCGATAATATG
CGATAATATGGAA
AATATGGTGTACCCAATATGGTGTACCC
GAATATGGTGTA
TAATATGGAATA
CCTGCGATAATATGGAATATGGTGTACCC
TATGGAATAT
AATATGGAATA
GGAATATGGA
TATGGAATATG
AATATGGAA
GGAATATGG
CCTGCGATAATATG
TAATATGGAATATG
ATGGAATATG ATATGGAATATGG
ATATGGA GCGATAATATGGAA
GCGATAATATG
TGGTGTACCCAATATGGAATAT
CCTGCGATAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGAATATGGTGTACCC
GGAATATGGAATA
AATATGGTGTA AATATGGAA
Paired end + Mate pair libraries
-
Repeats cause
6. Challenges
• Misassemblies • Complex rearrangements • Gaps
-
Next gen applications and repeats • WGS with Sanger:
• Repetitive DNA unstable in cloning vectors • Paired end/Mate pairs help with assembly
• 454 pyrosequencing • Problems with homopolymers • Paired end/Mate pairs help with assembly
• Illumina • Repetitive elements longer than read length • Deep coverage and mate pairs help with assembly
• PacBio • Problem is very high error rate: requires deep coverage PacBio or short
reads • Read length plows through repeats
6. Challenges
-
Further reading: • Metzker. 2010. Sequencing technologies—the next
generation. Nature Reviews. 11:31-46. • Mardis. 2013. Next-Generation Sequencing Platforms.
Ann. Rev. Anal. Chem 6:287-303. • Treangen and Salzberg. 2012. Repetitive DNA and next-
generation sequencing: computational challenges and solutions. Nature Reviews Genetics 13:36-46.
-
Project background reading • Brennecke, J, AA Aravin, A Stark, M Dus, M Kellis, R Sachidanandam, GJ
Hannon. 2007. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Cell 128:1089-1103.
• Lemos, B, LO Araripe, DL Hartl. 2008. Polymorphic Y chromosomes harbor
cryptic variation with manifold functional consequences. Science 319:91-93. • Nagao, A, T Mituyama, H Huang, D Chen, MC Siomi, H Siomi. 2010.
Biogenesis pathways of piRNAs loaded onto AGO3 in the Drosophila testis. RNA 16:2503-2515.
• Filion, GJ, JG van Bemmel, U Braunschweig, et al. 2010. Systematic protein
location mapping reveals five principal chromatin types in Drosophila cells. Cell 143:212-224.
-
Papers • Akbari, OS, I Antoshechkin, BA Hay, PM Ferree. 2013. Transcriptome
profiling of Nasonia vitripennis testis reveals novel transcripts expressed from the selfish B chromosome, paternal sex ratio. G3 (Bethesda) 3:1597-1605.
• Blumenstiel, JP, X Chen, M He, CM Bergman. 2014. An Age-of-Allele Test of
Neutrality for Transposable Element Insertions. Genetics 196:523-538. • Rogers, RL, JM Cridland, L Shao, TT Hu, P. Andolfatto, and KR Thornton.
2014. Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans. ArXiv preprint.
• Kelleher, E.S., and Barbash D.A. (2013) Analysis of piRNA-mediated
silencing of active TEs in Drosophila melanogaster suggests limits on the evolution of host genome defense. Molecular Biology and Evolution. 30:1816-1819.
-
Getting setup to run graphical software on BlueHive • Please go to: https://www.circ.rochester.edu/wiki/index.php/Getting_Started And https://www.circ.rochester.edu/wiki/index.php/NX_Cluster • Install X11 application if needed