lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA...
Transcript of lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA...
![Page 1: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/1.jpg)
The first generation DNA Sequencing
Some slides are modified fromfaperta.ugm.ac.id/newbie/download/pak_tar/.../Instrument20072.pptand Chengxiang Zhai at UIUC.
![Page 2: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/2.jpg)
The strand direction
http://en.wikipedia.org/wiki/DNA
![Page 3: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/3.jpg)
DNA sequencing Determination of nucleotide sequence
the determination of the precise sequence of nucleotides in a sample of DNA
Two similar methods:1. Maxam and Gilbert method
2. Sanger method
They depend on the production of a mixture of oligonucleotides labeled either radioactively or fluorescein, with one common end and differing in
length by a single nucleotide at the other end This mixture of oligonucleotides is separated by high resolution electrophoresis on polyacrilamide gels and the position of the bands
determined
![Page 4: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/4.jpg)
Maxam-Gilbert
• Walter Gilbert– Harvard physicist– Knew James Watson– Became intrigued with
the biological side– Became a biophysicist
• Allan Maxam
![Page 5: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/5.jpg)
Maxam-Gilbert Technique
• PrincipleChemical Degradation of Pyrimidines– Pyrimidines (C, T) are
damaged by hydrazine– Piperidine cleaves the
backbone– 2 M NaCl inhibits the
reaction with T
![Page 6: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/6.jpg)
Sanger Method
Fred Sanger, 1958– Was originally a protein
chemist– Made his first mark in
sequencing proteins– Made his second mark in
sequencing RNA 1980 dideoxy
sequencing
![Page 7: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/7.jpg)
Comparison
• Sanger Method– Enzymatic– Requires DNA synthesis– Termination of chain
elongation
• Maxam Gilbert Method– Chemical– Requires DNA– Requires long stretches of
DNA– Breaks DNA at different
nucleotides
![Page 8: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/8.jpg)
Dideoxynucleotide
no hydroxyl group at 3’ endprevents strand extension
CH2O
OPPP5’
3’
BASE
![Page 9: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/9.jpg)
![Page 10: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/10.jpg)
Sample Output
1 lane
![Page 11: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/11.jpg)
Phredhttp://www.phrap.org/phrap.docs/phred.html
![Page 12: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/12.jpg)
Sanger sequencing• Laser excitation of fluorescent labels as fragments of discreet lengths
exit the capillary, coupled to four‐color detection of emission spectra, provides the readout that is represented in a Sanger sequencing ‘trace’. Software translates these traces into DNA sequence, while also generating error probabilities for each base‐call.
• Simultaneous electrophoresis in 96 or 384 independent capillaries provides a limited level of parallelization.
• After three decades of gradual improvement, the Sanger biochemistry can be applied to achieve read‐lengths of up to ~1,000 bp, and per‐base ‘raw’ accuracies as high as 99.999%. In the context of highthroughput shotgun genomic sequencing, Sanger sequencing costs on the order of $0.50 per kilobase.
![Page 13: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/13.jpg)
How to obtain the human genome sequence
• The Sanger sequencing can only generate 1kb long DNA segments.
• How to obtain the human genome that are 3 billion letters?
![Page 14: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/14.jpg)
How to obtain the human genome sequence
• The answer is to get pieces of DNA segments and assemble them into the genome.
cut many times at random
![Page 15: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/15.jpg)
Challenges with Fragment Assembly
• Sequencing errors
~1‐2% of bases are wrong (late 0.001%)
•Repeats
false overlap due to repeat
Bacterial genomes:5%Mammals: 50%
![Page 16: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/16.jpg)
Repeat Types• Low-Complexity DNA (e.g. ATATATATACATA…)
• Microsatellite repeats (a1…ak)N where k ~ 3‐6(e.g. CAGCAGTAGCAGCACCAG)
• Transposons/retrotransposons– SINE Short Interspersed Nuclear Elements
(e.g., Alu: ~300 bp long, 106 copies)
– LINE Long Interspersed Nuclear Elements~500 ‐ 5,000 bp long, 200,000 copies
– LTR retroposons Long Terminal Repeats (~700 bp) at each end
• Gene Families genes duplicate & then diverge
• Segmental duplications ~very long, very similar copies
![Page 17: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/17.jpg)
Strategies for whole‐genome sequencing
1. Hierarchical – Clone‐by‐clone yeast, worm, humani. Break genome into many long fragmentsii. Map each long fragment onto the genomeiii. Sequence each fragment with shotgun
2. Online version of (1) – Walking rice genomei. Break genome into many long fragmentsii. Start sequencing each fragment with shotguniii. Construct map as you go
3. Whole Genome Shotgun fly, human, mouse, rat, fugu
One large shotgun pass on the whole genome
![Page 18: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/18.jpg)
Hierarchical Sequencing vs. Whole Genome Shotgun
• Hierarchical Sequencing– Advantages: Easy assembly– Disadvantages:
• Build library & physical map; • Redundant sequencing
• Whole Genome Shotgun (WGS)– Advantages: No mapping, no redundant sequencing– Disadvantages: Difficult to assemble and resolve repeats
Whole Genome Shotgun appears to get more popular…
![Page 19: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/19.jpg)
Whole Genome Shotgun Sequencing
cut many times at random
genome
forward-reverse paired readsknown dist
~500 bp~500 bp
![Page 20: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/20.jpg)
How many reads?
Cover region with ~7-fold redundancyOverlap reads and extend to reconstruct the
original genomic region
reads
![Page 21: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/21.jpg)
Read Coverage
Length of genomic segment: GNumber of reads: NLength of each read: L
Definition: Coverage C = NL/ G
C
![Page 22: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/22.jpg)
Enough Coverage
How much coverage is enough?
According to the Lander‐Waterman model:
Assuming uniform distribution of reads, C=7 results in 1 gap per 1,000 nucleotides
![Page 23: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/23.jpg)
Lander‐Waterman Model• Major Assumptions
– Reads are randomly distributed in the genome– The number of times a base is sequenced follows a Poisson
distribution
• Implications– G= genome length, L=read length, N = # reads– Mean of Poisson: =LN/G (coverage)– % bases not sequenced: p(X=0) =0.0009 = 0.09%– Total gap length: p(X=0)*G– Total number of gaps: p(X=0)*N
( )!
xep X xx
Average times
This model was used to plan the Human Genome Project…
![Page 24: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/24.jpg)
Overlap‐Layout‐Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and contigs into supercontigs
Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
![Page 25: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/25.jpg)
Overlap
• Find the best match between the suffix of one read and the prefix of another
• Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment
• Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
![Page 26: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/26.jpg)
Overlapping Reads
TAGATTACACAGATTAC
TAGATTACACAGATTAC|||||||||||||||||
• Sort all k‐mers in reads (k ~ 24)
• Find pairs of reads sharing a k-mer
• Extend to full alignment – throw away if not >95% similar
T GA
TAGA| ||
TACA
TAGT||
![Page 27: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/27.jpg)
Overlapping Reads and Repeats
• A k‐mer that appears N times, initiates N2
comparisons
• For an Alu that appears 106 times 1012comparisons – too much
• Solution:Discard all k‐mers that appear more than
t Coverage, (t ~ 10)
![Page 28: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/28.jpg)
Finding Overlapping Reads
Create local multiple alignments from the overlapping reads
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGA
![Page 29: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/29.jpg)
Finding Overlapping Reads (cont’d)
• Correct errors using multiple alignment
TAGATTACACAGATTACTGATAGATTACACAGATTACTGATAG TTACACAGATTATTGATAGATTACACAGATTACTGATAGATTACACAGATTACTGA
C: 20C: 35T: 30C: 35C: 40
C: 20C: 35C: 0C: 35C: 40
• Score alignments
•Accept alignments with good scores
A: 15A: 25A: 40A: 25-
A: 15A: 25A: 40A: 25A: 0
Multiple alignments will be covered later in the course…
![Page 30: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/30.jpg)
Layout
• Repeats are a major challenge• Do two aligned fragments really overlap, or are they from two copies of a repeat?
![Page 31: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/31.jpg)
Merge Reads into Contigs
Merge reads up to potential repeat boundaries
repeat region
![Page 32: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/32.jpg)
Merge Reads into Contigs (cont’d)
• Ignore non‐maximal reads• Merge only maximal reads into contigs
repeat region
![Page 33: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/33.jpg)
Merge Reads into Contigs (cont’d)
• Ignore “hanging” reads, when detecting repeat boundaries
sequencing errorrepeat boundary???
ba
![Page 34: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/34.jpg)
Merge Reads into Contigs (cont’d)
?????
Unambiguous
• Insert non-maximal reads whenever unambiguous
![Page 35: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/35.jpg)
Link Contigs into Supercontigs
Too dense: Overcollapsed?(Myers et al. 2000)
Inconsistent links: Overcollapsed?
Normal density
![Page 36: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/36.jpg)
Link Contigs into Supercontigs (cont’d)
Find all links between unique contigs
Connect contigs incrementally, if 2 links
![Page 37: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/37.jpg)
Link Contigs into Supercontigs (cont’d)
Fill gaps in supercontigs with paths of overcollapsed contigs
![Page 38: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/38.jpg)
Consensus
• A consensus sequence is derived from a profile of the assembled fragments
• A sufficient number of reads is required to ensure a statistically significant consensus
• Reading errors are corrected
![Page 39: lecture 2 The first generation DNA Sequencingcs.ucf.edu/~xiaoman/fall/lecture 2 The first... · DNA sequencing Determination of nucleotide sequence the determination of the precise](https://reader033.fdocuments.in/reader033/viewer/2022051902/5ff13bfe834d953532017c1f/html5/thumbnails/39.jpg)
Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting