+ Energy By: Chelsea Bonollo, Meg Hemmerlein, Josh Salzberg.
Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg...
-
Upload
maryann-bertha-shields -
Category
Documents
-
view
221 -
download
3
Transcript of Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg...
Genome Assembly: a brief introduction
Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg
1
2
SIZE SELECT
e.g., e.g., 10Kbp 10Kbp ± 8% ± 8% std.dev.std.dev.
SHEAR
Shotgun DNA Sequencing (Technology)
DNA target sampleDNA target sample
VectorVector
LIGATE & CLONE
PrimerPrimer
End Reads (Mates)End Reads (Mates)
SEQUENCE
550bp
Whole Genome Shotgun Sequencing
– Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously.
BAC 5’BAC 5’ BAC 3’BAC 3’
– Collect another 20X in clone coverage of 50Kbp end sequence pairs:~ 1.2million pairs for Human. pairs for Human.
– Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads reads for Human. for Human.
ShortShort LongLong
2Kbp2Kbp 10Kbp10Kbp
+ single highly automated process+ single highly automated process+ only three library constructions+ only three library constructions– – assembly is much more difficultassembly is much more difficult
Sequencing Factory
300 ABI 3700 DNA Sequencers 300 ABI 3700 DNA Sequencers
50 Production Staff50 Production Staff
20,000 sq. ft. of wet lab20,000 sq. ft. of wet lab
20,000 sq. ft. of sequencing space20,000 sq. ft. of sequencing space
800 tons of A/C (160,000 cfm)800 tons of A/C (160,000 cfm)
$1 million / year for electrical service$1 million / year for electrical service
$10 million / month for reagents$10 million / month for reagents
Celera’s Sequencing Factory(circa 2001)
Collected 27.27 Million reads = 5.11X coverageCollected 27.27 Million reads = 5.11X coverage
21.04 Million are paired (77%) = 10.52 Million pairs21.04 Million are paired (77%) = 10.52 Million pairs
2Kbp2Kbp 5.045M5.045M 98.6% true *98.6% true * <6% std.dev.<6% std.dev.
10Kbp10Kbp 4.401M4.401M 98.6% true *98.6% true * <8% std.dev.<8% std.dev.
50Kbp50Kbp 1.071M1.071M 90.0% true *90.0% true * <15% std.dev.<15% std.dev.
* validated against finished Chrom. 21 sequence* validated against finished Chrom. 21 sequence
The clones cover the genome 38.7X timesThe clones cover the genome 38.7X times
Data is from 5 individuals (roughly 3X, 4 others at .5X)Data is from 5 individuals (roughly 3X, 4 others at .5X)
Human Data (April 2000)
Consensus (15- 30Kbp)Consensus (15- 30Kbp)
ReadsReads
ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.
??
Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.
2-pair2-pair
Mean & Std.Dev.Mean & Std.Dev.is knownis known
ScaffoldScaffold
Pairs Give Order & Orientation
ChromosomeChromosomeSTSSTS
STS-mapped ScaffoldsSTS-mapped Scaffolds
ContigContig
Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)
ConsensusConsensus
Reads (of several haplotypes)Reads (of several haplotypes)
SNPsSNPsExternal “Reads”External “Reads”
Anatomy of a WGS Assembly
11
Assembly gaps
sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap
Sequencing gaps
Physical gaps
12
Shotgun sequencing statistics
13
Typical contig coverage
1 2 3 4 5 6 Coverage
Contig
Reads
Imagine raindrops on a sidewalk
14
Lander-Waterman statistics
L = read lengthT = minimum detectable overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L
E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ)contig = island with 2 or more reads
15
Example
c N #islands #contigs bases not in any read
bases not in contigs
1 1,667 655 614 698 367,806
3 5,000 304 250 121 49,787
5 8,334 78 57 20 6,735
8 13,334 7 5 1 335
Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40
16
Experimental data
X coverage
# ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs
1 284 54 1,234 (1,138) 3,337 526
3 597 67 1,794 (4,429) 9,589 1,092
5 548 79 2,495 (21,791) 17,977 1,398
8 495 85 3,294 (302,545) 64,307 1,762
complete 1 100 1.26 M 1.26 M 1,329
Caveat: numbers based on artificially chopping upthe genome of Wolbachia pipientis dMel
17
Assembly paradigms
• Overlap-layout-consensus– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)
• Eulerian path (especially useful for short read sequencing)
18
TIGR Assembler/phrap
Greedy
• Build a rough map of fragment overlaps
• Pick the largest scoring overlap
• Merge the two fragments
• Repeat until no more merges can be done
19
Overlap-layout-consensusMain entity: readRelationship between reads: overlap
12
3
45
6
78
9
1 2 3 4 5 6 7 8 9
1 2 3
1 2 3
1 2 3 12
3
1 3
2
13
2
ACCTGAACCTGAAGCTGAACCAGA
20
Paths through graphs and assembly
• Hamiltonian circuit: visit each node (city) exactly once, returning to the start
A
B D C
E
H G
I
F
A
B
C
D H
I
F
G
E
Genome
21
Implementation details
22
Overlap between two sequences
…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…
overlap (19 bases) overhang (6 bases)
overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences
The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.
% identity = 18/19 % = 94.7%
23
All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs
are possible– Build a table of k-mers contained in sequences (single
pass through the genome)– Generate the pairs from k-mer table (single pass
through k-mer table)
k-mer
A
B
C
D H
I
F
G
E
24
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
AA
BB
impliesimplies
AA
BB
TRUE
OROR
AA BB
REPEAT-INDUCED
Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch.
Trim & ScreenTrim & Screen
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
Compute all overlap consistent sub-assemblies:Compute all overlap consistent sub-assemblies:Unitigs (Uniquely Assembled Contig)
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Trim & ScreenTrim & Screen
OVERLAP GRAPH
Edge Types:
AA
BB
AA
BB
AA
BB
BB
BB
BB
AA
AA
AA
Regular DovetailRegular Dovetail
Prefix DovetailPrefix Dovetail
Suffix DovetailSuffix Dovetail
E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps
The Unitig Reduction
1. Remove “Transitively Inferrable” Overlaps:1. Remove “Transitively Inferrable” Overlaps:
AA
BB
CC AABB
CC
The Unitig Reduction
2. Collapse “Unique Connector” Overlaps:2. Collapse “Unique Connector” Overlaps:
AA BBAA
BB
412412 352352
4545
Arrival IntervalsArrival Intervals
Discriminator Statistic is log-odds ratio of probability unitig is is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA.unique DNA versus 2-copy DNA.
Definitely UniqueDefinitely Repetitive Don’t Know
-10-10 +10+1000
Dist. For UniqueDist. For Repetitive
Unique DNA unitig Repetitive DNA unitig
Identifying Unique DNA Stretches
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Scaffold U-unitigs with confirmed pairsScaffold U-unitigs with confirmed pairs
Mated reads
Trim & ScreenTrim & Screen
Repeat Rez I, IIRepeat Rez I, II
Assembly Pipeline
OverlapperOverlapper
UnitigerUnitiger
ScaffolderScaffolder
Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs
Unitig>0Unitig>0
Trim & ScreenTrim & Screen
33
REPEATS
34
Handling repeats1. Repeat detection
– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)
– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)
– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)
2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat
35
Statistical repeat detectionSignificant deviations from average coverage flagged as repeats.
- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with theoretical value
(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)
Problem 1: assumption of uniform distribution of fragments - leads to false positives
non-random librariespoor clonability regions
Problem 2: repeats with low copy number are missed - leads to false negatives
36
Mis-assembled repeats
a b c
a c
b
a b c d
I II III
I
II
III
a
b c
d
b c
a b d c e f
I II III IV
I III II IV
a d b e c f
a
collapsed tandem excision
rearrangement