Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg...

Genome Assembly: a brief introduction

Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg

1

SIZE SELECT

e.g., e.g., 10Kbp 10Kbp ± 8% ± 8% std.dev.std.dev.

SHEAR

Shotgun DNA Sequencing (Technology)

DNA target sampleDNA target sample

VectorVector

LIGATE & CLONE

PrimerPrimer

End Reads (Mates)End Reads (Mates)

SEQUENCE

550bp

Whole Genome Shotgun Sequencing

– Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously.

BAC 5’BAC 5’ BAC 3’BAC 3’

– Collect another 20X in clone coverage of 50Kbp end sequence pairs:~ 1.2million pairs for Human. pairs for Human.

– Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads reads for Human. for Human.

ShortShort LongLong

2Kbp2Kbp 10Kbp10Kbp

+ single highly automated process+ single highly automated process+ only three library constructions+ only three library constructions– – assembly is much more difficultassembly is much more difficult

Sequencing Factory

300 ABI 3700 DNA Sequencers 300 ABI 3700 DNA Sequencers

50 Production Staff50 Production Staff

20,000 sq. ft. of wet lab20,000 sq. ft. of wet lab

20,000 sq. ft. of sequencing space20,000 sq. ft. of sequencing space

800 tons of A/C (160,000 cfm)800 tons of A/C (160,000 cfm)

$1 million / year for electrical service$1 million / year for electrical service

$10 million / month for reagents$10 million / month for reagents

Celera’s Sequencing Factory(circa 2001)

Collected 27.27 Million reads = 5.11X coverageCollected 27.27 Million reads = 5.11X coverage

21.04 Million are paired (77%) = 10.52 Million pairs21.04 Million are paired (77%) = 10.52 Million pairs

2Kbp2Kbp 5.045M5.045M 98.6% true *98.6% true * <6% std.dev.<6% std.dev.



* validated against finished Chrom. 21 sequence* validated against finished Chrom. 21 sequence

The clones cover the genome 38.7X timesThe clones cover the genome 38.7X times

Data is from 5 individuals (roughly 3X, 4 others at .5X)Data is from 5 individuals (roughly 3X, 4 others at .5X)

Human Data (April 2000)

Consensus (15- 30Kbp)Consensus (15- 30Kbp)

ReadsReads

ContigContigAssembly without pairs results Assembly without pairs results in contigs whose order and in contigs whose order and orientation are not known.orientation are not known.

??

Pairs, especially groups of corroborating Pairs, especially groups of corroborating ones, link the contigs into scaffolds where ones, link the contigs into scaffolds where the size of gaps is well characterized.the size of gaps is well characterized.

2-pair2-pair

Mean & Std.Dev.Mean & Std.Dev.is knownis known

ScaffoldScaffold

Pairs Give Order & Orientation

ChromosomeChromosomeSTSSTS

STS-mapped ScaffoldsSTS-mapped Scaffolds

ContigContig

Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)

ConsensusConsensus

Reads (of several haplotypes)Reads (of several haplotypes)

SNPsSNPsExternal “Reads”External “Reads”

Anatomy of a WGS Assembly

11

Assembly gaps

sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap

physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Sequencing gaps

Physical gaps

12

Shotgun sequencing statistics

13

Typical contig coverage

1 2 3 4 5 6 Coverage

Contig

Reads

Imagine raindrops on a sidewalk

14

Lander-Waterman statistics

L = read lengthT = minimum detectable overlapG = genome sizeN = number of readsc = coverage (NL / G)σ = 1 – T/L

E(#islands) = Ne-cσ E(island size) = L((ecσ – 1) / c + 1 – σ)contig = island with 2 or more reads

15

Example

c N #islands #contigs bases not in any read

bases not in contigs

1 1,667 655 614 698 367,806

3 5,000 304 250 121 49,787

5 8,334 78 57 20 6,735

8 13,334 7 5 1 335

Genome size: 1 Mbp Read Length: 600 Detectable overlap: 40

16

Experimental data

X coverage

# ctgs % > 2X avg ctg size (L-W) max ctg size # ORFs

1 284 54 1,234 (1,138) 3,337 526

3 597 67 1,794 (4,429) 9,589 1,092

5 548 79 2,495 (21,791) 17,977 1,398

8 495 85 3,294 (302,545) 64,307 1,762

complete 1 100 1.26 M 1.26 M 1,329

Caveat: numbers based on artificially chopping upthe genome of Wolbachia pipientis dMel

17

Assembly paradigms

• Overlap-layout-consensus– greedy (TIGR Assembler, phrap, CAP3...)– graph-based (Celera Assembler, Arachne)

• Eulerian path (especially useful for short read sequencing)

18

TIGR Assembler/phrap

Greedy

• Build a rough map of fragment overlaps

• Pick the largest scoring overlap

• Merge the two fragments

• Repeat until no more merges can be done

19

Overlap-layout-consensusMain entity: readRelationship between reads: overlap

12

3

45

6

78

9

1 2 3 4 5 6 7 8 9

1 2 3

1 2 3

1 2 3 12

3

1 3

2

13

2

ACCTGAACCTGAAGCTGAACCAGA

20

Paths through graphs and assembly

• Hamiltonian circuit: visit each node (city) exactly once, returning to the start

A

B D C

E

H G

I

F

A

B

C

D H

I

F

G

E

Genome

21

Implementation details

22

Overlap between two sequences

…AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT…

overlap (19 bases) overhang (6 bases)

overhangoverlap - region of similarity between regionsoverhang - un-aligned ends of the sequences

The assembler screens merges based on: • length of overlap• % identity in overlap region• maximum overhang size.

% identity = 18/19 % = 94.7%

23

All pairs alignment• Needed by the assembler• Try all pairs – must consider ~ n2 pairs• Smarter solution: only n x coverage (e.g. 8) pairs

are possible– Build a table of k-mers contained in sequences (single

pass through the genome)– Generate the pairs from k-mer table (single pass

through k-mer table)

k-mer

A

B

C

D H

I

F

G

E

Repeat Rez I, IIRepeat Rez I, II

Assembly Pipeline

OverlapperOverlapper

UnitigerUnitiger

ScaffolderScaffolder

AA

BB

impliesimplies

AA

BB

TRUE

OROR

AA BB

REPEAT-INDUCED

Find all overlaps Find all overlaps 40bp allowing 6% mismatch. 40bp allowing 6% mismatch.

Trim & ScreenTrim & Screen


Assembly Pipeline

Compute all overlap consistent sub-assemblies:Compute all overlap consistent sub-assemblies:Unitigs (Uniquely Assembled Contig)


UnitigerUnitiger



OVERLAP GRAPH

Edge Types:

AA

BB

AA

BB

AA

BB

BB

BB

BB

AA

AA

AA

Regular DovetailRegular Dovetail

Prefix DovetailPrefix Dovetail

Suffix DovetailSuffix Dovetail

E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps

The Unitig Reduction

1. Remove “Transitively Inferrable” Overlaps:1. Remove “Transitively Inferrable” Overlaps:

AA

BB

CC AABB

CC

The Unitig Reduction

2. Collapse “Unique Connector” Overlaps:2. Collapse “Unique Connector” Overlaps:

AA BBAA

BB

412412 352352

4545

Arrival IntervalsArrival Intervals

Discriminator Statistic is log-odds ratio of probability unitig is is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA.unique DNA versus 2-copy DNA.

Definitely UniqueDefinitely Repetitive Don’t Know

-10-10 +10+1000

Dist. For UniqueDist. For Repetitive

Unique DNA unitig Repetitive DNA unitig

Identifying Unique DNA Stretches


Assembly Pipeline


UnitigerUnitiger


Scaffold U-unitigs with confirmed pairsScaffold U-unitigs with confirmed pairs

Mated reads



Assembly Pipeline


UnitigerUnitiger


Fill repeat gaps with doubly anchored positive unitigsFill repeat gaps with doubly anchored positive unitigs

Unitig>0Unitig>0


33

REPEATS

34

Handling repeats1. Repeat detection

– pre-assembly: find fragments that belong to repeats• statistically (most existing assemblers)• repeat database (RepeatMasker)

– during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

– post-assembly: find repetitive regions and potential mis-assemblies. • Reputer, RepeatMasker• "unhappy" mate-pairs (too close, too far, mis-oriented)

2. Repeat resolution– find DNA fragments belonging to the repeat– determine correct tiling across the repeat

35

Statistical repeat detectionSignificant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored- “arrival” rate of reads in contigs compared with theoretical value

(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)

Problem 1: assumption of uniform distribution of fragments - leads to false positives

non-random librariespoor clonability regions

Problem 2: repeats with low copy number are missed - leads to false negatives

36

Mis-assembled repeats

a b c

a c

b

a b c d

I II III

I

II

III

a

b c

d

b c

a b d c e f

I II III IV

I III II IV

a d b e c f

a

collapsed tandem excision

rearrangement

Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg...

Documents

Transcript of Genome Assembly: a brief introduction Slides courtesy of Mihai Pop, Art Delcher, and Steven Salzberg...