Whole Genome Assembly with iPlant
Transcript of Whole Genome Assembly with iPlant
![Page 1: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/1.jpg)
Whole Genome Assembly with iPlant Michael Schatz & Shoshana Marcus Dec 4, 2013 CSHL Plant Genomes and Biotechnology
![Page 2: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/2.jpg)
Outline
1. Assembly theory 1. Assembly by analogy 2. De Bruijn and Overlap graph 3. Coverage, read length, errors, and repeats
2. Genome assemblers 1. Assemblathon 2. ALLPATHS-LG
3. Celera Assembler 3. Assembly Tutorial with iPlant
![Page 3: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/3.jpg)
Shredded Book Reconstruction
• Dickens accidentally shreds the first printing of A Tale of Two Cities – Text printed on 5 long spools
• How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed together – Some fragments are identical
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …
![Page 4: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/4.jpg)
Greedy Reconstruction
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
of times, it was the
times, it was the age
It was the best of
of times, it was the
best of times, it was
times, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
worst of times, it was
of times, it was the
times, it was the age
it was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
wisdom, it was the age
it was the age of
was the age of foolishness,
the worst of times, it
The repeated sequence make the correct reconstruction ambiguous • It was the best of times, it was the [worst/age]
Model the assembly problem as a graph problem
![Page 5: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/5.jpg)
de Bruijn Graph Construction
• Dk = (V,E) • V = All length-k subfragments (k < l) • E = Directed edges between consecutive subfragments
• Nodes overlap by k-1 words
• Locally constructed graph reveals the global sequence structure • Overlaps between sequences implicitly computed
It was the best was the best of It was the best of
Original Fragment Directed Edge
de Bruijn, 1946 Idury and Waterman, 1995 Pevzner, Tang, Waterman, 2001
![Page 6: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/6.jpg)
de Bruijn Graph Assembly
the age of foolishness
It was the best
best of times, it
was the best of
the best of times,
of times, it was
times, it was the
it was the worst
was the worst of
worst of times, it
the worst of times,
it was the age
was the age of the age of wisdom,
age of wisdom, it
of wisdom, it was
wisdom, it was the
After graph construction, try to simplify the graph as
much as possible
![Page 7: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/7.jpg)
de Bruijn Graph Assembly
the age of foolishness
It was the best of times, it
of times, it was the
it was the worst of times, it
it was the age of the age of wisdom, it was the After graph construction,
try to simplify the graph as much as possible
![Page 8: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/8.jpg)
The full tale … it was the best of times it was the worst of times …
… it was the age of wisdom it was the age of foolishness … … it was the epoch of belief it was the epoch of incredulity … … it was the season of light it was the season of darkness … … it was the spring of hope it was the winder of despair …
it was the winter of despair
worst
best
of times
epoch of belief
incredulity
spring of hope
foolishness
wisdom
light
darkness
age of
season of
![Page 9: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/9.jpg)
N50 size Def: 50% of the genome is in contigs as large as the N50 value
Example: 1 Mbp genome
N50 size = 30 kbp (300k+100k+45k+45k+30k = 520k >= 500kbp)
Note:
N50 values are only meaningful to compare when base genome size is the same in all cases
1000
300 45 30 100 20 15 15 10 . . . . . 45
50%
![Page 10: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/10.jpg)
Assembly Applications • Novel genomes
• Metagenomes
• Sequencing assays – Structural variations – Transcript assembly – …
![Page 11: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/11.jpg)
Assembling a Genome
3. Simplify assembly graph
1. Shear & Sequence DNA
4. Detangle graph with long reads, mates, and other links
2. Construct assembly graph from overlapping reads …AGCCTAGGGATGCGCGACACGT
GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC CAACCTCGGACGGACCTCAGCGAA…
![Page 12: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/12.jpg)
Ingredients for a good assembly
Current challenges in de novo plant genome sequencing and assembly Schatz MC, Witkowski, McCombie, WR (2012) Genome Biology. 12:243
Coverage
High coverage is required – Oversample the genome to ensure
every base is sequenced with long overlaps between reads
– Biased coverage will also fragment assembly
Lander Waterman Expected Contig Length vs Coverage
Read Coverage
Exp
ect
ed
Co
ntig
Le
ng
th (
bp
)
0 5 10 15 20 25 30 35 40
10
01
k1
0k
10
0k
1M
+dog mean
+dog N50
+panda mean
+panda N50
1000 bp
710 bp
250 bp
100 bp
52 bp
30 bp
Read Coverage
Exp
ecte
d C
onti
g Le
ngth
Read Length
Reads & mates must be longer than the repeats – Short reads will have false overlaps
forming hairball assembly graphs – With long enough reads, assemble
entire chromosomes into contigs
Quality
Errors obscure overlaps – Reads are assembled by finding
kmers shared in pair of reads – High error rate requires very short
seeds, increasing complexity and forming assembly hairballs
![Page 13: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/13.jpg)
Typical contig coverage
1 2 3 4 5 6 C
over
age
Contig
Reads
Imagine raindrops on a sidewalk
Coverage
![Page 14: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/14.jpg)
Balls in Bins 1x
![Page 15: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/15.jpg)
Balls in Bins 2x
![Page 16: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/16.jpg)
Balls in Bins 4x
![Page 17: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/17.jpg)
Balls in Bins 8x
![Page 18: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/18.jpg)
Coverage and Read Length Idealized Lander-Waterman model • Reads start at perfectly random
positions
• Contig length is a function of coverage and read length – Short reads require much higher
coverage to reach same expected contig length
• Need even high coverage for higher ploidy, sequencing errors, sequencing biases – Recommend 100x coverage
Lander Waterman Expected Contig Length vs Coverage
Read Coverage
Expecte
d C
ontig L
ength
(bp)
0 5 10 15 20 25 30 35 40
100
1k
10k
100k
1M
+dog mean
+dog N50
+panda mean
+panda N50
1000 bp
710 bp
250 bp
100 bp
52 bp
30 bp
Assembly of Large Genomes using Second Generation Sequencing Schatz MC, Delcher AL, Salzberg SL (2010) Genome Research. 20:1165-1173.
![Page 19: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/19.jpg)
Unitigging / Unipathing
• After simplification and correction, compress graph down to its non-branching initial contigs – Aka “unitigs”, “unipaths” – Unitigs end because of (1) lack of coverage, (2) errors, and (3) repeats
Errors
![Page 20: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/20.jpg)
Errors in the graph
(Chaisson, 2009)
Clip Tips Pop Bubbles
was the worst of
worst of times, it
the worst of times,
the worst of tymes,
was the worst of times,
was the worst of tymes,
the worst of times, it
was the worst of times,
was the worst of tymes,
times, it was the age
tymes, it was the age
was the worst of it was the age
times,
tymes,
![Page 21: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/21.jpg)
Repetitive regions
• Over 50% of mammalian genomes are repetitive – Large plant genomes tend to be even worse – Wheat: 16 Gbp; Pine: 24 Gbp 21
Repeat Type Definition / Example Prevalence
Low-complexity DNA / Microsatellites (b1b2…bk)N where 1 < k < 6 CACACACACACACACACACA
2%
SINEs (Short Interspersed Nuclear Elements)
Alu sequence (~280 bp) Mariner elements (~80 bp)
13%
LINEs (Long Interspersed Nuclear Elements)
~500 – 5,000 bp 21%
LTR (long terminal repeat) retrotransposons
Ty1-copia, Ty3-gypsy, Pao-BEL (~100 – 5,000 bp)
8%
Other DNA transposons 3%
Gene families & segmental duplications 4%
![Page 22: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/22.jpg)
Scaffolding • Initial contigs (aka unipaths, unitigs)
terminate at – Coverage gaps: especially extreme GC – Conflicts: errors, repeat boundaries
• Use mate-pairs to resolve correct order through assembly graph – Place sequence to satisfy the mate constraints – Mates through repeat nodes are tangled
• Final scaffold may have internal gaps called sequencing gaps – We know the order, orientation, and spacing,
but just not the bases. Fill with Ns instead
A
C
D
R
B
A C D R B R R
![Page 23: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/23.jpg)
Post-assembly Analysis
After assembly: • Validation • CEGMA • BLAST • Gene Finding • Repeat mask • RNA-seq • *-seq • … • Publish! !
![Page 24: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/24.jpg)
Outline
1. Assembly theory 1. Assembly by analogy 2. De Bruijn and Overlap graph 3. Coverage, read length, errors, and repeats
2. Genome assemblers 1. Assemblathon 2. ALLPATHS-LG
3. Celera Assembler 3. Assembly Tutorial with iPlant
![Page 25: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/25.jpg)
• Attempt to answer the question: “What makes a good assembly?”
• Organizers provided sequence data to assembly experts around the world – Assemblathon 1: ~100Mbp simulated genome – Assemblathon 2: 3 vertebrate genomes each ~1GB
• Results demonstrate trade-offs assemblers must make
Assemblathon 1: A competitive assessment of de novo short read assembly methods. Earl, DA, et al. (2011) Genome Research. doi: 10.1101/gr.126599.111 Assemblathon 2: Evaluating de novo methods of genome assembly in three vertebrate species Bradnam, KR. et al (2013) GigaScience 2:10 doi:10.1186/2047-217X-2-10
![Page 26: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/26.jpg)
Assembly Results
![Page 27: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/27.jpg)
Final Rankings
• ALLPATHS and SOAPdenovo came out neck-and-neck followed closely behind by Celera Assembler, SGA, and ABySS
• My recommendation for “typical” short read assembly is to use ALLPATHS • Single molecule sequencing becoming extremely attractive if you have access
![Page 28: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/28.jpg)
Genome assembly with ALLPATHS-LG Iain MacCallum
![Page 29: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/29.jpg)
How ALLPATHS-LG works
assembly
reads
unipaths
corrected reads
doubled reads
localized data
local graph assemblies
global graph assembly
![Page 30: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/30.jpg)
ALLPATHS-LG sequencing model
*See next slide. **For best results. Normally not used for small genomes. However essential to assemble long repeats or duplications. Cutting coverage in half still works, with some reduction in quality of results. All: protocols are either available, or in progress.
![Page 31: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/31.jpg)
Error correction
Given a crystal ball, we could stack reads on the chromosomes they came from (with homologous chromosomes separate), then let each column ‘vote’:
A
C C C
C C C C C
chromosome
change to C
But we don’t have a crystal ball....
![Page 32: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/32.jpg)
Error correction
ALLPATHS-LG. For every K-mer, examine the stack of all reads containing the K-mer. Individual reads may be edited if they differ from the overwhelming consensus of the stack. If a given base on a read receives conflicting votes (arising from membership of the read in multiple stacks), it is not changed. (K=24)
" K #
T T T T T T T T T
columns inside the kmer are homogeneous
A
C C C
C C C C C
columns outside the kmer may be mixed
Two calls at Q20 or better are enough to protect a base
change to C
![Page 33: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/33.jpg)
Read doubling
+ 28 28
More than one closure allowed (but rare).
To close a read pair (red), we require the existence of another read pair (blue), overlapping perfectly like this:
![Page 34: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/34.jpg)
Unipath: unbranched part of genome – squeeze together perfect repeats of size ≥ K
Unipaths
R A B
R C D parts of genome
R A B
C D unipaths from these parts
R A B
C D unipath graph
Adjacent unipaths overlap by K-1 bases
![Page 35: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/35.jpg)
Localization
reaches to other unipaths (CN = 1) directly and indirectly
read pairs reach into repeats
and are extended by other unipaths
I. Find ‘seed’ unipaths, evenly spaced across genome (ideally long, of copy number CN = 1)
seed unipath
II. Form neighborhood around each seed
![Page 36: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/36.jpg)
Create assembly from global assembly graph
A
T
G
GG
{A,T} G
flatten
{A,T} G
scaffold
{A,T} G
patch
fix {A,T} {G,GG}
![Page 37: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/37.jpg)
19+ vertebrates assembled with ALLPATHS-LG
scaffold N50 (Mb)
cont
ig N
50 (k
b)
B6
129
bushbaby
tenrec
ground squirrel
N. brichardi
NA12878
coelacanth
stickleback
shrew
A. burtoni
P. nyererei
M. zebra
female ferret
tilapia
spotted gar 69 kk
male ferret 67 kb
squirrel monkey 19 Mb
chinchilla
![Page 38: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/38.jpg)
Genome assembly with the Celera Assembler
![Page 39: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/39.jpg)
Celera Assembler
1. Pre-overlap – Consistency checks
2. Trimming – Quality trimming & partial overlaps
3. Compute Overlaps – Find high quality overlaps
4. Error Correction – Evaluate difference in context of
overlapping reads
5. Unitigging – Merge consistent reads
6. Scaffolding – Bundle mates, Order & Orient
7. Finalize Data – Build final consensus sequences
http://wgs-assembler.sf.net
![Page 40: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/40.jpg)
Single Molecule Sequencing Technology
PacBio RS II Moleculo Oxford Nanopore
![Page 41: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/41.jpg)
Hybrid Sequencing
Illumina Sequencing by Synthesis
High throughput (60Gbp/day)
High accuracy (~99%) Short reads (~100bp)
Pacific Biosciences SMRT Sequencing
Lower throughput (1Gbp/day)
Lower accuracy (~85%) Long reads (5kbp+)
![Page 42: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/42.jpg)
1. Correction Pipeline 1. Map short reads to long reads 2. Trim long reads at coverage gaps 3. Compute consensus for each long read
2. Error corrected reads can be easily assembled, aligned
Hybrid Error Correction: PacBioToCA
Hybrid error correction and de novo assembly of single-molecule sequencing reads. Koren, S, Schatz, MC, et al. (2012) Nature Biotechnology. doi:10.1038/nbt.2280
http://wgs-assembler.sf.net
![Page 43: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/43.jpg)
Assembly Contig NG50
HiSeq Fragments 50x 2x100bp @ 180
3,925
MiSeq Fragments 23x 459bp 8x 2x251bp @ 450
6,332
“ALLPATHS-recipe” 50x 2x100bp @ 180 36x 2x50bp @ 2100 51x 2x50bp @ 4800
18,248
PBeCR Reads 19x @ 3500 ** MiSeq for correction
50,995
Enchanced PBeCR 19x @ 3500 ** MiSeq for correction
155,695
Preliminary Rice Assemblies
In collaboration with McCombie & Ware labs @ CSHL
![Page 44: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/44.jpg)
Assembly Summary Assembly quality depends on 1. Coverage: low coverage is mathematically hopeless 2. Repeat composition: high repeat content is challenging 3. Read length: longer reads help resolve repeats 4. Error rate: errors reduce coverage, obscure true overlaps
• Assembly is a hierarchical – Reads -> unitigs -> mates -> scaffolds
-> optical / physical / genetic maps -> chromosomes
• Recommendations: – ALLPATH-LG for Illumina-only – HGAP for PacBio-only, CA for Hybrid assembly – See Assemblathon papers for a more extensive analysis
![Page 45: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/45.jpg)
Outline
1. Assembly theory 1. Assembly by analogy 2. De Bruijn and Overlap graph 3. Coverage, read length, errors, and repeats
2. Genome assemblers 1. Assemblathon 2. ALLPATHS-LG 3. Celera Assembler
3. Assembly Tutorial with iPlant
![Page 46: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/46.jpg)
0. Download and install ALLPATHS-LG source code % wget ftp://ftp.broadinstitute.org/pub/crd/ALLPATHS/Release-LG/ % configure && make && make install
1. Collect the BAM or FASTQ files that you wish to assemble. Create a
in_libs.csv metadata file to describe your libraries and a in_groups.csv metadata file to describe your data files.
2. Prepare input files
% cd /tmp/cshl/asm % PrepareAllPathsInputs.pl \ DATA_DIR=`pwd` PLOIDY=1 >& prepare.log
3. Assemble.
% RunAllPathsLG \ PRE=/tmp REFERENCE_NAME=cshl \
DATA_SUBDIR=asm RUN=default >& run.log 4. Get the results (four files). % cd /tmp/cshl/asm/default/ASSEMBLIES/test/ % less final.{assembly,contigs}.{fasta,efasta}
Assembly with ALLPATHS-LG
![Page 47: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/47.jpg)
Assembly with iPlant
![Page 48: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/48.jpg)
Assembly Workflow
Upload Reads Minutes to Months
Quality Assessment Minutes to Hours
De novo Assembly Hours to Days
Assembly Assessment Minutes to Hours
![Page 49: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/49.jpg)
Upload Reads
![Page 50: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/50.jpg)
QC: FastQC
![Page 51: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/51.jpg)
QC: Read Coverage
Reference:
Reads:
Lander Waterman Expected Contig Length vs Coverage
Read Coverage
Exp
ecte
d C
on
tig
Le
ng
th (
bp
)
0 5 10 15 20 25 30 35 40
10
01
k1
0k
10
0k
1M
+dog mean
+dog N50
+panda mean
+panda N50
1000 bp
710 bp
250 bp
100 bp
52 bp
30 bp
Errors Coverage
Repeats
![Page 52: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/52.jpg)
Estimating coverage with Kmers Reference:
Reads:
…GAT TACA GATTACAC
TACACGGT…
![Page 53: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/53.jpg)
Estimating coverage with Kmers Reference:
Reads:
NA12878
![Page 54: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/54.jpg)
Wheat Genome (A. tauschi / CSHL)
![Page 55: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/55.jpg)
Heterozygous Genome
Contact: @mike_schatz
![Page 56: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/56.jpg)
QC: Mer counts
Frag1.fq Frag2.fq
FASTX_fastq-to-fasta FASTX_fastq-to-fasta
Suffixerator
Tallymer-mkindex
A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes Kurtz S. Narechania A, Stein JC, Ware D. (2008) BMC Genomics. 9:517
![Page 57: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/57.jpg)
Running ALLPATHS-LG
![Page 58: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/58.jpg)
Post-QC: CEGMA
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Parra G, Bradnam K, Korf I. (2007) Bioinformatics. 23 (9): 1061-1067.
![Page 59: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/59.jpg)
Assembly Workflow
Upload Reads Minutes to Months
Quality Assessment Minutes to Hours
De novo Assembly Hours to Days
Assembly Assessment Minutes to Hours
![Page 60: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/60.jpg)
Resources • iPlant
– http://www.iplantcollaborative.org/
• Assembly Competitions – Assemblathon: http://assemblathon.org/ – GAGE: http://gage.cbcb.umd.edu/
• Assembler Websites: – ALLPATHS-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ – SOAPdenovo: http://soap.genomics.org.cn/soapdenovo.html – Celera Assembler: http://wgs-assembler.sf.net
• Tools: – FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ – Tallymer: http://www.zbh.uni-hamburg.de/?id=211 – CEGMA: http://korflab.ucdavis.edu/datasets/cegma/
![Page 61: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/61.jpg)
Acknowledgements Special Thanks Shoshana Marcus James Gurtowski Roger Barthelson Stephen Goff Nicole Hopkins Dan Stanzione Joshua Stein Matthew Vaughn Doreen Ware Jason Williams
![Page 62: Whole Genome Assembly with iPlant](https://reader033.fdocuments.in/reader033/viewer/2022051521/586a081c1a28ab51458b7a8f/html5/thumbnails/62.jpg)
Questions? http://schatzlab.cshl.edu/
@mike_schatz