Genome Assembly Final Results
description
Transcript of Genome Assembly Final Results
J E R I D I LT SS U Z A N N A K I M
H E M A N A G R A J A ND E E PA K P U R U S H O T H A M
A M B I LY S I VA D A SA M I T R U PA N I
L E O W U
Genome Assembly Final Results
0 2 - 2 2 - 2 0 1 2
Outline
Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
Pipeline for evaluation
Strategy – Key alterations
Prinseq Preprocessing Unnecessary, assemblers have built in capabilities Use Prinseq for data statistics
Error Correction Does not fit methods Coral is based on Overlap-layout-consensus and
works best with de Bruijin Graph assemblers Echo has never been tested on 454 data
Final Assemblers Newbler, Mira, Celera, AmosCMP Discarded Assemblers Abyss, Velvet, and Pcap454
MAIA Hybrid Assembly Needs a close phylogenetic reference genome
Outline
Pipeline for evaluationQuantitative EvaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
Metrics No. of Contigs -> Lesser the better N50 -> Higher the better Assembly size -> Closer to the estimated genome, the
betterQuantitative Assembly Score
N50 * Assembly size No. of Contigs
Higher the score, the better!
Quantitative Evaluation
M19107 - Evaluation
Runs # Contigs
N50 Total Size
Score
Newbler 199 16319 1753573 8.16
Mira 201 19353 1790088 8.24
Celera 146 20609 1747621 8.39
Newbler_Mira 129 25914 1774129 8.55
Newbler_Celera 104 27207 1719874 8.65
Newbler_Mira_Celera
96 27478 1701316 8.69
Outline
Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
Qualitative Evaluation
Strategy Align the assembly contigs to the original reference
genome and compute differencesChallenges
No Original reference genome for our data setApproach
Create simulated 454 read datasets from a completely sequenced genome
Tools used FlowSim 454Sim Art-454
FlowSim
A simulation pipeline based on real dataLets you model each step of pyrosequencing processUtilities:
Clonesim : To simulate the shearing step Usage: clonesim -c count -l dist input.fasta
Gelfilter: To select a certain range of clone lengths. Usage: gelfilter min max
Kitsim: To attach A and B adaptors. Usage: kitsim -k key -a adapter input.fasta -o output.fasta
Mutator: To introduce random substitutions and indels in the sequences. Usage: mutator -i indel_rate -s subst_rate input.fasta -o output.fasta
Duplicator: To generate artificial duplicates of many clones. Usage: duplicator dup_prob
Flowsim : To simulate the actual pyrosequencing process Usage: flowsim -G generation input.fasta -o output.sff
Example: clonesim -c 400000 –l “Normal 350 95” input.fasta | gelfilter 25 600| kitsim | mutator | duplicator 0.03 | flowsim –G Titanium -o output.sff
454Sim
454 Simulation at higher speed and accuracyUSP: Configurable statistical modelsSupport GS FLX, Titanium and GS 20Utilities:
fragsim: To simulate shearing Usage: fragsim -c 1000000 -l 1000 genome.fasta >
genome.fragments.fasta 454sim: To simulate the sequencing step
Usage: 454sim -o genome.sff genome.fragments.fastaExample:
fragsim -c 250000 -l 1000 genome.fasta | 454sim –g FLX -o genome.sff
ART-454
Supports Illumina, 454 and Solexa read simulation
Used for 1000 Genomes ProjectUsage:
Art_454 Input.fasta Output prefix Fold_coverage (single – end reads)
Art_454 Input.fasta Output prefix Fold_coverage Mean_Flag_Len Std_Deviation (paired end reads)
Running pipeline on Simulated reads
Reference – Haemophilus influenzae F3047 (NC_014922)
Ran 454Sim, FlowSim and Art-454 to generate reads
Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG)
Merged assemblies using Minimus2
Evaluate Assembly Accuracy (How?)
Assembly Accuracy
Challenges Alignment of contigs to the reference genome
Approach Local alignment (BLAST, bwa, bowtie) Whole genome alignment (Mauve, MUMmer)
Align the assembly to the reference genome Compute nucleotide differences, gaps and rearranged
segments
Mauve
Uses positional homology genome alignment Each site in the assembly maps to at most one site on
the reference Optimized contiguity E.g. progressiveMauve
Ordering of contigs: Mauve Contig Mover algorithm
Compare to identify differences
Mauve Genome Aligner
After Ordering of Contigs
Mauve Assembly Metrics
Basecalling accuracy Count and location of bases called wrongly Direction of miscalling, e.g. A->G Count and location of bases predicted to exist, but
uncalledGenome content accuracy
Count and location of bases missing from the assembly Count and location of extra bases in the assembly Size distribution of the missing and extra fragments
Genome structure accuracy Estimate of misassembly count
Reference genome AGGCTAGCGCGCGATTAGGAT
CAssembly
AGTAGCGGGCCGATTAAGANC
Alignment AGGCTAGCGCG -
CGATTAGGATC AG - -
TAGCGGGCCGATTAAGANC
Example
Miscalls 2 (C->G and G->A)Uncalled bases 1 (N)Extra bases 1 (Insertion of C )Missing bases 2 (Deletion of GC )Missing segments 1Extra segments 1
Scoring simulated reads with Mauve
Reference – Haemophilus influenzae F3047 (NC_014922)
Ran 454Sim, FlowSim and Art-454 to generate reads
Ran de novo assemblers - Newbler, Mira3 and Celera (CABOG)
Merged assemblies using Minimus2Ran Mauve to align the assemblies back to
the reference genomeComputed Assembly metrics
Miscalled Bases
0
20
40
60
Number of miscalled bases
Newbler
16
Mira
52
CA
8
Newbler+Mira
18
Newbler+CA
24
Newbler+Mira
+CA
36
Uncalled bases
0
10
20
30
40
Number of uncalled bases
Newbler Mira
3
CA Newbler+Mira
14
Newbler+CA
7
Newbler+Mira
+CA
34
Total missing bases
0
50,000
100,000
150,000
200,000
Number of missed bases
Newbler
90,490
Mira
76,648
CA
92,196
Newbler+Mira
73,632
Newbler+CA
82,121
Newbler+Mira
+CA
195,619
Total extra segments
0
2,000
4,000
6,000
Number of extra bases
Newbler
709
Mira
4,387
CA
1,907
Newbler+Mira
5,895
Newbler+CA
4,590
Newbler+Mira
+CA
5,218
Outline
Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
Choosing the BEST assembly
Quantitative metrics N50 Contig count Assembly size
Qualitative metrics Miscalled bases Uncalled Missing bases Extra bases
Quantitative Score N50 * Assembly size
No. of Contigs
Qualitative Score ( % Accuracy ) Miscalls + Uncalled + Missing + Extra + Gaps in Ref + Gaps in
Assembly
Assembly Scores
Reference Size
1 -
Metrics Summary – Art 454
ASSEMBLY SCORE
QUALITY SCORE
Assembly spec. vs Accuracy plot – 454Sim
0.1
0.2
0.3
0.4
0.5
90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%
Assembly score
Mira
Newbler+CA+Mira
Newbler+Mira
Newbler+CA
CA
Newbler
Quality of output
Assembly spec. vs Accuracy plot - Art-454
0
1
2
3
4
5
6
90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%
Assembly score
Mira
Newbler
Newbler+CA
CA
Quality of output
Newbler+Mira
Newbler+Mira+CA
Assembly spec. vs Accuracy plot – FlowSim
0
1
2
3
4
5
6
7
8
90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 98.0 99.0 100.0%
Assembly score
Quality of output
Mira
Newbler+Mira
Newbler+Mira+CA
Newbler+CA
CA
Newbler
Assembly spec. vs Accuracy plot – M21709
0
2
4
6
8
50.0 55.0 60.0 65.0 70.0 75.0 80.0%
Assembly score
Newbler+Mira
AMOScmp
Quality of output
Celera
Mira
Newbler+Mira+CA
Newbler
Newbler+CA
AMOScmp+Newbler
Inference
Striking a balance is critical
We chose Newbler + MIRA for H. haemolyticus Newbler + AMOScmp for H. influenzae
Universally applicable pipeline
Assembling specific genomes/strains
Adopt the most consistent tool /pipeline (Conservative approach)
NEWBLER
Choose the one that works the best balance for your genome
NEWBLER + (CELERA/MIRA)
Outline
Pipeline for evaluationQuantitative evaluationQualitative EvaluationChoosing the BEST assemblyFinal resultsDemo
Final Results
Genomes
Contig # N50 Size Method
M19107 129 25914 1774129
Newbler + Mira
M19501 19 284900 1809865
Newbler + Mira
M21127 32 122121 2029793
Newbler + Mira
M21621 27 139238 1959123
Newbler + Mira
M21639 56 87673 2397857
Newbler + Mira
M21709 28 140484 1808157
Newbler + AMOScmp
Key take-aways
Understand your data Platform, long/short reads, Coverage, Paired/Non-paired,
Quality of basecalling etcEvaluate the need for error correctionChoose a set of “best” assemblers
De novo /Reference assembly, DBG/OLC algorithmMerge assembliesOrdering and ScaffoldingFinishing
Evaluate your assembly at every step to ensure that you are on the right track!
Coming next >>>Demo