Genome Assembly Forensics and Visualisation
Nathan S. Watson-Haigh
Fri 11th May 2012, ACPFG Journal Club
Schatz, M.C. et al., 2007. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8(3), p.R34.Phillippy, A.M., Schatz, M.C. & Pop, M., 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biology, 9(3), p.R55.Schatz, M.C. et al., 2011. Hawkeye and AMOS: Visualizing and Assessing the Quality of Genome Assemblies. Briefings in
Bioinformatics. Available at: http://bib.oxfordjournals.org/content/early/2011/12/23/bib.bbr074.
Overview
• Genome Assembly• N50/N90/N95• Paired-end and Matepair Reads• Mis-assembly Signatures• Assembly Validation and Manual Editing
Genome Assembly – Shotgun Reads
aligned shotgun reads
DNA being sequenced
Genome Assembly – Repeats
Genome Assembly – Repeats
Genome Assembly – Repeats
reads from different repeats can’t be
resolved
double coverage
Genome Assembly – Repeats
Genome Assembly – Diploid
Assembly Metrics – N50
• The N50 is the most widely reported metric for de novo assemblies
• It is a single measure of the contig length size distribution of an assembly– If contigs are sorted into descending length order, the
n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs
– Commonly reported with the N90 and N95
Assembly Metrics – N50
+ = N50
+ = N90
+ = N95
Assembly Metrics – N50
• The N50 is the most widely reported metric for de novo assemblies
• It is a single measure of the contig length size distribution of an assembly– If contigs are sorted into descending length order, the
n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs
– Commonly reported with the N90 and N95• These stats DO NOT imply anything about
assembly quality– Could simply concatenate contigs together to get a
better N50!!
Paired-end Reads
Matepair Reads
Paired-end and Matepair Reads
Paired-end Matepair
reverse compliment
So, Why are Pairs so Useful?
So, Why are Pairs so Useful?
Pairs are Useful – Orientation and Separation
Pairs are Useful – Orientation and Separation
Pairs are Useful – Orientation and Separation
Pairs are Useful – Orientation and Separation
Pairs are Useful – Orientation and Separation
Incorrect orientationIncorrect distance
Mis-assembly Signatures – Collapsed Tandem Repeat
Correct alignment
Incorrect alignment
Mis-assembly Signatures – Collapsed Tandem Repeat
Mis-assembly
Correct assembly
Mis-assembly Signatures – Collapsed (small) Tandem Repeat
Mis-assembly
Correct assembly
Mis-assembly Signatures – Collapsed Repeat
Mis-assembly
Correct assembly
Mis-assembly Signatures – Rearrangement
Mis-assembly
Correct assembly
Automated Assemblies Are One Thing, Good Assemblies Are Another
• Given the computer resources you can generate an automated assembly in a few weeks– Not necessarily good– Need to optimise assembly parameters
• For small organisms (< ~15Mbases)– Commodity hardware– OLC assemblers
• For larger genomes– More RAM (10-100’s Gbytes) for OLC assemblers– De Bruijin Graph assemblers– Read Mapping step to generate contig read alignments
Automated Assemblies Are One Thing, Good Assemblies Are Another
• Automated assemblies need to be checked for mis-assemblies– Need paired-end/matepair reads– Need viewers to visualise paired-end data– Need editors to break/join/reassemble parts of the
assembly deemed to be inconsistent with read pair info– Need enough computer hardware to allow all this data to
be loaded – especially with large volumes of Illumina paired-end data
Automated Assemblies Are One Thing, Good Assemblies Are Another
• Very time consuming and laborious to check/edit– Small assemblies (< ~15Mbases)
• Several weeks/few months to move 1 scaffold/contig at a time
– Large assemblies need a team to do the same thing• Need enough RAM to load all the paired-end data• Need ways to identify regions requiring closer inspection• identify possible mis-assemblies
• Major hurdles– Software inadequacies– Time– File formats! Grrrr!
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
SeqMan Pro – Strategy View
SeqMan Pro
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
Gap5 – Template View
Gap5 – Contig Comparator
Gap5 – Join Editor
Gap5 – Contig Editor
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
Consed – Assembly View
Consed – Contig Viewer/Editor
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
Scaffold/Contig Length Distribution
Library Stats
• A measure of the deviation of local distribution of insert sizes to the global distribution of insert sizes– 0 indicates no deviation– ≤ 3 indicates much
compression– ≥3 indicates much
expansion
Compression-Expansion (CE) Statistic
Insert Coverage Read Coverage
500bp inserts 3kb inserts
20kb inserts
AMOSvalidate
• An assembly analysis pipeline to identify possible mis-assemblies– Paired-end data
• CE stats• Incorrect orientation• Missing mate
– Coverage– SNP density– Singletons
Hawkeye Cons
• Poor support for correcting mis-assemblies once detected
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
Closing Remarks
• Software exist to allow manual editing of assemblies– Time consuming– Different tools have different features– Most fall over with assemblies > ~15Mbases or with
many contigs/scaffolds (10k-100k)
Closing Remarks
• Ideal Tool– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
Closing Remarks
• Ideal Tool– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
– Contig join editor for manual alignment and editing of contigs (like Gap5)
Gap5 – Join Editor
Closing Remarks
• Ideal Tool– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
– Contig join editor for manual alignment and editing of contigs (like Gap5)
– Visualise clipped regions with consensus mismatches (like Gap5)
Gap5 – Contig Editor
Closing Remarks
• Ideal Tool– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
– Contig join editor for manual alignment and editing of contigs (like Gap5)
– Visualise clipped regions with consensus mismatches (like Gap5)
– Automated analysis of assembly to identify regions requiring attention (like AMOSvalidate) and a way to navigate to those regions for editing
– Minimise mouse-clicks and keyboard presses!!
Newbler Plant Genome Assemblies
• Pretty conservative in contig construction• Seems to split out repetitive regions into their
own contigs pretty well• Heterozygsity issues
– SNP alignment issues– Indels break contigs– Hidden in clipped regions– Manual joining of neighbouring contigs can reduce
scaffolded contig numbers by 60-70%– Many unscaffolded contigs have high sequence
similarity to scaffolded contigs – could collapse these and reduce the number of unscaffolded contigs by 50%
Gap5 – Contig Editor
Top Related