Soares historias de lo diverso y lo homohéneo compressed ilovepdf compressed compressed
Assembly algorithms using compressed data structures - B·Debate
Transcript of Assembly algorithms using compressed data structures - B·Debate
dnGASP Workshop Barcelona 2011 Jared Simpson
Jared Simpson Wellcome Trust Sanger Institute
Assembly algorithms using
compressed data structures
dnGASP Workshop Barcelona 2011
Overview
Introduction to SGA
– Overlap-based string graph assembler
– Based on compressed data structures
– Open source, modular and extensible
Results
– Human assembly in under 60GB of memory
– dnGASP entry summary
Jared Simpson
dnGASP Workshop Barcelona 2011
scaffolding
assembly
correction
SGA Overview
Jared Simpson
sga index
sga correct
sga index
sga filter
sga assemble
bwa align
sga scaffold
FASTQ FASTQ
FASTA
dnGASP Workshop Barcelona 2011
FM-Index
Data structure designed to index compressed
representation of text
Built on the top of the Burrows-Wheeler transform
Very efficient to:
– Count and locate occurrences of a substring
– Extend pattern matches
Compressed
– Efficiency grows with redundancy of data
Jared Simpson
dnGASP Workshop Barcelona 2011
FM-Index - Compression
Jared Simpson
dnGASP Workshop Barcelona 2011
SGA Error Correction
Efficient k-mer based error corrector
– Scan reads left-to-right looking for low-frequency k-mers
• Search for change to make all k-mers high frequency
– Unlike hash tables, FM-index is not parameterized by k-mer
length
Jared Simpson
dnGASP Workshop Barcelona 2011
Example read – Kmer frequencies
Jared Simpson
dnGASP Workshop Barcelona 2011
SGA Error Correction – After Correction
Jared Simpson
dnGASP Workshop Barcelona 2011
SGA Error Correction
Also have multi-alignment based corrector
– Call consensus from inexact overlaps found with FM-index
– Less efficient than k-mer corrector
Jared Simpson
dnGASP Workshop Barcelona 2011
String graphs
Graph based assembly model
Reads are vertices
– Vertices connected by an edge if they have a non-transitive
overlap
Can be directly constructed using the FM-index
Jared Simpson
“Efficient construction of an assembly string graph using the FM-index” ISMB 2010
dnGASP Workshop Barcelona 2011
String Graphs
Jared Simpson
R1
R2 R3
R4
R1 GATACCGTAGA R2 TACCGTAGATGC R3 GTAGATGCAGT R4 AGATGCAGTA
AGT
TACC
G = GATACCGTAGATGCAGTA
dnGASP Workshop Barcelona 2011
SGA Merging
Jared Simpson
• Can merge reads that have one incoming/outgoing edge
• We can compute the predecessor and successor vertices for a
read and merge without explicitly building the entire graph
dnGASP Workshop Barcelona 2011
String Graph Assembly
After correction, merging process reduces amount of
data by ~20-30x
After merging, the string graph easily fits in memory
The in-memory graph is processed to detect/remove
polymorphism and output a final set of contigs
Jared Simpson
dnGASP Workshop Barcelona 2011
SGA Scaffolding
Input: set of contigs, paired reads
Reads aligned to contigs with BWA
Alignments processed to discover contig-contig links
and construct a scaffold graph
Jared Simpson
contig-1948
contig-346 250
contig-221
75
2000
dnGASP Workshop Barcelona 2011
SGA scaffold
Jared Simpson
Step one:
• Cull repeats from graph based on read depth
and connectivity.
• Ensure remaining edges are consistent
dnGASP Workshop Barcelona 2011
SGA scaffold
Jared Simpson
Step two:
• Find paths through the connected
components
dnGASP Workshop Barcelona 2011
SGA scaffold
After building scaffolds, gaps can be resolved by
finding walks though the string graph
Planning additional module to fill remaining gaps by
local assembly of pairs falling into the gap
Jared Simpson
dnGASP Workshop Barcelona 2011
Human Genome Assembly
CEU trio member NA12878
– Sequenced by the Broad Institute
– Illumina HiSeq: 40X 100bp reads
– Single library 400bp insert
– Low error rate (<1%)
Jared Simpson
dnGASP Workshop Barcelona 2011
Human Genome Results
Jared Simpson
Stage Wall Time (Jobs) Cumulative Time Max Memory
Index raw reads 94 hr (127) 738 CPU-hr 34 GB
Correct reads 30 hr (63) 1275 CPU-hr 53 GB
Index corrected 44 hr (123) 513 CPU-hr 28 GB
Filter reads 62 hr (5) 358 CPU-hr 45 GB
Merge reads 48 hr (1) 162 CPU-hr 47 GB
Assemble reads 102 hr (10) 161 CPU-hr 26 GB
Align to contigs 46 hr (67) 63 CPU-hr 10 GB
Build scaffolds 3 hr (8) 5 CPU-hr 7 GB
All stages 429 hr 3275 CPU-hr 53 GB
dnGASP Workshop Barcelona 2011
Human Genome Results
Jared Simpson
dnGASP Workshop Barcelona 2011
dnGASP Challenge
Main challenge: Very high error rate at the ends of
reads
Jared Simpson
dnGASP Workshop Barcelona 2011
dnGASP Challenge
Main challenge: Very high error rate at the ends of
reads
Jared Simpson
dnGASP Workshop Barcelona 2011
dnGASP Summary
Total running time: 322 wall clock hours
– Many parallel process, total CPU time 2300 hours
Max memory: 46GB
Contig N50: 16.2kb
Scaffold N50: 275kb
95.2% of genome covered by contigs >200bp
99.4% of contigs align full-length to reference
Jared Simpson
dnGASP Workshop Barcelona 2011
Summary
Modular approach to assembly
– Focus on reusable, memory-efficient components
Planned improvements:
– Better gap filling after scaffolding
– Correcting/assembling mixed read sets
Jared Simpson
github.com/jts/sga
dnGASP Workshop Barcelona 2011
Acknowledgements
Richard Durbin
Matthias Haimel, Albert Vilella (EBI)
Mark DePristo (Broad)
Jared Simpson