Assembly algorithms using compressed data structures - B·Debate

25
dnGASP Workshop Barcelona 2011 Jared Simpson Jared Simpson Wellcome Trust Sanger Institute Assembly algorithms using compressed data structures

Transcript of Assembly algorithms using compressed data structures - B·Debate

Page 1: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011 Jared Simpson

Jared Simpson Wellcome Trust Sanger Institute

Assembly algorithms using

compressed data structures

Page 2: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

Overview

Introduction to SGA

– Overlap-based string graph assembler

– Based on compressed data structures

– Open source, modular and extensible

Results

– Human assembly in under 60GB of memory

– dnGASP entry summary

Jared Simpson

Page 3: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

scaffolding

assembly

correction

SGA Overview

Jared Simpson

sga index

sga correct

sga index

sga filter

sga assemble

bwa align

sga scaffold

FASTQ FASTQ

FASTA

Page 4: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

FM-Index

Data structure designed to index compressed

representation of text

Built on the top of the Burrows-Wheeler transform

Very efficient to:

– Count and locate occurrences of a substring

– Extend pattern matches

Compressed

– Efficiency grows with redundancy of data

Jared Simpson

Page 5: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

FM-Index - Compression

Jared Simpson

Page 6: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

SGA Error Correction

Efficient k-mer based error corrector

– Scan reads left-to-right looking for low-frequency k-mers

• Search for change to make all k-mers high frequency

– Unlike hash tables, FM-index is not parameterized by k-mer

length

Jared Simpson

Page 7: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

Example read – Kmer frequencies

Jared Simpson

Page 8: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

SGA Error Correction – After Correction

Jared Simpson

Page 9: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

SGA Error Correction

Also have multi-alignment based corrector

– Call consensus from inexact overlaps found with FM-index

– Less efficient than k-mer corrector

Jared Simpson

Page 10: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

String graphs

Graph based assembly model

Reads are vertices

– Vertices connected by an edge if they have a non-transitive

overlap

Can be directly constructed using the FM-index

Jared Simpson

“Efficient construction of an assembly string graph using the FM-index” ISMB 2010

Page 11: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

String Graphs

Jared Simpson

R1

R2 R3

R4

R1 GATACCGTAGA R2 TACCGTAGATGC R3 GTAGATGCAGT R4 AGATGCAGTA

AGT

TACC

G = GATACCGTAGATGCAGTA

Page 12: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

SGA Merging

Jared Simpson

• Can merge reads that have one incoming/outgoing edge

• We can compute the predecessor and successor vertices for a

read and merge without explicitly building the entire graph

Page 13: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

String Graph Assembly

After correction, merging process reduces amount of

data by ~20-30x

After merging, the string graph easily fits in memory

The in-memory graph is processed to detect/remove

polymorphism and output a final set of contigs

Jared Simpson

Page 14: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

SGA Scaffolding

Input: set of contigs, paired reads

Reads aligned to contigs with BWA

Alignments processed to discover contig-contig links

and construct a scaffold graph

Jared Simpson

contig-1948

contig-346 250

contig-221

75

2000

Page 15: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

SGA scaffold

Jared Simpson

Step one:

• Cull repeats from graph based on read depth

and connectivity.

• Ensure remaining edges are consistent

Page 16: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

SGA scaffold

Jared Simpson

Step two:

• Find paths through the connected

components

Page 17: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

SGA scaffold

After building scaffolds, gaps can be resolved by

finding walks though the string graph

Planning additional module to fill remaining gaps by

local assembly of pairs falling into the gap

Jared Simpson

Page 18: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

Human Genome Assembly

CEU trio member NA12878

– Sequenced by the Broad Institute

– Illumina HiSeq: 40X 100bp reads

– Single library 400bp insert

– Low error rate (<1%)

Jared Simpson

Page 19: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

Human Genome Results

Jared Simpson

Stage Wall Time (Jobs) Cumulative Time Max Memory

Index raw reads 94 hr (127) 738 CPU-hr 34 GB

Correct reads 30 hr (63) 1275 CPU-hr 53 GB

Index corrected 44 hr (123) 513 CPU-hr 28 GB

Filter reads 62 hr (5) 358 CPU-hr 45 GB

Merge reads 48 hr (1) 162 CPU-hr 47 GB

Assemble reads 102 hr (10) 161 CPU-hr 26 GB

Align to contigs 46 hr (67) 63 CPU-hr 10 GB

Build scaffolds 3 hr (8) 5 CPU-hr 7 GB

All stages 429 hr 3275 CPU-hr 53 GB

Page 20: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

Human Genome Results

Jared Simpson

Page 21: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

dnGASP Challenge

Main challenge: Very high error rate at the ends of

reads

Jared Simpson

Page 22: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

dnGASP Challenge

Main challenge: Very high error rate at the ends of

reads

Jared Simpson

Page 23: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

dnGASP Summary

Total running time: 322 wall clock hours

– Many parallel process, total CPU time 2300 hours

Max memory: 46GB

Contig N50: 16.2kb

Scaffold N50: 275kb

95.2% of genome covered by contigs >200bp

99.4% of contigs align full-length to reference

Jared Simpson

Page 24: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

Summary

Modular approach to assembly

– Focus on reusable, memory-efficient components

Planned improvements:

– Better gap filling after scaffolding

– Correcting/assembling mixed read sets

Jared Simpson

github.com/jts/sga

Page 25: Assembly algorithms using compressed data structures - B·Debate

dnGASP Workshop Barcelona 2011

Acknowledgements

Richard Durbin

Matthias Haimel, Albert Vilella (EBI)

Mark DePristo (Broad)

Jared Simpson