de Bruijn Graph Construction from Combination of Short and Long Reads

57
de Bruijn Graph Construction from Combination of Short and Long Reads CSE 6406 : Bioinformatics Algorithms Course Faculty: Dr. Atif Hasan Rahman

Transcript of de Bruijn Graph Construction from Combination of Short and Long Reads

Page 1: de Bruijn Graph Construction from Combination of Short and Long Reads

de Bruijn Graph Construction from Combination ofShort and Long Reads

CSE 6406 : Bioinformatics AlgorithmsCourse Faculty: Dr. Atif Hasan Rahman

Page 2: de Bruijn Graph Construction from Combination of Short and Long Reads

Group Members

KAZI LUTFUL KABIR (1015052067)

SIKDER TAHSIN AL-AMIN (1015052076)

MD MAHABUR RAHMAN (1015052016)

Page 3: de Bruijn Graph Construction from Combination of Short and Long Reads

Outline

Common Terminology Motivation de Bruijn Graph A- Bruijn Graph Finding Genomic Path Error Correction in Draft Genome Potential Scopes of Development

Page 4: de Bruijn Graph Construction from Combination of Short and Long Reads

Common Terminology Read: A read refers to the sequence of a cluster that is obtained after the end of the

sequencing process which is ultimately the sequence of a section of a unique fragment

Contig: A set of reads related to each other by overlap of their sequence

Genomic Path: A path in the assembly graph that corresponds to traversing the genome

Draft genome: Sequence of genomic DNA having lower accuracy than finished sequence-some segments are missing or in the wrong order or orientation

Tip: An error occurred during the sequencing process causing the graph to end prematurely having both correct and incorrect k-mers.

Bubble: An error occurred during the sequence reading process such that there is a path for the k-mer reads to reconnect with the main graph

Page 5: de Bruijn Graph Construction from Combination of Short and Long Reads

Limitations of Classical deBruijn Graph

Imperfect coverage of genome by reads (every k-mer from the genome is represented by a read)

Reads are error-prone

Multiplicities of k-mers are unknown

Distances between reads within the read-pairs are inexact

Page 6: de Bruijn Graph Construction from Combination of Short and Long Reads

Motivation Implicit Assumption: de Bruijn-Inapplicable for long reads assembly

Misunderstanding: de Bruijn graph can only assemble highly accurate reads & fails in case(s) of error-prone SMRT reads

Assumption: de Bruijn Approach limited to short and accurate reads and OLC is the only way to assemble long error prone reads

Original version of de Bruijn Approach is far away from being optimal with respect to genome assembly problem

Page 7: de Bruijn Graph Construction from Combination of Short and Long Reads

de Bruijn Graph Demonstration de Bruijn graph DB(Str, k) of a string Str :- Path(Str, k) :a path of |Str| - k + 1 edges where, i-th edge : i-th k-mer in Str i-th vertex : i-th (k-1)-mer in Str Glue identical vertices in Path(Str, k)

A circular string, Str = CATCAGATAGGA 3-mers : CAT, ATC, TCA, CAG,………..

For, edge CAT, CA and AT are the constituent vertices

Page 8: de Bruijn Graph Construction from Combination of Short and Long Reads

de Bruijn Graph Construction

Page 9: de Bruijn Graph Construction from Combination of Short and Long Reads

A-Bruijn Graph

A variation of de Bruijn graph approach

More general approach than de Bruijn

Include breakpoint graphs- a major arena of genome rearrangement study

Page 10: de Bruijn Graph Construction from Combination of Short and Long Reads

A-Bruijn Graph Demonstration An arbitrary substring-free set of strings, V (a set of solid strings)

V consists of words (of any length) -Path(Str, V ) : a path through all words from V appearing in Str (in order) -Assign integer shift(v,w) to the edge (v,w) in this path to denote the

difference between the positions of v and w in Str

Glue identically labeled vertices as to construct the A-Bruijn graph AB(Str, V)

AB(Str, V) is generalized to AB(Reads, V)- A path for each read- Glue all identical vertices in all paths - An Eulerian path in AB(Reads,V) spells out the genome

Selecting an appropriate set of solid strings : a crucial factor

Page 11: de Bruijn Graph Construction from Combination of Short and Long Reads

A-Bruijn Graph Demonstration

A circular string, Str = CATCAGATAGGA

Set of solid strings, V= { CA, AT, TC, AGA, TA, AGG, AC }

Integer shift AGA→ AT : 2 CATCAGATAGGACATCAGATAGGA

Page 12: de Bruijn Graph Construction from Combination of Short and Long Reads

A-Bruijn Graph Construction

Page 13: de Bruijn Graph Construction from Combination of Short and Long Reads

Solid String Selection Short Illumina reads and long SMRT reads differ in terms of their resultant

A-Bruijn graph Short Illumina read: resultant graph can be analyzed further after application of

graph simplification procedures (bubble and tip removal)- not applicable for long SMRT reads (with error rate > 10%)

Good Candidate for solid string: k-mers that appear frequently in reads - (k,t)-mer : k-mer that has appeared at least t times- for a typical bacterial SMRT assembly, k=15 and t=8 (default choice)

Page 14: de Bruijn Graph Construction from Combination of Short and Long Reads

Finding Genomic Path in A-Bruijn Graph

hybridSPAdes Algorithm (for co-assembling short and long reads): 1. Constructing the assembly graph from short reads using SPAdes

2. Mapping long reads to the assembly graph and generating readpaths

3. Closing gaps in the assembly graph using the consensus of longreads that span the gaps

4. Resolving repeats in the assembly graph by incorporating long read-paths into the decision rule of EXSPANDER (a repeat resolution framework)

Page 15: de Bruijn Graph Construction from Combination of Short and Long Reads

Finding Genomic Path in A-Bruijn Graph SPAdes Algorithm :

(1) Assembly graph construction: de Bruijn graph simplification (2) k-bimer adjustment: accurate distance estimation between k-mers in the genome

(3) Construction of the paired assembly graph: PDBG approach (4) Contig construction: backtracking graph simplification

hybridSPAdes vs longSPAdes: hybrid: deBruijn graph on k-mers from shortreads long: A-Bruijn graph on (k,t)-mers from longreads

Page 16: de Bruijn Graph Construction from Combination of Short and Long Reads

ABruijn Assembler

Attempts to find a genomic path in the original A-Bruijn graph (instead of simplified one)

In the context of A-Bruijn graph, it is difficult to decide whether two reads overlap or not

Parameters of longSPAdes in new contexts

Some additional parameters along with those of longSPAdes

Page 17: de Bruijn Graph Construction from Combination of Short and Long Reads

Matching reads against draft genome

ABruijn uses BLASR to align all reads against draft genome.

It further combines pairwise alignments of all reads into a multiple alignment, Alignment.

Since this is inaccurate for error-prone draft genome, we need to modify it.

Page 18: de Bruijn Graph Construction from Combination of Short and Long Reads

Matching reads against draft genome Our goal is to partition multiple alignment reads

into thousands of short segments- Called Mini-Alignments

And error correct each segment.- As error correction methods are fast for short segments

However, constructing mini-alignments is not simple

Page 19: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome

Non-reference positionReference position

Page 20: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome

Cov(i) = Total number of reads covering a position

Page 21: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome

Match(i)= if read matches with reference column

Page 22: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome

Del(i) = number of space symbol in the column

Page 23: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome

Sub (i) = number of substituted symbol

Page 24: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome

Ins(i) = number of non-space symbol in non-reference column

Page 25: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome

Cov(i) = Match (i) + Del (i) + Sub(i)

Match rate= Match(i) / Cov(i)Deletion rate= Del(i) / Cov(i)Substitution rate= Sub(i) / Cov(i)

Page 26: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome For a given l-mer, - Local Match rate= minimum match rate - Local Insertion rate= maximum insertion rate

l-mer is called (α, β) solid if – α<Local match rate & β> =Local Insertion rate

Page 27: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome

Taking (α, β) = (0.8,0.2)

Page 28: de Bruijn Graph Construction from Combination of Short and Long Reads

Defining solid regions in draft genome The contiguous sequence of (α, β)-solid l-mers forms a

solid region.

The goal now is to select a position (landmark) within each solid region and to form mini-alignments from the segments of reads.

Page 29: de Bruijn Graph Construction from Combination of Short and Long Reads

Breaking multiple alignment into mini-alignments

Another A-Bruijn graph with much simpler bubbles is constructed using (α, β)-solid l-mers.

First landmarks are selected outside

homonucleotide runs.

Page 30: de Bruijn Graph Construction from Combination of Short and Long Reads

Selecting landmarks

4-mer- CAGT – Gold //all its nucleotides are different -ATGA – Simple //consecutive nucleotides different

Landmarks- Middle points (2nd and 3rd Nucleotides)

ABruijn analyzes each mini-alignment and error corrects each segment between consecutive landmarks.

Page 31: de Bruijn Graph Construction from Combination of Short and Long Reads

Constructing the A-Bruijn graph on solid regions in the draft genome

Each solid region containing a landmark is labeled by its landmark position and break each read into a sequence of segments.

Each read is represented as a directed path through the vertices.

Page 32: de Bruijn Graph Construction from Combination of Short and Long Reads

Constructing the A-Bruijn graph on solid regions in the draft genome

To construct the A-Bruijn graph AB(Alignment), all identically labeled vertices are glued together.

Page 33: de Bruijn Graph Construction from Combination of Short and Long Reads

Constructing the A-Bruijn graph on solid regions in the draft genome

The edges between two consecutive landmarks form a necklace.

If the length of the necklace is long (exceeds 100bp) , Abruijn reduces it by increasing number of necklaces.

Page 34: de Bruijn Graph Construction from Combination of Short and Long Reads

Probabilistic model for necklace polishing Neklace contains read-segmets

- Segments={….,}

Find a consensus sequence that maximizes

Where = product of all match, mismatch, insertion, deletion rates for all positions

Page 35: de Bruijn Graph Construction from Combination of Short and Long Reads

Probabilistic model for necklace polishing Start from initial necklace sequence

Iteratively checks if a mutation exits that increases Select the mutation that results maximum increase

Iterate until convergence

Page 36: de Bruijn Graph Construction from Combination of Short and Long Reads

Error-correcting Homonucleotide runs

The performance of the probabilistic approach deteriorates when it estimates the lengths of homonucleotide runs.

Thus a homonucleotide likelihood function is introduced based on the statistics of homonucleotide runs.

Page 37: de Bruijn Graph Construction from Combination of Short and Long Reads

Error-correcting Homonucleotide runs

To generate the statistics, an arbitrary set of reads is needed.

The aligned segment is represented simply as the set of its nucleotide counts.

-For ex, AATTACA = 4A1C2T.

After all runs in the reference genome, the statistics for all read segments are obtained.

Page 38: de Bruijn Graph Construction from Combination of Short and Long Reads
Page 39: de Bruijn Graph Construction from Combination of Short and Long Reads

Error-correcting Homonucleotide runs

The frequencies are used for computing the likelihood function as the product of these frequencies for all reads.

To decide on the length of a homonucleotide run, the length of the run that maximizes the likelihood function is selected.

Page 40: de Bruijn Graph Construction from Combination of Short and Long Reads

Error-correcting Homonucleotide runs

For ex, Segments={5A, 6A, 6A, 7A, 6A1C}-Pr(Segments|6A)=0.155 × 0.473^2 × 0.1 × 0.02 =0.0007-Pr(Segments|7A)=0.049 × 0.154^2 × 0.418 × 0.022 = .00001

So, select AAAAAA over AAAAAAA as the necklace consensus.

Page 41: de Bruijn Graph Construction from Combination of Short and Long Reads

Benchmarking

Performed benchmarking of ABruijn and PBcR against the reference E. coli K12 genome.

ABruijn and PBcR differs from E.coli k12 reference genome in 2906 and 2925 positions respectively.

Both agree on 2871.- suggesting errors occurred.

Page 42: de Bruijn Graph Construction from Combination of Short and Long Reads

Benchmarking Remaining positions are focused

Page 43: de Bruijn Graph Construction from Combination of Short and Long Reads

Benchmarking

ABruijn also used to assemble the ECOLInano dataset.

Assembler described in Loman et al. and ABruijn assembled the ECOLInano dataset into a single circular contig with error rates 1.5% and 1.1%, respectively.

Page 44: de Bruijn Graph Construction from Combination of Short and Long Reads

Potential Scope of Development

Calculate Likelihood Ratio of multiple solid string sets

Page 45: de Bruijn Graph Construction from Combination of Short and Long Reads

Calculate likelihood ratio of multiple solid string sets

Building a probability model Derive Solid String Sets for similar Genome

known SequencesApply A-Bruijn approach to find the SolutionFind the set which leads to approximate best

solution

Page 46: de Bruijn Graph Construction from Combination of Short and Long Reads

Calculate likelihood ratio of multiple solid string sets

Building a probability model Derive a Relation between the optimal set and

Long Read SequenceApply this Relation for unknown similar type of

Genome Sequence to assign the probabilistic value

Page 47: de Bruijn Graph Construction from Combination of Short and Long Reads

Potential Scope of DevelopmentApplying Bridging Effect

Page 48: de Bruijn Graph Construction from Combination of Short and Long Reads

Applying Bridging Effect

In case of Long Read K-mer length is bigger.

Difficult to detect correct branch

Page 49: de Bruijn Graph Construction from Combination of Short and Long Reads

Applying Bridging Effect

Apply short Read Process before Branching

Integrate the result with the Long Read Sequence to detect correct Branching

Page 50: de Bruijn Graph Construction from Combination of Short and Long Reads

Potential Scope of DevelopmentWalk on the Combined Sequence

Page 51: de Bruijn Graph Construction from Combination of Short and Long Reads

Merge WalkingApply both Short Read & Long

Read Approach on Known Genome Read Sequence

Result from Short Read ProcessResult from Long Read Process

Page 52: de Bruijn Graph Construction from Combination of Short and Long Reads

Merge WalkingFind the potentially overlapping

sequenceSequence from Long Read Process

Sequence from Short Read Process

Overlapping area

Page 53: de Bruijn Graph Construction from Combination of Short and Long Reads

Merge WalkingBuild multiple Solution Set

combining both result Each Solution in the Set must

contain the overlapped portionResult from Short Read ProcessResult from Long Read Process

Page 54: de Bruijn Graph Construction from Combination of Short and Long Reads

Merge WalkingCompare the each solution with

known Genome SequenceForm a Secondary Solution Set

which contains the similar optimal solutions

Page 55: de Bruijn Graph Construction from Combination of Short and Long Reads

Merge WalkingAlign these solutions to both short

read and long read approach’s result

Detect the overlapped sequenceFind the characteristic of related

overlapped sequence

Page 56: de Bruijn Graph Construction from Combination of Short and Long Reads

Merge WalkingFor an unknown similar genome

sequence apply the obtained characteristic to form a solution combining both results

Page 57: de Bruijn Graph Construction from Combination of Short and Long Reads

Thank you