Paul Medvedev Michael Brudno

93
Bioinforma tics Algorithms Department of Computer Science and Engineering BUET Maximum Likelihood Genome Assembly Paul Medvedev Michael Brudno Presented by Md. Tanvir Al Amin, Md. Shaifur Rahman Khalid Mahmood *Some of the slides are taken from other sources

description

Maximum Likelihood Genome Assembly. Paul Medvedev Michael Brudno. Bioinformatics Algorithms. Presented by Md. Tanvir Al Amin, Md. Shaifur Rahman Khalid Mahmood. Department of Computer Science and Engineering. BUET. *Some of the slides are taken from other sources. - PowerPoint PPT Presentation

Transcript of Paul Medvedev Michael Brudno

Page 1: Paul  Medvedev Michael  Brudno

Bioinformatics

AlgorithmsDepartment of Computer Science

and Engineering

BUET

Maximum Likelihood Genome Assembly

Paul MedvedevMichael Brudno

Presented byMd. Tanvir Al Amin,

Md. Shaifur RahmanKhalid Mahmood

*Some of the slides are taken from other sources

Page 2: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Computational Genomics

Our genome encodes an enormous amount of information about our beings our looks our size how our bodies work …. our health our behaviors … who we are!

gcgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggtagtagctagatgcagggataaacacacggaggcgagtgatcggtaccgggctgaggtgttagctaatgatgagtacgtatgaggcaggatgagtgacccgatgaggctagatgcgatggatggatcgatgatcgatgcatggtgatgcgatgctagatgatgtgtgtcagtaagtaagcgatgcggctgctgagagcgtaggcccg…….

Page 3: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Contributions of the paper

Two-fold, first one being : First exact polynomial time algorithm for

the shortest double-stranded genome, given its k-molecule spectrum

A problem that was solved for strings, but remained open for molecules

Page 4: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Contributions of the paper

Second one : Oppose the idea of shortest genome

Because It overcollapses Instead propose a new objective :

A maximum likelihood framework for assembling the genome that is most likely the source of the reads.

Page 5: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Contributions of the paper

Maximum likelihood framework Assumes perfect reads Uniform distribution Advantage of high coverage (NGS)

Estimate copy counts of repeats Combine with matepair data

Read => Contigs

Page 6: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

6

Outline

Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost Biflow

Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion

Page 7: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

7

Outline

Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost Biflow

Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion

Page 8: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Whole Genome Shotgun Sequencing

SEQUENCER

DNA

ASSEMBLER

reads

FINISHING

contigs

sequence

Sanger vs. NGS

C++C++ Problems in Assembly Sequencing Errors Unknown

Orientation Incomplete

Coverage Repeats

Page 9: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}9

Whole Genome Shotgun Sequencing

Break genome into shotgun-sized fragments and sequence

Match the overlapping regions of contiguous sequences

Demonstrated by Celera Genomics to be feasible for whole genome assembly

Sequenced human genome at 1/10’th the cost of the public Human Genome Project

Page 10: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

10

Whole Genome Assembly

Next Generation Sequencing (NGS) ?? Improved speed and cost-effectiveness

relative to the other methods… … but much shorter read length (25-200

bp) Only proven on re sequencing projects, i.e.

a reference genome is already available Posses significant challenges to the

problem of de novo genome assembly – determination of a completely unknown genome.

Page 11: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Assemblers

Previous (Sanger) Assemblers

NGS Assemblers SSAKE (Jeck et al., 2007) VCAKE (Warren et al. 2007) SHARCGS (Dohm et al. 2007) Shorty (Chen and Skiena 2007) ALLPATHS (Butler et al. 2008) Edena (Hernandez et al. 2008) Euler-(U)SR (Chaisson and Pevzner 2008, 2009) Velvet (Zerbino and Birney, 2008)

Page 12: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

12

Outline

Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost Biflow

Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion

Page 13: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Theoretical view

Input: set of strings over {A,C,G,T} called reads Output: A common superstring of the reads.

{TACAT, CATAC, ACGTAC} TACATACGTAC

Initially: Shortest Common Superstring (SCS) NP-hard [Gallant et al 1980] Over-collapsing of repeats Can be found using a TSP solver

de Bruijn graphs [Pevzner, Tang, Waterman 01] string graphs [Myers 05]

Both formulations are NP-hard.

Page 14: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

14

String graph (Myers)

Represent reads as vertices, and read overlaps as edges

Remove redundant edges Establish edge constraints

Unique? (flow is exactly one) Required? (min. flow is 1) Optional? (min. flow is 0)

Find shortest walk

Page 15: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

15

EULER assembler (Pevzner, Tang and Waterman)

Represent reads as edges and overlaps as vertices in a de Bruijn graph

Assembly can be efficiently solved as an Eulerian Path Problem: each edge must be visited exactly once

Repeats dealt with by using multiple edges for a single repeat read

Page 16: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Overlap Graph

Nodes are reads Edges are overlaps Weights are lengths of prefix TSP Tour is SCS

Example:{TACAT, CATAC, ACGTAC} TACATACGTAC

ACGTAC

CATAC

TACAT

3 5

5 3

2

2

Page 17: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Why Shortest CS?

Maximum Likehood Genome Assembly {Medvedev, Brundo} 04/21/23

DNA is full of repeats: identical and nearly identical copies that appear multiple times

Alu repeat is 300beses long, present 1,000,000 times in the human genome

SCS approach “over-collapses” the repeats: they are only present once in the answer

Solution: Model repeats explicitly through either de Brujin graph or String graps

Maybe this will also become tractable?

Page 18: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

De Bruijn Graphs

Nodes are (k-1)-mers Edges are k-mers The set of k-mers is called

a k-spectrum Finding shortest string with

given k-spectrum equivalent to Chinese Postman

{AGC, ATC, ATT, CAG, CAT, GCA, TCA,

TTC}

CA

GC AG

TC AT

TT

Pevzner 1989

Page 19: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

De Bruijn Graphs with Walks Nodes are (k-1)-mers Edges are k-mers Reads are walks Finding superwalk (one that includes all walks) Not a polynomial time problem De Bruijn Superwalk is NP-hard

{AGC, ATTCA, CATT, GCAG, ATG}

CA

GC AG

TC AT

TT

Pevzner et al 2001

Page 20: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Chinese Postman Tours

Solving Chinese Postman: An Eulerian tour is a solution

Euleriazation: make a graph Eulerian Can be done with min cost flow:

Unbalanced nodes are sources/sinks

Duplicate all edges used in flow

Pevzner 1989

{AGC, ATTCA, CATT, GCAG, ATG}

CA

GC AG

TC AT

TT

Page 21: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

DNA is not a String

CCCA

TG

GC

AA

TT

AT

AC

GGGT

{AAC, ATT, CAA, CCA,

GCC, TGC, TTG}

{GTT, TAA, TTG, TGG,

GGC, GCA, CAA}

CCCA

TG

GC

AA

TT

AT

AC

GGGT

• The shortest walk that visits every edge at least once (a Chinese postman tour) is the shortest string with the given k-spectrum [Pevzner 1989]

ATTGCCAAC5’ 3’

GTTGGCAAT5’3’

Page 22: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Complexity of CPT

Equivalent to

Undirected Polynomial Time Matching

Directed Polynomial Time Matching

Mixed NP-hard Network Flow

Bidirected Polynomial Time

Bidirected Flow

Page 23: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Modeling Double Strandedness

Kececioglu 91, Kececioglu-Meyers 95

Page 24: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Modeling Double-Strandedness How can two DNA molecules overlap?

A A C

C T T

A A C

T C G

T G G

A A C

Kececioglu 1992

-GTT+AAC

-AAG+CTT

-GTT+AAC

-CGA+TCG

-GTT+AAC

-CCA+TGG

ATTGCCAAC5’ 3’

GTTGGCAAT5’3’

Page 25: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Walks in bidirected graphs

A walk has to “match” directions at each node.

Suppose the node +AA/TT-. Edge orientations correspond to

strands A path can use a node in both orientations

-AT+AT

-TT+AA

-GT+AC

-GC+GC

-TG+CA

-GG+CC

Page 26: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Rules for Matching Directions When we walk through it, we can

Come in using in arrow, then leave using out arrow This is forward, so read the “+” strand. i.e.

AA here Come in using out arrow, then leave using in

arrow This is backward, so Read the “-” strand, i.e

TT here.-AT+AT

-TT+AA

-GT+AC

-GC+GC

-TG+CA

-GG+CC

Page 27: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Bidirected Graphs So what this walk corresponds to ?

• GGCAAT• ATTGCC

-AT+AT

-TT+AA

-GT+AC

-GC+GC

-TG+CA

-GG+CC

Page 28: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Bidirected de Bruijn Graphs The shortest walk that visits every edge at least once (a Chinese

postman tour) is the shortest DNA molecule with the given k-spectrum

-AT+AT

-TT+AA

-GT+AC

-GC+GC

-TG+CA

-GG+CC

-GC+GC

-AT+AT

-TT+AA

-GT+AC

-TG+CA

-GG+CC

{AAC, ATT, CAA, CCA,

GCC, TGC, TTG}

{GTT, TAA, TTG, TGG,

GGC, GCA, CAA}

ATTGCCAAC5’ 3’

GTTGGCAAT5’3’

Page 29: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Representing Bidirected graphs

Page 30: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Motivation: Overlap Graphs

Several downsides of the de Bruijn approach Division into k-mers arbitrary Very sensitive to sequencing errors Not memory efficient (one node per k-mer)

Goal One node per read (or better) No division into k-mers Flexibility in the presence of sequencing errors

Myers 2005

Page 31: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

How To Build A Overlap Graph (1)

{ACGTAC, CATAC, TACAT}

Nodes are reads Edges are overlaps Weights are lengths of

non-overlapping prefix Transitively inferable overlaps

ACGTAC

TACAT

CATAC3

53

22

TACATACGTAC

Page 32: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Bidirected Overlap Graph

In this work, authors have used Bidirected overlap graph.

In a bidirected overlap graph, each vertex is a double-stranded read

Edges represent read overlaps

Page 33: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

33

Bidirected Overlap Graph

Three possible ways that two double-stranded reads can overlap (corresponds to the three types of edges) Suppose we have two reads r1 and r2

Each read can be oriented to the left or to the right The three possible overlaps are:

i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) ii) r1 points left and r2 points right iii) r1 points right and r2 points left

Page 34: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

34

Bidirected Overlap Graph

The overlap graph is constructed by placing an edge between two reads if they overlap by a minimum number of characters omin

Question: How is omin determined? Then perform transitive edge reduction:

remove overlaps covered by two shorter overlaps

Page 35: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Observation

A bidirected graph contains an Eulerian circuit if and only if it is connected and balanced.

Page 36: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Chinese postman Problem on Bidirected Graphs

Page 37: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Chinese postman Problem on Bidirected Graphs

Let G be a weighted bidirected graph. There exists a circuit of weight i if and only if there exists an Eulerian extension of weight i. G has a circuit if and only if it is strongly

connected. The minimum weight Eulerian extension of

G has at most 2|E||V| edges.

Page 38: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Chinese postman Problem on Bidirected Graphs

The running time of Algorithm 1 is O(|E|2log(|V|)log(E)).

Gabow’s algorithm runs in O(|E|2log(|V|)log(max(u(e)))

u is the flow upper bound function f(e) <= 2 |E| |V| for every edge e,

So, we can safely let u(e) = 2 |E| |V|

Page 39: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Chinese postman Problem on Bidirected Graphs Hence the theorem is proved : Given a set of k-molecules S, we can find

the shortest (k-1)-circular DNA molecule whose k-molecule spectrum is S in time O(|S|2log2(|S|)).

This is a polynomial time algorithm, explicitly handling the double strandedness

The first main result of this paper.

Page 40: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

40

Outline

Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost Biflow

Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion

Page 41: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

41

Sequence assembly using NGS

Sequence assembly using NGS Several methods available now (e.g.

SSAKE, VCAKE, SHARCGS, etc.) All of these assume that the length of

the assembled genome must be minimized

Results in over-collapsing of repeats Given ubiquity of repeats in eukaryotic

genomes, authors considered this a poor assumption

Page 42: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Goal of an Assembler

What should the goal of an assembler be ?? Shortest string ??

Problem of over-collapse

Page 43: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

43

Maximum Likelihood Genome Assembly Change goal of sequence assembly

Maximize the likelihood that the resultant genome was the source of the given reads

Take advantage of the high coverage of NGS to statistically estimate the copy-count of each read: identify and quantify repeats

Maximizing the likelihood of observed read frequencies can be cast as mininum cost bidirected flow (biflow) problem

Allows solution to be obtained with an off-the-shelf network flow solver

Authors claim 99.99% accuracy

Page 44: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

44

Maximum Likelihood Genome Assembly

Second important aspect is the use of matepair information for joining contigs

Other systems look for all paths between mated reads

The proposed Method looks only for short paths between some pairs of reads

Question: How to decide the upper bound for these “short paths”? And how to decide which pairs of reads to examine?

Page 45: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

45

Outline

Whole Genome Shotgun Assembly Review of Related Work The Medvedev-Brudno Method Bidirected Overlap Graph Adjustments to the Standard Min-cost

Biflow Problem Maximizing the Global Read-Count Likelihood Efficiently Solving a Min-cost Biflow Flow to Contigs Conflict node resolution Results Discussion

Page 46: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

46

Adjustments to the Standard Min-cost Biflow Problem

Standard Min-cost Biflow Problem Set upper and lower flow bounds on

each edge Flow function f : E → N must obey the

constraint for each edge e For each vertex, the incoming flow is

balanced with the outgoing flow Objective: Find the flow that minimizes

l e f e u e

ce f e

Page 47: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

47

Adjustments to the Standard Min-cost Biflow Problem

Medvedev-Brudno Min-cost Biflow Problem Upper and lower flow bounds on vertices as well Accomplished by splitting every vertex v into

two: v+ and v-

Page 48: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

48

Adjustments to the Standard Min-cost Biflow Problem

v- serves as the “incoming” vertex, and inherits v’ incoming edges

v+ serves as the “outgoing” vertex, and inherits v’s outgoing edges

Finally add one edge between v- and v+ and assign it the upper and lower flow bounds for v

Page 49: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Adjustments to the Standard Min-cost Biflow Problem

Second variation: represent the cost ce as a convex function

A function is convex if every point on or above it forms a convex set

A convex set refers to an area where, for every pair of points within that area, every point on the straight line segment connecting those points also lies within that area

Page 50: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Convex Function

Page 51: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}51

Adjustments to the Standard Min-cost Biflow Problem

An area that is not convex would have some sort of concave portion that would contradict the above property of convex sets

In the overlap graph, convex functions are modelled with piecewise-linear approximations, allowing the flow to be solved as a linear min-cost flow problem

Page 52: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

52

Adjustments to the Standard Min-cost Biflow Problem

Supersource and supersink added to convert flow problem into circulation problem

Each vertex has a lower bound of 1, since each read must appear in the finished genome at least once

Edge bounds are set to 0 (lower bound) and infinity (upper bound)

Page 53: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

53

Adjustments to the Standard Min-cost Biflow Problem

Prohibitively large cost on the edge leading from the supersource and the edge leading to the supersink to ensure that the assembly uses the smallest number of contigs possible

Flow through each vertex represents number of times it appears in the assembled genome

Page 54: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Supersource and Supersink

Page 55: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Maximum Likelihood Framework Let D be a circular genome of length

N(D) di = number of times the k-molecule i

appears in D Suppose i = ACGT

For, simplicity they are drawnas strings instead of molecules

A C G T

Page 56: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Maximum Likelihood Framework Random trial Sample a position and take a k-molecule What is the probability that the k-

molecule is i

For, simplicity they are drawnas strings instead of molecules

)(DN

di

Page 57: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Maximum Likelihood Framework Sample Uniformly We call it success, if we get i So, p = success probability =

We do the experiment n times Xi be the random variable indicating

number of times we get i What is the distribution of Xi ??

)(DN

di

Binomial Distribution

Page 58: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Maximum Likelihood Framework How many options for i ? There of 4k possibilities …. Hence 4k random variables ….

Suppose k = 3 X1 X2 X3 X4 X5 X6 …… X64

They are, XAAA XAAC XAAG XAAT …… XTTT

Page 59: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

59

Maximizing the Global Read-Count Likelihood

Taking all random variables over n experiments. What is the probability that AAA comes xAAA times, and

AAC comes xAAC times, ….. and CGT comes xCGT times ……and TTT comes xTTT times ??

Each random variable for every possible k-mer has a binomial distribution. Their joint distribution is the following multinomial distribution:

i

x

i

i

i

kk

DN

d

x

nxXxXxXP

!

!,,,

442211

Page 60: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Maximum Likelihood Framework But D is not known, but the results of the

n trials are known !! The probability can be considered as the

likelihood of the parameters of the distribution di, given the outcome of the trials xi which is called Global Read-count Likelihood

kk xxddL4141 ,,|,,

Page 61: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

61

Maximizing the Global Read-Count Likelihood

Goal is to maximize L, or, equivalently, minimize the negative log of L

i

kk

x

i

i DN

d

x

nxxddL

!

!,,|,,4141

Page 62: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

62

Maximizing the Global Read-Count Likelihood

To translate this problem into a convex min-cost biflow problem, we need convex functions ci for each k-mer

Problem: the Xi random variables are not independent, because we have constraint :

We need something like : ii gcLlog

)(DNdi

Page 63: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

63

Maximizing the Global Read-Count Likelihood

But, as the number of trials goes to infinity, the Xi random variables become independent.

In NGS techniques, the number of trials is usually large enough to warrant the approximation of the multinomial distribution as the product of the binomial distributions for each Xi

Page 64: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Maximizing the Global Read-Count Likelihood In this binomial approximation,

genome length N(G) is constant, and independent of the sampling frequencies

Therefore, use N instead, which is the actual length of the genome G

Page 65: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

65

Maximizing the Global Read-Count Likelihood

New approximation of L:

Now And ci is used as the convex functions for the

vertices of the min-cost biflow

ii

kk

xn

i

x

i

iii N

d

N

d

x

nxXPxxddL 1,,|,,

4141

ii dcKLlog

iiiiii dNxndxdc loglog

Page 66: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

66

Outline

Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost

Biflow Problem Methods: Maximizing the Global Read-Count

Likelihood Methods: Efficiently Solving a Min-cost

Biflow Methods: Show Me the Contigs Results Discussion

Page 67: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

67

Efficiently Solving a Min-Cost Biflow Problem: No existing efficient

implementation of a min-cost biflow algorithm

Though, Gabow (1983) presented polynomial time algorithm for min cost biflow It is difficult to implement. Author’s didn’t find any existing

implementation either…

Authors solve by converting a bidirected flow into a directed flow problem.

Page 68: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

68

Efficiently Solving a Min-Cost Biflow Directed network flow is solved by

reducing the problem to a linear program (LP)

Use an edge incidence matrix derived from the overlap graph If cell has a value of 1, then edge n is an in-

edge for vertex m If the value is -1, n is an out-edge 0 means n and m are not on speaking terms

Use incidence matrix as constraint matrix for LP: optimal LP solution corresponds to a minimum flow

IV E

Im,n

Page 69: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

69

Efficiently Solving a Min-Cost Biflow

The incidence matrix is Totally Unimodular (TU)

Leads to Linear programs that always have integer solutions.

Makes it possible to produce an integral solution with LP, rather than resort to Integer Programming -> NP-hard

Page 70: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

70

Efficiently Solving a Min-Cost Biflow

Possible for +2 or -2 to appear in the incidence matrix, since two in-edges/out-edges can enter a single vertex

Incidence matrix is actually a binet matrix Optimal LP solution for binet

matrices is guaranteed to be half-integral (i.e. the coefficients are multiples of 0.5)

Hochbaum 2004

Page 71: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

71

Efficiently Solving a Min-Cost Biflow

Page 72: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

72

Efficiently Solving a Min-Cost Biflow

Monotonization Procedure For every vertex v in the bidirected graph, replace

with two vertices v1 and v2 in the new graph Each of v’s in-edges are replaced with two edges,

one of which points into v1, while the other points out of v2

Likewise, each of v’s out-edges are replaced with two edges, one of which points out of v1, while the other points into v2

Bounds and costs from original graph are transferred to the new graph, and the solution of the new graph will be transferred to the original graph

Hochbaum 2004

Page 73: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Efficiently Solving a Min Cost Flow Problem can now be solved with off-

the-shelf software After finding the min cost flow in the

directed graph, transfer the results to the original bidirected graph by adding the flows through the pairs of twin edges and dividing by two.

Hence, the optimal result is half integral and the monotonized flow is at worst a 2-approximation to the optimal integral flow.

Hochbaum 2004

Page 74: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

74

Outline

Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost

Biflow Problem Methods: Maximizing the Global Read-Count

Likelihood Methods: Efficiently Solving a Min-cost Biflow

(Linear) Methods: Show Me the Contigs Results Discussion

Page 75: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

75

Flow to Contigs

Flow’s have been solved, Now, decompose it into a collection of walks,

which translates into assembled contigs

Graph is first simplified by removing all edges with a flow of zero

Additional simplifications possible ….

Page 76: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

76

Flow to Contigs

…… by removing vertices v where: There is exactly one edge going into v and one edge

leading out of v, and the flow on both edges is the same

Vertices where there is also a loop with the same flow as the other two edges, and

Split and join vertices, where the flow on the in- edges is the same as those of the out-edges

Page 77: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

77

Flow to Contigs

After at most 2|V| of these simplifications, the remaining vertices are conflict vertices those that didn’t match the previous criteria

Page 78: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

78

Conflict Node Resolution

Using matepair information Look for edges at these vertices with opposite

orientations supported by matepairs Use BFS to find all reads within a certain

distance from the vertex (in both direction) We have two sets of vertices L and R,

corresponding to reads that were observed on the inside of a vertex and the outside.

Match those reads that are matepairs. For those matepairs where one read is on the

incoming side and the other is on the outgoing side, find the shortest path between them using Dijkstra’s algorithm

Page 79: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

A’ B’

A’ B’

A

Resolving Conflict Nodes with Mate Pairs

Does there exist a short path between A’ and B’?

B

A B

?

• Dijkstra’s shortest path algorithm -- bounded• Greedily join edges if they have enough supporting reads.

Page 80: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

80

Greedy Matching

Make note of the number of mates that fall within the expected insert distance

Pairs of in/out edges that have a significant number of matepairs that fall within the insert distance are joined into a common edge

The previous step is repeated until no more edges can be joined in this manner

Graph simplification continues in iterative phases until convergence

Page 81: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

81

Outline

Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost

Biflow Problem Methods: Maximizing the Global Read-Count

Likelihood Methods: Efficiently Solving a Min-cost Biflow

(Linear) Methods: Show Me the Contigs Results Discussion

Page 82: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

82

Results

Generated synthetic reads from E. coli genome, which has a total length of 4.6 Mega basepairs.

Simulated matepairs’ distances were uniformly distributed within 10% of the expected insert size

Reads were 25 bp long, and error-free

Page 83: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Results

Coverage rates involved 50x, 75x, 100x, and 200x

Minimum overlap length varied between 17 and 21

Authors claim that, overall running time of the algorithm is approx 1 hour on one machine Question: What kind of machine??

Page 84: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Copy Count Results

Authors compared the flow going through every vertex in the overlap graph to the number of times that the corresponding read appears.

Page 85: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}85

Read Count Results

Compared vertex flow with read frequency in the original genome

High degree of accuracy Error rate between 10-4 and 10-6

Generally more tendency to overestimate read frequency Authors claim only slight improvements beyond 75x

coverage but 200x coverage is fantastically good

Page 86: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

86

Assembly Results

Take the edges of the graph produced after the conflict node resolution and generate the sequence it spells out

Compute N50: The length of the shortest contig s.t. 50% of the genome lies in longer contigs

Also compute N90: Similar to N50, but the cutoff is 90%

Finally, compute errors by aligning each contig to the reference genome and seeing how many local alignments it takes to completely tile the contig (minus one because it always takes at least one alignment to do it)

Page 87: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Assembly Results

N50 Results

N90 Results

Page 88: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

88

Assembly Results (cont’d)

Length of contigs that contain 50% of the genome varied between 23-28 kb

Length of contigs that contain 90% of the genome varied between 7-8 kb

N50 error rate: ~1/100-180 kb N90 error rate: ~1/100-160 kb Greedy algorithm can be fooled by several

strong edge matches Contig size is good relative to other whole

genome assemblies involving small read sizes

Page 89: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

89

Outline

Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost

Biflow Problem Methods: Maximizing the Global Read-Count

Likelihood Methods: Efficiently Solving a Min-cost Biflow

(Linear) Methods: Show Me the Contigs Results Discussion

Page 90: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

90

Discussion

Demonstrated that bidirected flow is a powerful method for gnome-assembly.

Introduced a maximum likelihood framework for sequence assembly

By unifying Pevzner’s work on de Bruijn graphs, Kececioglu and Myer’s work on bidirected graphs in assembly, and Edmond and Gabow’s work on bidirected flow. The paper gives an exact polynomial time

assembly algorithm in the parsimony setting explicitly dealing with double-strandedness.

Page 91: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

91

Discussion

First major assumption: Reads are error-free Can be overcome with higher coverage

Second major assumption: Uniform sampling of all genomic regions Reality: certain portions of the genome are

easier to sample than others More difficult to overcome Could be overcome by establishing the biases

of the sequencing apparatus used

Page 92: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Future Research

Exploration of the exact biases of the NGS platforms

Correction for these Is there any better heuristic for the

greedy resolution ??

Page 93: Paul  Medvedev Michael  Brudno

04/21/23

Maximum Likehood Genome Assembly {Medvedev, Brundo}

Questions ??

Thank you