On Genome Assembly

23
On Genome Assembly Abhiram Ranade IIT Bombay

description

On Genome Assembly. Abhiram Ranade IIT Bombay. Genome. Constituent of living cells that determines hereditary characteristics a.k.a. DNA Sequence of nucleobases : A denine, G uanine, T hymine, C ytosine ACCTGGA… Human genome: 3 billion nucleobases - PowerPoint PPT Presentation

Transcript of On Genome Assembly

Page 1: On Genome Assembly

On Genome Assembly

Abhiram RanadeIIT Bombay

Page 2: On Genome Assembly

Genome

• Constituent of living cells that determines hereditary characteristics a.k.a. DNA

• Sequence of nucleobases: Adenine, Guanine, Thymine, Cytosine

ACCTGGA…• Human genome: 3 billion nucleobases• Knowledge of sequence is very useful

Page 3: On Genome Assembly

Genome Sequencing

Biochemical techniques can “read” genomes of length ~ 700 nucleobases

• Make many copies of the genome.• Break the copies randomly into pieces of

length ~ 700.• Read the pieces.• Try to infer what original genome must have

been. Genome Assembly

Page 4: On Genome Assembly

Assembly Example

• Input pieces “Reads”: abcd, cdefghi, hijkl• Assembly:• Input pieces “Reads”: abcd, cdefghi, hijkl, hicd• Assembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl

abcdefghijkl

Page 5: On Genome Assembly

Strategy

• Characterize all possible assemblies of the given read set

• Assign a probability to each assembly• Pick the assembly with the highest probability

Page 6: On Genome Assembly

All possible assemblies

• A = valid assembly of reads if – each read appears in A at least once.– Nothing else appears, i.e. A is made by pasting

together the reads in possibly overlapping faction• Can we compute/represent the set

{A | A is a valid assembly}Overlap/String Graph!

Page 7: On Genome Assembly

Overlap Graph: Intuition

• Vertices = Reads.• Edge from read u to read v if read u, read v

likely to overlap in assembly.e.g. abcd, cdef => abcdef• Assembly = walk in the graph: will encourage

overlaps

Page 8: On Genome Assembly

Overlap graph

• Vertices: reads + empty read ϕ• Edges: (ri, rj) : if long suffix of ri = prefix of rj

abcd cdefghi• Long = ? Real genomes: 50? 100? Any value

that indicates overlap is not coincidental • Edge label: portion of rj not belonging to

overlap. abcd cdefghiefghi

(Long = 2)

Page 9: On Genome Assembly

Overlap graph, long=2

abcd cdefghi hijkl

hicd

ϕ

abcd

efghi jkl

cd

efghi

ϕ

ϕ

hicdϕ

Page 10: On Genome Assembly

Overlap graph, long=2

abcd cdefghi hijkl

hicd

ϕ

abcd

efghi jkl

cd

efghi

ϕ

ϕ

hicdϕ

Page 11: On Genome Assembly

Walk => Assembly

• Assembly = Walk in the overlap graph which– Starts at ϕ, Ends at ϕ– Passes through every vertex at least once.– Passes through every edge at least once?

• Assembled sequence = concatenation of labels along the walk.– Every read appears in the sequence

• Walk revisits ϕ: reconstruction is incomplete, in several pieces.

Page 12: On Genome Assembly

Assembly => Walk

• Input: Assembly A• Output: Walk which will generate A

– Visit vertices in the order of appearance in A

Overlap graph characterizes assemblies. Variations on graph also studied.

Page 13: On Genome Assembly

Approaches to assembly

• Occam’s Razor: Most likely = “Shortest”– Shortest walk that visits every vertex at least once:

NP-hard– Shortest walk that visits every edge at least once:

Chinese Postman problem. Polytime.– Pragmatic: Use some greedy approach to find

above.• Model probability more accurately

Page 14: On Genome Assembly

A Twist: pair constraints

• Sequencing process may give additional constraints: distance from ri to rj in assembly is about D

• Example: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely?abcdefghijklhicd

abcdefghicdhijkl abcdefghicdefghijkl

Page 15: On Genome Assembly

Systematic estimation of probability of a given assembly

Page 16: On Genome Assembly

Algebraic representation of walks

• Walk is cyclic: number of times vertex entered = number of times it is exited.

• Walk = fluid flowTotal fluid coming in = Total fluid going outXij = fluid going from i to j.= number of times walk goes from i to j

Formulate conditions on Xij and solve

Page 17: On Genome Assembly

Algebraic representation: Xij,δj

Xij = Number of times walk goes from i to j.δj = Number of times walk goes over j

Lij = Length of label of edge (i,j)Length of genome L =L may be known. €

Xij = Xjkk∑

i∑ = δj > 0

Xij * Liji, j∑

Page 18: On Genome Assembly

Maximum likelihood reconstruction(Medvedev-Brudno 08)

Goal: Find assembly A most likely given the observations. maximize Pr(A | r1,r2,…rn)

=Pr(A, r1,r2,…,rn) / Pr(r1,r2,…,rn)=Pr(r1,r2,…,rn|A) * Pr(A) / Pr(r1,r2,…,rn)Standard assumption: Unconditional probability

Pr(A) same for all A.Maximize Pr(r1,r2,…,rn|A) and output A that

maximizes.

Page 19: On Genome Assembly

Computing Pr(r1,r2,…,rn|A)

• A = abcdefghicdefghijklr1 = abcd, r2 = cdefghi, r3 = hijkl, r4 = hicd

• Process of generating reads:– Pick a random starting point.– Pick a length at random

• Pr(r2) = ? = 2/Length * Pr(read length = 7)

= δ2/Length * Pr(read length = 7)

Page 20: On Genome Assembly

Computing Pr(r1,r2,…,rn|A)

Generative model: • For i=1 to n

– Pick starting point for ri

– Pick length Li

• Probability of generating ri:= Number of times ri appears in A/Length of A * Probability of getting the correct length= δi/L * Pr(Li)

Page 21: On Genome Assembly

Computing Pr(r1,r2,…,rn|A)

• Pr = Πi δi/L * Pr(Li)• We want to pick A for which this probability is

maximum• Score(A) = Πi δi/L * Pr(Li)

• Best A will have max value of Πi δi/L • So now we have a program

Page 22: On Genome Assembly

Finding the best assembly

• Maximize Πi δi/L• s.t.

• L known approximately. L = Lgeε ≈ Lg(1+ε)

• Lg, Lij : constants. Solve for Xij≥0, δj≥0, ε• Convex optimization €

Xij = Xjkk∑

i∑ = δj > 0

L = Xij *Liji, j∑

Page 23: On Genome Assembly

Concluding Remarks

• Experiments seem to indicate our approach works well.

www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf• Computationally intensive, but well founded.• May not be useful for large genomes – linear

time algorithms only!• How to handle pair constraints: important

open problem.• Graphs are everywhere!