Master Course

Post on 26-Jan-2016

25 views 0 download

description

Master Course. MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya. Master Course. - PowerPoint PPT Presentation

Transcript of Master Course

Master Course

MSc Bioinformatics for Health Sciences

H15: Algorithms on strings and sequences

Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Dep. de Llenguatges i Sistemes InformàticsCEPBA-IBM Research Institute

Universitat Politècnica de Catalunya

Master Course

Fourth lecture:

Sequence assembly

Sequence assembly

It is applied to the following topics:

• EST assembly

• DNA sequencing .

• Hibridization: provide information about l-tuples present in DNA.

DNA sequencing

There are two techniques:

• Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.

• Hibridization: provide information about l-mers present in DNA

DNA sequencing

There are two techniques:

• Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.

Hybridization

Let xxxxxxxxxxxxx be the sequence we want to know,

and the hybridization technique gives us the set of 3-mers that belong to it:

AAC GAT TGCACG CGG GCC TTG GGA ATT

How can the sequence be reconstructed?

Hybridization

As AAC and ACG belong to the sequence,

then AACG belongs to the sequence,

AAC GAT TGCACG CGG GCC TTG GGA ATT

Given the 3-mers of the sequence:

because the longest (proper) suffix of AAC matches the longest (proper) prefix of ACG.

This relation can be represented with a directed graph AAC ACG

Hybridization

Construction of the complete suffix-prefix graph

AAC GAT TGC

ACG CGG GCC TTG

GGA ATT

AACGGATTGCC

that gives us the unknown sequence:

But, is this a realistic case?

Hybridization

Let us introduce a more realistic case:

and the sequence is given by the Hamiltonian path

Which is the cost of the hybridization method?

AAC CAA GAT TGC

ACG CGG GCC TTG

GGC GGA CCG ATT

and whose cost is NP-Complet!

that is the path that traverses all nodes exactly once

2. Searching for the suffix-prefix matches :

Hybridization: cost

Cost: 1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet

Excursió: cost

Quadratic cost: O(m2 )

Linear cost: O(m)

Exponencial cost: O(2m )

m t = 1 mseg10m 10t = 10 mseg1000m 1000t = 1 seg

m t = 1mseg.10m 100t = 100 mseg.1000m 1000000t = 16 min

m t = 1 mseg.10m 210 t = 1 seg1000m 21000 t = 1030 t = 1018 anys

2. Searching for the suffix-prefix matches :

Hybridization: cost

Cost: 1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet

How the NP-completness can be avoided?

Hybridization:

Search for the Hamiltonian path (NP-complet)

AAC GAT TGC

ACG CGG GCC TTG

GGC GGA CCG ATT

or search for the Eulerian path (lineal) AA

AC

GG

CG

GA

CC

GC

TG

TT

AT

Hybridization: Eulerian path

Unbalanced nodes: indegree = outdegree (Starting or ending nodes )

Balanced nodes: indegree = oudegree (traversed nodes: )

Search for the Eulerian path of the graph:

Hybridization: Eulerian path

Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.

Hybridization: camí Eulerià

Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.

2. Searching for the suffix-prefix matches :

Hybridization: cost

Cost: 1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Eulerian path

Linear cost

Now, which is the limiting factor?

Hybridization: limiting factor

AAC CAA GAT TGC

ACG CGG GCC TTG

GGA ATT

Repeated l-mers:

Which is the probability of a repeat?

CAACGGATTGCC

CAACGGACGGATTGCC

GAC

Given the graph:

How many sequences can be assembled?

Hybridization: statistical model

Model: random sequence of length N with identically distributed bases (1/4),

How the probability of a repeat can be computed?

Given 2 l-mers, the probability to match is : 4-L

Given 3 l-mers, the expected number of 2-matches is : (32)4-L

Given m l-mers, the expected number of 2-matches is: (m2)4-L

If (m2)4-L <1 then m<sqr(2·4L) then for L = 8, m =512!

Conclusion: this technique can be applied only to short sequences.

Hybridization:

Connect to

http://alggen.lsi.upc.edu

And follow links RESEARCH SEARCH MREPATT

Genome sequences are close to random sequences?

• Hibridizationació: provide information about l-mers present in DNA

DNA sequencing

There are two techniques:

• Shot gun: DNA sequences are broken into 100Kb-500Kb random fragments.

Shotgun

With the unknown sequence xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

It is possible :

• to make some copies

• to break it into random and unsorted short segments

What can we do?

Shotgun: algorisme

Assume xxxxx|xxxxxxx|xxxxxxx|xxxx xxxxxxxx|xxxxxx|xxxxxx|xxxxxxx|xxxxxx|xxxxxx|xxxxxxx

The algorithm is:

1st. Compare all pairs searching for suffix-prefix approximate matches.

2nd. Construct the graph suffix-prefix

3th. Find the path

Shotgun

Given the three copies xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The shotgun brokes it into the following segments

accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt

Shotgun

The pairwise comparison that searchs for suffix-prefix approximate matching can be done with:

• Dynamic programming ( quadratic cost)

• two steps:• Find the pairs suspected to be assembled

(Linear cost with the hash algorithm)

• Assembly them with dynamic programming.

Shotgun

accgtaccaccttta

tacctt

tttaac taacga

acgatac

accgaccgt

tacaggt

gataca

Given the graph

accgtacctttaacgatacaggt

but, the Hamiltonian has exponential cost!

Shotgun:

New problems arise

xxxxxxxxxxxxx

xxxxx

xxxxxx xxxxxx xxxxxxxx

accgaccgt

xxxxxxxxxxxxxx

• Consecutive repeats• Lack of coverage•…

Shotgun: properties of the coverage

Given the coverage:

Some questions arisess:

• What is the mean length of contigs?

• How many contigs we have to expect?

• What is the percentage of coverage?

Shotgun: percentage of coverage

Degree of coverage N d / L

Given the modelL

N d

We assume that segments are randomly distributed.

a base was covered by k segments is given by the binomial dsitribution (N,d / L):

The probability that

Prob{X=k}= (d/L)k (1-d/L)n-kNk

Then the probability that at least one segment covers a base is

Prob{X>0}= 1-Prob{X=0}= 1- e-

Shotgun: percentage of coverage

What is the limit of the binomial distribution n i p 0

having np=

Distribució de Poisson P()

Prob{X=k}= e- k

k!

= 1- e(N d / L)

Then, with N d / L = 4.6 we obtain a 99% of coverage

and with N d / L = 6.9 weobtain a 99.9% of coverage.

Assembly of ESTs

Is the same procedure than shotgun sequencing…

…but with a great one advantage:

there are many graphs with a small number of nodes!Connect to

http://alggen.lsi.upc.es

Links RESEARCH ESSEM