Master Course
description
Transcript of Master Course
![Page 1: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/1.jpg)
Master Course
MSc Bioinformatics for Health Sciences
H15: Algorithms on strings and sequences
Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)
Dep. de Llenguatges i Sistemes InformàticsCEPBA-IBM Research Institute
Universitat Politècnica de Catalunya
![Page 2: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/2.jpg)
Master Course
Fourth lecture:
Sequence assembly
![Page 3: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/3.jpg)
Sequence assembly
It is applied to the following topics:
• EST assembly
• DNA sequencing .
![Page 4: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/4.jpg)
• Hibridization: provide information about l-tuples present in DNA.
DNA sequencing
There are two techniques:
• Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.
![Page 5: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/5.jpg)
• Hibridization: provide information about l-mers present in DNA
DNA sequencing
There are two techniques:
• Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.
![Page 6: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/6.jpg)
Hybridization
Let xxxxxxxxxxxxx be the sequence we want to know,
and the hybridization technique gives us the set of 3-mers that belong to it:
AAC GAT TGCACG CGG GCC TTG GGA ATT
How can the sequence be reconstructed?
![Page 7: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/7.jpg)
Hybridization
As AAC and ACG belong to the sequence,
then AACG belongs to the sequence,
AAC GAT TGCACG CGG GCC TTG GGA ATT
Given the 3-mers of the sequence:
because the longest (proper) suffix of AAC matches the longest (proper) prefix of ACG.
This relation can be represented with a directed graph AAC ACG
![Page 8: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/8.jpg)
Hybridization
Construction of the complete suffix-prefix graph
AAC GAT TGC
ACG CGG GCC TTG
GGA ATT
AACGGATTGCC
that gives us the unknown sequence:
But, is this a realistic case?
![Page 9: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/9.jpg)
Hybridization
Let us introduce a more realistic case:
and the sequence is given by the Hamiltonian path
Which is the cost of the hybridization method?
AAC CAA GAT TGC
ACG CGG GCC TTG
GGC GGA CCG ATT
and whose cost is NP-Complet!
that is the path that traverses all nodes exactly once
![Page 10: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/10.jpg)
2. Searching for the suffix-prefix matches :
Hybridization: cost
Cost: 1. Finding the l-mers AAC, CAA, ACG,... :
There are 4L l-mers of length L that should be generated
If there are m L-mers, then there are O(m2 L2 ) comparisons
3. Searching for the Hamiltonian path
NP- Complet
![Page 11: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/11.jpg)
Excursió: cost
Quadratic cost: O(m2 )
Linear cost: O(m)
Exponencial cost: O(2m )
m t = 1 mseg10m 10t = 10 mseg1000m 1000t = 1 seg
m t = 1mseg.10m 100t = 100 mseg.1000m 1000000t = 16 min
m t = 1 mseg.10m 210 t = 1 seg1000m 21000 t = 1030 t = 1018 anys
![Page 12: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/12.jpg)
2. Searching for the suffix-prefix matches :
Hybridization: cost
Cost: 1. Finding the l-mers AAC, CAA, ACG,... :
There are 4L l-mers of length L that should be generated
If there are m L-mers, then there are O(m2 L2 ) comparisons
3. Searching for the Hamiltonian path
NP- Complet
How the NP-completness can be avoided?
![Page 13: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/13.jpg)
Hybridization:
Search for the Hamiltonian path (NP-complet)
AAC GAT TGC
ACG CGG GCC TTG
GGC GGA CCG ATT
or search for the Eulerian path (lineal) AA
AC
GG
CG
GA
CC
GC
TG
TT
AT
![Page 14: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/14.jpg)
Hybridization: Eulerian path
Unbalanced nodes: indegree = outdegree (Starting or ending nodes )
Balanced nodes: indegree = oudegree (traversed nodes: )
Search for the Eulerian path of the graph:
![Page 15: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/15.jpg)
Hybridization: Eulerian path
Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.
![Page 16: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/16.jpg)
Hybridization: camí Eulerià
Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.
![Page 17: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/17.jpg)
2. Searching for the suffix-prefix matches :
Hybridization: cost
Cost: 1. Finding the l-mers AAC, CAA, ACG,... :
There are 4L l-mers of length L that should be generated
If there are m L-mers, then there are O(m2 L2 ) comparisons
3. Searching for the Eulerian path
Linear cost
Now, which is the limiting factor?
![Page 18: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/18.jpg)
Hybridization: limiting factor
AAC CAA GAT TGC
ACG CGG GCC TTG
GGA ATT
Repeated l-mers:
Which is the probability of a repeat?
CAACGGATTGCC
CAACGGACGGATTGCC
GAC
Given the graph:
How many sequences can be assembled?
![Page 19: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/19.jpg)
Hybridization: statistical model
Model: random sequence of length N with identically distributed bases (1/4),
How the probability of a repeat can be computed?
Given 2 l-mers, the probability to match is : 4-L
Given 3 l-mers, the expected number of 2-matches is : (32)4-L
Given m l-mers, the expected number of 2-matches is: (m2)4-L
If (m2)4-L <1 then m<sqr(2·4L) then for L = 8, m =512!
Conclusion: this technique can be applied only to short sequences.
![Page 20: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/20.jpg)
Hybridization:
Connect to
http://alggen.lsi.upc.edu
And follow links RESEARCH SEARCH MREPATT
Genome sequences are close to random sequences?
![Page 21: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/21.jpg)
![Page 22: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/22.jpg)
• Hibridizationació: provide information about l-mers present in DNA
DNA sequencing
There are two techniques:
• Shot gun: DNA sequences are broken into 100Kb-500Kb random fragments.
![Page 23: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/23.jpg)
Shotgun
With the unknown sequence xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
It is possible :
• to make some copies
• to break it into random and unsorted short segments
What can we do?
![Page 24: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/24.jpg)
Shotgun: algorisme
Assume xxxxx|xxxxxxx|xxxxxxx|xxxx xxxxxxxx|xxxxxx|xxxxxx|xxxxxxx|xxxxxx|xxxxxx|xxxxxxx
The algorithm is:
1st. Compare all pairs searching for suffix-prefix approximate matches.
2nd. Construct the graph suffix-prefix
3th. Find the path
![Page 25: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/25.jpg)
Shotgun
Given the three copies xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The shotgun brokes it into the following segments
accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt
![Page 26: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/26.jpg)
Shotgun
The pairwise comparison that searchs for suffix-prefix approximate matching can be done with:
• Dynamic programming ( quadratic cost)
• two steps:• Find the pairs suspected to be assembled
(Linear cost with the hash algorithm)
• Assembly them with dynamic programming.
![Page 27: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/27.jpg)
Shotgun
accgtaccaccttta
tacctt
tttaac taacga
acgatac
accgaccgt
tacaggt
gataca
Given the graph
accgtacctttaacgatacaggt
but, the Hamiltonian has exponential cost!
![Page 28: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/28.jpg)
Shotgun:
New problems arise
xxxxxxxxxxxxx
xxxxx
xxxxxx xxxxxx xxxxxxxx
accgaccgt
xxxxxxxxxxxxxx
• Consecutive repeats• Lack of coverage•…
![Page 29: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/29.jpg)
Shotgun: properties of the coverage
Given the coverage:
Some questions arisess:
• What is the mean length of contigs?
• How many contigs we have to expect?
• What is the percentage of coverage?
![Page 30: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/30.jpg)
Shotgun: percentage of coverage
Degree of coverage N d / L
Given the modelL
N d
We assume that segments are randomly distributed.
a base was covered by k segments is given by the binomial dsitribution (N,d / L):
The probability that
Prob{X=k}= (d/L)k (1-d/L)n-kNk
![Page 31: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/31.jpg)
Then the probability that at least one segment covers a base is
Prob{X>0}= 1-Prob{X=0}= 1- e-
Shotgun: percentage of coverage
What is the limit of the binomial distribution n i p 0
having np=
Distribució de Poisson P()
Prob{X=k}= e- k
k!
= 1- e(N d / L)
Then, with N d / L = 4.6 we obtain a 99% of coverage
and with N d / L = 6.9 weobtain a 99.9% of coverage.
![Page 32: Master Course](https://reader035.fdocuments.in/reader035/viewer/2022062500/5681540f550346895dc20e73/html5/thumbnails/32.jpg)
Assembly of ESTs
Is the same procedure than shotgun sequencing…
…but with a great one advantage:
there are many graphs with a small number of nodes!Connect to
http://alggen.lsi.upc.es
Links RESEARCH ESSEM