Contents First week First week: algorithms for exact string matching: One pattern One pattern: The...
-
Upload
martha-atkinson -
Category
Documents
-
view
218 -
download
0
Transcript of Contents First week First week: algorithms for exact string matching: One pattern One pattern: The...
Contents
•First weekFirst week: algorithms for exact string matching:
One patternOne pattern: The algorithm depends on |p| and |
k patternsk patterns: The algorithm depends on k, |p| and ||
•Second weekSecond week: Alignment of sequences.
–Edit distance between two strings: dynamic programming
–Alignment of sequences:
– 2 sequences
– 3 or more sequences
•Third weekThird week: dealing with long sequences.
Distance between words
Which is the distance between the words:– table, maple– able, table– announce, pronounce– ACCTG, ACTT
… and between– ACGG, ACTGTGG
-AATCTACTAGCGTACTACTC,ACTACTACGTACTACG
Edit distance
We accept three types of errors:
The edit distance d between two strings is the minimum number of
substitutions,insertions and deletionsneeded to transform the first string into the second one
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
3. Deletion: ACCGTGAT ACCGGAT
2. Insertion: ACCGTGAT ACCGATGAT
1. Mismatch: ACCGTGAT ACCGAGAT
Indel
Edit distance
We accept three types of errors:
The edit distance d between two strings is the minimum number of
substitutions,insertions and deletionsneeded to transform the first string into the second one
3. Deletion: ACCGTGAT ACCGGAT
2. Insertion: ACCGTGAT ACCGATGAT
1. Mismatch: ACCGTGAT ACCGAGAT
d(ACT,ACT)= d(ACT,AC)= d(ACT,C)=d(ACT,)= d(AC,ATC)= d(ACTTG,ATCTG)=
Indel
0 1 23 1 2
Edit distance and alignments
The alignment that gives the distance can be represented:
And the score of the alignment is the addition of the scores of the columns:
– 0 if both chars are the same– 1 otherwise
ACCGTGAT ACCG -GAT * * * * * * *
ACCG -TGAT ACCGATGAT * * * * * * * *
ACCGTGATACCGAGAT * * * * * * *
ACCGTGTTATGTGTATG- - TGA - - AT ACCG -GAT- - GTGT -TGTTTGAGTAT * * * * * * * * * * * * * * * * *
Edit distance and alignments
But there are many alignments between two sequencesGiven ACCG ACT:
Then the Edit distance is the score of the best alignment
ACCG- - AC -T
ACCG AC - T * *
ACCGACT - * *
ACCG- - - - - - - ACT
so, we can find the distance by generating all alignments and picking up so, we can find the distance by generating all alignments and picking up
the one with smallest score.the one with smallest score.
Edit distance and Pairwise alignment
Given two DNA sequences
A (a1
a2
...an
) and B (b1
b2
...bm
) from the alphabet {a,c,t,g}
we say that A* and B* from {a,c,t,g,-} are aligned iff
i) A* and B* become A and B if gaps ( – ) are removed.
ii) |A*|=|B*|
iii) For all i, it is not possible that ai
= bi = -
Write all alignments between AA and AC ...
Edit distance and Pairwise alignment
To blackboard
Edit distance and alignment of strings
C T A C T A C T A C G T
ACTGA
Edit distance and alignment of strings
C T A C T A C T A C G T
ACTGA
Edit distance and alignment of strings
C T A C T A C T A C G T ACTGA
The cell contains the distance between AC and CTACT.
Edit distance and alignment of strings
C T A C T A C T A C G T A C T GA
?
Edit distance and alignment of strings
C T A C T A C T A C G T 0 A C T GA
?
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 A C T GA
-C
?
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 A C T GA
- -CT
?
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A C T GA
- - - - - -CTACTA
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A ?C ?T ?GA
Edit distance and alignment of strings
C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3G…A
ACT - - -
C T A C T A C T A C G T 0 1 2 3 4 5 6 7 8 …A 1C 2T 3GA
Edit distance and alignment of strings
BA(AC,CTA) -C
BA(A,CTA)CC
BA(A,CTAC)C -
BA(AC,CTAC)= best
d(AC,CTAC)=min
d(AC,CTA)+1
d(A,CTA)
d(A,CTAC)+1
Bioinformatics
Pairwise alignment
Best alignment
How can an alignment be scored?
Catcactactgacgactatcgtagcgcggctat acatctacgccaa- ctac-t-gtgtagatcgccgg
c-tgactgc-- acgactatcgt- attgcggctacacactacgcacaactactgtatgtcgc-cgg----
* * *** * ************* ********* **** ******* * **** ** * ***
• Gap: worst case
• Mismatch: unfavorable
• Match: favorable
Then we assign a score for each case,
for example 1,-1,-2.
Pairwise alignment
Edit distance:
match=0 mismatch=1 indel=1
d(A,CTAC)+1d(AC,CTACT)=minimum d(A,CTA)….+1 d(AC,CTA)+1
Similarity:
match=1 mismatch=-1 indel=-2
s(A,CTAC)-2s(AC,CTACT)=maximum s(A,CTA) 1 s(AC,CTA)-2
-+
Pairwise alignment
Connect to alggen tool
Best alignment
accaccacaccacaacgagcata … acctgagcgatat
a
c
c
.
.
t
Given the maximum score, how can the best alignment be found?
• Quadratic cost in space and time
• Up to 10,000 bps sequences in length
Download alggen tool
Some preconceived ideas
We have developed the theory according to the following principles:
1) Both sequences have a similar length (global).
2) The model of gaps is linear
If there are k consecutive gaps
the penalty scores k(-2).
Assume that we have sequences with different length
S1
S2
Semiglobal pairwise alignment
It is meaningless to introduce gaps until both sequences have similar length ….
The most probable alignment should be
How can these alignments be found?
Final gaps Initial gaps
Semiglobal pairwise alignment
C T A C T A C T A C G T
A
C
T
Initial gaps
Note that
Final gaps
Semiglobal pairwise alignment
C T A C T A C T A C G T
A
C
T
The cell contains the score of the best
alignment of CTA with the empty sequence.
Given a cell
0 0 0 0 0 0 0 0 0 0 0 00
Semiglobal pairwise alignment
C T A C T A C T A C G T
0 0 0 0 0 0 0…
A
C
T
The contribution of the initial gaps is disregarded, then
C T A C T A C T A C G T
0 0 0 0 0 0 0…
A 1
C 2
T 3
but, what happens with the final gaps?
Semiglobal pairwise alignment
C T A C T A C T A C G T
0 0 0 0 0 0 0…
A 1
C 2
T 3
… by checking the last row for the best score.
How does the algorithm search for the best alignment?
Affine-gap model score
Given the following alignments
that have the same score …
a g t a c c c c g t a g
a g t - c c - - g t a -
a g t a c c c c g t a g
a g t - c - c - g t a -
a g t a c c c c g t a g
a g t - c - - c g t a -
a g t a c c c c g t a g
a g t - - c c - g t a -
a g t a c c c c g t a g
a g t - - c - c g t a -
a g t a c c c c g t a g
a g t - - - c c g t a -
Which is the most reliable case
from a biological point of view?
Affine-gap model score
Then, how can we distinguish between
consecutive gaps and separated gaps?
a g t a c c c c g t a g
a g t - - c - c g t a -
a g t a c c c c g t a g
a g t - - - c c g t a -
By scoring the opening gaps greater than the extension gaps,
for instance, -10 and -0.5.
Then, the penalty of k consecutive gaps becomes
OG + (k-1) EG
which is an affine-gap function.
How is the best alignment found?.
C T A C T A C T A C G T
A
C
T
G
A
Affine-gap model score
Smallest arrows: refer to the introduction of an opening gap.
Largest arrows: refer to the introduction of an extension gap.
But from which cell do the largest arrows originate?
Local alignment
Given two sequences, we can consider the alignments of all
their substrings…
…how can the best of them be found?
Two questions arise:
- how can the alignments be compared?
- how can the best one be selected?
Bioinformatics
Multiple alignment
A
C
A
-1__
Pairwise to multiple alignment
What happens with three strings?
Let n be their lenght, then the cost becomes
S3
S2
S1
O(n3) “O(23)” “O(32)”
And with k strings? O(nk 2k k2)
Multiple alignment
Programs of multialignment use different heuristics:
Clustal (Progressive alignment)
http://www.ebi.ac.uk/clustalw
TCoffee (Progressive alignment + data bases)
http://igs-server.cnrs-mrs.fr/Tcoffee_cgi/index.cgi
HMM (Hidden Markov Models)
Multiple alignment
Connect to alggen tool
Advanced Data Structure: Bioinformatics
•First weekFirst week: Algorithms for exact string matching.
•Second weekSecond week: Alignment of sequences.
•Third weekThird week: Dealing with long sequences.