Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family...
-
date post
22-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family...
Multiple Sequence Alignment
• Motivation
• Definition
• Scoring (Sum of Pairs scoring)
• Algorithms
• Family Representations
Multiple Sequence Alignment
• Motivation– What are we trying to accomplish?
• Definition
• Scoring (Sum of Pairs scoring)
• Algorithms
• Family Representations
Multiple Sequence Alignment
• Motivation– Representation of protein families
– Identification and representation of conserved features of DNA/protein sequences that correlate with structure or function
– Deduction of evolutionary history from DNA/protein sequences
– Read pages 333-342• A lot of this is done by “heuristic” or “intuition” and is
difficult to automate
Biological Motivation
• Previous “First Fact of Biological Sequence Comparison”– In biomolecular sequences (DNA, RNA, amino acid
sequences), high sequence similarity usually implies significant functional or structural similarity
• Second Fact of Biological Sequence Comparison– Evolutionarily and functionally related molecular
strings can differ significantly throughout the string yet preserve the same 3D structure(s), 2D substructure(s), active sites, or dispersed residues
2 strings versus multiple strings
• 2 strings– Based on first fact
• Find unknown biological relationships using string similarity
• Method: database searching
• Multiple strings– Based loosely on second fact
• Given known biological relationships (function, structure, etc), identify unknown conserved subpatterns in a set of strings
• These subpatterns can then be used as a known pattern for other database searches
Multiple Sequence Alignment
• Motivation
• Definition
• Scoring (Sum of Pairs scoring)
• Algorithms
• Family Representations
Definition
• A global alignment of a set of k>2 strings {Si} is obtained
– by inserting spaces (dashes) into each Si so that each string has the same length at the end.
– Placing each string into columns, one character (or dash) per column.
– Note ALL positions in both S and T are involved
• A local alignment of a set of k>2 strings {Si} is obtained
– by selecting one substring Si’ from each string Si
– globally aligning those substrings
Example
• Strings {abca, ababa, accb, cbbc}a b c - a
a b a b a
a c c b -
c b - b c
Multiple Sequence Alignment
• Motivation• Definition• Scoring (Sum of Pairs scoring)
– Induced pairwise alignments– Definition of sum of pair (SP) scoring– Justification (or lack thereof)
• Algorithms• Family Representations
Scoring MSAs
• Key fact: there is no universally accepted score function– My impression is that people evaluate MSA’s by feel
(they know a good one when they see it)
• Definitions– Given a MSA M, the induced pairwise alignment of Si
and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired.
Definitions
• Definitions– Given a MSA M, the induced pairwise alignment of Si
and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired.
– The score of an induced pairwise alignment is determined using any chosen scoring scheme for two-string alignment in the standard manner.
Example
• Examplea b c - a
a b a b a
a c c b -
c b - b c
• Induced alignmenta b c - a
a c c b -
• Score0 1 0 1 1 = 3
Sum of Pairs (SP)
• Definition: The SP score of a MSA M is the sum of the scores of pairwise global alignments induced by M
• Examplea b c - aa b a b aa c c b -c b - b c
• SP score: 2 + 3 + 4 + 3 + 3 + 4 = 19
Justification
• Difficult to give a sound biological justification for SP or any other scoring scheme
• Main reasons for studying it– It is easy to work with– It has been used by many people in studying
MSA– It is used in several packages
Multiple Sequence Alignment
• Motivation• Definition• Scoring (Sum of Pairs scoring)• Algorithms
– Exact, NP-hard problem– Approximation Algorithm (Center Star)– Heuristic Methods
• Family Representations
Formal Problem
• Input– k strings {Si}– Scoring function
• Output– MSA of {Si} with minimum (maximum) SP score
• Observation– Exact solution is NP-hard– Dynamic programming takes O(nk) time, so solving
exactly for more than even 6 strings of typical length is often not feasible
Heuristic Speedup
• View problem as a shortest path problem with O(nk) nodes
• Given an upper bound on the actual value, we can eliminate exploration of many nodes using branch and bound ideas
• Key is to send values forward rather than backwards– Backwards: All nodes will eventually be evaluated– Forwards: Limit to those which can possibly be less
than current estimate on optimal
Backwards
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2 2 2
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Forwards
D(i,j) w r i t e r s
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
v 1 1 1 2 3 4 5 6 7
i 2 2 2 2
n 3 3
t 4 4
n 5 5
e 6 6
r 7 7
Approximation Algorithms
• Given the hardness of computing the exact solution, how about developing algorithms that compute a solution that is guaranteed to be close to optimal
• Goal: Find a polynomial-time algorithm A that minimizes– supI A(I)/OPT(I)
• Only computer scientists seem interested in this• Biologists seem to do things more heuristically
Alignments consistent with a tree
• D(Si,Sj) is the optimal weighted edit distance between Si and Sj
• Definition: Let T be a tree where each node is labeled with a string from {Si}. Then a multiple alignment of {Si} is consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si, Sj) that are connected by an edge in T.
Example
-AX-Z-A-YZ-AXYZ--XYZAYXYZ• All edge alignment
scores are optimal• Others are not such as
AYXYZ with -AXYZ
AXZ AYZ
AXYZ
XYZ
AYXYZ
Theorem
• For any {Si} and any tree T whose nodes are labeled with distinct nodes of {Si}, we can efficiently find an MSA M(T) of {Si} that is consistent with T.
• Proof– Incrementally align any two adjacent nodes
• Two aligned gaps have zero cost
– Add gaps as necessary to other already aligned sequences
Example
• Align AXYZ and XYZAXYZ
-XYZ
• Align AYXYZ and -XYZA-XYZ or -AXYZ
--XYZ --XYZ
AYXYX AYXYZ
• …
AXZ AYZ
AXYZ
XYZ
AYXYZ
Triangle Inequality
• Assume an alphabet-weighted scoring scheme s(x,y)– x and y could be any character (or a space)
• A scoring scheme satisfies the triangle inequality if for any three characters (including a space) x, y, and z,– s(x,z) <= s(x,y) + s(y,z)
• Note, not all scoring schemes used in biology satisfy this triangle inequality property
Center Star Method
• For {Si}, define Sc to be the string that minimizes all strings D(Sc, Sj)
• Define the center star to be the star where the center node is labeled with Sc
• Define Mc to be an MSA of {Si} that is consistent with the center star
• Define d(Si, Sj) to be the score of the pairwise alignment of Si and Sj induced by Mc.
• Denote the score of an alignment M as d(M).
Example
• all strings D(AXYZ, Sj) = 4• Mc before AYXYZ added
AXYZAX-ZA-YZ-XYZ
• Mc after AYXYZ addedA-XYZA-X-ZA--YZ--XYZAYXYZ
AXZ AYZ
AXYZ
XYZAYXYZ
Example continued
• Mc after AYXYZ addedA-XYZA-X-ZA--YZ--XYZAYXYZ
• d(AYZ,AYXYZ) = 2• d(Mc) = 1 + 1 + 1 + 1 + 2 + 2
+ 2 + 2 + 2 + 2 = 16
AXZ AYZ
AXYZ
XYZAYXYZ
Results
• Lemma: Assuming triangle inequality, then– d(Si, Sj) <= d(Si, Sc) + d(Sc, Sj)
– = D(Si, Sc) + D(Sc, Sj)
• Definition: Let M* be the optimal alignment of {Si} and d*(Si, Sj) be the score of the induced pairwise alignment.
• Theorem: d(Mc) / d(M*) <= 2(k-1)/k < 2
Proof
Weighted SP
• Each induced pairwise score is multiplied by a weight w(i,j).
• Optimal weighted SP can be computed in exponential time (in k) using dynamic programming
• Little is known about approximation of weighted SP– Why doesn’t center star give a guaranteed bound here?
Heuristic Techniques
• In practice, people tend to use more heuristic methods with no proven performance guarantees
• Basic idea– Do some form of iterative or progressive alignment
– For example, do an alignment based on a minimum spanning tree of some sort
• Find two closest nodes and join them– how should we define closeness?
• then iteratively add closest non-aligned node to the alignment
Heuristic Techniques
• In practice, people tend to use more heuristic methods with no proven performance guarantees
• Basic idea– Do some form of iterative or progressive alignment
– For example, do an alignment based on a minimum spanning tree of some sort
• Find two closest nodes and join them– how should we define closeness?
• then iteratively add closest non-aligned node to the alignment
One method of defining closeness
• sd(i,j) scores– given a scoring scheme– Compute D(Si, Sj)– 100 times do
• “Jumble” Si and Sj and compute D(jum(Si), jum(Sj))
– Compute mean and standard deviation of these 100 jumbled comparisons
– Define sd(i,j) = D(Si, Sj)/standard deviation (no mean?)
• Intuition– Strings Si and Sj contain non-random structures (hopefully
secondary structure) in common if sd(i,j) is high
Multiple Sequence Alignment
• Motivation
• Definition
• Scoring (Sum of Pairs scoring)
• Algorithms
• Family Representations– Profiles– Regular expressions/motifs
Representation Problem
• Input– family of sequences that typically have a known
biological similarity
• Desired output– Representation of this family of sequences that
reveals any string/sequence similarities that hopefully are related to their biological similarity
Profiles
• Strings {abca, ababa, accb, cbbc}a b c - a
a b a b a
a c c b -
c b - b c
• Profile 1 2 3 4 5
a 75 25 50
b 75 75
c 25 25 50 25
- 25 25 25
Log odds ratio
• Strings {abca, ababa, accb, cbbc}a b c - aa b a b aa c c b -c b - b c
• Profile 1 2 3 4 5a 75 25 50b 75 75c 25 25 50 25- 25 25 25
• p(a) = 6/20 = 30%• p(a,1) = 3/4 = 75%• log (p(x,j)/p(x)) is entry• Example (without logs)
1 2 3 4 5
a 2.5 0 .83 0 1.7
b 0 2.5 0 2.5 0
c 1 1 2 0 1
- 0 0 1.7 1.7 1.7
Nice feature of profiles
• Natural extension of alignment and scoring of strings to profiles
• Aligning a string to a profile– We can generalize notions of pairwise string alignment
• Scoring– Compute a weighted sum based on frequency of
characters in the column
• Can generalize to profile to profile alignments• Optimal alignment
– Dynamic programming can solve
Signature representations
• Signature or motif signature– pattern contained as a substring in most
members of a family– typically represented as a regular expression– Such a regular expression might be derived
given a multiple sequence alignment