Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family...

Multiple Sequence Alignment

• Motivation

• Definition

• Scoring (Sum of Pairs scoring)

• Algorithms

• Family Representations


• Motivation– What are we trying to accomplish?

• Definition


• Algorithms



• Motivation– Representation of protein families

– Identification and representation of conserved features of DNA/protein sequences that correlate with structure or function

– Deduction of evolutionary history from DNA/protein sequences

– Read pages 333-342• A lot of this is done by “heuristic” or “intuition” and is

difficult to automate

Biological Motivation

• Previous “First Fact of Biological Sequence Comparison”– In biomolecular sequences (DNA, RNA, amino acid

sequences), high sequence similarity usually implies significant functional or structural similarity

• Second Fact of Biological Sequence Comparison– Evolutionarily and functionally related molecular

strings can differ significantly throughout the string yet preserve the same 3D structure(s), 2D substructure(s), active sites, or dispersed residues

2 strings versus multiple strings

• 2 strings– Based on first fact

• Find unknown biological relationships using string similarity

• Method: database searching

• Multiple strings– Based loosely on second fact

• Given known biological relationships (function, structure, etc), identify unknown conserved subpatterns in a set of strings

• These subpatterns can then be used as a known pattern for other database searches


• Motivation

• Definition


• Algorithms


Definition

• A global alignment of a set of k>2 strings {Si} is obtained

– by inserting spaces (dashes) into each Si so that each string has the same length at the end.

– Placing each string into columns, one character (or dash) per column.

– Note ALL positions in both S and T are involved

• A local alignment of a set of k>2 strings {Si} is obtained

– by selecting one substring Si’ from each string Si

– globally aligning those substrings

Example

• Strings {abca, ababa, accb, cbbc}a b c - a

a b a b a

a c c b -

c b - b c


• Motivation• Definition• Scoring (Sum of Pairs scoring)

– Induced pairwise alignments– Definition of sum of pair (SP) scoring– Justification (or lack thereof)

• Algorithms• Family Representations

Scoring MSAs

• Key fact: there is no universally accepted score function– My impression is that people evaluate MSA’s by feel

(they know a good one when they see it)

• Definitions– Given a MSA M, the induced pairwise alignment of Si

and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired.

Definitions

• Definitions– Given a MSA M, the induced pairwise alignment of Si

and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired.

– The score of an induced pairwise alignment is determined using any chosen scoring scheme for two-string alignment in the standard manner.

Example

• Examplea b c - a

a b a b a

a c c b -

c b - b c

• Induced alignmenta b c - a

a c c b -

• Score0 1 0 1 1 = 3

Sum of Pairs (SP)

• Definition: The SP score of a MSA M is the sum of the scores of pairwise global alignments induced by M

• Examplea b c - aa b a b aa c c b -c b - b c

• SP score: 2 + 3 + 4 + 3 + 3 + 4 = 19

Justification

• Difficult to give a sound biological justification for SP or any other scoring scheme

• Main reasons for studying it– It is easy to work with– It has been used by many people in studying

MSA– It is used in several packages


• Motivation• Definition• Scoring (Sum of Pairs scoring)• Algorithms

– Exact, NP-hard problem– Approximation Algorithm (Center Star)– Heuristic Methods


Formal Problem

• Input– k strings {Si}– Scoring function

• Output– MSA of {Si} with minimum (maximum) SP score

• Observation– Exact solution is NP-hard– Dynamic programming takes O(nk) time, so solving

exactly for more than even 6 strings of typical length is often not feasible

Heuristic Speedup

• View problem as a shortest path problem with O(nk) nodes

• Given an upper bound on the actual value, we can eliminate exploration of many nodes using branch and bound ideas

• Key is to send values forward rather than backwards– Backwards: All nodes will eventually be evaluated– Forwards: Limit to those which can possibly be less

than current estimate on optimal

Backwards

D(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

Forwards

D(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

Approximation Algorithms

• Given the hardness of computing the exact solution, how about developing algorithms that compute a solution that is guaranteed to be close to optimal

• Goal: Find a polynomial-time algorithm A that minimizes– supI A(I)/OPT(I)

• Only computer scientists seem interested in this• Biologists seem to do things more heuristically

Alignments consistent with a tree

• D(Si,Sj) is the optimal weighted edit distance between Si and Sj

• Definition: Let T be a tree where each node is labeled with a string from {Si}. Then a multiple alignment of {Si} is consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si, Sj) that are connected by an edge in T.

Example

-AX-Z-A-YZ-AXYZ--XYZAYXYZ• All edge alignment

scores are optimal• Others are not such as

AYXYZ with -AXYZ

AXZ AYZ

AXYZ

XYZ

AYXYZ

Theorem

• For any {Si} and any tree T whose nodes are labeled with distinct nodes of {Si}, we can efficiently find an MSA M(T) of {Si} that is consistent with T.

• Proof– Incrementally align any two adjacent nodes

• Two aligned gaps have zero cost

– Add gaps as necessary to other already aligned sequences

Example

• Align AXYZ and XYZAXYZ

-XYZ

• Align AYXYZ and -XYZA-XYZ or -AXYZ

--XYZ --XYZ

AYXYX AYXYZ

• …

AXZ AYZ

AXYZ

XYZ

AYXYZ

Triangle Inequality

• Assume an alphabet-weighted scoring scheme s(x,y)– x and y could be any character (or a space)

• A scoring scheme satisfies the triangle inequality if for any three characters (including a space) x, y, and z,– s(x,z) <= s(x,y) + s(y,z)

• Note, not all scoring schemes used in biology satisfy this triangle inequality property

Center Star Method

• For {Si}, define Sc to be the string that minimizes all strings D(Sc, Sj)

• Define the center star to be the star where the center node is labeled with Sc

• Define Mc to be an MSA of {Si} that is consistent with the center star

• Define d(Si, Sj) to be the score of the pairwise alignment of Si and Sj induced by Mc.

• Denote the score of an alignment M as d(M).

Example

• all strings D(AXYZ, Sj) = 4• Mc before AYXYZ added

AXYZAX-ZA-YZ-XYZ

• Mc after AYXYZ addedA-XYZA-X-ZA--YZ--XYZAYXYZ

AXZ AYZ

AXYZ

XYZAYXYZ

Example continued

• Mc after AYXYZ addedA-XYZA-X-ZA--YZ--XYZAYXYZ

• d(AYZ,AYXYZ) = 2• d(Mc) = 1 + 1 + 1 + 1 + 2 + 2

+ 2 + 2 + 2 + 2 = 16

AXZ AYZ

AXYZ

XYZAYXYZ

Results

• Lemma: Assuming triangle inequality, then– d(Si, Sj) <= d(Si, Sc) + d(Sc, Sj)

– = D(Si, Sc) + D(Sc, Sj)

• Definition: Let M* be the optimal alignment of {Si} and d*(Si, Sj) be the score of the induced pairwise alignment.

• Theorem: d(Mc) / d(M*) <= 2(k-1)/k < 2

Weighted SP

• Each induced pairwise score is multiplied by a weight w(i,j).

• Optimal weighted SP can be computed in exponential time (in k) using dynamic programming

• Little is known about approximation of weighted SP– Why doesn’t center star give a guaranteed bound here?

Heuristic Techniques

• In practice, people tend to use more heuristic methods with no proven performance guarantees

• Basic idea– Do some form of iterative or progressive alignment

– For example, do an alignment based on a minimum spanning tree of some sort

• Find two closest nodes and join them– how should we define closeness?

• then iteratively add closest non-aligned node to the alignment

One method of defining closeness

• sd(i,j) scores– given a scoring scheme– Compute D(Si, Sj)– 100 times do

• “Jumble” Si and Sj and compute D(jum(Si), jum(Sj))

– Compute mean and standard deviation of these 100 jumbled comparisons

– Define sd(i,j) = D(Si, Sj)/standard deviation (no mean?)

• Intuition– Strings Si and Sj contain non-random structures (hopefully

secondary structure) in common if sd(i,j) is high


• Motivation

• Definition


• Algorithms

• Family Representations– Profiles– Regular expressions/motifs

Representation Problem

• Input– family of sequences that typically have a known

biological similarity

• Desired output– Representation of this family of sequences that

reveals any string/sequence similarities that hopefully are related to their biological similarity

Profiles

• Strings {abca, ababa, accb, cbbc}a b c - a

a b a b a

a c c b -

c b - b c

• Profile 1 2 3 4 5

a 75 25 50

b 75 75

c 25 25 50 25

- 25 25 25

Log odds ratio

• Strings {abca, ababa, accb, cbbc}a b c - aa b a b aa c c b -c b - b c

• Profile 1 2 3 4 5a 75 25 50b 75 75c 25 25 50 25- 25 25 25

• p(a) = 6/20 = 30%• p(a,1) = 3/4 = 75%• log (p(x,j)/p(x)) is entry• Example (without logs)

1 2 3 4 5

a 2.5 0 .83 0 1.7

b 0 2.5 0 2.5 0

c 1 1 2 0 1

- 0 0 1.7 1.7 1.7

Nice feature of profiles

• Natural extension of alignment and scoring of strings to profiles

• Aligning a string to a profile– We can generalize notions of pairwise string alignment

• Scoring– Compute a weighted sum based on frequency of

characters in the column

• Can generalize to profile to profile alignments• Optimal alignment

– Dynamic programming can solve

Signature representations

• Signature or motif signature– pattern contained as a substring in most

members of a family– typically represented as a regular expression– Such a regular expression might be derived

given a multiple sequence alignment

Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family...

Documents

Transcript of Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family...