Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family...

40
Multiple Sequence Alignment • Motivation • Definition • Scoring (Sum of Pairs scoring) • Algorithms • Family Representations
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family...

Page 1: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Multiple Sequence Alignment

• Motivation

• Definition

• Scoring (Sum of Pairs scoring)

• Algorithms

• Family Representations

Page 2: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Multiple Sequence Alignment

• Motivation– What are we trying to accomplish?

• Definition

• Scoring (Sum of Pairs scoring)

• Algorithms

• Family Representations

Page 3: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Multiple Sequence Alignment

• Motivation– Representation of protein families

– Identification and representation of conserved features of DNA/protein sequences that correlate with structure or function

– Deduction of evolutionary history from DNA/protein sequences

– Read pages 333-342• A lot of this is done by “heuristic” or “intuition” and is

difficult to automate

Page 4: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Biological Motivation

• Previous “First Fact of Biological Sequence Comparison”– In biomolecular sequences (DNA, RNA, amino acid

sequences), high sequence similarity usually implies significant functional or structural similarity

• Second Fact of Biological Sequence Comparison– Evolutionarily and functionally related molecular

strings can differ significantly throughout the string yet preserve the same 3D structure(s), 2D substructure(s), active sites, or dispersed residues

Page 5: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

2 strings versus multiple strings

• 2 strings– Based on first fact

• Find unknown biological relationships using string similarity

• Method: database searching

• Multiple strings– Based loosely on second fact

• Given known biological relationships (function, structure, etc), identify unknown conserved subpatterns in a set of strings

• These subpatterns can then be used as a known pattern for other database searches

Page 6: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Multiple Sequence Alignment

• Motivation

• Definition

• Scoring (Sum of Pairs scoring)

• Algorithms

• Family Representations

Page 7: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Definition

• A global alignment of a set of k>2 strings {Si} is obtained

– by inserting spaces (dashes) into each Si so that each string has the same length at the end.

– Placing each string into columns, one character (or dash) per column.

– Note ALL positions in both S and T are involved

• A local alignment of a set of k>2 strings {Si} is obtained

– by selecting one substring Si’ from each string Si

– globally aligning those substrings

Page 8: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Example

• Strings {abca, ababa, accb, cbbc}a b c - a

a b a b a

a c c b -

c b - b c

Page 9: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Multiple Sequence Alignment

• Motivation• Definition• Scoring (Sum of Pairs scoring)

– Induced pairwise alignments– Definition of sum of pair (SP) scoring– Justification (or lack thereof)

• Algorithms• Family Representations

Page 10: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Scoring MSAs

• Key fact: there is no universally accepted score function– My impression is that people evaluate MSA’s by feel

(they know a good one when they see it)

• Definitions– Given a MSA M, the induced pairwise alignment of Si

and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired.

Page 11: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Definitions

• Definitions– Given a MSA M, the induced pairwise alignment of Si

and Sj is obtained from M by removing all rows except the two rows for Si and Sj. Opposing spaces can be removed if desired.

– The score of an induced pairwise alignment is determined using any chosen scoring scheme for two-string alignment in the standard manner.

Page 12: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Example

• Examplea b c - a

a b a b a

a c c b -

c b - b c

• Induced alignmenta b c - a

a c c b -

• Score0 1 0 1 1 = 3

Page 13: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Sum of Pairs (SP)

• Definition: The SP score of a MSA M is the sum of the scores of pairwise global alignments induced by M

• Examplea b c - aa b a b aa c c b -c b - b c

• SP score: 2 + 3 + 4 + 3 + 3 + 4 = 19

Page 14: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Justification

• Difficult to give a sound biological justification for SP or any other scoring scheme

• Main reasons for studying it– It is easy to work with– It has been used by many people in studying

MSA– It is used in several packages

Page 15: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Multiple Sequence Alignment

• Motivation• Definition• Scoring (Sum of Pairs scoring)• Algorithms

– Exact, NP-hard problem– Approximation Algorithm (Center Star)– Heuristic Methods

• Family Representations

Page 16: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Formal Problem

• Input– k strings {Si}– Scoring function

• Output– MSA of {Si} with minimum (maximum) SP score

• Observation– Exact solution is NP-hard– Dynamic programming takes O(nk) time, so solving

exactly for more than even 6 strings of typical length is often not feasible

Page 17: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Heuristic Speedup

• View problem as a shortest path problem with O(nk) nodes

• Given an upper bound on the actual value, we can eliminate exploration of many nodes using branch and bound ideas

• Key is to send values forward rather than backwards– Backwards: All nodes will eventually be evaluated– Forwards: Limit to those which can possibly be less

than current estimate on optimal

Page 18: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Backwards

D(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

Page 19: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Forwards

D(i,j) w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

Page 20: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Approximation Algorithms

• Given the hardness of computing the exact solution, how about developing algorithms that compute a solution that is guaranteed to be close to optimal

• Goal: Find a polynomial-time algorithm A that minimizes– supI A(I)/OPT(I)

• Only computer scientists seem interested in this• Biologists seem to do things more heuristically

Page 21: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Alignments consistent with a tree

• D(Si,Sj) is the optimal weighted edit distance between Si and Sj

• Definition: Let T be a tree where each node is labeled with a string from {Si}. Then a multiple alignment of {Si} is consistent with T if the induced pairwise alignment of Si and Sj has score D(Si,Sj) for each pair of strings (Si, Sj) that are connected by an edge in T.

Page 22: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Example

-AX-Z-A-YZ-AXYZ--XYZAYXYZ• All edge alignment

scores are optimal• Others are not such as

AYXYZ with -AXYZ

AXZ AYZ

AXYZ

XYZ

AYXYZ

Page 23: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Theorem

• For any {Si} and any tree T whose nodes are labeled with distinct nodes of {Si}, we can efficiently find an MSA M(T) of {Si} that is consistent with T.

• Proof– Incrementally align any two adjacent nodes

• Two aligned gaps have zero cost

– Add gaps as necessary to other already aligned sequences

Page 24: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Example

• Align AXYZ and XYZAXYZ

-XYZ

• Align AYXYZ and -XYZA-XYZ or -AXYZ

--XYZ --XYZ

AYXYX AYXYZ

• …

AXZ AYZ

AXYZ

XYZ

AYXYZ

Page 25: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Triangle Inequality

• Assume an alphabet-weighted scoring scheme s(x,y)– x and y could be any character (or a space)

• A scoring scheme satisfies the triangle inequality if for any three characters (including a space) x, y, and z,– s(x,z) <= s(x,y) + s(y,z)

• Note, not all scoring schemes used in biology satisfy this triangle inequality property

Page 26: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Center Star Method

• For {Si}, define Sc to be the string that minimizes all strings D(Sc, Sj)

• Define the center star to be the star where the center node is labeled with Sc

• Define Mc to be an MSA of {Si} that is consistent with the center star

• Define d(Si, Sj) to be the score of the pairwise alignment of Si and Sj induced by Mc.

• Denote the score of an alignment M as d(M).

Page 27: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Example

• all strings D(AXYZ, Sj) = 4• Mc before AYXYZ added

AXYZAX-ZA-YZ-XYZ

• Mc after AYXYZ addedA-XYZA-X-ZA--YZ--XYZAYXYZ

AXZ AYZ

AXYZ

XYZAYXYZ

Page 28: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Example continued

• Mc after AYXYZ addedA-XYZA-X-ZA--YZ--XYZAYXYZ

• d(AYZ,AYXYZ) = 2• d(Mc) = 1 + 1 + 1 + 1 + 2 + 2

+ 2 + 2 + 2 + 2 = 16

AXZ AYZ

AXYZ

XYZAYXYZ

Page 29: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Results

• Lemma: Assuming triangle inequality, then– d(Si, Sj) <= d(Si, Sc) + d(Sc, Sj)

– = D(Si, Sc) + D(Sc, Sj)

• Definition: Let M* be the optimal alignment of {Si} and d*(Si, Sj) be the score of the induced pairwise alignment.

• Theorem: d(Mc) / d(M*) <= 2(k-1)/k < 2

Page 30: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Proof

Page 31: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Weighted SP

• Each induced pairwise score is multiplied by a weight w(i,j).

• Optimal weighted SP can be computed in exponential time (in k) using dynamic programming

• Little is known about approximation of weighted SP– Why doesn’t center star give a guaranteed bound here?

Page 32: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Heuristic Techniques

• In practice, people tend to use more heuristic methods with no proven performance guarantees

• Basic idea– Do some form of iterative or progressive alignment

– For example, do an alignment based on a minimum spanning tree of some sort

• Find two closest nodes and join them– how should we define closeness?

• then iteratively add closest non-aligned node to the alignment

Page 33: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Heuristic Techniques

• In practice, people tend to use more heuristic methods with no proven performance guarantees

• Basic idea– Do some form of iterative or progressive alignment

– For example, do an alignment based on a minimum spanning tree of some sort

• Find two closest nodes and join them– how should we define closeness?

• then iteratively add closest non-aligned node to the alignment

Page 34: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

One method of defining closeness

• sd(i,j) scores– given a scoring scheme– Compute D(Si, Sj)– 100 times do

• “Jumble” Si and Sj and compute D(jum(Si), jum(Sj))

– Compute mean and standard deviation of these 100 jumbled comparisons

– Define sd(i,j) = D(Si, Sj)/standard deviation (no mean?)

• Intuition– Strings Si and Sj contain non-random structures (hopefully

secondary structure) in common if sd(i,j) is high

Page 35: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Multiple Sequence Alignment

• Motivation

• Definition

• Scoring (Sum of Pairs scoring)

• Algorithms

• Family Representations– Profiles– Regular expressions/motifs

Page 36: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Representation Problem

• Input– family of sequences that typically have a known

biological similarity

• Desired output– Representation of this family of sequences that

reveals any string/sequence similarities that hopefully are related to their biological similarity

Page 37: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Profiles

• Strings {abca, ababa, accb, cbbc}a b c - a

a b a b a

a c c b -

c b - b c

• Profile 1 2 3 4 5

a 75 25 50

b 75 75

c 25 25 50 25

- 25 25 25

Page 38: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Log odds ratio

• Strings {abca, ababa, accb, cbbc}a b c - aa b a b aa c c b -c b - b c

• Profile 1 2 3 4 5a 75 25 50b 75 75c 25 25 50 25- 25 25 25

• p(a) = 6/20 = 30%• p(a,1) = 3/4 = 75%• log (p(x,j)/p(x)) is entry• Example (without logs)

1 2 3 4 5

a 2.5 0 .83 0 1.7

b 0 2.5 0 2.5 0

c 1 1 2 0 1

- 0 0 1.7 1.7 1.7

Page 39: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Nice feature of profiles

• Natural extension of alignment and scoring of strings to profiles

• Aligning a string to a profile– We can generalize notions of pairwise string alignment

• Scoring– Compute a weighted sum based on frequency of

characters in the column

• Can generalize to profile to profile alignments• Optimal alignment

– Dynamic programming can solve

Page 40: Multiple Sequence Alignment Motivation Definition Scoring (Sum of Pairs scoring) Algorithms Family Representations.

Signature representations

• Signature or motif signature– pattern contained as a substring in most

members of a family– typically represented as a regular expression– Such a regular expression might be derived

given a multiple sequence alignment