Multiple Sequence Alignments

Multiple Sequence Alignments

Craig A. Struble, Ph.D.Department of Mathematics, Statistics, and Computer ScienceMarquette University

MSCS 230: Bioinformatics I - Multiple Sequence Alignment

2

Overview


3

Example

Multiple sequence alignment of 7 neuroglobins using clustalx


4

Example

•Searching for domains with RPS-BLAST


5

Applications of Multiple Sequence Alignment

Identify conserved domains/elements in sequences Compare regions of similarity among

multiple organisms

Identify probes for similar sequences in other organismsDevelop PCR primersPhylogenetic analysis


6

Definition

A multiple alignment of strings S1, … Sk is a series of strings with spaces S1’, …, Sk’ such that |S1’| = … = |Sk’| Sj’ is an extension of Sj by insertion of

spaces

Goal: Find an optimal multiple alignment.


7

Scoring Alignments

In order to find an optimal alignment, we need to be able to measure how good an alignment is Sum of pairs (SP) method: in a

column, score each pair of letters and total the scores. Pairs of gaps score 0.

Total up scores for each column


8

SP Method Example

Using BLOSUM62 matrix, gap penalty -8In column 1, we have pairs

-,S -,S S,S

k(k-1)/2 pairs per column

- I K

S I K

S S E

-8 - 8 + 4 = -12


9

Dynamic Programming

The dynamic programming approach can be adapted to MSAFor simplicity, assume k sequences of length nThe dynamic programming array F is k-dimensional of length n+1 (including initial gaps)The entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]


10

Dynamic Programming

Letting i represent the vector (i1,…,ik) and b represent a nonzero binary vector of length k, we fill in the array with the formula

where (selecting a column to score)

))],,(()([max)( bisColumnSPbiFiFb

kjjcbisColumn 1)(),,(

0 if

1 if ][

j

jjj

j b

bisc


11

Example

Let i=(1,1,1,1), b=(1,0,0,0)Checking F(0,1,1,1) (i-b)Column(s,i,b) is

SP-score is -24 (assuming gap penalty of -8)

s1: MPEs2: MKEs3: MSKEs4: SKE

M---


12

Analysis

O(nk) entries to fillEach entry combines O(2k) other entriesCosts O(k2) to calculate each SP scoreOverall cost is O(k2 2k nk), or exponential in the number of sequences!MSA with SP-score shown NP-complete


13

Star Alignments

Heuristic method for multiple sequence alignmentsSelect a sequence sc as the center of the starFor each sequence s1, …, sk such that index i c, perform a Needleman-Wunsch global alignmentAggregate alignments with the principle “once a gap, always a gap.”


14

Star Alignments Example

s2

s1s3

s4

s1: MPEs2: MKEs3: MSKEs4: SKE

MPE

| |

MKE

MSKE

- ||

MKE

SKE

||

MKE MPEMKE

-MPE-MKEMSKE

-MPE-MKEMSKE-SKE


15

Choosing a center

Try them all and pick the one with the best scoreCalculate all O(k2) alignments, and pick the sequence sc that maximizes

cici ssscore ),(


16

Analysis

Assuming all sequences have length nO(n2) to calculate global alignmentO(k) global alignments to calculateUsing a reasonable data structure for joining alignments, no worse than O(kl), where l is upper bound on alignment lengthsO(kn2+k2l) overall cost


17

Tree Alignments

Model the k sequences with a tree having k leaves (1 to 1 correspondence)Compute a weight for each edge, which is the similarity scoreSum of all the weights is the score of the treeFind tree with maximum score


18

Tree alignment example

Match +1, gap -1, mismatch 0

If x=CT and y=CG, score of 6

CAT

GT

CTG

CG

x y


19

Analysis

The tree alignment problem is NP-complete Hence, phylogenetic tree generation

is NP-complete

Again, likely only exponential time solution available (for optimal answers)


20

Progressive Approaches

CLUSTALW Perform pairwise alignments Construct a tree, joining most similar

sequences first (guide tree) Align sequences sequentially, using the

phylogenetic tree

PILEUP Similar to CLUSTALW Uses UPGMA to produce tree (chapter 6)


21

Progressive Approaches


22

Problems with Progressive Alignments

MSA depends on pairwise alignmentsIf sequences are very distantly related, much higher likelihood of errorsCare must be made in choosing scoring matrices and penaltiesOther approaches using Bayesian methods such as hidden Markov models


23

When Craig Talks Next

Introduction to Bayesian StatisticsProfile and Block analysisExpectation Maximization (MEME)Introduction to HMMsMultiple sequence alignments using HMMs

Multiple Sequence Alignments

Documents

Transcript of Multiple Sequence Alignments