Multiple Sequence Alignments
description
Transcript of Multiple Sequence Alignments
Multiple Sequence Alignments
Craig A. Struble, Ph.D.Department of Mathematics, Statistics, and Computer ScienceMarquette University
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
2
Overview
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
3
Example
Multiple sequence alignment of 7 neuroglobins using clustalx
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
4
Example
•Searching for domains with RPS-BLAST
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
5
Applications of Multiple Sequence Alignment
Identify conserved domains/elements in sequences Compare regions of similarity among
multiple organisms
Identify probes for similar sequences in other organismsDevelop PCR primersPhylogenetic analysis
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
6
Definition
A multiple alignment of strings S1, … Sk is a series of strings with spaces S1’, …, Sk’ such that |S1’| = … = |Sk’| Sj’ is an extension of Sj by insertion of
spaces
Goal: Find an optimal multiple alignment.
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
7
Scoring Alignments
In order to find an optimal alignment, we need to be able to measure how good an alignment is Sum of pairs (SP) method: in a
column, score each pair of letters and total the scores. Pairs of gaps score 0.
Total up scores for each column
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
8
SP Method Example
Using BLOSUM62 matrix, gap penalty -8In column 1, we have pairs
-,S -,S S,S
k(k-1)/2 pairs per column
- I K
S I K
S S E
-8 - 8 + 4 = -12
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
9
Dynamic Programming
The dynamic programming approach can be adapted to MSAFor simplicity, assume k sequences of length nThe dynamic programming array F is k-dimensional of length n+1 (including initial gaps)The entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
10
Dynamic Programming
Letting i represent the vector (i1,…,ik) and b represent a nonzero binary vector of length k, we fill in the array with the formula
where (selecting a column to score)
))],,(()([max)( bisColumnSPbiFiFb
kjjcbisColumn 1)(),,(
0 if
1 if ][
j
jjj
j b
bisc
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
11
Example
Let i=(1,1,1,1), b=(1,0,0,0)Checking F(0,1,1,1) (i-b)Column(s,i,b) is
SP-score is -24 (assuming gap penalty of -8)
s1: MPEs2: MKEs3: MSKEs4: SKE
M---
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
12
Analysis
O(nk) entries to fillEach entry combines O(2k) other entriesCosts O(k2) to calculate each SP scoreOverall cost is O(k2 2k nk), or exponential in the number of sequences!MSA with SP-score shown NP-complete
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
13
Star Alignments
Heuristic method for multiple sequence alignmentsSelect a sequence sc as the center of the starFor each sequence s1, …, sk such that index i c, perform a Needleman-Wunsch global alignmentAggregate alignments with the principle “once a gap, always a gap.”
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
14
Star Alignments Example
s2
s1s3
s4
s1: MPEs2: MKEs3: MSKEs4: SKE
MPE
| |
MKE
MSKE
- ||
MKE
SKE
||
MKE MPEMKE
-MPE-MKEMSKE
-MPE-MKEMSKE-SKE
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
15
Choosing a center
Try them all and pick the one with the best scoreCalculate all O(k2) alignments, and pick the sequence sc that maximizes
cici ssscore ),(
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
16
Analysis
Assuming all sequences have length nO(n2) to calculate global alignmentO(k) global alignments to calculateUsing a reasonable data structure for joining alignments, no worse than O(kl), where l is upper bound on alignment lengthsO(kn2+k2l) overall cost
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
17
Tree Alignments
Model the k sequences with a tree having k leaves (1 to 1 correspondence)Compute a weight for each edge, which is the similarity scoreSum of all the weights is the score of the treeFind tree with maximum score
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
18
Tree alignment example
Match +1, gap -1, mismatch 0
If x=CT and y=CG, score of 6
CAT
GT
CTG
CG
x y
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
19
Analysis
The tree alignment problem is NP-complete Hence, phylogenetic tree generation
is NP-complete
Again, likely only exponential time solution available (for optimal answers)
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
20
Progressive Approaches
CLUSTALW Perform pairwise alignments Construct a tree, joining most similar
sequences first (guide tree) Align sequences sequentially, using the
phylogenetic tree
PILEUP Similar to CLUSTALW Uses UPGMA to produce tree (chapter 6)
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
21
Progressive Approaches
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
22
Problems with Progressive Alignments
MSA depends on pairwise alignmentsIf sequences are very distantly related, much higher likelihood of errorsCare must be made in choosing scoring matrices and penaltiesOther approaches using Bayesian methods such as hidden Markov models
MSCS 230: Bioinformatics I - Multiple Sequence Alignment
23
When Craig Talks Next
Introduction to Bayesian StatisticsProfile and Block analysisExpectation Maximization (MEME)Introduction to HMMsMultiple sequence alignments using HMMs