Blast Clustal

87
Bioinformática e Estrutura Molecular A collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing sequences with known functions with these new sequences is one way of understanding the biology of that organism from which the new sequence comes. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays there are many tools and techniques that provide the sequence comparisons (sequence alignment) and analyze the alignment product to understand the biology. Sequence analysis in molecular biology and bioinformatics is an automated, computer- based examination of characteristic fragments, e.g. of a DNA strand. It basically includes relevant topics: 1. The comparison of sequences in order to find similarity and dissimilarity in compared sequences (sequence alignment) 2. Identification of gene-structures , reading frames , distributions of introns and exons and regulatory elements 3. Finding and comparing point mutations or the single nucleotide polymorphism (SNP) in organism in order to get the genetic marker. 4. Revealing the evolution and genetic diversity of organisms. 5. Function annotation of genes. Bioinformatics 73

Transcript of Blast Clustal

Page 1: Blast Clustal

Bioinformática e Estrutura Molecular

A collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms.However, comparing sequences with known functions with these new sequences is one way of understanding the biology of that organism from which the new sequence comes.Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays there are many tools and techniques that provide the sequence comparisons (sequence alignment) and analyze the alignment product to understand the biology.Sequence analysis in molecular biology and bioinformatics is an automated, computer-based examination of characteristic fragments, e.g. of a DNA strand. It basically includes relevant topics:1. The comparison of sequences in order to find similarity and dissimilarity in compared

sequences (sequence alignment)2. Identification of gene-structures, reading frames, distributions of introns and exons and

regulatory elements3. Finding and comparing point mutations or the single nucleotide polymorphism (SNP) in

organism in order to get the genetic marker.4. Revealing the evolution and genetic diversity of organisms.5. Function annotation of genes.

Bioinformatics

73

Page 2: Blast Clustal

Bioinformática e Estrutura Molecular

1st step: Are there other proteins with similar sequences ?

To find out we must search the existing protein sequence databases and compare our sequence with all the other sequences in the database>sp|P00974|BPT1_BOVIN Pancreatic trypsin inhibitor OS=Bos taurus PE=1 SV=2MKMSRLCLSVALLVLLGTLAASTPGCDTSNQAKAQRPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGAIGPWENL

What we really want to be able to do is go from DNA sequence to protein function…!

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Basic Local Alignment Search Tool

Bioinformatics - BLAST

74

Page 3: Blast Clustal

Bioinformática e Estrutura Molecular

http://blast.ncbi.nlm.nih.gov/Blast.cgiBasic Local Alignment Search ToolBioinformatics - BAST

75

Page 4: Blast Clustal

Bioinformática e Estrutura Molecular

http://www.nature.com/scitable/topicpage/Basic-Local-Alignment-Search-Tool-BLAST-29096

Bioinformatics - BLASTBasic Local Alignment Search Tool

76

Page 5: Blast Clustal

Bioinformática e Estrutura Molecular

The bit score gives an indication of how good the alignment is; the higher the score, the better the alignment. In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences.

The E-value gives an indication of the statistical significance of a given pairwise alignment and reflects the size of the database and the scoring system used. The lower the E-value, the more significant the hit. A sequence alignment that has an E-value of 0.05 means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone.

Bioinformatics - BLASTBasic Local Alignment Search Tool

77

Page 6: Blast Clustal

Bioinformática e Estrutura Molecular

Number of identical residues in this alignment (Identities), the number of conservative substitutions (Positives), and if applicable, the number of gaps in the alignment

The line between the two sequences indicates the similar i t ies between the sequences.If the query and the subject have the same amino acid at a given location, the residue itself is shown.Conservative substitutions, as judged by the substitution matrix, are indicated with +.One or more dashes (–)

within a sequence indicate insertions or deletions.

Bioinformatics - BLASTBasic Local Alignment Search Tool

78

Page 7: Blast Clustal

Bioinformática e Estrutura Molecular

What is a substitution matrix ?

Bioinformatics - BLASTBasic Local Alignment Search Tool

79

Page 8: Blast Clustal

Bioinformática e Estrutura Molecular

What is a substitution matrix ?

BLOSUM (BLOcks of Amino Acid SUbstitution Matrix[1]) is a substitution matrix used for sequence alignment of proteins.

Bioinformatics - BLASTBasic Local Alignment Search Tool

80

Page 9: Blast Clustal

Bioinformática e Estrutura Molecular

Why do we run sequence alignments ?

- To observe patterns of conservation (or variability). To find the common motifs present in both sequences.

- To assess whether it is likely that two sequences evolved from the same sequence.

- To find out which sequences from the database are similar to the sequence at hand.

Alignment involves:

1. Construction of the best alignment between the sequences.2. Assessment of the similarity from the alignment.

There are three different types of sequence alignment:Global alignment/Local alignment/Multiple sequence alignment

Bioinformatics - BLASTBasic Local Alignment Search Tool

81

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 10: Blast Clustal

Bioinformática e Estrutura Molecular

Why do we run sequence alignments ?

- to infer homology from similarity measurements.If 2 sequences are homologous then they should have regions of similarityIf 2 sequences are similar then they are not necessarily homologous.

Bioinformatics - BLASTBasic Local Alignment Search Tool

82

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Homology- Two genes share a common evolutionary history- Divergence from a common ancestral sequence.

Similarity- Observable quantity- Two sequences have regions containing similar terms

Homologies can be inferred from similarity measurements.- Relationship is putative (more evidence required)- Human ζ (zeta) Crystallin and quinone oxidoreductasesequence similarity suggests homology and experimental evidence shows that they have different function

Page 11: Blast Clustal

Bioinformática e Estrutura Molecular

Why do we run sequence alignments ?

- to infer homology from similarity measurements.If 2 sequences are homologous then they should have regions of similarityIf 2 sequences are similar then they are not necessarily homologous.

Bioinformatics - BLASTBasic Local Alignment Search Tool

83

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Conserved regions in pairwise alignments:

- Conserved regions within distantly related organisms are usually most responsible for structure, function or gene expression.

- Substrate binding site, disulfide bridges, etc.

- For conserved regions between recent divergent organisms this may not be the case.

- EX) mouse and rat- Higher number of conserved regions overall

Page 12: Blast Clustal

Bioinformática e Estrutura Molecular

Global Alignment: the best alignment over the entire length of two sequences and is suitable when the two sequences are of similar length, with a significant degree of similarity throughout. Example:! !

S I M I L A R I T Y

P I - L L A R - - -

Local alignment: Involving stretches that are shorter than the entire sequences, possibly more than one. Suitable when comparing substantially different sequences, which possibly differ significantly in length, and have only a short patches of similarity.

M I L A R

I L L A R

Bioinformatics - BLAST alignmentBasic Local Alignment Search Tool

82

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 13: Blast Clustal

Bioinformática e Estrutura Molecular

Multiple alignment: Simultaneous alignment of more than two sequences. Suitable when searching for subtle conserved sequence patterns in a protein family, and when more than two sequences of the protein family are available.

Bioinformatics - BLAST alignmentBasic Local Alignment Search Tool

83

S I M I L A R I T Y

P I - L L A R - - -

M O L A R I T Y

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 14: Blast Clustal

Bioinformática e Estrutura Molecular

Consider the ”best” alignment of ATGGCGT and ATGAGT

Intuitively we seek an alignment to maximize the number of residue-to-residue matches.

Alignment by eye !

A T G G C G T* * * - ! * *A T G - A G T

Sequence alignment is the establishment of residue-to-residue correspondence between two or more sequences such that the order of residues in each sequence is preserved.A gap, which indicates a residue-to-nothing match, may be introduced in either sequence.A gap-to-gap match is meaningless and is not allowed.

Bioinformatics - BLAST alignment

84

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 15: Blast Clustal

Bioinformática e Estrutura Molecular

Now we must introduce mathematics and a scoring system

Definition: Sequence alignment is the establishment of residue-to-residue correspondence between two or more sequences such that the order of residues in each sequence is preserved.

A gap, which indicates a residue-to-nothing match, may be introduced in either sequence.

A gap-to-gap match is meaningless and is not allowed.

Bioinformatics - BLAST alignment

85

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 16: Blast Clustal

Bioinformática e Estrutura Molecular

The scoring scheme

Give two sequences we need a number to associate with each possible alignment (i.e. the alignment score = goodness of alignment).

The scoring scheme is a set of rules which assigns the alignment score to any given alignment of two sequences.

1. The scoring scheme is residue based: it consists of residue substitution scores (i.e. score for each possible residue alignment), plus penalties for gaps.

2. The alignment score is the sum of substitution scores and gap penalties.

Bioinformatics - BLAST alignment

86

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 17: Blast Clustal

Bioinformática e Estrutura Molecular

A simple scheme

Use +1 as a reward for a match, -1 as the penalty for a mismatch, and ignore gapsThe best alignment ”by eye” from before:

score: +1+1+1+0−1+1+1=4

An alternative alignment:

score: +1+0−1+1−1+1+1=2

A T G G C G T

A T G - A G T

A T G G C G T

A - T G A G T

Bioinformatics - BLAST alignment

87

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 18: Blast Clustal

Bioinformática e Estrutura Molecular

The substitution matrix

A concise way to express the residue substitution costs can be achieved with a N ×N matrix (N is 4 for DNA and 20 for proteins).

The substitution matrix for the simple scoring scheme: C T A GC 1 -1 -1 -1T -1 1 -1 -1A -1 -1 1 -1G -1 -1 -1 1

Bioinformatics - BLAST alignment

88

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

A better scheme

A, G are purines (pyrimidine ring fused to an imidazole ring), T, C are pyrimidines (one six-membered ring).Assume we believe that from evolutionary standpoint purine/pyrimidine mutations are less likely to occur compared to purine/purine (pyrimidine/pyrimidine) mutations. Can we capture this in a substitution matrix?

C T A GC 2 1 -1 -1T 1 2 -1 -1A -1 -1 2 1G -1 -1 1 2

Page 19: Blast Clustal

Bioinformática e Estrutura Molecular

Protein substitution matrices are significantly more complex than DNA scoring matrices.

Proteins are composed of twenty amino acids, and physico-chemical properties of individual amino acids vary considerably.

A protein substitution matrix can be based on any property of amino acids: size, polarity, charge, hydrophobicity.

In practice the most important are evolutionary substitution matrices.

Bioinformatics - BLAST alignment

89

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 20: Blast Clustal

Bioinformática e Estrutura Molecular

PAM (”point accepted mutation”) familyPAM250, PAM120, etc.

BLOSUM (”Blocks substitution matrix”) familyBLOSUM62, BLOSUM50, etc.

The substitution scores of both PAM and BLOSUM matrices are derived from the analysis of known alignments of closely related proteins.

The BLOSUM matrices are newer and considered better.

Bioinformatics - Evolutionary substitution matrix

90

Page 21: Blast Clustal

Bioinformática e Estrutura Molecular

Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales.

For the BLOSUM matrices, Henikoff and Henikoff used multiple alignments of evolutionarily divergent proteins.

The probabilities used in the matrix calculation are computed by looking at "blocks" of conserved sequences found in multiple protein alignments. These conserved sequences are assumed to be of functional importance within related proteins.

To reduce bias from closely related sequences, segments in a block with a sequence identity above a certain threshold were clustered giving weight 1 to each such cluster. For the BLOSUM62 matrix, this threshold was set at 62%.

Pairs frequencies were then counted between clusters, hence pairs were only counted between segments less than 62% identical.

It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST.

Bioinformatics - Evolutionary substitution matrix

91

Page 22: Blast Clustal

Bioinformática e Estrutura Molecular

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0

R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3

N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3

D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3

C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1

Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2

E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2

G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3

H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

BLOSUM62 Substitution matrixBioinformatics - BLAST alignment

92

Page 23: Blast Clustal

Bioinformática e Estrutura Molecular

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0

R 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3

N 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3

D 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3

C 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1

Q 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2

E 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2

G 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3

H 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3

I 4 2 -3 1 0 -3 -2 -1 -3 -1 3

L 4 -2 2 0 -3 -2 -1 -2 -1 1

K 5 -1 -3 -1 0 -1 -3 -2 -2

M 5 0 -2 -1 -1 -1 -1 1

F 6 -4 -2 -2 1 3 -1

P 7 -1 -1 -4 -3 -2

S 4 1 -3 -2 -2

T 5 -2 -2 0

W 11 2 -3

Y 7 -1

V 4

BLOSUM62 Substitution matrixBioinformatics - BLAST alignment

93

Page 24: Blast Clustal

Bioinformática e Estrutura Molecular

Gaps

So far we ignored gaps (amounts to gap penalty of 0)

A gap corresponds to an insertion or a deletion of a residue

A conventional wisdom dictates that the penalty for a gap must be several times greater than the penalty for a mutation. That is because a gap/extra residue

- Interrupts the entire polymer chain

- In DNA shifts the reading frame

Bioinformatics - BLAST alignment

94

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 25: Blast Clustal

Bioinformática e Estrutura Molecular

Gap initiation and extensionThe conventional wisdom: the creation of a new gap should be strongly disfavored.

However, once created insertions/deletions of chunks of more than one residue should be much less expensive (i.e. insertion of domains often occurs).

A simple yet effective solution is affine gap penalties:

! ! ! ! ! ! ! ! ! ! ! ! ! γ(n) = − o − (n − 1)e

where o is opening penalty and e is extension penalty

encourages gap extension

Bioinformatics - BLAST alignment

95

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 26: Blast Clustal

Bioinformática e Estrutura Molecular

Affine gaps: a physical insightAffine gaps favor the alignment:

Over the alignment:

Exactly what we want from the biological viewpoint.

A T G T A G T G T A T A G T A C A T G C A

A T G T A G - - - - - - - T A C A T G C A

A T G T A G T G T A T A G T A C A T G C A

A T G T A - - G - - T A - - - C A T G C A

Bioinformatics - BLAST alignment

96

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 27: Blast Clustal

Bioinformática e Estrutura Molecular

The alignment score with BLOSUM62Consider two alternative alignments of ANRGDFS and ANREFS with the gap opening penalty of 10 (-10 score):

score: 4+6+5−10+2+6+4 = 17

score: 4+6+5−2−10+6+4 = 13

The scoring scheme provides us with the quantitative measure of how good is some alignment relative to alternative alignments.

However the scoring scheme does not tell us how to find the best alignment.

A N R G D F S

A N R - E F S

A N R G D F S

A N R E - F S

Bioinformatics - BLAST alignment

97

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 28: Blast Clustal

Bioinformática e Estrutura Molecular

How do we find the best alignment?Brute-force approach:

Generate the list all possible alignments between two sequences, score themSelect the alignment with the best score

The number of possible global alignments between two sequences of length N is

For two sequences of 250 residues this is ∼ 10

1078 number of atoms in the universe

22N

√πN

Bioinformatics - BLAST alignment

98

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

149

Page 29: Blast Clustal

Bioinformática e Estrutura Molecular

The Needleman-Wunsch algorithm

A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970).

The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences.

The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953!

Bioinformatics - BLAST alignment

99

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 30: Blast Clustal

Bioinformática e Estrutura Molecular

How does dynamic programming work?A divide-and-conquer strategy:Break the problem into smaller subproblems. Solve the smaller problems optimally.

Use the sub-problem solutions to construct an optimal solution for the original problem.

Dynamic programming can be applied only to problems exhibiting the properties of overlapping subproblems (e.g. finding the best chess move)

Bioinformatics - BLAST alignment

100

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 31: Blast Clustal

Bioinformática e Estrutura Molecular

The mathematicsA matrix D(i,j) indexed by residues of each sequence is built recursively, such that

subject to a boundary conditions. s(i, j) is the substitution score for residues i and j, and g is the gap penalty.

�D(i,j) = maxD(i−1,j−1) + s(xi,yj)D(i−1,j) + gD(i,j−1) + g

Bioinformatics - BLAST alignment

101

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 32: Blast Clustal

Bioinformática e Estrutura Molecular

A walk-through: an overview

We consider all possible pairs of residue from two sequences (this gives rise to a 2D matrix representation).

We will have two matrices:the score matrix and traceback matrix.

The Needleman-Wunsch algorithm consists of three steps:1. Initialization of the score matrix2. Calculation of scores and filling the traceback matrix3. Deducing the alignment from the traceback matrix

Bioinformatics - BLAST alignment

102

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 33: Blast Clustal

Bioinformática e Estrutura Molecular

Consider the simple example

The alignment of two sequences SEND and AND with the BLOSUM62 substitution matrix and gap opening penalty of 10 (no gap extension):

S E N D- A N DA - N DA N - DA N D -

score: +1+3 ← the best score-3-8

Bioinformatics - BLAST alignment

103

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 34: Blast Clustal

Bioinformática e Estrutura Molecular

The score and traceback matrices

The cells of the score matrix are labelled C(i,j) where i = 1,2,...,N and j = 1,2,...,M

S E N D

C(1,1) C(1,2) C(1,3) C(1,4) C(1,5)

A C(2,1) C(2,2) C(2,3) C(2,4) C(2,5)

N C(3,1) C(3,2) C(3,3) C(3,4) C(3,5)

D C(4,1) C(4,2) C(4,3) C(4,4) C(4,5)

S E N D

A

N

D

Bioinformatics - BLAST alignment

104

�D(i,j) = maxD(i−1,j−1)+s(xi,yj)D(i−1,j)+gD(i,j−1)+g

Page 35: Blast Clustal

Bioinformática e Estrutura Molecular

Initialization

The first row and column of the two matrices are filled in the initialization step

S E N D

0 -10 -20 -30 -40

A -10 C(2,2) C(2,3) C(2,4) C(2,5)

N -20 C(3,2) C(3,3) C(3,4) C(3,5)

D -30 C(4,2) C(4,3) C(4,4) C(4,5)

S E N D

done left left left left

A up

N up

D up

Bioinformatics - BLAST alignment

105

diag ! – ! the letters from two sequences are alignedleft !! – ! a gap is introduced in the left sequenceup ! ! – ! a gap is introduced in the top sequence

�D(i,j) = maxD(i−1,j−1)+s(xi,yj)D(i−1,j)+gD(i,j−1)+g

qdiag! =! C(0,1) + s(1,2)qup! =! C(0,2) + gqleft! =! C(1,1) + g = 0 - 10 = -10

Page 36: Blast Clustal

Bioinformática e Estrutura Molecular

Initialization

The first row and column of the two matrices are filled in the initialization step

S E N D

0 -10 -20 -30 -40

A -10 C(2,2) C(2,3) C(2,4) C(2,5)

N -20 C(3,2) C(3,3) C(3,4) C(3,5)

D -30 C(4,2) C(4,3) C(4,4) C(4,5)

S E N D

done left left left left

A up

N up

D up

Bioinformatics - BLAST alignment

106

diag ! – ! the letters from two sequences are alignedleft !! – ! a gap is introduced in the left sequenceup ! ! – ! a gap is introduced in the top sequence

�D(i,j) = maxD(i−1,j−1)+s(xi,yj)D(i−1,j)+gD(i,j−1)+g

qdiag! =! C(0,2) + s(1,3)qup! =! C(0,3) + gqleft! =! C(1,2) + g = -10 - 10 = -20

Page 37: Blast Clustal

Bioinformática e Estrutura Molecular

Scoring

The cells are filled starting from cell C(2,2)The score is the maximum value for:

where S(i,j) is the substitution score for letters i and j and g is the gap penalty

S E N D

0 -10 -20 -30 -40

A -10 C(2,2) C(2,3) C(2,4) C(2,5)

N -20 C(3,2) C(3,3) C(3,4) C(3,5)

D -30 C(4,2) C(4,3) C(4,4) C(4,5)

S E N D

done left left left left

A up

N up

D up

qdiag! =! C(i-1,j-1) + s(i,j)qup! =! C(i-1,j) + gqleft! =! C(i,j-1) + g

Bioinformatics - BLAST alignment

107

�D(i,j) = maxD(i−1,j−1)+s(xi,yj)D(i−1,j)+gD(i,j−1)+g

Page 38: Blast Clustal

Bioinformática e Estrutura Molecular

Scoring– a pictorial representation

The value of the cell C(i,j) depends only on the values of the immediately adjacent northwest diagonal, up, and left cells:

C(i-1,j-1) C(i-1,j)

C(i,j-1) C(i,j)

Bioinformatics - BLAST alignment

108

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 39: Blast Clustal

Bioinformática e Estrutura Molecular

The Needleman-Wunsch progression

The first step is to calculate the value of C(2,2):

S E N D

0 -10 -20 -30 -40

A -10 ?

N -20

D -30

S E N D

done left left left left

A up ?

N up

D up

Bioinformatics - BLAST alignment

109

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

diag ! – ! the letters from two sequences are alignedleft !! – ! a gap is introduced in the left sequenceup ! ! – ! a gap is introduced in the top sequence

Page 40: Blast Clustal

Bioinformática e Estrutura Molecular

The Needleman-Wunsch progression

S E N D

0 -10 -20 -30 -40

A -10 ?

N -20

D -30

S E N D

done left left left left

A up ?

N up

D up

qdiag! =! C(i-1,j-1) + s(i,j)qup! =! C(i-1,j) + gqleft! =! C(i,j-1) + g

For cell C(2,2)

qdiag! =! C(1,1) + S(S,A)! =! 0 + 1 ! ! = 1qup! =! C(1,2) + g!! ! ! = -10 + (-10) != -20qleft! =! C(2,1) + g!! ! ! = -10 + (-10)! = -20

from the BLOSUM62 matrix

Bioinformatics - BLAST alignment

110

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 41: Blast Clustal

Bioinformática e Estrutura Molecular

The Needleman-Wunsch progression

S E N D

0 -10 -20 -30 -40

A -10 1

N -20

D -30

S E N D

done left left left left

A up diag

N up

D up

qdiag! =! C(i-1,j-1) + s(i,j)qup! =! C(i-1,j) + gqleft! =! C(i,j-1) + g

For cell C(2,2)

qdiag! =! C(1,1) + S(S,A)! =! 0 + 1 ! ! = 1qup! =! C(1,2) + g!! ! ! = -10 + (-10) != -20qleft! =! C(2,1) + g!! ! ! = -10 + (-10)! = -20

from the BLOSUM62 matrix

Highest score is 1 from the diag cell

Bioinformatics - BLAST alignment

111

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 42: Blast Clustal

Bioinformática e Estrutura Molecular

The Needleman-Wunsch progression

S E N D

0 -10 -20 -30 -40

A -10 1 ?

N -20

D -30

S E N D

done left left left left

A up diag ?

N up

D up

qdiag! =! C(i-1,j-1) + s(i,j)qup! =! C(i-1,j) + gqleft! =! C(i,j-1) + g

Go on to cell C(2,3)

qdiag! =! C(1,2) + S(E,A)! = -10 + -1 !! = -11qup! =! C(1,3) + g!! ! ! = -20 + (-10) != -30qleft! =! C(2,2) + g!! ! ! = 1 + (-10)!! = -9

from the BLOSUM62 matrix

Bioinformatics - BLAST alignment

112

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 43: Blast Clustal

Bioinformática e Estrutura Molecular

The Needleman-Wunsch progression

S E N D

0 -10 -20 -30 -40

A -10 1 -9 -19 -29

N -20 -9 -1 -3 -13

D -30 -19 -11 2 3

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

Finally when all cells in the scores and traceback matrices are filled in

Bioinformatics - BLAST alignment

113

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 44: Blast Clustal

Bioinformática e Estrutura Molecular

The traceback

Traceback = the process of deduction of the best alignment from the traceback matrix.

The traceback always begins with the last cell to be filled with the score, i.e. the bottom right cell.

One moves according to the traceback value written in the cell.

There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left.

The traceback is completed when the first, top-left cell of the matrix is reached (”done” cell).

Bioinformatics - BLAST alignment

114

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 45: Blast Clustal

Bioinformática e Estrutura Molecular

The traceback path

The traceback performed on the completed traceback matrix:

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

The traceback starts here

Bioinformatics - BLAST alignment

115

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 46: Blast Clustal

Bioinformática e Estrutura Molecular

The best alignment

The alignment is deduced from the values of cells along the traceback path, by taking into account the values of the cell in the traceback matrix:

diag !– !the letters from two sequences are alignedleft ! – !a gap is introduced in the left sequenceup ! – !a gap is introduced in the top sequence

Sequences are aligned backwards.

Bioinformatics - BLAST alignment

116

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

The traceback starts here

Page 47: Blast Clustal

Bioinformática e Estrutura Molecular

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

DD

Bioinformatics - BLAST alignment

117

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

The traceback step by stepFirst is - diag (aligned)

Page 48: Blast Clustal

Bioinformática e Estrutura Molecular

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

N DN D

Bioinformatics - BLAST alignment

118

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

The traceback step by stepFirst is - diag (aligned)Second is also - diag (aligned)

Page 49: Blast Clustal

Bioinformática e Estrutura Molecular

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

E N D- N D

Bioinformatics - BLAST alignment

119

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

The traceback step by stepFirst is - diag (aligned)Second is also - diag (aligned)Third is - left (gap in left sequence)

Page 50: Blast Clustal

Bioinformática e Estrutura Molecular

The traceback step by stepFirst is - diag (aligned)Second is also - diag (aligned)Third is - left (gap in left sequence)Fourth is - diag (aligned)

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

Bioinformatics - BLAST alignment

120

S E N DA - N D

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 51: Blast Clustal

Bioinformática e Estrutura Molecular

The traceback step by stepFirst is - diag (aligned)Second is also - diag (aligned)Third is - left (gap in left sequence)Fourth is - diag (aligned)Fifth is done

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

Bioinformatics - BLAST alignment

121

S E N DA - N D

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 52: Blast Clustal

Bioinformática e Estrutura Molecular

Comparison

So our exhaustive search gave:

And our alignment using the NW algorithm:

S E N D- A N DA - N DA N - DA N D -

score: +1+3 ← the best score-3-8

Bioinformatics - BLAST alignment

122

S E N D

done left left left left

A up diag left left left

N up diag diag diag left

D up up diag diag diag

S E N DA - N D

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 53: Blast Clustal

Bioinformática e Estrutura Molecular

A few observations

It was much easier to align SEND and AND by the exhaustive search!

As we consider longer sequences the situation quickly turns against the exhaustive search:

Two 12 residue sequences would require considering ∼ 1 million alignments.

Two 150 residue sequences would require considering ∼ 1088 alignments (∼ 1078 is the estimated number of atoms in the Universe).

For two 150 residue sequences the Needleman-Wunsch algorithm requires filling a 150 × 150 matrix.

Assuming it takes 0.01 s to calculate the score and traceback for all possible alignments it would still take............. ca. 3 x 1079 years to complete

Bioinformatics - BLAST alignment

123

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 54: Blast Clustal

Bioinformática e Estrutura Molecular

The summary

The alignment is over the entire length of two sequences: the traceback starts from the lower right corner of the traceback matrix, and completes in the upper left cell of the matrix.

The Needleman-Wunsch algorithm works in the same way regardless of the length or complexity of sequences, and guarantees to find the best alignment.

The Needleman-Wunsch algorithm is appropriate for finding the best alignment of two sequences which are (i) of the similar length; (ii) similar across their entire lengths.

Bioinformatics - BLAST alignment

124

The Needleman-Wunsch algorithm for sequence alignment7th Melbourne Bioinformatics Course

Vladimir Likíc, The University of Melbourne

Page 55: Blast Clustal

Bioinformática e Estrutura Molecular

Basic Local Alignment Search Tool

http://www.nature.com/scitable/topicpage/Basic-Local-Alignment-Search-Tool-BLAST-29096

Bioinformatics - BLAST alignment

125

Page 56: Blast Clustal

Bioinformática e Estrutura Molecular

The BLAST Heuristic

BLAST increases the speed of alignment by decreasing the search space or number of comparisons it makes.

Instead of comparing every residue against every other, BLAST uses short "word" (w) segments to create alignment "seeds."

Requiring three residues to match in order to seed an alignment means that fewer sequence regions need to be compared.Larger word sizes usually mean that there are even fewer regions to evaluate.

BLAST is designed to create a word list from the query sequence with words of a specific length, as defined by the user.

I R S T F I R S

D F I R

S E N D F I R S T T I M E

D F I F I R I R S

Bioinformatics - BLAST alignment

126

Page 57: Blast Clustal

Bioinformática e Estrutura Molecular

The BLAST Heuristic*

Once an alignment is seeded, BLAST extends the alignment according to a threshold (T) that is set by the user.

When performing a BLAST query, the computer extends words with a neighborhood score greater than T.

A cutoff score (S) is used to select alignments over the cutoff, which means the sequences share significant homologies.

If a hit is detected, then the algorithm checks whether w is contained within a longer aligned segment pair that has a cutoff score greater than or equal to S (Altschul et al., 1990).

When an alignment score starts to decrease past a lower threshold score (X), the alignment is terminated.

These and many other variables can be adjusted to either increase the speed of the algorithm or emphasize its sensitivity.*A heuristic method is used to rapidly come to a solution that is hoped to be close to the best possible answer, or 'optimal solution'.

Bioinformatics - BLAST alignment

127

Page 58: Blast Clustal

Bioinformática e Estrutura Molecular

Basic Local Alignment Search ToolNumber of identical residues in this alignment (Identities), the number of conservative substitutions (Positives), and if applicable, the number of gaps in the alignment

The line between the two sequences indicates the similar i t ies between the sequences.If the query and the subject have the same amino acid at a given location, the residue itself is shown.Conservative substitutions, as judged by the substitution matrix, are indicated with +.One or more dashes (–)

within a sequence indicate insertions or deletions.

Bioinformatics - BLAST alignment

128

Page 59: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

129

Page 60: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

130

Page 61: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

131

>sp|P61626|LYSC_HUMAN Lysozyme C OS=Homo sapiens GN=LYZ PE=1 SV=1MKALIVLGLVLLSVTVQGKVFERCELARTLKRLGMDGYRGISLANWMCLAKWESGYNTRATNYNAGDRSTDYGIFQINSRYWCNDGKTPGAVNACHLSCSALLQDNIADAVACAKRVVRDPQGIRAWVAWRNRCQNRDVRQYVQGCGV

>sp|P00698|LYSC_CHICK Lysozyme C OS=Gallus gallus GN=LYZ PE=1 SV=1MRSLLILVLCFLPLAALGKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL

>sp|P05105|LYS_HYACE Lysozyme OS=Hyalophora cecropia PE=2 SV=2MTKYVILLAVLAFALHCDAKRFTRCGLVQELRRLGFDETLMSNWVCLVENESGRFTDKIGKVNKNGSRDYGLFQINDKYWCSKGTTPGKDCNVTCNQLLTDDISVAATCAKKIYKRHKFDAWYGWKNHCQHGLPDISDC

>sp|Q7LZQ1|LYSC_TRISI Lysozyme C OS=Trionyx sinensis GN=LYZ PE=1 SV=3GKIYEQCELAREFKRHGMDGYHGYSLGDWVCTAKHESNFNTAATNYNRGDQSTDYGILQINSRWWCNDGKTPKAKNACGIECSELLKADITAAVNCAKRIVRDPNGMGAWVAWTKYCKGKDVSQWIKGCKL

>sp|P09963|LYS_BPP22 Lysozyme OS=Enterobacteria phage P22 GN=19 PE=1 SV=1MMQISSNGITRLKREEGERLKAYSDSRGIPTIGVGHTGKVDGNSVASGMTITAEKSSELLKEDLQWVEDAISSLVRVPLNQNQYDALCSLIFNIGKSAFAGSTVLRQLNLKNYQAAADAFLLWKKAGKDPDILLPRRRRERALFLS

>sp|P30201|LYSC_MACMU Lysozyme C OS=Macaca mulatta GN=LYZ PE=2 SV=1MKAVIILGLVLLSVTVQGKIFERCELARTLKRLGLDGYRGISLANWVCLAKWESNYNTQATNYNPGDQSTDYGIFQINSHYWCNNGKTPGAVNACHISCNALLQDNIADAVTCAKRVVSDPQGIRAWVAWRNHCQNRDVSQYVQGCGV

Page 62: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

132

Page 63: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

133

Page 64: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

134

Color Residue

ORANGE GPST

BLUE FWY

RED HKR

GREEN ILMV

Page 65: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

135

Clustal X Default ColouringClustal X Default ColouringClustal X Default Colouring

Residue at position Applied Colour { Threshhold, Residue group }

A,I,L,M,F,W,V BLUE {+60%, WLVIMAFCHP}

R,K RED {+60%,KR},{+80%, K,R,Q}

N GREEN {+50%, N}, {+85%, N,Y}

C BLUE {+60%, WLVIMAFCHP}

C PINK {100%, C}

Q GREEN {+60%,KR},{+50%,QE},{+85%,Q,E,K,R}

E MAGENTA {+60%,KR},{+50%,QE},{+85%,E,Q,D}

D MAGENTA {+60%,KR}, {+85%, K,R,Q}, {+50%,ED}

G ORANGE {+0%, G}

H,Y CYAN {+60%, WLVIMAFCHP}, {+85%, W,Y,A,C,P,Q,F,H,I,L,M,V}

P YELLOW {+0%, P}

S,T GREEN {+60%, WLVIMAFCHP}, {+50%, TS}, {+85%,S,T}

Page 66: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

136

Page 67: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

137

Page 68: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

138

Can also run using ignore gaps - this compares like with like

Page 69: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

139

BLAST v Clustal

ClustalW and BLAST are designed to obtain and identify sequence similarities

ClustalW measures similarity between several input sequences. These are used to construct a phylogenic tree to depict evolutionary relationships

BLAST on the other hand identifies similarities between and input sequence and all sequences in a database. BLAST returns all sequences and doesn’t worry about origin.

Page 70: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

140

The ClustalW algorithm

ClustalW is a sequence alignment method used to build phylogenic trees by pairwise sequence alginments which are used to construct a mutilple sequence alignment (MSA)

The algorithm functions by:

➡! aligning all sequences in pairs

➡! placing the most similar sequence pairs (highest % sequence identity) close ! together in a phylogenic tree and the most dissimilar farther apart

➡! building a MSA using the most similar sequence pair and aligning them

➡! extending the MSA by aliogning pairs in decreasing order of similarity

A phylogenetic tree or evolutionary tree is a tree showing the evolutionary relationships among various biological species or other entities that are known to have a common ancestor.

Page 71: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

141

Clustering analysis - the problemLet us consider 4 sequences - they are homologus from 4 different species

Species A ATSS

Species B ATGS

Species C TTSG

Species D TSGG

By measuring the number of differences between each pair of species (Hamming distances) we can deveop a clustering proceedure to develop an unrooted phylogenic tree

Page 72: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

142

Determining pairwise Hamming distancesWe need to first determine the number of differences between the sequences

ATSS ATGS TTSG TSGG

Species A ATSS

Species B ATGS

Species C TTSG

Species D TSGG

Page 73: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

143

Determining pairwise Hamming distancesWe need to first determine the number of differences between the sequences

ATSS ATGS TTSG TSGG

Species A ATSS 0

Species B ATGS 0

Species C TTSG 0

Species D TSGG 0

Page 74: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

144

Determining pairwise Hamming distancesWe need to first determine the number of differences between the sequences

ATSS ATGS TTSG TSGG

Species A ATSS 0 1 2 4

Species B ATGS 0

Species C TTSG 0

Species D TSGG 0

Page 75: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

145

Determining pairwise Hamming distancesWe need to first determine the number of differences between the sequences

ATSS ATGS TTSG TSGG

Species A ATSS 0 1 2 4

Species B ATGS 0 3 3

Species C TTSG 0

Species D TSGG 0

Page 76: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

146

Determining pairwise Hamming distancesWe need to first determine the number of differences between the sequences

ATSS ATGS TTSG TSGG

Species A ATSS 0 1 2 4

Species B ATGS 0 3 3

Species C TTSG 0 2

Species D TSGG 0

Page 77: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

147

Determining pairwise Hamming distances

smallest distance is 1 between ATSS and ATGS

ATSS ATGS TTSG TSGG

Species A ATSS 0 1 2 4

Species B ATGS 0 3 3

Species C TTSG 0 2

Species D TSGG 0

Page 78: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

148

Forming a sequence cluster

Smallest distance is 1 between ATSS and ATGS

So our first cluster will be {ATSS, ATGS}

Our tree will contain:

ATSS ATGS

Page 79: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

149

Determining pairwise Hamming distances

Using the cluster {ATSS, ATGS} we now calculate a second distance matrix

{ATSS, ATGS} TTSG TSGG

{ATSS, ATGS}

TTSG

TSGG

Page 80: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

150

Determining pairwise Hamming distances

Using the cluster {ATSS, ATGS} we now calculate a second distance matrix

{ATSS, ATGS} TTSG TSGG

{ATSS, ATGS} 0

TTSG 0

TSGG 0

Page 81: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

151

Determining pairwise Hamming distances

Using the cluster {ATSS, ATGS} we now calculate a second distance matrix

{ATSS, ATGS} TTSG TSGG

{ATSS, ATGS} 0 0.5(2+3)=2.5 0.5(4+3)=3.5

TTSG 0 2

TSGG 0

Page 82: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

152

Determining pairwise Hamming distances

The smallest distance is now 2 between TTSG and TSGG

{ATSS, ATGS} TTSG TSGG

{ATSS, ATGS} 0 2.5 3.5

TTSG 0 2

TSGG 0

Page 83: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

153

Forming another sequence cluster

Smallest distance is 2 between TTSG and TSGG

So our second cluster will be {TTSG, TSGG}

Our tree will contain the fragment:

TTSG TSGG

Page 84: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

154

Making the treeNow we have 2 clusters and we can link them to produce a tree

TTSG TSGGATSS ATGS

Page 85: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

155

Branch lengthsNow we have to calculate the branch length:

! The length between 2 sequences is half the Hamming distance between them

TTSG TSGGATSS ATGS

Page 86: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

156

Branch lengthsNow we have to calculate the branch length:

! The length between 2 sequences is half the Hamming distance between them

TTSG TSGGATSS ATGS

ATSS ATGS TTSG TSGG

Species A ATSS 0 1 2 4

Species B ATGS 0 3 3

Species C TTSG 0 2

Species D TSGG 0

{ATSS, ATGS} TTSG TSGG

{ATSS, ATGS} 0 2.5 3.5

TTSG 0 2

TSGG 0

0.5 0.5 1 1

1.5 1.5

Page 87: Blast Clustal

Bioinformática e Estrutura Molecular

Bioinformatics - Clustal alignment

157

Branch lengthsWe have created an unrooted phylogenic tree using a clustering method

Why is this useful ?

Can predict evolutionary relationships

The proceedure is similar to the start of a ClustalW search