Refresher: BLOSUM – Brief Overview

50

Transcript of Refresher: BLOSUM – Brief Overview

Page 1: Refresher: BLOSUM – Brief Overview
Page 2: Refresher: BLOSUM – Brief Overview

Refresher: BLOSUM – Brief OverviewBased purely on counts and alignmentBLOSUM65 < BLOSUM45 in evolutionary distance

1 k i i1 in

ij i

20

j i j i1 1 1

a set of sequences S = S ...S , where S = s ...s , s is the j-th amino acid in sequence S .

The probability of observing each amino acid X =p(X)

p(X) ( , S ) / ( , S )

where ( ,

k k

i j i

j

c X c X

c X= = =

= ∑ ∑∑

i

2

) is the count of amino acid in sequence S

( , | ) 2 ( ). ( ) ( ) , i j

l m l m l

S X

P X X random p X p X or p X if l m= =

Page 3: Refresher: BLOSUM – Brief Overview

Refresher: In general …log-odds for alignment

( , | ) # of , combinations/all possible pairwise combinations finally take the log2-odds likelihood ratio and multiply by 2 for easy representationNote: all possible pairwise combinations=k

l m l mP X X data X X=

{ }

{ }l m

l m

. ( 1) / 2 does use pseudo count to accomodate for combinations not observed:

add 1 in numerator, add 210 (20*19+20) in denominator( , )ie (X , X | )

. ( 1) / 2 210(X ,X | ).log

(X

l m

n nBlosum

count X XP datak n n

P dataP

λ

=− +

l m

. , X | )

being a scaling factor for representation

subs matrixrandom

λ

=

Page 4: Refresher: BLOSUM – Brief Overview

Example:

Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman: CCTGTGACGTAGCAAACGA

The inference of evolutionary relationshipsThe inference of putative common ancestorsTrees (with branches and leaves!)

What is phylogeny?

Page 5: Refresher: BLOSUM – Brief Overview

Rooted vs Unrooted tree

Rooted tree Unrooted tree

C D B AB

A

D

C

OTUs – Operational taxonomic units (leaves)HTUs – Hypothetical taxonomic units (internal nodes)

Page 6: Refresher: BLOSUM – Brief Overview

Types of dendograms – diagramaticrepresentation of phylogenetic trees

A: phenogram (Overall similarity based)

Important: grouping of OTUs Meaningless: Vertical separation and branch lengths

B: a phylogram (similarity based)

important : Branch length and grouping

C: a cladogram (rooted)

Important: groupingMeaningless: Angles, lengths

D: a radial phylogram

Same as B

A B

C D

Page 7: Refresher: BLOSUM – Brief Overview

Calculating the number of unrooted trees

3

1 2

3 4

1 2

3

How many possibilities are there for adding leaf 3? 1

How many possibilities are there for adding leaf 4?

Page 8: Refresher: BLOSUM – Brief Overview

Calculating the number of unrooted trees

How many possibilities are there for leaf 5? For the 5th

leaf, there are 5 possibilitiesie to add the nth leaf, we can add at the immediate leaf branches+ the number of internal branches

ie (n-1) “terminal-branches” + (n-4) “internal branches”=2n-5

1 2

3

4

5

Page 9: Refresher: BLOSUM – Brief Overview

Total number of trees?

#unrooted trees for n taxa: (2n-5)*(2n-7)*...*3*1 = (2n-5)! / [2n-3*(n-3)!]

#rooted trees for n taxa: (2n-3)*(2n-5)*(2n-7)*...*3 = (2n-3)! / [2n-2*(n-2)!]

1 2

3

4

5

N = 10

#unrooted: 2,027,025#rooted: 34,459,425

N = 30#unrooted: 8.7 x 1036

#rooted: 4.95 x 1038

Commonly represented as 2n-5!! or 2n-3!!

Page 10: Refresher: BLOSUM – Brief Overview

Parsimony Approach to Evolutionary Tree Reconstruction

Assumes observed character differences resulted from the fewest possible mutations

Seeks the tree that yields lowest possible parsimony score - sum of cost of all mutations found in the tree

Page 11: Refresher: BLOSUM – Brief Overview

Example for calculating parsimony score

AGAGGA

AAAAAG

AAA AGA

AAA

11

1

Total #substitutions = 3

GGAAAA

AGAAAG

AAA AAA

AAA

11 2

Total #substitutions = 4The left tree is preferred over the right tree.

The total number of changes is called the parsimony score.

AGA GGAAAAAAG

Page 12: Refresher: BLOSUM – Brief Overview

1. A procedure to find the minimum number of changes needed to explain the data for a given tree topology, where species are assigned to leaves.

2. A search through the space of trees.3. Efficient algorithms for (1). (2) is hard, we can

use heuristic approaches.

Parsimony Based Tree Reconstruction

Page 13: Refresher: BLOSUM – Brief Overview

Sankoff algorithm to count evolutionary changes in a given tree

Step-1: Define a cost matrix [cij], representing changes from character state i to state j

Step-2: Starting from the leaves, we work down at each node, k to calculate a score Sk(i) for each state, i (i= 1.. 4 for DNA)

Sk is the minimal cost of events for all subtrees above that node, k

A C G TA 0 2.5 1 2.5C 2.5 0 2.5 1G 1 2.5 0 2.5T 2.5 1 2.5 0

Page 14: Refresher: BLOSUM – Brief Overview

Sankoff algo – cont’dRemember for each internal node (“parent”) we have two sub-nodes (“children”) So the scores need to be added up on both branchesFirst, consider say the left branch, with left child node at state i and we want to evaluate the score for parent node at state A.

If S(left child node) is known for each states, the S(parent node@j) will be

LeftChild node(C1) Right

Child node

Parent node

A

A/T/G/C

A C

1

1

1

1

( )( )

( ) | min( )( )

A A C

A T Cp LeftArm

A G C

A C C

c S Ac S T

S Ac S Gc S C

+⎧ ⎫⎪ ⎪+⎪ ⎪= ⎨ ⎬+⎪ ⎪⎪ ⎪+⎩ ⎭

Page 15: Refresher: BLOSUM – Brief Overview

Sankoff algo – cont’d

LeftChild node(C1) Right

Child node(C2)

{ } { }1 2/ / / / / /

( ) '( ) '( )

min ( ) min ( )

p p left p right

i j C i k Cj A T G C k A T G C

LeftBranch RightBranch

S i S i S i

c S j c S k→ →= =

= +

= + + +

Parent node

A

A/T/G/C

{ }

1

1

1

1

1/ / /

( )( )

( ) | min( )( )

min ( )

A A C

A T Cp LeftBranch

A G C

A C C

A j Cj A T G C

LeftBranch

c S Ac S T

S Ac S Gc S C

c S j

→=

+⎧ ⎫⎪ ⎪+⎪ ⎪= ⎨ ⎬+⎪ ⎪⎪ ⎪+⎩ ⎭

= +

Page 16: Refresher: BLOSUM – Brief Overview

Sankoff algorithm – Example{ } { }1 2/ / / / / /

( ) min ( ) min ( )p i j C i k Cj A T G C k A T G C

RightArmLeftArm

S i c S j c S k→ →= == + + +

A C G TA 0 2.5 1 2.5

C 2.5 0 2.5 1

G 1 2.5 0 2.5

T 2.5 1 2.5 0

{C} {A} {C} {A} {G}

A C G T0 ∞ ∞ ∞

A C G T∞ 0 ∞ ∞

A C G T1 5 1 5

A C G T∞ 0 ∞ ∞

A C G T0 ∞ ∞ ∞

A C G T∞ ∞ 0 ∞

A C G T3.5 3.5 3.5 4.5

A C G T6 6 7 8

Cost matrix

Page 17: Refresher: BLOSUM – Brief Overview

Assigning branch lengths using Least squares

2

1 1;

( ) ( )

Method Dependent Weight

n n

ij ij iji j j i

ij

Q T w D d

w= = ≠

= −

=

∑ ∑Observed distances denoted by Dij and the “real branch” lengths to be predicted as dij

We have an observed matrix of Distances between sequences from all possible pair-wise comparisons – Remember D and K in JC model?

C

A B

D

E0.05

0.10

0.070.03

0.05

0.08

0.06

A B C D EA 0 0.23 0.16 0.20 0.17B 0.23 0 0.23 0.17 0.24C 0.16 0.23 0 0.15 0.11D 0.20 0.17 0.15 0 0.21E 0.17 0.24 0.11 0.21 0

Page 18: Refresher: BLOSUM – Brief Overview

Each tree has branch lengths from which “predicted” set of distances can be computed: d(i,j) (small d, denotes the distance of the branches, unlike the observed pairwise distances D).

Human

Chimp

Gorilla0.3 0.41

0.25

d(Human,Chimp) = 0.55

d(Human,Gorilla) = 0.71

d(Chimp, Gorilla) = 0.66

Motivation: From a distance table to a tree

Page 19: Refresher: BLOSUM – Brief Overview

The question is can we find branch lengths, so that the d’s are equal to the D’s?

Human

Chimp

GorillaX Y

Z

D(Human,Chimp) = 0.3 =X + Z

D(Human,Gorilla) = 0.4 = X + Y

D(Chimp, Gorilla) = 0.5 = Y + Z

Motivation: From a distance table to a tree

X+Z = 0.3

X+Y = 0.4

Y+Z = 0.5Y-Z = 0.1

Y+Z = 0.5

Y = 0.3

Z = 0.2

X = 0.1

Page 20: Refresher: BLOSUM – Brief Overview

A

B

D

X Y

Z

5 Variables,

6 Equations,

It might be that there’s no solutionC

W

V

Is there always a solution??

2 2 3 for N > 3NC N> −

Page 21: Refresher: BLOSUM – Brief Overview

Least squares – Cont’d

,ij ij k kk

d x v= ∑

12 1 2 3 4 5 6 7

13 1 2 3 4 5 6 7

45 1 2 3 4 5 6 7

1 1 0 0 0 0 11 0 1 0 0 1 0

...0 0 0 1 1 1 1

d v v v v v v vd v v v v v v v

d v v v v v v v

= + + + + + += + + + + + +

= + + + + + +

introduce an indicator variable , which is 1 if branch lies in the path from species i to species j and 0 otherwise

,ij kxkv

A B

CD

Ev7

v2

v4v6

v3

v1v5

Page 22: Refresher: BLOSUM – Brief Overview

LS- Cont’d2

1 1;

2

1 1;

when the weights are 1.0 ( ) ( )

( )

n n

ij iji j j i

n n

ij ijk ki j j i k

Q T D d

D x v

= = ≠

= = ≠

= −

= −

∑ ∑

∑ ∑ ∑ , ,1 1;;

2 ( ) 0n n

ij k ij ij k ki j j i kk

dQ x D x vdv = = ≠

= − − =∑ ∑ ∑

( )T TX D X X v=1( )T Tv X X X D−=

XNo of rows=n(n-1)/2No of columns=2n-3 (eq to k)

Page 23: Refresher: BLOSUM – Brief Overview

LS - example

A

B

Cv2

v1v3

A B CA 0 10 12B 10 0 8

C 12 8 0

1 1 01 0 10 1 1

X⎛ ⎞⎜ ⎟= ⎜ ⎟⎜ ⎟⎝ ⎠

2 1 11 2 11 1 2

TX X⎛ ⎞⎜ ⎟= ⎜ ⎟⎜ ⎟⎝ ⎠

1

3 1 14 4 41 3 1( )4 4 41 1 34 4 4

TX X −

⎛ ⎞− −⎜ ⎟⎜ ⎟⎜ ⎟= − −⎜ ⎟⎜ ⎟⎜ ⎟− −⎝ ⎠

1( ) ...T Tv X X X D−= =

10128

D⎛ ⎞⎜ ⎟= ⎜ ⎟⎜ ⎟⎝ ⎠

Page 24: Refresher: BLOSUM – Brief Overview

When we have weighted LS, then previous equationscan be written:

where W is a diagonal matrix with distance weights on main diagonal.

( )T TX WD X WX v=

1( )T Tv X WX X WD−=

Page 25: Refresher: BLOSUM – Brief Overview

Evaluating a tree is good but how to build the tree? − Fast clustering-based algorithm for tree construction

UPGMA (unweighted pair group method using arithmetic averages)

Given two disjoint clusters Ci, Cj of sequences,

1dij = ––––––––– Σ{p ∈Ci, q ∈Cj}dpq

|Ci| × |Cj|

Note that if Ck = Ci ∪ Cj, then distance to another cluster Cl is:

dil |Ci| + djl |Cj|dkl = –––––––––––––– = UPGMA distance

|Ci| + |Cj| ( ) ( )(( ), ) ( ) ( , ) ( ) ( , )( ) ( ) ( ) ( )

n i n jD ij l D i l D j ln i n j n i n j

= ++ +

Page 26: Refresher: BLOSUM – Brief Overview

UPGMA algorithm

• Find i and j with smallest Dij

• Create new group X by joining nodes i & j

• Compute distances between X (new member) and others (old)

• Place node X at Dij/2

• Delete i and j; replace with X in D-matrix

Page 27: Refresher: BLOSUM – Brief Overview

The distance table

dog bear raccon weasel seal sea lion

cat chimp

dog 0 32 48 51 50 48 98 148bear 0 26 34 29 33 84 136raccon 0 42 44 44 92 152weasel 0 44 38 86 142seal 0 24 89 142sea lion 0 90 142cat 0 148chimp 0

Page 28: Refresher: BLOSUM – Brief Overview

seal sea lion

We call the parent node of seal and sea lion “ss”.

12 12

Distance between these two taxa was 24, so each branch has a length of 12.

ss

Page 29: Refresher: BLOSUM – Brief Overview

Removing the seal and sea-lion rows and columns,and adding the ss row and columns

dog bear raccon weasel ss cat chimp

dog 0 32 48 51 ? 98 148bear 0 26 34 ? 84 136raccon 0 42 ? 92 152weasel 0 ? 86 142ss 0 89 142cat 0 148chimp 0

Page 30: Refresher: BLOSUM – Brief Overview

Computing dog-ss distance

dog bear raccon weasel seal sea lion

cat chimp

dog 0 32 48 51 50 48 98 148

),())()(

)((),())()(

)(()),(( kjDjnin

jnkiDjnin

inkijD+

++

=

Here, i=seal, j=sea lion, k = dog.

n(i)=n(j)=1.

D(ss,dog) = 0.5D(sea lion,dog) + 0.5D(seal,dog) = 49.

Page 31: Refresher: BLOSUM – Brief Overview

The new table. Starting second iteration…

dog bear raccon weasel ss cat chimp

dog 0 32 48 51 49 98 148bear 0 26 34 31 84 136raccon 0 42 44 92 152weasel 0 41 86 142ss 0 89 142cat 0 148chimp 0

Page 32: Refresher: BLOSUM – Brief Overview

Inferring tree

We call the parent node of bear and raccoon “br”.

Distance between bear and raccoon was 26, so each branch has a length of 13.

seal sea lion

12 12

ss

bear raccoon

13 13

br

Page 33: Refresher: BLOSUM – Brief Overview

Computing br-ss distance

dog bear raccon weasel ss cat chimp

ss 49 31 44 41 0 89.5 142

Here, i=raccoon, j=bear, k = ss.

n(i)=n(j)=1. D(br,ss) = 0.5D(bear,ss)+0.5D(raccoon,ss)=37.5.

),())()(

)((),())()(

)(()),(( kjDjnin

jnkiDjnin

inkijD+

++

=

Page 34: Refresher: BLOSUM – Brief Overview

The new table. Starting next iteration…

dog br weasel ss cat chimp

dog 0 40 51 49 98 148br 0 38 37.5 88 144weasel 0 41 86 142ss 0 89 142cat 0 148chimp 0

Page 35: Refresher: BLOSUM – Brief Overview

Inferring tree

Distance between br and ss was 37.5, so each branch has a length of 18.75. But this is the distance from brss to the leaves. The distance brssto ss is 18.75-12=6.75. The distance between brss to br is 18.75-13=5.75

seal sea lion

12 12

ss

bear raccoon

6.75

13

brss

br

5.75

13

And so on…..

Page 36: Refresher: BLOSUM – Brief Overview

UPGMA’s Weakness

The algorithm produces an ultrametric tree : the distance from the root to any leaf is the same

• UPGMA assumes a constant molecular clock: all species represented by the leaves in the tree are assumed to accumulate mutations (and thus evolve) at the same rate. This is a major pitfall of UPGMA.

Page 37: Refresher: BLOSUM – Brief Overview

UPGMA’s Weakness: Example

23

41 1 4 32

Correct tree UPGMA

seal sea lion

12 12

ss

bear raccoon

6.75

13

brss

br

5.75

13

Page 38: Refresher: BLOSUM – Brief Overview

Clustering algorithm 2 – NJ (neighbor joining)

Tree reconstruction for any 3x3 matrix is straightforwardWe have 3 leaves i, j, k and a center vertex c

Observe D’s, infer d’s

dic + djc = Dij

dic + dkc = Dik

djc + dkc = Djk

Page 39: Refresher: BLOSUM – Brief Overview

NJ –Cont’d

2dic + Djk = Dij + Dik

dic = (Dij + Dik – Djk)/2

Similarly,djc = (Dij + Djk – Dik)/2dkc = (Dki + Dkj – Dij)/2

Page 40: Refresher: BLOSUM – Brief Overview

Trees with > 3 Leaves – NJ Cont’d

An unrooted tree with n leaves has 2n-3 branches

This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables

This is not always easy to solve for large n

Page 41: Refresher: BLOSUM – Brief Overview

NJ algorithm

For each tip compute

Choose i and j for which, is smallest

Join nodes i and j to X. Compute branch length fro i to X and j to X

Compute the distance between X an remaining nodes

:/ ( 2)

n

i ijj j i

u D n≠

= −∑

ij i jD u u− −

( ) 2

( ) 2i X ij i j

j X ij j i

v D u u

v D u u→

= + −

= + −

( ) 2X k ik jk ijv D D D→ = + −

Page 42: Refresher: BLOSUM – Brief Overview

NJ algorithm – Cont’d

New node X is treated as a new tip and old nodes I, j are deleted

If more than two nodes remain go back to step-1, else connect the two nodes (l,m) by Dl,m

Page 43: Refresher: BLOSUM – Brief Overview

Bootstrapping to get the best trees

Main outline of algorithm

1. Select random columns from a multiple alignment – one column can then appear several times

2. Build a phylogenetic tree based on the random sample from (1)

3. Repeat (1), (2) many (say, 1000) times

4. Output the tree that is constructed most frequently or calculate a probability for each sub-tree topology

Page 44: Refresher: BLOSUM – Brief Overview

Pairwise alignment: calculation of distance matrix

Rooted NJ tree (guide tree) and calculation of sequence weights

Progressive alignment following the guide tree

Multiple sequence alignment using phylogenetic methods − Clustal-W

Page 45: Refresher: BLOSUM – Brief Overview

Step 1-Calculation of Distance Matrix using pairwise alignment

Use the Distance Matrix to create a Guide Tree todetermine the “order” of the sequences.

I =D = 1 – (I) D = Difference score

# of identical aa’s in pairwise global alignmenttotal number of aa’s in shortest sequence

Hbb-Hu 1 -

Hbb-Ho 2 .17 -

Hba-Hu 3 .59 .60 -

Hba-Ho 4 .59 .59 .13 -

Myg-Ph 5 .77 .77 .75 .75 -

Gib-Pe 6 .81 .82 .73 .74 .80 -

Lgb-Lu 7 .87 .86 .86 .88 .93 .90 -

1 2 3 4 5 6 7

Page 46: Refresher: BLOSUM – Brief Overview

Step 2-Create Rooted Tree and calculate weights

Weight

AlignmentOrder of alignment:1 Hba-Hu vs Hba-Ho2 Hbb-Hu vs Hbb-Ho3 A vs B4 Myg-Ph vs C5 Gib-Pe vs D6 Lgh-Lu vs E

Neighbor joining algorithm – simple, to be discussed later

Page 47: Refresher: BLOSUM – Brief Overview

Step 3-Progressive alignment

Page 48: Refresher: BLOSUM – Brief Overview

Step 3-Progressive alignmentScoring during progressive alignment

Page 49: Refresher: BLOSUM – Brief Overview

Recommended MSA Programs

MUSCLE (fast and accurate)MAVID (genome-scale alignment)SAM ( hidden markov, powerful and wide range of options)

Page 50: Refresher: BLOSUM – Brief Overview