Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons"...

14
2011-05-23 1 1 Sequence comparison and alignment • Maximal maximum parsimony: Choose the alignemnt requiring the lest improbable combina<on of point muta<ons and inser<ons/dele<ons Before alignment Sequence1 AGGVLIIQVG |||||| Sequence2 AGGVLIQVG AAer alignment Sequence1 AGGVLIIQVG |||||| ||| Sequence2 AGGVLI-QVG Inser<on in sequence 1 or dele<on in sequence 2; gap Comparing longer sequnces A B A B 2 Scoring of alignments, scoring matrixes •Unitary scoring matrix Iden<ty = one point; otherwise no point •Does not take into account that some muta<ons are more common than other •Insensi<ve; difficult to detect distant rela<onships

Transcript of Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons"...

Page 1: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

1

1  

Sequence  comparison  and  alignment  

 • Maximal  maximum  parsimony:  Choose  the  alignemnt  requiring  the  lest  improbable  combina<on  of  point  muta<ons  and  inser<ons/dele<ons  

Before  alignment    Sequence1 AGGVLIIQVG! ||||||!Sequence2 AGGVLIQVG  

AAer  alignment    Sequence1 AGGVLIIQVG! |||||| |||!Sequence2 AGGVLI-QVG  

Inser<on  in  sequence  1  or  dele<on  in  sequence  2;  gap  

Comparing  longer  sequnces  

A  B    A  B  

2  

Scoring  of  alignments,  scoring  matrixes  

• Unitary  scoring  matrix  Iden<ty  =  one  point;  otherwise  no  point  

 

• Does  not  take  into  account  that  some  muta<ons  are  more  common  than  other  

• Insensi<ve;  difficult  to  detect  distant  rela<onships  

Page 2: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

2

3  

Muta<on  matrix  summarises  observed  muta<ons  

Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6

Mutation probability Matrix for the Evolutionary Distance of 2 PAM

Cly Pro Asp Glu Ala Asn Gln Ser Thr Lys Arg His Val Ile Mel Cys Leu Phe Tyr Trp SumGly 9 8 7 0 1 7 1 3 2 2 4 0 2 2 1 1 4 2 8 5 0 1 7 0 0 3 2 0 0 0 10063Pro 7 9 8 5 0 1 1 3 2 3 9 1 3 1 1 5 3 0 0 4 3 0 0 0 0 0 0 9 9 4 2Asp 8 1 9 7 5 7 9 6 1 3 4 5 2 7 2 6 2 8 0 6 4 0 1 0 2 0 0 0 9 9 9 6Glu 1 3 1 7 9 5 9 7 2 6 2 1 9 4 0 1 5 1 2 1 3 0 4 7 4 1 0 4 0 0 0 9 9 8 1Ala 4 2 5 4 2 4 3 7 9 7 3 0 3 1 3 4 9 9 4 5 1 8 0 5 3 2 3 1 9 5 5 5 0 0 10188Asn 1 0 1 0 3 6 7 1 4 9 7 0 1 2 0 5 1 1 7 1 9 7 2 4 4 4 1 0 2 0 0 0 9 9 2 7Gln 4 1 1 1 6 2 4 1 2 1 5 9 7 3 6 1 3 1 0 9 1 4 1 4 5 4 i i 0 2 0 0 0 9 9 0 0Ser 2 6 1 5 2 8 1 6 5 9 6 7 2 2 9 5 9 8 6 9 1 4 2 1 7 7 4 2 3 2 7 3 6 0 0 10003Thr 6 8 3 1 4 3 0 2 5 2 0 7 6 9 7 5 9 1 0 0 8 2 0 2 4 1 1 8 5 3 0 0 10030Lys 5 6 1 3 2 1 1 7 3 7 2 3 2 2 1 4 9 8 4 5 6 5 1 4 1 3 9 I l 0 6 0 4 0 10125Arg 0 0 0 0 0 5 1 3 1 0 2 3 9 8 8 1 1 7 0 0 1 8 0 0 2 0 0 9 9 6 0H i s 0 0 4 3 2 2 0 1 5 1 0 5 6 1 9 9 8 6 5 1 4 0 0 3 3 4 1 1 9 9 7 5Val 6 8 5 1 0 2 7 7 1 2 9 2 5 1 2 0 3 9 7 8 3 1 5 6 8 2 1 8 2 2 3 0 0 10188Ile 0 2 0 3 1 3 4 3 1 4 4 0 4 7 0 9 7 0 3 2 2 3 2 2 1 4 0 0 9 8 7 2Met 0 0 0 0 2 0 4 5 2 2 7 0 1 2 7 9 6 7 2 5 1 4 5 0 0 9 7 3 7C y s 1 0 0 0 1 0 0 1 2 3 0 0 0 6 2 1 1 9 9 2 8 0 0 0 0 9 9 6 4Len 2 0 3 7 4 3 4 5 7 6 0 6 2 4 5 2 9 9 0 9 8 9 9 1 9 0 0 10140Phe 0 0 0 0 2 0 0 5 2 0 3 4 2 1 8 1 8 0 1 0 9 8 7 9 7 4 3 0 10047Tyr 0 0 0 0 0 0 0 0 0 2 0 4 0 0 0 0 0 5 1 9 9 0 9 1 7 9 9 8 1Trp 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 8 7 9 9 4 1 9 9 6 0

All entries are multiplied by 10,000. An element of this matrix rn(i,j) gives the probability that the amino acid in column 1 will bereplaced by the amino acid in row i after an evolutionary interval of 2 PAM, i.e., 2 accepted point mutations per 100 amino acids. Thus,there is a probability of 0.0059 that Ala will be replaced by Ser, anda probability of 0.0099 that Ser will be replaced by Ala. The sum ofeach column is 1.0. The sum of a row represents the growth factor per 2 PAM" of the corresponding amino acid residue; it ranges from0.9737 (for Met) to 1.0188 (for Ala and Val).

(PAM=percentage of accepted mutations)

4  

Scoring  matrix  from  muta<on  matrix  Each  matrix  element  relates  the  probablity  for  similarty  due  to  conserva<on  to  chance  similarity  

”log-­‐odds”matrixes;  use  logarithsm  of  subs<tu<on  probabili<tes  element  

 gaps:  Gap  crea<on  penalty  Gap  extension  penalty  

Weigh<ng  with  log-­‐odd  matrix  

Scoring  

Page 3: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

3

5  

Log-­‐odds  matrix  for  250  PAM  C 12!S 0 2 !T -2 1 3!P -3 1 0 6!A -2 1 1 1 2!G -3 1 0 -1 1 5!N -4 1 0 -1 0 0 2!D -5 0 0 -1 0 1 2 4!E -5 0 0 -1 0 0 1 3 4!Q -5 -1 -1 0 0 -1 1 2 2 4!H -3 -1 -1 0 -1 -2 2 1 1 3 6!R -4 0 -1 0 -2 -3 0 -1 -1 1 2 8!K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5!M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6!I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5!L -8 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 8!V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4!F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 !Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10!W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 ! ! C S T P A G N D E Q H R K M I L V F Y W!!

Symmetric  

6  

Assump<ons  of  the  PAM  model  

Assump<ons  in  PAM  model:            1.replacement  at  any  site  depends  only  on  the  amino  acid  at  that  site  and  the  

probability  given  by  the  table  (Markov  model).            2.sequences  that  are  being  compared  have  average  amino  acid  composi<on.      Sources  of  error  in  PAM  model              1.Many  sequences  depart  from  average  composi<on.            2.Rare  replacements  were  observed  too  infrequently  to  resolve  rela<ve  

probabili<es  accurately  (for  36  pairs  no  replacements  were  observed!).            3.Errors  in  1PAM  are  magnified  in  the  extrapola<on  to  250  PAM.            4.The  Markov  process  is  an  imperfect  representa<on  of  evolu<on:  Distantly  

related  sequences  usually  have  islands  (blocks)  of  conserved  residues.  This                implies  that  replacement  is  not  equally  probable  over  en<re  sequence.    

Must  use  sequences  with  known  rela<on  (>85  %  iden<ty  )  and  extrapolate  to  lower  levels  of  similarity  

Page 4: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

4

7  

Evolu<onary  rates  Rates of Mutation Acceptance!! PAMS per 100! Million Years! !IG kappa chain C region 37!Kappa casein 33!Phospholipase A 19!Prolactin 17!Carbonic anhydrase C 16!Hemoglobin alpha chain 12!Lipid-binding protein A-II 10!Animal lysozyme 9.8!Myoglobin 8.9!Trypsin 5.9!Alpha crystallin A chain 5.0!Cytochrome b 4.5!Calcitonin 4.3!Neurophysin 2 3.6!Lactate dehydrogenase 3.4!Adenylate kinase 3.2!Triosephosphate isomerase 2.8!Vasoactive intestinal peptide 2.6!Cytochrome c 2.2!Plant ferredoxin 1.9!Troponin C. skeletal muscle 1.5!Glutamate dehydrogenase 0.9!Histone H2B 0.9!Histone H2A 0.5!Histone H3 0.14!Histone H4 0.10!Ubiquitin 0.00!

Depend on protein; different mutation rates and selection pressures

8  

BLOSUM  matrixes  

•  Basedon  short  local  alignements,  (BLOCKS  database)Know  relatedness  not  required  

 •  Suitable  level  of  similarity  can  be  used  

Page 5: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

5

9  

BLOSUM  vs.  PAM  

It  is  best  to  use  scoring  matrix  derived  from  seqenced  with  simlilar  levels  of  similarity  to  those  inves<gated    

Op<mal  pairwise  alignment  depends  on  efficient  algorithms  

•  Problem: finding the alignment with highest score •  Calculation of score for all possibilities not feasible; need optimization

method to find best solution with minimum computation: •  Dynamic programming method

M N A L S Q L N N l  l  A l  L l  l  M l  S l  Q l  N l  l  H

M N A L S Q L N N l  l  A l  L l  M l  S l  Q l  N l  l  H

Illustration with dot plots: finding the best path

Page 6: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

6

Real  proteins  

Needleman-Wunsch algorithm: implemenation of dynamic programming method for finding the optimal soution in pairwise alignment Mathematically guaranteed: it can be proven that the best alignment will be found

12  

Database  searching  •  How  to  find  related  sequences  to  a  given  sequnce  in  a  large  database?  

•  Need  large  number  of  sequence  comparisons  and  scoring  of  results  •  Need  fast  methods  for  sequence  comparisons;  approximate.  • Word  methods  (k-­‐tuple)  

FASTA,  BLAST    • Search  form  ”words”  (k-­‐tuples)  inves<gate  hits  more  closely.  

• Result:  

A  number  of  op<mised  alignments  (with  gaps)    ranked  according  to  score  

Page 7: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

7

How  BLAST  works    Query Sequence

“words” (subsequences of the query sequence)

Query words are compared to the database (target sequences) and exact matches identified

For each word match, alignment is extended in both directions to find alignments that score greater than some threshold (maximal segment pairs, or MSPs) (Schneider and La Rota 2000)

14  

BLAST"

Op<mize  and  rank  HSP’s  

Page 8: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

8

FASTA  

16  

Sta<s<cs  Score (S) Measure of similarity between query sequence and match Expect (E) value: A parameter that describes the number of hits (with score≥ S) one can expect to see by chance when searching a database of a particular size. It decreases exponentially as the score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance

K och λ parameters for database and scoring, n’ and m’ for relate to sequence lengths; D size of database

Page 9: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

9

BLAST  output  1:  Overview  

Sequence motifs detected

BLAST  output  2:  List  

Page 10: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

10

19  

BLAST output 3: alignments"Best hit (number 1)

Number 25

20  

Significance  of  sequence  similarity  

Practical definition > 20 % of residues identical (after reasonble correction for insertions/deletions). Probability of 20 % identity by chance in 100-residue sequences?

Efter Schultz och Schirmer, Principles of Protein Structure

Låt l = sekvenslängd, i antal identiska aminosyror

P =1

20! "

# $

i 1920! "

# $

l%i l!i!(l % i)!

Med l = 100 och i = 20 fås

P =1

20! "

# $

20 1920! "

# $

80 100!20!80!

& 10%7

Alignment  becoms  more  difficult  at  lower  levels  of  similarity  

Page 11: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

11

21  

Sequence and functional similarity"

Petsko&Ringe  fig  4.3  

Single domain

Multiplve domains

22  

Translated BLAST"Method  Query    Database  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  BLASTP  protein    protein  BLASTN  DNA    DNA    BLASTX    DNA  

 (6  reading  frames)  protein    TBLASTN  protein    DNA    Time  consuming  

     (6  reading  frames)    TBLASTX  DNA      DNA    More  <me  consuming  

   (6  reading  frames)    (6  reading  frames)    

Page 12: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

12

23  

Multiple sequence alignment"Alignment  of  more  than  two  sequnces  to  produce  best  global  fit  (op<mal  placemnt  of  gaps  etc)    

1 2 3 4 5 6 7 8 9 10!----------------------------------!I Y D G G A V - E A L!II Y D G G - - - E A L!III F E G G I L V E A L !IV F D - G I L V Q A V!V Y E G G A V V Q A L!!!

posi<on  

Sekven

s  nr  

•  Computa<onally  difficult;  propor<onal  to  (seqqunce  length)number  

•  Not  mathema<cally  guaranteed  to  fined  best  solu<on  

• Most  methods  start  with  parwise  alignmnents  

•  Clustal  is  a  common  program  for  MSA  

24  

Rekonstruc<on  of  evolu<on  from  sequence  alignments  

ACGH!DBGH!ADIJ!CBIJ!

ACGH! DBGH! ADIJ! CBIJ!

ABGH! ABIJ!

B->C! A->D! B->D! A->C!

I<->G!J<->H!

Parsimony: most probable path?

Minimal number of mutations?

Related sequnces Phylogenteic tree

Page 13: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

13

25  

Multiple sequence alignment: example"Källa  

Protein  kinase  domains  from  Pfam  

Petsko&Ringe  fig  4.4  

Cataly<c  loop  

Inser<ons/dele<on  

26  

Database search using multiple alignment: PSI-BLAST"

•  Posi<on-­‐specific  iterated  BLAST  •  Useful  for  detec<on  of  related  sequnces  with  weak  similarity  •   Step1.  Iden<fiy  close  rela<ves  

and  perform  mul<ple  sequence  alignment.  Generate  sequnce  profile  (PSSM)  from  MSA  

•  Step  2  Query  database  with  the  generated  profile.  Hits  can  be  added  to  the  alignment  and  the  profile  can  be  modified.    

•  Repeat  step  2un<l  no  more  sequences  are  added  to  alignment  

Page 14: Scoringofalignments,scoring matrixes2011-05-23 2 3 Mutaon"matrix" summarises"observed" mutaons" Frmcomparison of sequnces knowwn to be related; see also Petsko&Rinte fig 1-6 Mutation

2011-05-23

14

Sequence  conserva<on  and  structure  Cellulose-binding domains