Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM...

43
Technical University of Denmark - DTU Department of systems biology CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Biological Sequence Analysis Chapter 3

Transcript of Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM...

Page 1: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Biological Sequence Analysis

• Chapter 3

Page 2: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Protein Families

Organism 1

Organism 2

Enzyme 1

Enzyme 2

Closely related Same Function

MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS::::::.::::::::::::::::::::.::::::::::::::::::::::::::::::::::::::::::MSEKKQTVDLGLLEEDDEFEEFPAEDWTGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

Related Sequences

Protein Family

Page 3: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Alignments

ACDEFGHIKLMNACEDFGHIPLMN

ACDEFGHIKLMNACACFGKIKLMN

Page 4: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Alignments

ACDEFGHIKLMNACEDFGHIPLMN

75%ID

ACDEFGHIKLMNACACFGKIKLMN

Page 5: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Alignments

ACDEFGHIKLMNACEDFGHIPLMN

75%ID

ACDEFGHIKLMNACACFGKIKLMN

75%ID

Page 6: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Substitutions

AlanineA

W Tryptophane

Page 7: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Substitutions

Glutamic acid

Aspartic acidD

E

Page 8: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Substitutions

T Threonine

S Serine

Page 9: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Deriving Substitution ScoresBLOSUM, Henikoff & Henikoff, 1992

Protein Family

Block A Block B

Page 10: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

...A...

...A...

...A......S......A...

...A...

...A...

...A...

...A...

...A...A 8 AA 1 AS

7 AA6 AA5 AA4 AA3 AA2 AA0 AA1 AA

1 AS1 AS1 AS1 AS1 AS2 SA0 AS

1 AS

36 9

45

s

w

ws(s-1)/2 = 1x10x9/2 =

f

Page 11: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

AA AC AD AE

0.8 0 0 0 .........AR AS AT AV AW AY CC CD

0 0.2 0 0 0 0 0 0

VY WW WY YY

0 0 0 0.........

qij =f ij

i=1

20

∑ fijj=1

i

fAA = 36,qAA = 36 /45 = 0.8fAS = 9,qAS = 9 /45 = 0.2

q

Raw Frequencies

Page 12: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

pi = qii +!

j !=i

qij/2

The probability of occurrence of the i’th amino acid in an i, j pair is:

45 pairs = 90 participants in pairs

A’s in pairs: 36x2 + 9x1 = 81AA AS

S’s in pairs: 0x2 + 9x1 = 9

Probability pA for encountering an A: 81/90 = 0.9Probability pS for encountering an S: 9/90 = 0.1

Page 13: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

Expected probability, e, of occurrence of pairs:

eAA = pApA = 0.9x0.9 = 0.81

eAS = pApS + pSpA = 0.9x0.1 + 0.1x0.9 = 2x(0.9x0.1) = 0.18

eSS = pSpS = 0.1x0.1 = 0.01

eij = pipj(i = j), eij = 2pipj(i != j)

Page 14: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

Odds and logodds:Odd ratio: qij/eij

logodd, s: sij = log2(qij/eij)

sij = 0 means that the observed frequencies are as expected

sij < 0 means that the observed frequencies are lower than expectedsij > 0 means that the observed frequencies are higher than expected

In the final BLOSUM matrices values are presented in half-bits, i.e., logodds are multiplied with 2 and rounded to nearest integer.

Page 15: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

Segment clusteringSequences with more than X% ID are represented as one average sequence (cluster)

Sequences are added to the cluster if it has more than X% ID to any of the sequences already in the cluster

If the clustering level is more than 50% ID, the final Matrix is a BLOSUM50, more than 62% leads to the BLOSUM62 matrix, etc.

Page 16: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

ACDEFGHIKLMNACEDFGHIPLMN

ACDEFGHIKLMNACACFGKIKLMN

Page 17: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

ACDEFGHIKLMNACEDFGHIPLMN

ACDEFGHIKLMNACACFGKIKLMN

4+9-2-4+6+6-1+4+5+4+5+6=42

Page 18: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLOSUM MatricesHenikoff & Henikoff, 1992

A R N D C Q E G H I L K M F P S T W Y V B Z X *A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

ACDEFGHIKLMNACEDFGHIPLMN

ACDEFGHIKLMNACACFGKIKLMN

4+9-2-4+6+6-1+4+5+4+5+6=42 4+9+2+2+6+6+8+4-1+4+5+6=55

Page 19: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

A

10 20 30 40 50 60 70humanD -----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

..: . : :.:. : . .. ...:. :::::::::::::::..::::.::::Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------

10 20 30 40 50 60

B

10 20 30 40 50 60 70humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

....:...:::::::::::::::::::::..:::..........::....:..::..........Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK-----

10 20 30 40 50 60

Figure 3.3: (A) The human proteasomal subunit aligned to the mosquito homolog using theBLOSUM50 matrix. (B) The human proteasomal subunit aligned to the mosquito homolog using

identity scores.

Gaps

Page 20: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

A

10 20 30 40 50 60 70humanD -----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

..: . : :.:. : . .. ...:. :::::::::::::::..::::.::::Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------

10 20 30 40 50 60

B

10 20 30 40 50 60 70humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

....:...:::::::::::::::::::::..:::..........::....:..::..........Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK-----

10 20 30 40 50 60

Figure 3.3: (A) The human proteasomal subunit aligned to the mosquito homolog using theBLOSUM50 matrix. (B) The human proteasomal subunit aligned to the mosquito homolog using

identity scores.

Gaps

Page 21: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

A

10 20 30 40 50 60 70humanD -----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

..: . : :.:. : . .. ...:. :::::::::::::::..::::.::::Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------

10 20 30 40 50 60

B

10 20 30 40 50 60 70humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

....:...:::::::::::::::::::::..:::..........::....:..::..........Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK-----

10 20 30 40 50 60

Figure 3.3: (A) The human proteasomal subunit aligned to the mosquito homolog using theBLOSUM50 matrix. (B) The human proteasomal subunit aligned to the mosquito homolog using

identity scores.

Gaps

Page 22: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

A

10 20 30 40 50 60 70humanD -----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

..: . : :.:. : . .. ...:. :::::::::::::::..::::.::::Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------

10 20 30 40 50 60

B

10 20 30 40 50 60 70humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAHVWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

....:...:::::::::::::::::::::..:::..........::....:..::..........Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK-----

10 20 30 40 50 60

Figure 3.3: (A) The human proteasomal subunit aligned to the mosquito homolog using theBLOSUM50 matrix. (B) The human proteasomal subunit aligned to the mosquito homolog using

identity scores.

Gaps

10 20 30 40 50 60 70humanD ----MSEKKQPVDLGLLEEDDEFEEFPAEDWAGLDEDEDAH-VWEDNWDDDNVEDDFSNQLRAELEKHGYKMETS

....:...:::::::::::::::::::::..:::... :::::::::::::::..::::.::::Anophe MSDKENKDKPKLDLGLLEEDDEFEEFPAEDWAGNKEDEEELSVWEDNWDDDNVEDDFNQQLRAQLEKHK------

10 20 30 40 50 60

Page 23: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Gap PenaltiesA gap is a kind like a mismatch but...

Often the gap score (gap penalty) has an even lower value (higher penalty) than the lowest mismatch score

Having only one type of gap penalties is called a linear gap cost

Biologically gaps are often inserted/deleted as a one or more event

In most alignment algorithms is two gap penalties.

One for making the first gap

Another (higher score=less penalty) for making an additional gap

This is called affine gap penalty

Page 24: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Dynamic programming: example

A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1

Gaps: -2

Page 25: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Dynamic programming: example

Page 26: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Dynamic programming: example

Page 27: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Dynamic programming: example

Page 28: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Dynamic programming: example

T C G C A: : : :T C - C A1+1-2+1+1= 2

Page 29: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Global versus local alignmentsGlobal alignment: align full length of both sequences. (The

“Needleman-Wunsch” algorithm).

Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).

Global alignment

Seq 1

Seq 2

Local alignment

Page 30: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Local alignment overview

• The recursive formula is changed by adding a fourth possibility: zero. This means local alignment scores are never negative.

• Trace-back is started at the highest value rather than in lower right corner

• Trace-back is stopped as soon as a zero is encountered

score(x,y) = max

score(x,y-1) - gap-penalty

score(x-1,y-1) + substitution-score(x,y)

score(x-1,y) - gap-penalty

0

Page 31: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Local alignment: example

Page 32: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Substitution matrices and sequence similarity

• Substitution matrices come as series of matrices calculated for different degrees of sequence similarity (different evolutionary distances).

• ”Hard” matrices are designed for similar sequences– Hard matrices a designated by high numbers in the BLOSUM series

(e.g., BLOSUM80)– Hard matrices yield short, highly conserved alignments

• ”Soft” matrices are designed for less similar sequences– Soft matrices have low BLOSUM values (45)– Soft matrices yield longer, less well conserved alignments

Page 33: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Alignments: things to keep in mind

“Optimal alignment” means “having the highest possible score, given substitution matrix and set of gap penalties”.

This is NOT necessarily the biologically most meaningful alignment.

Specifically, the underlying assumptions are often wrong: substitutions are not equally frequent at all positions, affine gap penalties do not model insertion/deletion well, etc.

Pairwise alignment programs always produce an alignment - even when it does not make sense to align sequences.

Page 34: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Database searching

Using pairwise alignments to search

databases for similar sequences

Database

Query sequence

Page 35: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Database searchingMost common use of pairwise sequence alignments is to search databases for related sequences. For instance: find probable function of newly isolated protein by identifying similar proteins with known function.

Most often, local alignment ( “Smith-Waterman”) is used for database searching: you are interested in finding out if ANY domain in your protein looks like something that is known.

Often, full Smith-Waterman is too time-consuming for searching large databases, so heuristic methods are used (fasta, BLAST).

Page 36: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Database searching: heuristic search algorithms

FASTA (Pearson 1995)

Uses heuristics to avoid calculating the full dynamic programming matrix

Speed up searches by an order of magnitude compared to full Smith-Waterman

The statistical side of FASTA is still stronger than BLAST

BLAST (Altschul 1990, 1997)

Uses rapid word lookup methods to completely skip most of the database entries

Extremely fastOne order of magnitude faster

than FASTATwo orders of magnitude faster

than Smith-Waterman

Almost as sensitive as FASTA

Page 37: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

BLAST flavors

BLASTNNucleotide query sequenceNucleotide database

BLASTPProtein query sequenceProtein database

BLASTXNucleotide query sequenceProtein databaseCompares all six reading frames

with the database

TBLASTNProtein query sequenceNucleotide database”On the fly” six frame translation of

database

TBLASTXNucleotide query sequenceNucleotide databaseCompares all reading frames of query

with all reading frames of the database

Page 38: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Searching on the web: BLAST at NCBI

Very fast computer dedicated to running BLAST searches

Many databases that are always up to date

Nice simple web interface

But you still need knowledge about BLAST to use it properly

Page 39: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

When is a database hit significant?

• Problem:

• Even unrelated sequences can be aligned (yielding a low score)

• How do we know if a database hit is meaningful?

• When is an alignment score sufficiently high?

• Solution:

• Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences).

• Compare actual scores to the distribution of random scores.

• Is the real score much higher than you’d expect by chance?

Page 40: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Random alignment scores follow extreme value distributions

The exact shape and location of the distribution depends on the exact nature of the database and the query sequence

Searching a database of unrelated sequences result in scores following an extreme value distribution

Page 41: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Significance of a hit: one possible solution

(1) Align query sequence to all sequences in database, note scores

(2) Fit actual scores to a mixture of two sub-distributions: (a) an extreme value distribution and (b) a normal distribution

(3) Use fitted extreme-value distribution to predict how many random hits to expect for any given score (the “E-value”)

Page 42: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Significance of a hit: exampleSearch against a database of 10,000 sequences.

An extreme-value distribution (blue) is fitted to the distribution of all scores.

It is found that 99.9% of the blue distribution has a score below 112.

This means that when searching a database of 10,000 sequences you’d expect to get 0.1% * 10,000 = 10 hits with a score of 112 or better for random reasons

10 is the E-value of a hit with score 112. You want E-values well below 1!

Page 43: Biological Sequenc e Anal ysis - CBS€¦ · CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS BLOSUM Matrices Henikoff & Henikoff, 1992 Segment clustering Sequences with more than X% ID

Technical University of Denmark - DTUDepartment of systems biology

CE

NT

ER

FOR

BIO

LOG

ICA

L SE

QU

EN

CE

AN

ALY

SIS

Database searching: E-values in BLAST

BLAST uses precomputed extreme value distributions to calculate E-values from alignment scores

For this reason BLAST only allows certain combinations of substitution matrices and gap penalties

This also means that the fit is based on a different data set than the one you are working on

A word of caution: BLAST tends to overestimate the significance of its matches

E-values from BLAST are fine for identifying sure hitsOne should be careful using BLAST’s E-values to judge if a marginal hit can be

trusted (e.g., you may want to use E-values of 10-4 to 10-5).